Wiley --> wiley.com

The Web Testing Companion: The Insider's Guide to Efficient and Effective Tests

Lydia Ash

Appendix G - Problem Characters and Sample Test Input

This appendix contains sample input that has a high likelihood of causing misbehavior in many different types of applications. The exact usage varies depending on the application-some will be sensitive to these cases in a URL, others through a text input field, and others will be very tolerant of the data and behave correctly. Many applications will have their own sets of problematic input that may contain these and may have some unique ones.

In order to make it easier for you to use the inputs in your own testing, this file is available for download here. AppendixG.doc

Characters from the Single-Byte Character Sets

Control Characters

The control characters in Table G.1 are often left off of code pages because these first 32 code points are common to them all but are nonprintable entities.

Unicode Point Abbreviation Keystroke Name Comments
[U+0000] NUL Ctrl+@ NULL This needs to be tested in every place where data can be input or stored; many systems will crash or fail when this is encountered because they are not expecting this; code needs to handle these situations gracefully.
[U+0001] SOH Ctrl+A START OF HEADING  
[U+0002] STX Ctrl+B START OF TEXT  
[U+0003] ETX Ctrl+C END OF TEXT  
[U+0004] EOT Ctrl+D END OF TRANSMISSION  
[U+0005] ENQ Ctrl+E ENQUIRY  
[U+0006] ACK Ctrl+F ACKNOWLEDGE  
[U+0007] BEL Ctrl+G BELL (Beep)-caused teletype machines to ring a bell; will cause many common terminal/term emulation programs to beep.
[U+0008] BS Ctrl+H BACKSPACE  
[U+0009] HT Ctrl+I HORIZONTAL TAB  
[U+000A] LF Ctrl+J LINE FEED  
[U+000B] VT Ctrl+K VERTICAL TAB  
[U+000C] FF Ctrl+L FORM FEED  
[U+000D] CR Ctrl+M CARRIAGE RETURN  
[U+000E] SO Ctrl+N SHIFT OUT Switches output device to alternate character set.
[U+000F] SI Ctrl+O SHIFT IN Switches output device to default character set.
[U+0010] DLE Ctrl+P DATA LINK ESCAPE  
[U+0011] DC1 Ctrl+Q DEVICE CONTROL 1 Also the XON command for a modem soft handshake.
[U+0012] DC2 Ctrl+R DEVICE CONTROL 2  
[U+0013] DC3 Ctrl+S DEVICE CONTROL 3 Also the XOFF command for the modem soft handshake.
[U+0014] DC4 Ctrl+T DEVICE CONTROL 4  
[U+0015] NAK Ctrl+U NEGATIVE ACKNOWLEDGE  
[U+0016] SYN Ctrl+V SYNCHRONOUS IDLE  
[U+0017] ETB Ctrl+W END OF TRANSMISSION BLOCK  
[U+0018] CAN Ctrl+X CANCEL  
[U+0019] EM Ctrl+Y END OF MEDIUM  
[U+001A] SUB Ctrl+Z SUBSTITUTE  
[U+001B] ESC Ctrl+[ ESCAPE  
[U+001C] FS Ctrl+\ FILE SEPARATOR  
[U+001D] GS Ctrl+] GROUP SEPARATOR  
[U+001E] RS Ctrl+^ RECORD SEPARATOR  
[U+001F] US Ctrl+_ UNIT SEPARATOR  

IBM PC Keyboard Scan Codes

For special key combinations (for example, Alt+S, F5, and so on), a special two-character escape sequence is used. Depending on the language, the escape character can be either Escape [U+001B] or NUL [U+0000]. I will assume that NUL is being used in Table G.2. Having these codes can be very useful for automation or other places where you need to send particular keys.

Key Combination Escape Sequence
Alt+A [U+0000][U+001E]
Alt+B [U+0000][U+0030]
Alt+C [U+0000][U+002E]
Alt+D [U+0000][U+0020]
Alt+E [U+0000][U+0012]
Alt+F [U+0000][U+0021]
Alt+G [U+0000][U+0022]
Alt+H [U+0000][U+0023]
Alt+I [U+0000][U+0017]
Alt+J [U+0000][U+0024]
Alt+K [U+0000][U+0025]
Alt+L [U+0000][U+0026]
Alt+M [U+0000][U+0032]
Alt+N [U+0000][U+0031]
Alt+O [U+0000][U+0018]
Alt+P [U+0000][U+0019]
Alt+Q [U+0000][U+0010]
Alt+R [U+0000][U+0013]
Alt+S [U+0000][U+001A]
Alt+T [U+0000][U+0014]
Alt+U [U+0000][U+0016]
Alt+V [U+0000][U+002F]
Alt+W [U+0000][U+0011]
Alt+X [U+0000][U+002D]
Alt+Y [U+0000][U+0015]
Alt+Z [U+0000][U+002C]
PGUP [U+0000][U+0049]
PGDN [U+0000][U+0051]
HOME [U+0000][U+0047]
END [U+0000][U+004F]
UPARRW [U+0000][U+0048]
DNARRW [U+0000][U+0050]
LFTARRW [U+0000][U+004B]
RTARRW [U+0000][U+004D]
F1 [U+0000][U+003B]
F2 [U+0000][U+003C]
F3 [U+0000][U+003D]
F4 [U+0000][U+003E]
F5 [U+0000][U+003F]
F6 [U+0000][U+0040]
F7 [U+0000][U+0041]
F8 [U+0000][U+0042]
F9 [U+0000][U+0043]
F10 [U+0000][U+0044]
F11 [U+0000][U+0085]
F12 [U+0000][U+0086]
Alt+F1 [U+0000][U+0068]
Alt+F2 [U+0000][U+0069]
Alt+F3 [U+0000][U+006A]
Alt+F4 [U+0000][U+006B]
Alt+F5 [U+0000][U+006C]
Alt+F6 [U+0000][U+006D]
Alt+F7 [U+0000][U+006E]
Alt+F8 [U+0000][U+006F]
Alt+F9 [U+0000][U+0070]
Alt+F10 [U+0000][U+0071]
Alt+F11 [U+0000][U+008B]
Alt+F12 [U+0000][U+008C]
 

Character Combinations

Using the control characters mentioned previously in this appendix, each separately, is one type of test case; however, they can sometimes be handled correctly individually yet mean something special when used in certain combinations. Below is one key combination to test that uses the control characters.

[U+000D][U+000A] - CRLF or (CR)(LF), carriage return, and a line feed - means multiple things, such as the end of a packet segment; two of these in a row also need to be tested as input or within a stream of input because many protocols see two in a row as the end of a transmission.

Lower ASCII

Table G.3 provides some information about each potentially problematic lower ASCII character. Depending on the usage and context, these characters can mean very different things. The notations are just suggestions about how a character could be a sensitive or unwise character.

Character Code page point Unicode point Name Comment
   0x20 [U+0020] Space Also a C reserved char-very useful for turning up problems if first, last, or only char entered; problematic in a URL
! 0x21 [U+0021] Exclamation mark Problematic in a URL
" 0x22 [U+0022] Double quotes A C reserved char and delimiter; problematic in a URL
# 0x23 [U+0023] Number sign May be a delimiter; problematic in a URL
$ 0x24 [U+0024] Dollar sign A reserved character in a query component
% 0x25 [U+0025] Percent A C reserved char or a delimiter
& 0x26 [U+0026] Ampersand Character in a query component; problematic in a URL
' 0x27 [U+0027] Apostrophe A C reserved char and unwise to leave unescaped; problematic in a URL
( 0x28 [U+0028] Left parenthesis Problematic in a URL
) 0x29 [U+0029] Right parenthesis Problematic in a URL
* 0x2A [U+002A] Asterisk  
+ 0x2B [U+002B] Plus sign Character in a query component; problematic in a URL
, 0x2C [U+002C] Comma Character in a query component; problematic in a URL
- 0x2D [U+002D] Hyphen - minus  
. 0x2E [U+002E] Full stop (period) Especially as last char of a file name
/ 0x2F [U+002F] Solidus (slash) Especially as last char of a file name; also a C reserved char or reserved in a query component; problematic in a URL
: 0x3A [U+003A] Colon A reserved character in a query component; problematic in a URL
; 0x3B [U+003B] Semicolon A valid char in a URL, however can be problematic; may want to escape anyway; reserved within a query component, can be a parameter delimiter.
< 0x3C [U+003C] Less-than sign Can be a delimiter or part of HTML or script; problematic in a URL
= 0x3D [U+003D] Equals sign Reserved character in a query component; problematic in a URL
> 0x3E [U+003E] Greater-than sign Can be a delimiter or part of HTML or script; problematic in a URL
? 0x3F [U+003F] Question mark Reserved character in a query component; problematic in a URL
@ 0x40 [U+0040] Commercial At (at sign) Reserved character in a query component; problematic in a URL unless part of the authentication
[ 0x5B [U+005B] Left square bracket An unwise character to leave unescaped; problematic in a URL ; also problematic in RTL
\ 0x5C [U+005C] Reverse solidus (backslash) Especially as last char of a file name; an unwise character to leave unescaped; problematic in a URL
] 0x5D [U+005D] Right square bracket An unwise character to leave unescaped; problematic in a URL ; also problematic in RTL
^ 0x5E [U+005E] Circumflex accent An unwise character to leave unescaped; problematic in a URL
_ 0x5F [U+005F] Low line An unwise character to leave unescaped; problematic in a URL
` 0x60 [U+0060] Grave accent An unwise character to leave unescaped; problematic in a URL ; also problematic in RTL
{ 0x7B [U+007B] Left curly brace An unwise character to leave unescaped; problematic in a URL
| 0x7C [U+007C] Vertical line (pipe) An unwise character to leave unescaped; problematic in a URL ; also problematic in RTL
} 0x7D [U+007D] Right curly brace  
~ 0x7E [U+007E] Tilde  
 0x7F [U+007F] Delete  
«  0xAB [U+00AB] Left-pointing double angle   
_ 0x1C [U+001C] File Separator  

Extended Range Problem Characters

Table G.4 contains potentially problematic extended range characters from the single-byte code pages.

Table G.4 Extended Range Problem Characters

Character Unicode point Name Comment
ö [U+00F6] Latin Small Letter O with Diaeresis Can be a problem in filenames on DBCS systems.
§ [U+00A7] Section Sign  
ß [U+00DF] Latin Small Letter Sharp S  
å [U+00E5] Latin Small Letter A with Ring Above DOS delete marker. Mostly significant if first char in a string; essentially this is a Ctrl+z.
[U+20AC] Euro Currency Symbol  
ª [U+00AA] Feminine Ordinal Indicator This can sometimes be interpreted by Novell's NetWare as a disconnect signal or other similar low-level command. If your software will be used with NetWare, you will want to plan your tests to include these.
® [U+00AE] Registered Sign This can sometimes be interpreted by Novell's NetWare as a disconnect signal or other similar low-level command. If your software will be used with NetWare, you will want to plan your tests to include these.
¿ [U+00BF] Inverted Question Mark This can sometimes be interpreted by Novell's NetWare as a disconnect signal or other similar low-level command. If your software will be used with NetWare, you will want to plan your tests to include these.
İ [U+0130] 0xDD on 1254 code page Latin Capital Letter I with Dot Above Only found in Turkish on the 1254 code page; this can be seen being converted if the system does not properly handle this.
ı [U+0131] 0xFD on 1254 code page Latin Small Dotless Letter I Only found in Turkish on the 1254 code page; this can be seen being converted if the system does not properly handle this.

Problem Character Combinations

Table G.5 contains problem character combinations from the lower ASCII, the extended range (or upper ASCII), and then combinations of the two.

Table G.5 Problem Character Combinations

Characters Unicode points Names Comment
::  [U+003A][U+003A] Two colons  
~1: [U+007E][U+0031][U+003A] A tilde, a number (any number), and a colon  
.. [U+002E][U+002E] Two periods This can present security problems by allowing access to files otherwise not accessible.
$$ [U+0024][U+0024] Two dollar signs  
:€� [U+003A][U+20AC][U+FFFD] Colon, Euro symbol, and [U+FFFD] Although FFFD is not a "real" character, this can present problems.
++ [U+002B][U+002B] Two pluses  
%0 [U+0025][U+0030] Percent sign, number zero Can cause problems in Perl scripts.
\n [U+005C][U+006E] Backslash, letter n Escape sequence for new line in JavaScript.
\b [U+005C][U+0062] Backslash, letter b Escape sequence for bolding in JavaScript.
%20 [U+0025][U+0032][U+0030] Percent sign, number two, number zero URL encoded sequence for a space.
00:\ [U+0030][U+0030][U+003A][U+005C] Two number zeros, colon, backslash  
& [U+0026] Ampersand  
< [U+003C] Less-than sign  
> [U+003E] Greater-than sign  
= [U+003D] Equals sign  
Ü¢£  [U+00DC][U+00A2][U+00A3] Letter U with diaeresis, cent sign, pound (currency) sign - high literals  
FFFFFFFF

[U+0046][U+0046][U+0046][U+0046]

[U+0046][U+0046][U+0046][U+0046]

Eight letter F Input as a value, especially a regkey.
::$DATA

[U+003A][U+003A][U+0024][U+0044]

[U+0041][U+0054][U+0041]

Two colons, dollar sign, letters D, A, T, A Indicates data stream.

Lower ASCII Character Combination Verification Cases

Table G.6 contains test cases to try in order to verify that your application properly handles various lower ASCII characters. Whereas the previous set of character combinations were chosen because of their potential ability to break an application, these are chosen for their ability to prove that the application is properly handling valid lower ASCII input.

Table G.6 Character Combination Verification Cases

Characters Unicode points Comment
aAzZ [U+0061][U+0041][U+007A][U+005A] Tests that basic alphabetic characters are accepted.
1234 [U+0031][U+0032][U+0033][U+0034] Tests that common numbers are accepted.
12aZ [U+0031][U+0032][U+007A][U+005A] Tests that numbers and letters are accepted, starting with numbers.
aZ12 [U+007A][U+005A][U+0031][U+0032] Tests that letters and numbers are accepted, ending with numbers.
~!;:?/*

[U+007E][U+0021][U+003B][U+003A][U+003F]

[U+002F][U+002A]

Tests that common symbols are accepted.
/../ [U+002F][U+002E][U+002E][U+002F] Tests symbols, but in an arrangement that can be interpreted as a file path.
..%255c.. [U+002E][U+002E][U+0025][U+0032][U+0035][U+0035][U+0063][U+002E][U+002E] Test case for URL canonicalization.
..%%35%63.. [U+002E][U+002E][U+0025][U+0025][U+0033][U+0035][U+0025][U+0036][U+0033][U+002E][U+002E] Test case for URL canonicalization.
..%%35c.. [U+002E][U+002E][U+0025][U+0025][U+0033][U+0035][U+0063][U+002E][U+002E] Test case for URL canonicalization.
..%25%35%63.. [U+002E][U+002E][U+0025][U+0032][U+0035][U+0025][U+0033][U+0035][U+0025][U+0036][U+0033][U+002E][U+002E] Test case for URL canonicalization.
..%252f.. [U+002E][U+002E][U+0025][U+0032][U+0035][U+0032][U+0066][U+002E][U+002E] Test case for URL canonicalization.
..%255c.. [U+002E][U+002E][U+0025][U+0032][U+0035][U+0035][U+0063][U+002E][U+002E] Test case for URL canonicalization.
..%c0%2f.. [U+002E][U+002E][U+0025][U+0063][U+0030][U+0025][U+0032][U+0066][U+002E][U+002E] Test case for URL canonicalization.
..%c0%af.. [U+002E][U+002E][U+0025][U+0063][U+0030][U+0025][U+0061][U+0066][U+002E][U+002E] Test case for URL canonicalization.
..%c1%1c.. [U+002E][U+002E][U+0025][U+0063][U+0031][U+0025][U+0031][U+0063][U+002E][U+002E] Test case for URL canonicalization.
..%c1%9c.. [U+002E][U+002E][U+0025][U+0063][U+0031][U+0025][U+0039][U+0063][U+002E][U+002E] Test case for URL canonicalization.
..%255c../..%255c../..%255c/..%c1%1c../..%c1%1c../..%c1%1c..   Test case for URL canonicalization.
/À®./ [U+002F][U+00C0][U+00AE][U+002E][U+002F] Used with the previous test, specifically to test parsers-if the previous input is not an allowed sequence, then this should probably not be an allowed sequence.
\\?\C:\foo.txt [U+005C][U+005C][U+003F][U+005C][U+0043]

[U+003A][U+005C][U+0066][U+006F][U+006F]

[U+002E][U+0074][U+0078][U+0074]

Tests the assumption that the local file location has the second character of a colon; NT specific.
\\127.0.0.1\C$\ [U+005C][U+005C][U+0031][U+0032][U+0037][U+002E]

[U+0030][U+002E][U+0030][U+002E][U+0031][U+005C]

[U+0043][U+0024][U+005C]

Tests the assumption that the local file location has the second character of a colon; refers to the UNC localhost.
&lt; [U+0026][U+006C][U+0074][U+003B] HTML sequence for the less-than sign.
&nbsp; [U+0026][U+006E][U+0062][U+0073][U+0070][U+003B] HTML sequence for a non-breaking space.
<br> [U+003C][U+0062][U+0072][U+003E] HTML tag for a break.
&#65; [U+0026][U+0023][U+0036][U+0035][U+003B] Decimal HTML sequence for the letter A.
&#x0041;

[U+0026][U+0023][U+0078][U+0030][U+0030][U+0034]

[U+0031][U+003B]

Similar to previous example, but this is the hexadecimal HTML sequence for the letter A.
0xf [U+0030][U+0078][U+0066] May be assumed to be the hexadecimal reference to a number, in this case it would be 15.
0xa [U+0030][U+0078][U+0061] May be assumed that this is the hexadecimal reference to another number, in this case it would be converted to 10.
%UFF3C [U+0025][U+0055][U+0046][U+0046][U+0033][U+0043] URL encoded DBCS backslash.
Iiİı [U+0049][U+0069][U+0130][U+0131] Tests the two Latin Latter I's and the two extra Turkish I's.
<script>alert('Hello')</script>

[U+003C][U+0073][U+0063][U+0072][U+0069][U+0070]

[U+0074][U+003E][U+0061][U+006C][U+0065][U+0072]

[U+0074][U+0028][U+0027][U+0048][U+0065][U+006C]

[U+006C][U+006F][U+0027][U+0029][U+003C][U+002F]

[U+0073][U+0063][U+0072][U+0069] [U+0070][U+0074]

[U+003E]

Script will pop up a Hello alert box if it is executed-should not be executed.
'><script>alert('Hello')</script> [U+0027][U+003E][U+003C][U+0073][U+0063][U+0072][U+0069]

[U+0070][U+0074][U+003E][U+0061][U+006C][U+0065][U+0072]

[U+0074][U+0028][U+0027][U+0048][U+0065][U+006C][U+006C]

[U+006F][U+0027][U+0029][U+003C][U+002F][U+0073][U+0063]

[U+0072][U+0069][U+0070][U+0074][U+003E]

Similar to the previous example, except this will attempt to close a tag before the script.
"><script>alert('Hello')</script> [U+0027][U+00322][U+003C][U+0073][U+0063][U+0072][U+0069]

[U+0070][U+0074][U+003E][U+0061][U+006C][U+0065][U+0072]

[U+0074][U+0028][U+0027][U+0048][U+0065][U+006C][U+006C]

[U+006F][U+0027][U+0029][U+003C][U+002F][U+0073][U+0063]

[U+0072][U+0069][U+0070][U+0074][U+003E]

Similar to the previous example; this will attempt to close a tag before the script.
<Script>alert('Hello')</Script> [U+003C][U+0053][U+0063][U+0072][U+0069][U+0070][U+0074]

[U+003E][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028]

[U+0027][U+0048][U+0065][U+006C][U+006C][U+006F][U+0027]

[U+0029][U+003C][U+002F][U+0053][U+0063][U+0072][U+0069]

[U+0070][U+0074][U+003E]

Using mixed case in the script, testing for an exact string match.
<sCript>alert('Hello')</sCript> [U+003C][U+0073][U+0043][U+0072][U+0069][U+0070][U+0074]

[U+003E][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028]

[U+0027][U+0048][U+0065][U+006C][U+006C][U+006F][U+0027]

[U+0029][U+003C][U+002F][U+0073][U+0043][U+0072][U+0069]

[U+0070][U+0074][U+003E]

Similar to the previous example, using mixed case in the script, testing for an exact string match.
<SCRIPT>alert('Hello')</SCRIPT> [U+003C][U+0053][U+0043][U+0052][U+0049][U+0050][U+0054]

[U+003E][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028]

[U+0027][U+0048][U+0065][U+006C][U+006C][U+006F][U+0027]

[U+0029][U+003C][U+002F][U+0053][U+0043][U+0052][U+0049]

[U+0050][U+0054][U+003E]

Similar to the previous example, using all capitals in the script ,testing for an exact string match.
&#60;script&#62;alert('Hello')
&#60;&#47;script&#62;
[U+0026][U+0023][U+0036][U+0030][U+003B][U+0073][U+0063]

[U+0072][U+0069][U+0070][U+0074][U+0026][U+0023][U+0036]

[U+0032][U+003B][U+0061][U+006C][U+0065][U+0072][U+0074]

[U+0028][U+0027][U+0048][U+0065][U+006C][U+006C][U+006F]

[U+0027][U+0029][U+0026][U+0023][U+0036][U+0030][U+003B]

[U+0026][U+0023][U+0034][U+0037][U+003B][U+0073][U+0063]

[U+0072][U+0069][U+0070][U+0074][U+0026][U+0023][U+0036]

[U+0032][U+003B]

Similar to the original script example, except this string has the symbols in their decimal HTML reference.
%22><script%20for=window
%20event=%22onload()%22>
document.write(%22Hello%22);
document.close();</script>
Hello%22);document.close();
</script>.write(%22Hello%22);
document.close();</script>
[U+0025][U+0032][U+0032][U+003E][U+003C][U+0073][U+0063]

[U+0072][U+0069][U+0070][U+0074][U+0025][U+0032][U+0030]

[U+0066][U+006F][U+0072][U+003D][U+0077][U+0069][U+006E]

[U+0064][U+006F][U+0077][U+0020][U+0025][U+0032][U+0030]

[U+0065][U+0076][U+0065][U+006E][U+0074][U+003D][U+0025]

[U+0032][U+0032][U+006F][U+006E][U+006C][U+006F][U+0061]

[U+0064][U+0028][U+0029][U+0025][U+0032][U+0032][U+003E]

[U+0064][U+006F][U+0063][U+0075][U+006D][U+0065][U+006E]

[U+0074][U+002E][U+0077][U+0072][U+0069][U+0074][U+0065]

[U+0028][U+0025][U+0032][U+0032][U+0048][U+0065][U+006C]

[U+006C][U+006F][U+0025][U+0032][U+0032][U+0029][U+003B]

[U+0064][U+006F][U+0063][U+0075][U+006D][U+0065][U+006E]

[U+0074][U+002E][U+0063][U+006C][U+006F][U+0073][U+0065]

[U+0028][U+0029][U+003B][U+003C][U+002F][U+0073][U+0063]

[U+0072][U+0069][U+0070][U+0074][U+003E][U+0048][U+0065]

[U+006C][U+006C][U+006F][U+0025][U+0032][U+0032][U+0029]

[U+003B][U+0064][U+006F][U+0063][U+0075][U+006D][U+0065]

[U+006E][U+0074][U+002E][U+0063][U+006C][U+006F][U+0073]

[U+0065][U+0028][U+0029][U+003B][U+003C][U+002F][U+0073]

[U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+002E]

[U+0077][U+0072][U+0069][U+0074][U+0065][U+0028][U+0025]

[U+0032][U+0032][U+0048][U+0065][U+006C][U+006C][U+006F]

[U+0025][U+0032][U+0032][U+0029][U+003B][U+0064][U+006F]

[U+0063][U+0075][U+006D][U+0065][U+006E][U+0074][U+002E]

[U+0063][U+006C][U+006F][U+0073][U+0065][U+0028][U+0029]

[U+003B][U+003C][U+002F][U+0073][U+0063][U+0072][U+0069]

[U+0070][U+0074][U+003E]

Similar to the previous example, except this has all quotes and spaces URL escaped.
<script>(unencode("<script>
alert('Hello')</script>"))</script>
[U+003C][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074]

[U+003E][U+0028][U+0075][U+006E][U+0065][U+006E][U+006F]

[U+0064][U+0065][U+0028][U+0022][U+003C][U+0073][U+0063]

[U+0072][U+0069][U+0070][U+0074][U+003E][U+0061][U+006C]

[U+0065][U+0072][U+0074][U+0028][U+0027][U+0048][U+0065]

[U+006C][U+006C][U+006F][U+0027][U+0029][U+003C][U+002F]

[U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E]

[U+0022][U+0029][U+0029][U+003C][U+002F][U+0073][U+0063]

[U+0072][U+0069][U+0070][U+0074][U+003E]

Similar to previous examples, except this attempts to use the unencode function to get script to execute.
blah<script>(unencode ("<script>alert('Hello')
</script>"))</script>
[U+0062][U+006C][U+0061][U+0068][U+003C][U+0073][U+0063]

[U+0072][U+0069][U+0070][U+0074][U+003E][U+0028][U+0075]

[U+006E][U+0065][U+006E][U+006F][U+0064][U+0065][U+0028]

[U+0022][U+003C][U+0073][U+0063][U+0072][U+0069][U+0070]

[U+0074][U+003E][U+0061][U+006C][U+0065][U+0072][U+0074]

[U+0028][U+0027][U+0048][U+0065][U+006C][U+006C][U+006F]

[U+0027][U+0029][U+003C][U+002F][U+0073][U+0063][U+0072]

[U+0069][U+0070][U+0074][U+003E][U+0022][U+0029][U+0029]

[U+003C][U+002F][U+0073][U+0063][U+0072][U+0069][U+0070]

[U+0074][U+003E]

Similar to above examples, except this attempts to use the unencode function to get script to execute.
blah'<script>(unencode("<script>alert('Hello')
</script>"))</script>
[U+0062][U+006C][U+0061][U+0068][U+0027][U+003C][U+0073]

[U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+0028]

[U+0075][U+006E][U+0065][U+006E][U+006F][U+0064][U+0065]

[U+0028][U+0022][U+003C][U+0073][U+0063][U+0072][U+0069]

[U+0070][U+0074][U+003E][U+0061][U+006C][U+0065][U+0072]

[U+0074][U+0028][U+0027][U+0048][U+0065][U+006C][U+006C]

[U+006F][U+0027][U+0029][U+003C][U+002F][U+0073][U+0063]

[U+0072][U+0069][U+0070][U+0074][U+003E][U+0022][U+0029]

[U+0029][U+003C][U+002F][U+0073][U+0063][U+0072][U+0069]

[U+0070][U+0074][U+003E]

Similar to previous examples, except this attempts to use the unencode function to get script to execute and a single quote.
blah"<script>(unencode("<script>alert('Hello')
</script>"))</script>
[U+0062][U+006C][U+0061][U+0068][U+0022][U+003C][U+0073]

[U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+0028]

[U+0075][U+006E][U+0065][U+006E][U+006F][U+0064][U+0065]

[U+0028][U+0022][U+003C][U+0073][U+0063][U+0072][U+0069]

[U+0070][U+0074][U+003E][U+0061][U+006C][U+0065][U+0072]

[U+0074][U+0028][U+0027][U+0048][U+0065][U+006C][U+006C]

[U+006F][U+0027][U+0029][U+003C][U+002F][U+0073][U+0063]

[U+0072][U+0069][U+0070][U+0074][U+003E][U+0022][U+0029]

[U+0029][U+003C][U+002F][U+0073][U+0063][U+0072][U+0069]

[U+0070][U+0074][U+003E]

Similar to previous examples, except this attempts to use the unencode function to get script to execute and a double quote.
<SCRIPT LANGUAGE="VBScript">
MsgBox "Hello!" </SCRIPT>
[U+003C][U+0053][U+0043][U+0052][U+0049][U+0050][U+0054]

[U+0020][U+004C][U+0041][U+004E][U+0047][U+0055][U+0041]

[U+0047][U+0045][U+003D][U+0022][U+0056][U+0042][U+0053]

[U+0063][U+0072][U+0069][U+0070][U+0074][U+0022][U+003E]

[U+0020][U+004D][U+0073][U+0067][U+0042][U+006F][U+0078]

[U+0020][U+0022][U+0048][U+0065][U+006C][U+006C][U+006F]

[U+0021][U+0022][U+0020][U+003C][U+002F][U+0053][U+0043]

[U+0052][U+0049][U+0050][U+0054][U+003E]

VBScript of the previous example-alert box will pop up if it is executed.
<a href="JavaScript:alert()">link</a> [U+003C][U+0061][U+0020][U+0068][U+0072][U+0065][U+0066]

[U+003D][U+0022][U+004A][U+0061][U+0076][U+0061][U+0053]

[U+0063][U+0072][U+0069][U+0070][U+0074][U+003A][U+0061]

[U+006C][U+0065][U+0065][U+0072][U+0074][U+0028][U+0029]

[U+0022][U+003E][U+006C][U+0069][U+006E][U+006B][U+003C]

[U+002F][U+0061][U+003E]

 
‹script›alert(`Hello`)‹/script› [U+2039][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074]

[U+203A][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028]

[U+2018][U+0048][U+0065][U+006C][U+006C][U+006F][U+2018]

[U+0029][U+2039][U+2044][U+0073][U+0063][U+0072][U+0069]

[U+0070][U+0074][U+203A]

Symbols have been replaced with their high-bit counterparts.

HTML tags can include script where it may not be anticipated. Because these tags, and others, can include script with their attributes, they cannot be considered safe. The following lines contain some examples of how script can appear in what appear to be safe HTML tags.

<img src="JavaScript:alert()">img src</img>

<bgsound src="JavaScript:alert()">bgsound src</bgsound>

<iframe src="JavaScript:alert()">iframe src</iframe>

<table background="JavaScript:alert()">table background</table>

<object data="JavaScript:alert()">object data</object>

<frameset onload="JavaScript:alert()">frameset onload</frameset>

<body onload="JavaScript:alert()">body onload</body>

<body background="JavaScript:alert()">body background</body><span ID="ActiveX ID"></span>

Upper ASCII Character Combinations

In Table G.7 you will find upper ASCII (extended range) character combinations for use in verifying that your application can handle various valid upper ASCII input.

Table G.7 Upper ASCII Character Combinations

Characters Unicode point Comment
öÜß [U+00F6][U+00DC][U+00DF] High literals
Ü¢£ [U+00DC][U+00A2][U+00A3] High literals
 ©® [U+00A0][U+00A9][U+00AE] Problem literals
¿¾Õ [U+00BF][U+00BE][U+00D5] Regional literals
&><" [U+0026][U+003E][U+003C][U+0022] Named entities
©®¾¿Õ [U+00A0][U+00A9][U+00AE][U+00BE][U+00BF][U+00D5] Literals
åE5å [U+00E5][U+0045][U+35][U+E5] Can be mistaken for the DOS delete mark
€\$\ [U+20AC][U+005C][U+0024][U+005C]  
’ [U+00E2][U+20AC][U+2122]  
 

Diacritics

Table G.8 contains the combining marks that can cause large problems and have no ANSI equivalent; these are typed in combination with another character to alter them (for example, typed in with c [u+0063] to create ç ).

Table G.8 Diacritics

Unicode point Name
[U+0333] Combining double lowline
[U+033F] Combining double overline
[U+0327] Combining cedilla
 

High-Bit Characters

The characters listed in Table G.9 are different from their low-bit counterparts and often end up converted to their low-bit counterparts when the software cannot handle them. For instance, try taking script and substituting in the correlating high-bit characters to see if a filter allows them through and another component downgrades them, with the end result of script being executed. These characters can also be problematic on their own as input.

Table G.9 High-Bit Characters

Characters Unicode point Name
­ [U+00AD] Soft hyphen (SHY)
[U+2018] Single opening quote
[U+2019] Single closing quote
[U+201C] Double opening quote
[U+201D] Double closing quote
´ [U+00B4] Acute accent
¸ [U+00B8] Cedilla
  [U+00A0] Non-Breaking Space (NBSP)
© [U+00A9] Copyright
® [U+00AE] Registered Mark
  [U+2122] Trademark
[U+2013] En-dash
[U+2014] Em-dash
[U+2026] Ellipsis
[U+2044] Fraction Slash
[U+2039] Single Left-Pointing Angle
[U+203A] Single Right-Pointing Angle
[U+2032] Prime
[U+2033] Double Prime
 

Characters from Multibyte Character Sets

The rest of the tables in this appendix deal with double-byte characters and single-byte characters from the multibyte code pages.

Boundary Cases

Table G.10 contains characters for testing the first and last characters of the various multibyte code page ranges.

Table G.10 Boundary Cases for the Multibyte Code Pages

Characters Unicode point Comment
  [U+3000] [81/40] in 932, [A1/A1] in 949 and 936, [A1/40] in 950 Ideographic space - beginning of first DBCS range on 932 code page
[U+6EEC] [9F/FC] in 932 End of first DBCS range on 932 code page
[U+FF61] [A1] in 932 Beginning of Kana (single byte range) on 932 code page
[U+FF9F] [DF] in 932 End of Kana
[U+6F3E][E0/40] in 932 Beginning of Second DBCS range on 932 code page
[U+9ED1] [FC/4B]  in 932 End of Second DBCS on 932 code page
[U+E4C6] [A1/40]  in 936 code page Beginning of CHS  936 code page
[U+E4C5] [FE/FE] in 936 code page End of CHS 936 code page
[U+EEB8] [81/40]  in 950 code page Beginning of CHT 950 code page
[U+E310] [FE/FE]  in 950 code page End of CHT 950 code page
[U+AC02] [81/41] in 949 code page Beginning of Korean 949 code page
[U+8A70] [FD/FE] in 949 code page End of Korean 949 code page
 

Testing Individual Bytes that Make up the Double-Byte Character

Since the double-byte characters consist of 2 bytes read in individually, either one of the bytes could be mistaken for a special lower ASCII character. Because of this, you need to look at the special meaning of the lower ASCII characters and take the code point that they occupy to identify double-byte characters that have that code point as either a leading byte or a trailing byte (see Tables G.11 through G.16).

Table G.11 Lead Byte Is 81

Character Unicode code point Code point
[U+30FC] [81/5B] on 932 code page
[U+2010] [81/5D] on 932 code page
[U+FF3C] [81/5F] on 932 code page
[U+FF0B] [81/7B] on 932 code page
[U+FF0D] [81/7C] on 932 code page
± [U+00B1] [81/7D] on 932 code page
× [U+00D7] [81/7E] on 932 code page

Table G.12 Trailing Byte is 5C (ANSI Backslash Character - Need to Use as First, Middle, and Last Character in a String)
Character Unicode code point Code point
[U+2015] [81/5C] on 932 code page
[U+E0F7] [81/5C] on Windows 932 code page
[U+4E57] [81/5C] on 936 code page
[U+EED4]  [81/5C] on 950 code page

Table G.13 Lead Byte Is E5 - Special DOS Deletion Mark
Character Unicode code point Code point
[U+8541] [E5/40] on 932 code page
[U+86EC] [E5/7E] on 932 code page
[U+591C] [E5/A8] on 949 code page
[U+F981] [E5/FC] on 949 code page

Table G.14 Trail Bytes Is AD - ANSI Soft Hyphen
Character Unicode code point Code point
[U+4F03] [81/AD] on 936 code page
[U+85C4] [F0/AD] on 950 code page

The double-byte Romanji characters are Latin-looking characters that need to be used anywhere that Latin single-byte characters are expected.

Table G.15 Romanji Characters - Latin-Looking Characters from the 932 Page
Character Unicode point Comment
[U+25EF] Boundary
[U+FF10] Use the double-byte numbers where any number might be expected.
[U+FF11] Use the double-byte numbers where any number might be expected.
[U+FF20] Use the double-byte symbols where any symbol might be expected.
[U+FF21] Use the double-byte letters where any letter might be expected.
[U+FF3A] Use the double-byte letters where any letter might be expected.
[U+FF41] Use the double-byte letters where any letter might be expected.
[U+FF5A] Use the double-byte letters where any letter might be expected.
[U+3041] Boundary
[U+FF0E] Use the double-byte fullwidth period where any period might be expected.
[U+FF0F] Use the double-byte fullwidth solidus where any forward-slash might be expected.
[U+FF1A] Use the double-byte fullwidth colon where any colon might be expected.
[U+FF01] Use the double-byte fullwidth exclamation mark where any exclamation mark might be expected.
[U+2018] Use the double-byte fullwidth left single quote where any quote might be expected.
[U+2019] Use the double-byte fullwidth right single quote where any quote might be expected.
[U+201C] Use the double-byte fullwidth left double quote where any quote might be expected.
[U+201D] Use the double-byte fullwidth right double quote where any quote might be expected.
[U+FF1C] Use the double-byte fullwidth less-than sign where any less-than sign might be expected.
[U+FF1E] Use the double byte fullwidth greater-than sign where any greater-than sign might be expected.

Table G.16 shows characters that represent potential problems in NetWare.

Table G.16 NetWare Potential Problem Characters
Character Unicode code point Code point
[U+FF6A] [AA] on 932 code page
[U+FF6E] [AE] on 932 code page
ソ [U+FF7F] [BF] on 932 code page
[U+7A50] [88/AA] on 932 code page
[U+65ED] [88/AE] on 932 code page
[U+88B7] [88/BF] on 932 code page
 

Potential Problem Character Conversions

When the same character shares more than one code point, it can cause problems when converting from the code page to Unicode and then back to the code page. Tables G.17 and G.18 contain some examples of these types of problem characters.

Table G.17 JPN-932

Character Unicode code point Code point
[U+4E28] [FA/68] which will equal [ED/4C]
[U+FFE4] [FA/55] which will equal [EE/FA]
[U+5393] [FA/8D]
[U+6659] [FA/D7]
[U+7E8A] [FA/5C]
[U+69E2] [FA/EC]

Table G.18 CHT-950
Character Unicode code point Code point
[U+2550] [A2/A4] which will equal [F9/F9]
[U+255E] [A2/A5] which will equal [F9/E9]
[U+256A] [A2/A6] which will equal [F9/EA]
[U+5341] [A2/CC] which will equal [A4/51]
[U+2561] [A2/A7] which will equal [F9/EB]
[U+5345] [A2/CE] which will equal [A4/CA]
[U+256D] [F9/FA] which will equal [A2/7E]

Miscellaneous DBCS Problem Characters

Table G.19 contains a variety of other characters that may cause problems in your application. These are ones that do not necessarily fall into classifications of types of problems, but they are historically known to cause misbehavior.

Table G.19 Miscellaneous DBCS Problem Characters

Character Unicode point Comment
[U+90C2] 936 code page CHS character.
[U+33D5] 936 and 950 code pages.
[U+2574] 950 code page.
[U+FF5E]  932, 936, 949, and 950 code pages. Full-width tilde; can have a different Unicode mapping to the code page table depending on the platform.
_ [U+FF3F] 932, 936, 949, and 950 code pages.
[U+FF03] 932, 936, 949, and 950 code pages.
[U+FF06] 932, 936, 949, and 950 code pages.
[U+2593] 936 and 950 code pages.
[U+AC00] 949 code page.
耀 [U+8000] The E5 trailing byte of this Korean char can cause problems.
[U+80AD] 932, 936, and 950 code pages.
 

Multibyte Character Combinations

The problem characters that have been discussed in this section all come from the multibyte character sets; however, thus far I have discussed only individual code points. Table G.20 contains strings of multibyte characters to use both in verification and in testing the ability of your application to handle truly problematic characters.

Table G.20 Multibyte Character Combinations

Character Unicode points Comment
ヲゥォッ [U+FF66][U+FF69][U+FF6B][U+FF6F] String of four single-byte DBCS characters
ヲゥ ォッ [U+FF66][U+FF69][U+3000][U+FF6B][U+FF6F] String of single-byte DBCS characters with a DBCS space in the middle
ヲゥォッィ [U+FF66][U+FF69][U+FF6B][U+FF6F][U+FF68] String of five single-byte DBCS characters
ヲゥォッィェ [U+FF66][U+FF69][U+FF6B][U+FF6F][U+FF68][U+FF6A] String of six single-byte DBCS characters
黑鸙鶴滬 [U+9ED1][U+9E19][U+FA2D][U+6EEC] String of four double-byte DBCS characters
黑鸙鶴滬滸 [U+9ED1][U+9E19][U+FA2D][U+6EEC][U+6EF8] String of five double-byte DBCS characters
黑鸙鶴滬滸滾 [U+9ED1][U+9E19][U+FA2D][U+6EEC][U+6EF8][U+6EFE] String of six double-byte DBCS characters
ヲゥォッ黑鸙ヲゥォッ [U+FF66][U+FF69][U+FF6B][U+FF6F][U+9ED1]
[U+9E19][U+FF66][U+FF69][U+FF6B][U+FF6F]
String of DBCS characters starting and ending with single-byte characters with double-byte characters in the middle
黑鸙ヲゥォッ黑鸙 [U+9ED1][U+9E19][U+FF66][U+FF69][U+FF6B][U+FF6F]
[U+9ED1][U+9E19]
String of double-byte characters starting and ending with double-byte characters, with single-byte characters in the middle
¥\\¥ [U+FFE5][U+005C][U+005C][U+FFE5] Yen signs around two back-slashes

Unicode-Only Characters

Table G.21 contains characters that are not found in any code page, but rather exist only in Unicode. These characters are useful in identifying problems in an application that should be handling Unicode input, uncovering any potential code page dependencies it has.

Table G.21 Unicode-Only Characters

Character Unicode code point Comment
[U+2002] En space
[U+2003] Em space
[U+200E] Left-to-right mark
[U+200F] Right-to-left mark
[U+2011] Non-breaking hyphen
[U+201F] Double high reversed quotation marks
[U+202A] Left-to-right embedding
[U+202B] Right-to-left embedding
[U+FFFD] Replacement character
 [U+FEFF] Byte order mark (BOM)
[U+2028] Line Separator mark (LSEP)
सुस्वागतम [U+0938][U+0941][U+0938][U+094D][U+0935]
[U+093E][U+0917][U+0924][U+092E]
Devanagari characters-can be a problem
and unsupported in some areas

UTF-8 Potential Problems

In UTF-8 encoding you have three ranges of characters because the characters can be encoded with 1, 2, or 3 bytes. Testing the boundaries here is very important. Another good test case is to take a long string of the 3-byte encoded Unicode characters and try to overrun buffers with them. This will turn up a number of missed buffer overflows because the error handling may be expecting 2 bytes per character (assumptions based on the double-byte characters), but not 3-byte characters. (See Table G.22.)

Table G.22 UTF-8 Potential Problems

Character Unicode code point Comment
[space] [U+0020] First printable character that requires only 1-byte encoding (Basic Latin-space)
~ [U+007E] Last character that requires only 1-byte encoding (Basic Latin)
 [U+0081] First character that requires 2-byte encoding (Latin-1 supplement)
ۭ [U+06ED] Last character that requires only 2-byte encoding (Arabic)
[U+0901] First character that requires 3-byte encoding (Devanagari)
[U+6EEC] Character in the middle of the 3-byte encoding range (CJK Unified)
[U+FFEE] End of the 3-byte encoding (Half-width form)

 

 
[Book Home] [Links] [App. B] [App. L] [Lang Guides] [Code Pgs] [Samples] [HTTP Responses] [Questions] [Templates] [System Guides] [Readings]