The Web Testing Companion: The Insider's Guide to Efficient and Effective
Tests
Lydia Ash
Appendix G - Problem Characters and Sample Test Input
This appendix contains sample input that has a high likelihood of causing misbehavior in
many different types of applications. The exact usage varies depending on the
application-some will be sensitive to these cases in a URL, others through a text input
field, and others will be very tolerant of the data and behave correctly. Many applications
will have their own sets of problematic input that may contain these and may have some
unique ones.
In order to make it easier for you to use the inputs in your own testing, this
file is available for download here. AppendixG.doc
Characters from the Single-Byte Character Sets
Control Characters
The control characters in Table G.1 are often left off of code pages because these first 32 code points are common to them all but are nonprintable entities.
| Unicode Point |
Abbreviation |
Keystroke |
Name
|
Comments
|
| [U+0000] |
NUL |
Ctrl+@ |
NULL
|
This needs to be tested in every place where data can be input or stored; many systems will crash or fail when this is encountered because they are not expecting this; code needs to handle these situations gracefully.
|
| [U+0001] |
SOH |
Ctrl+A |
START OF HEADING
|
|
| [U+0002] |
STX |
Ctrl+B |
START OF TEXT
|
|
| [U+0003] |
ETX |
Ctrl+C |
END OF TEXT
|
|
| [U+0004] |
EOT |
Ctrl+D |
END OF TRANSMISSION
|
|
| [U+0005] |
ENQ |
Ctrl+E |
ENQUIRY
|
|
| [U+0006] |
ACK |
Ctrl+F |
ACKNOWLEDGE
|
|
| [U+0007] |
BEL |
Ctrl+G |
BELL
|
(Beep)-caused teletype machines to ring a bell; will cause many common terminal/term emulation programs to beep.
|
| [U+0008] |
BS |
Ctrl+H |
BACKSPACE
|
|
| [U+0009] |
HT |
Ctrl+I |
HORIZONTAL TAB
|
|
| [U+000A] |
LF |
Ctrl+J |
LINE FEED
|
|
| [U+000B] |
VT |
Ctrl+K |
VERTICAL TAB
|
|
| [U+000C] |
FF |
Ctrl+L |
FORM FEED
|
|
| [U+000D] |
CR |
Ctrl+M |
CARRIAGE RETURN
|
|
| [U+000E] |
SO |
Ctrl+N |
SHIFT OUT
|
Switches output device to alternate character set.
|
| [U+000F] |
SI |
Ctrl+O |
SHIFT IN
|
Switches output device to default character set.
|
| [U+0010] |
DLE |
Ctrl+P |
DATA LINK ESCAPE
|
|
| [U+0011] |
DC1 |
Ctrl+Q |
DEVICE CONTROL 1
|
Also the XON command for a modem soft handshake.
|
| [U+0012] |
DC2 |
Ctrl+R |
DEVICE CONTROL 2
|
|
| [U+0013] |
DC3 |
Ctrl+S |
DEVICE CONTROL 3
|
Also the XOFF command for the modem soft handshake.
|
| [U+0014] |
DC4 |
Ctrl+T |
DEVICE CONTROL 4
|
|
| [U+0015] |
NAK |
Ctrl+U |
NEGATIVE ACKNOWLEDGE
|
|
| [U+0016] |
SYN |
Ctrl+V |
SYNCHRONOUS IDLE
|
|
| [U+0017] |
ETB |
Ctrl+W |
END OF TRANSMISSION BLOCK
|
|
| [U+0018] |
CAN |
Ctrl+X |
CANCEL
|
|
| [U+0019] |
EM |
Ctrl+Y |
END OF MEDIUM
|
|
| [U+001A] |
SUB |
Ctrl+Z |
SUBSTITUTE
|
|
| [U+001B] |
ESC |
Ctrl+[ |
ESCAPE
|
|
| [U+001C] |
FS |
Ctrl+\ |
FILE SEPARATOR
|
|
| [U+001D] |
GS |
Ctrl+] |
GROUP SEPARATOR
|
|
| [U+001E] |
RS |
Ctrl+^ |
RECORD SEPARATOR
|
|
| [U+001F] |
US |
Ctrl+_ |
UNIT SEPARATOR
|
|
IBM PC Keyboard Scan Codes
For special key combinations (for example, Alt+S, F5, and so on), a special two-character escape sequence is used. Depending on the language, the escape character can be either Escape [U+001B] or NUL [U+0000]. I will assume that NUL is being used in Table G.2. Having these codes can be very useful for automation or other places where you need to send particular keys.
| Key Combination |
Escape Sequence |
| Alt+A |
[U+0000][U+001E] |
| Alt+B |
[U+0000][U+0030] |
| Alt+C |
[U+0000][U+002E] |
| Alt+D |
[U+0000][U+0020] |
| Alt+E |
[U+0000][U+0012] |
| Alt+F |
[U+0000][U+0021] |
| Alt+G |
[U+0000][U+0022] |
| Alt+H |
[U+0000][U+0023] |
| Alt+I |
[U+0000][U+0017] |
| Alt+J |
[U+0000][U+0024] |
| Alt+K |
[U+0000][U+0025] |
| Alt+L |
[U+0000][U+0026] |
| Alt+M |
[U+0000][U+0032] |
| Alt+N |
[U+0000][U+0031] |
| Alt+O |
[U+0000][U+0018] |
| Alt+P |
[U+0000][U+0019] |
| Alt+Q |
[U+0000][U+0010] |
| Alt+R |
[U+0000][U+0013] |
| Alt+S |
[U+0000][U+001A] |
| Alt+T |
[U+0000][U+0014] |
| Alt+U |
[U+0000][U+0016] |
| Alt+V |
[U+0000][U+002F] |
| Alt+W |
[U+0000][U+0011] |
| Alt+X |
[U+0000][U+002D] |
| Alt+Y |
[U+0000][U+0015] |
| Alt+Z |
[U+0000][U+002C] |
| PGUP |
[U+0000][U+0049] |
| PGDN |
[U+0000][U+0051] |
| HOME |
[U+0000][U+0047] |
| END |
[U+0000][U+004F] |
| UPARRW |
[U+0000][U+0048] |
| DNARRW |
[U+0000][U+0050] |
| LFTARRW |
[U+0000][U+004B] |
| RTARRW |
[U+0000][U+004D] |
| F1 |
[U+0000][U+003B] |
| F2 |
[U+0000][U+003C] |
| F3 |
[U+0000][U+003D] |
| F4 |
[U+0000][U+003E] |
| F5 |
[U+0000][U+003F] |
| F6 |
[U+0000][U+0040] |
| F7 |
[U+0000][U+0041] |
| F8 |
[U+0000][U+0042] |
| F9 |
[U+0000][U+0043] |
| F10 |
[U+0000][U+0044] |
| F11 |
[U+0000][U+0085] |
| F12 |
[U+0000][U+0086] |
| Alt+F1 |
[U+0000][U+0068] |
| Alt+F2 |
[U+0000][U+0069] |
| Alt+F3 |
[U+0000][U+006A] |
| Alt+F4 |
[U+0000][U+006B] |
| Alt+F5 |
[U+0000][U+006C] |
| Alt+F6 |
[U+0000][U+006D] |
| Alt+F7 |
[U+0000][U+006E] |
| Alt+F8 |
[U+0000][U+006F] |
| Alt+F9 |
[U+0000][U+0070] |
| Alt+F10 |
[U+0000][U+0071] |
| Alt+F11 |
[U+0000][U+008B] |
| Alt+F12 |
[U+0000][U+008C] |
Character Combinations
Using the control characters mentioned previously in this appendix, each separately,
is one type of test case; however, they can sometimes be handled correctly
individually yet mean something special when used in certain combinations.
Below is one key combination to test that uses the control characters.
[U+000D][U+000A] - CRLF or (CR)(LF), carriage return, and a line feed
- means multiple things, such as the end of a packet segment; two of these
in a row also need to be tested as input or within a stream of input because
many protocols see two in a row as the end of a transmission.
Lower ASCII
Table G.3 provides some information about each potentially problematic lower
ASCII character. Depending on the usage and context, these characters
can mean very different things. The notations are just suggestions about
how a character could be a sensitive or unwise character.
| Character |
Code page point |
Unicode point |
Name |
Comment
|
| |
0x20 |
[U+0020] |
Space |
Also a C reserved char-very useful for turning up problems if first, last, or only char entered; problematic in a URL
|
| ! |
0x21 |
[U+0021] |
Exclamation mark |
Problematic in a URL
|
| " |
0x22 |
[U+0022] |
Double quotes |
A C reserved char and delimiter; problematic in a URL
|
| # |
0x23 |
[U+0023] |
Number sign |
May be a delimiter; problematic in a URL
|
| $ |
0x24 |
[U+0024] |
Dollar sign |
A reserved character in a query component
|
| % |
0x25 |
[U+0025] |
Percent |
A C reserved char or a delimiter
|
| & |
0x26 |
[U+0026] |
Ampersand |
Character in a query component; problematic in a URL
|
| ' |
0x27 |
[U+0027] |
Apostrophe |
A C reserved char and unwise to leave unescaped; problematic in a URL
|
| ( |
0x28 |
[U+0028] |
Left parenthesis |
Problematic in a URL
|
| ) |
0x29 |
[U+0029] |
Right parenthesis |
Problematic in a URL
|
| * |
0x2A |
[U+002A] |
Asterisk |
|
| + |
0x2B |
[U+002B] |
Plus sign |
Character in a query component; problematic in a URL
|
| , |
0x2C |
[U+002C] |
Comma |
Character in a query component; problematic in a URL
|
| - |
0x2D |
[U+002D] |
Hyphen - minus |
|
| . |
0x2E |
[U+002E] |
Full stop (period) |
Especially as last char of a file name
|
| / |
0x2F |
[U+002F] |
Solidus (slash) |
Especially as last char of a file name; also a C reserved char or reserved in a query component; problematic in a URL
|
| : |
0x3A |
[U+003A] |
Colon |
A reserved character in a query component; problematic in a URL
|
| ; |
0x3B |
[U+003B] |
Semicolon |
A valid char in a URL, however can be problematic; may want to escape anyway; reserved within a query component, can be a parameter delimiter.
|
| < |
0x3C |
[U+003C] |
Less-than sign |
Can be a delimiter or part of HTML or script; problematic in a URL
|
| = |
0x3D |
[U+003D] |
Equals sign |
Reserved character in a query component; problematic in a URL
|
| > |
0x3E |
[U+003E] |
Greater-than sign |
Can be a delimiter or part of HTML or script; problematic in a URL
|
| ? |
0x3F |
[U+003F] |
Question mark |
Reserved character in a query component; problematic in a URL
|
| @ |
0x40 |
[U+0040] |
Commercial At (at sign) |
Reserved character in a query component; problematic in a URL unless part of the authentication
|
| [ |
0x5B |
[U+005B] |
Left square bracket |
An unwise character to leave unescaped; problematic in a URL ; also problematic in RTL
|
| \ |
0x5C |
[U+005C] |
Reverse solidus (backslash) |
Especially as last char of a file name; an unwise character to leave unescaped; problematic in a URL
|
| ] |
0x5D |
[U+005D] |
Right square bracket |
An unwise character to leave unescaped; problematic in a URL ; also problematic in RTL
|
| ^ |
0x5E |
[U+005E] |
Circumflex accent |
An unwise character to leave unescaped; problematic in a URL
|
| _ |
0x5F |
[U+005F] |
Low line |
An unwise character to leave unescaped; problematic in a URL
|
| ` |
0x60 |
[U+0060] |
Grave accent |
An unwise character to leave unescaped; problematic in a URL ; also problematic in RTL
|
| { |
0x7B |
[U+007B] |
Left curly brace |
An unwise character to leave unescaped; problematic in a URL
|
| | |
0x7C |
[U+007C] |
Vertical line (pipe) |
An unwise character to leave unescaped; problematic in a URL ; also problematic in RTL
|
| } |
0x7D |
[U+007D] |
Right curly brace |
|
| ~ |
0x7E |
[U+007E] |
Tilde |
|
| |
0x7F |
[U+007F] |
Delete |
|
| « |
0xAB |
[U+00AB] |
Left-pointing double angle |
|
| _ |
0x1C |
[U+001C] |
File Separator |
|
Extended Range Problem Characters
Table G.4 contains potentially problematic extended range characters from the
single-byte code pages.
Table G.4 Extended Range Problem Characters
| Character |
Unicode point |
Name |
Comment |
| ö |
[U+00F6] |
Latin Small Letter O with Diaeresis |
Can be a problem in filenames on DBCS systems. |
| § |
[U+00A7] |
Section Sign |
|
| ß |
[U+00DF] |
Latin Small Letter Sharp S |
|
| å |
[U+00E5] |
Latin Small Letter A with Ring Above |
DOS delete marker. Mostly significant if first char
in a string; essentially this is a Ctrl+z. |
| € |
[U+20AC] |
Euro Currency Symbol |
|
| ª |
[U+00AA] |
Feminine Ordinal Indicator |
This can sometimes be interpreted by Novell's NetWare
as a disconnect signal or other similar low-level command. If your
software will be used with NetWare, you will want to plan your tests
to include these. |
| ® |
[U+00AE] |
Registered Sign |
This can sometimes be interpreted by Novell's NetWare
as a disconnect signal or other similar low-level command. If your
software will be used with NetWare, you will want to plan your tests
to include these. |
| ¿ |
[U+00BF] |
Inverted Question Mark |
This can sometimes be interpreted by Novell's NetWare
as a disconnect signal or other similar low-level command. If your
software will be used with NetWare, you will want to plan your tests
to include these. |
| İ |
[U+0130] 0xDD on 1254 code page |
Latin Capital Letter I with Dot Above |
Only found in Turkish on the 1254 code page; this can
be seen being converted if the system does not properly handle this.
|
| ı |
[U+0131] 0xFD on 1254 code page |
Latin Small Dotless Letter I |
Only found in Turkish on the 1254 code page; this can
be seen being converted if the system does not properly handle this.
|
Problem Character Combinations
Table G.5 contains problem character combinations from the lower ASCII, the
extended range (or upper ASCII), and then combinations of the two.
Table G.5 Problem Character Combinations
| Characters |
Unicode points |
Names |
Comment |
| :: |
[U+003A][U+003A] |
Two colons |
|
| ~1: |
[U+007E][U+0031][U+003A] |
A tilde, a number (any number), and a colon |
|
| .. |
[U+002E][U+002E] |
Two periods |
This can present security problems by allowing access
to files otherwise not accessible. |
| $$ |
[U+0024][U+0024] |
Two dollar signs |
|
| :€� |
[U+003A][U+20AC][U+FFFD] |
Colon, Euro symbol, and [U+FFFD] |
Although FFFD is not a "real" character, this can present
problems. |
| ++ |
[U+002B][U+002B] |
Two pluses |
|
| %0 |
[U+0025][U+0030] |
Percent sign, number zero |
Can cause problems in Perl scripts. |
| \n |
[U+005C][U+006E] |
Backslash, letter n |
Escape sequence for new line in JavaScript. |
| \b |
[U+005C][U+0062] |
Backslash, letter b |
Escape sequence for bolding in JavaScript. |
| %20 |
[U+0025][U+0032][U+0030] |
Percent sign, number two, number zero |
URL encoded sequence for a space. |
| 00:\ |
[U+0030][U+0030][U+003A][U+005C] |
Two number zeros, colon, backslash |
|
| & |
[U+0026] |
Ampersand |
|
| < |
[U+003C] |
Less-than sign |
|
| > |
[U+003E] |
Greater-than sign |
|
| = |
[U+003D] |
Equals sign |
|
| Ü¢£ |
[U+00DC][U+00A2][U+00A3] |
Letter U with diaeresis, cent sign, pound (currency)
sign - high literals |
|
| FFFFFFFF |
[U+0046][U+0046][U+0046][U+0046]
[U+0046][U+0046][U+0046][U+0046]
|
Eight letter F |
Input as a value, especially a regkey. |
| ::$DATA |
[U+003A][U+003A][U+0024][U+0044]
[U+0041][U+0054][U+0041]
|
Two colons, dollar sign, letters D, A, T, A |
Indicates data stream. |
Lower ASCII Character Combination Verification Cases
Table G.6 contains test cases to try in order to verify that your application
properly handles various lower ASCII characters. Whereas the previous
set of character combinations were chosen because of their potential ability
to break an application, these are chosen for their ability to prove that
the application is properly handling valid lower ASCII input.
Table G.6 Character Combination Verification Cases
| Characters |
Unicode points |
Comment |
| aAzZ |
[U+0061][U+0041][U+007A][U+005A] |
Tests that basic alphabetic characters are accepted.
|
| 1234 |
[U+0031][U+0032][U+0033][U+0034] |
Tests that common numbers are accepted. |
| 12aZ |
[U+0031][U+0032][U+007A][U+005A] |
Tests that numbers and letters are accepted, starting
with numbers. |
| aZ12 |
[U+007A][U+005A][U+0031][U+0032] |
Tests that letters and numbers are accepted, ending
with numbers. |
| ~!;:?/* |
[U+007E][U+0021][U+003B][U+003A][U+003F]
[U+002F][U+002A]
|
Tests that common symbols are accepted. |
| /../ |
[U+002F][U+002E][U+002E][U+002F] |
Tests symbols, but in an arrangement that can be interpreted
as a file path. |
| /À®./ |
[U+002F][U+00C0][U+00AE][U+002E][U+002F] |
Used with the previous test, specifically to test parsers-if
the previous input is not an allowed sequence, then this should probably
not be an allowed sequence. |
| \\?\C:\foo.txt |
[U+005C][U+005C][U+003F][U+005C][U+0043]
[U+003A][U+005C][U+0066][U+006F][U+006F]
[U+002E][U+0074][U+0078][U+0074]
|
Tests the assumption that the local file location has
the second character of a colon; NT specific. |
| \\127.0.0.1\C$\ |
[U+005C][U+005C][U+0031][U+0032][U+0037][U+002E]
[U+0030][U+002E][U+0030][U+002E][U+0031][U+005C]
[U+0043][U+0024][U+005C]
|
Tests the assumption that the local file location has
the second character of a colon; refers to the UNC localhost. |
| < |
[U+0026][U+006C][U+0074][U+003B] |
HTML sequence for the less-than sign. |
| |
[U+0026][U+006E][U+0062][U+0073][U+0070][U+003B] |
HTML sequence for a non-breaking space. |
| <br> |
[U+003C][U+0062][U+0072][U+003E] |
HTML tag for a break. |
| A |
[U+0026][U+0023][U+0036][U+0035][U+003B] |
Decimal HTML sequence for the letter A. |
| A |
[U+0026][U+0023][U+0078][U+0030][U+0030][U+0034]
[U+0031][U+003B]
|
Similar to previous example, but this is the hexadecimal
HTML sequence for the letter A. |
| 0xf |
[U+0030][U+0078][U+0066] |
May be assumed to be the hexadecimal reference to a
number, in this case it would be 15. |
| 0xa |
[U+0030][U+0078][U+0061] |
May be assumed that this is the hexadecimal reference
to another number, in this case it would be converted to 10. |
| %UFF3C |
[U+0025][U+0055][U+0046][U+0046][U+0033][U+0043] |
URL encoded DBCS backslash. |
| Iiİı |
[U+0049][U+0069][U+0130][U+0131] |
Tests the two Latin Latter I's and the two extra Turkish
I's. |
| <script>alert('Hello')</script> |
[U+003C][U+0073][U+0063][U+0072][U+0069][U+0070]
[U+0074][U+003E][U+0061][U+006C][U+0065][U+0072]
[U+0074][U+0028][U+0027][U+0048][U+0065][U+006C]
[U+006C][U+006F][U+0027][U+0029][U+003C][U+002F]
[U+0073][U+0063][U+0072][U+0069] [U+0070][U+0074]
[U+003E]
|
Script will pop up a Hello alert box if it is executed-should
not be executed. |
| '><script>alert('Hello')</script> |
[U+0027][U+003E][U+003C][U+0073][U+0063][U+0072][U+0069]
[U+0070][U+0074][U+003E][U+0061][U+006C][U+0065][U+0072]
[U+0074][U+0028][U+0027][U+0048][U+0065][U+006C][U+006C]
[U+006F][U+0027][U+0029][U+003C][U+002F][U+0073][U+0063]
[U+0072][U+0069][U+0070][U+0074][U+003E]
|
Similar to the previous example, except this will attempt
to close a tag before the script. |
| "><script>alert('Hello')</script>
|
[U+0027][U+00322][U+003C][U+0073][U+0063][U+0072][U+0069]
[U+0070][U+0074][U+003E][U+0061][U+006C][U+0065][U+0072]
[U+0074][U+0028][U+0027][U+0048][U+0065][U+006C][U+006C]
[U+006F][U+0027][U+0029][U+003C][U+002F][U+0073][U+0063]
[U+0072][U+0069][U+0070][U+0074][U+003E]
|
Similar to the previous example; this will attempt
to close a tag before the script. |
| <Script>alert('Hello')</Script> |
[U+003C][U+0053][U+0063][U+0072][U+0069][U+0070][U+0074]
[U+003E][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028]
[U+0027][U+0048][U+0065][U+006C][U+006C][U+006F][U+0027]
[U+0029][U+003C][U+002F][U+0053][U+0063][U+0072][U+0069]
[U+0070][U+0074][U+003E]
|
Using mixed case in the script, testing for an exact
string match. |
| <sCript>alert('Hello')</sCript> |
[U+003C][U+0073][U+0043][U+0072][U+0069][U+0070][U+0074]
[U+003E][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028]
[U+0027][U+0048][U+0065][U+006C][U+006C][U+006F][U+0027]
[U+0029][U+003C][U+002F][U+0073][U+0043][U+0072][U+0069]
[U+0070][U+0074][U+003E]
|
Similar to the previous example, using mixed case in
the script, testing for an exact string match. |
| <SCRIPT>alert('Hello')</SCRIPT> |
[U+003C][U+0053][U+0043][U+0052][U+0049][U+0050][U+0054]
[U+003E][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028]
[U+0027][U+0048][U+0065][U+006C][U+006C][U+006F][U+0027]
[U+0029][U+003C][U+002F][U+0053][U+0043][U+0052][U+0049]
[U+0050][U+0054][U+003E]
|
Similar to the previous example, using all capitals
in the script ,testing for an exact string match. |
<script>alert('Hello')
</script> |
[U+0026][U+0023][U+0036][U+0030][U+003B][U+0073][U+0063]
[U+0072][U+0069][U+0070][U+0074][U+0026][U+0023][U+0036]
[U+0032][U+003B][U+0061][U+006C][U+0065][U+0072][U+0074]
[U+0028][U+0027][U+0048][U+0065][U+006C][U+006C][U+006F]
[U+0027][U+0029][U+0026][U+0023][U+0036][U+0030][U+003B]
[U+0026][U+0023][U+0034][U+0037][U+003B][U+0073][U+0063]
[U+0072][U+0069][U+0070][U+0074][U+0026][U+0023][U+0036]
[U+0032][U+003B]
|
Similar to the original script example, except this
string has the symbols in their decimal HTML reference. |
%22><script%20for=window
%20event=%22onload()%22>
document.write(%22Hello%22);
document.close();</script>
Hello%22);document.close();
</script>.write(%22Hello%22);
document.close();</script> |
[U+0025][U+0032][U+0032][U+003E][U+003C][U+0073][U+0063]
[U+0072][U+0069][U+0070][U+0074][U+0025][U+0032][U+0030]
[U+0066][U+006F][U+0072][U+003D][U+0077][U+0069][U+006E]
[U+0064][U+006F][U+0077][U+0020][U+0025][U+0032][U+0030]
[U+0065][U+0076][U+0065][U+006E][U+0074][U+003D][U+0025]
[U+0032][U+0032][U+006F][U+006E][U+006C][U+006F][U+0061]
[U+0064][U+0028][U+0029][U+0025][U+0032][U+0032][U+003E]
[U+0064][U+006F][U+0063][U+0075][U+006D][U+0065][U+006E]
[U+0074][U+002E][U+0077][U+0072][U+0069][U+0074][U+0065]
[U+0028][U+0025][U+0032][U+0032][U+0048][U+0065][U+006C]
[U+006C][U+006F][U+0025][U+0032][U+0032][U+0029][U+003B]
[U+0064][U+006F][U+0063][U+0075][U+006D][U+0065][U+006E]
[U+0074][U+002E][U+0063][U+006C][U+006F][U+0073][U+0065]
[U+0028][U+0029][U+003B][U+003C][U+002F][U+0073][U+0063]
[U+0072][U+0069][U+0070][U+0074][U+003E][U+0048][U+0065]
[U+006C][U+006C][U+006F][U+0025][U+0032][U+0032][U+0029]
[U+003B][U+0064][U+006F][U+0063][U+0075][U+006D][U+0065]
[U+006E][U+0074][U+002E][U+0063][U+006C][U+006F][U+0073]
[U+0065][U+0028][U+0029][U+003B][U+003C][U+002F][U+0073]
[U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+002E]
[U+0077][U+0072][U+0069][U+0074][U+0065][U+0028][U+0025]
[U+0032][U+0032][U+0048][U+0065][U+006C][U+006C][U+006F]
[U+0025][U+0032][U+0032][U+0029][U+003B][U+0064][U+006F]
[U+0063][U+0075][U+006D][U+0065][U+006E][U+0074][U+002E]
[U+0063][U+006C][U+006F][U+0073][U+0065][U+0028][U+0029]
[U+003B][U+003C][U+002F][U+0073][U+0063][U+0072][U+0069]
[U+0070][U+0074][U+003E]
|
Similar to the previous example, except this has all
quotes and spaces URL escaped. |
<script>(unencode("<script>
alert('Hello')</script>"))</script> |
[U+003C][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074]
[U+003E][U+0028][U+0075][U+006E][U+0065][U+006E][U+006F]
[U+0064][U+0065][U+0028][U+0022][U+003C][U+0073][U+0063]
[U+0072][U+0069][U+0070][U+0074][U+003E][U+0061][U+006C]
[U+0065][U+0072][U+0074][U+0028][U+0027][U+0048][U+0065]
[U+006C][U+006C][U+006F][U+0027][U+0029][U+003C][U+002F]
[U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E]
[U+0022][U+0029][U+0029][U+003C][U+002F][U+0073][U+0063]
[U+0072][U+0069][U+0070][U+0074][U+003E]
|
Similar to previous examples, except this attempts
to use the unencode function to get script to execute. |
blah<script>(unencode ("<script>alert('Hello')
</script>"))</script> |
[U+0062][U+006C][U+0061][U+0068][U+003C][U+0073][U+0063]
[U+0072][U+0069][U+0070][U+0074][U+003E][U+0028][U+0075]
[U+006E][U+0065][U+006E][U+006F][U+0064][U+0065][U+0028]
[U+0022][U+003C][U+0073][U+0063][U+0072][U+0069][U+0070]
[U+0074][U+003E][U+0061][U+006C][U+0065][U+0072][U+0074]
[U+0028][U+0027][U+0048][U+0065][U+006C][U+006C][U+006F]
[U+0027][U+0029][U+003C][U+002F][U+0073][U+0063][U+0072]
[U+0069][U+0070][U+0074][U+003E][U+0022][U+0029][U+0029]
[U+003C][U+002F][U+0073][U+0063][U+0072][U+0069][U+0070]
[U+0074][U+003E]
|
Similar to above examples, except this attempts to
use the unencode function to get script to execute. |
blah'<script>(unencode("<script>alert('Hello')
</script>"))</script> |
[U+0062][U+006C][U+0061][U+0068][U+0027][U+003C][U+0073]
[U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+0028]
[U+0075][U+006E][U+0065][U+006E][U+006F][U+0064][U+0065]
[U+0028][U+0022][U+003C][U+0073][U+0063][U+0072][U+0069]
[U+0070][U+0074][U+003E][U+0061][U+006C][U+0065][U+0072]
[U+0074][U+0028][U+0027][U+0048][U+0065][U+006C][U+006C]
[U+006F][U+0027][U+0029][U+003C][U+002F][U+0073][U+0063]
[U+0072][U+0069][U+0070][U+0074][U+003E][U+0022][U+0029]
[U+0029][U+003C][U+002F][U+0073][U+0063][U+0072][U+0069]
[U+0070][U+0074][U+003E]
|
Similar to previous examples, except this attempts
to use the unencode function to get script to execute and a single
quote. |
blah"<script>(unencode("<script>alert('Hello')
</script>"))</script> |
[U+0062][U+006C][U+0061][U+0068][U+0022][U+003C][U+0073]
[U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+0028]
[U+0075][U+006E][U+0065][U+006E][U+006F][U+0064][U+0065]
[U+0028][U+0022][U+003C][U+0073][U+0063][U+0072][U+0069]
[U+0070][U+0074][U+003E][U+0061][U+006C][U+0065][U+0072]
[U+0074][U+0028][U+0027][U+0048][U+0065][U+006C][U+006C]
[U+006F][U+0027][U+0029][U+003C][U+002F][U+0073][U+0063]
[U+0072][U+0069][U+0070][U+0074][U+003E][U+0022][U+0029]
[U+0029][U+003C][U+002F][U+0073][U+0063][U+0072][U+0069]
[U+0070][U+0074][U+003E]
|
Similar to previous examples, except this attempts
to use the unencode function to get script to execute and a double
quote. |
<SCRIPT LANGUAGE="VBScript">
MsgBox "Hello!" </SCRIPT> |
[U+003C][U+0053][U+0043][U+0052][U+0049][U+0050][U+0054]
[U+0020][U+004C][U+0041][U+004E][U+0047][U+0055][U+0041]
[U+0047][U+0045][U+003D][U+0022][U+0056][U+0042][U+0053]
[U+0063][U+0072][U+0069][U+0070][U+0074][U+0022][U+003E]
[U+0020][U+004D][U+0073][U+0067][U+0042][U+006F][U+0078]
[U+0020][U+0022][U+0048][U+0065][U+006C][U+006C][U+006F]
[U+0021][U+0022][U+0020][U+003C][U+002F][U+0053][U+0043]
[U+0052][U+0049][U+0050][U+0054][U+003E]
|
VBScript of the previous example-alert box will pop
up if it is executed. |
| <a href="JavaScript:alert()">link</a>
|
[U+003C][U+0061][U+0020][U+0068][U+0072][U+0065][U+0066]
[U+003D][U+0022][U+004A][U+0061][U+0076][U+0061][U+0053]
[U+0063][U+0072][U+0069][U+0070][U+0074][U+003A][U+0061]
[U+006C][U+0065][U+0065][U+0072][U+0074][U+0028][U+0029]
[U+0022][U+003E][U+006C][U+0069][U+006E][U+006B][U+003C]
[U+002F][U+0061][U+003E]
|
|
| ‹script›alert(`Hello`)‹/script› |
[U+2039][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074]
[U+203A][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028]
[U+2018][U+0048][U+0065][U+006C][U+006C][U+006F][U+2018]
[U+0029][U+2039][U+2044][U+0073][U+0063][U+0072][U+0069]
[U+0070][U+0074][U+203A]
|
Symbols have been replaced with their high-bit counterparts.
|
HTML tags can include script where it may not be anticipated. Because
these tags, and others, can include script with their attributes, they
cannot be considered safe. The following lines contain some examples of
how script can appear in what appear to be safe HTML tags.
<img src="JavaScript:alert()">img src</img>
<bgsound src="JavaScript:alert()">bgsound src</bgsound>
<iframe src="JavaScript:alert()">ifame src</iframe>
<table background="JavaScript:alert()">table background</table>
<object data="JavaScript:alert()">object data</object>
<frameset onload="JavaScript:alert()">frameset onload</frameset>
<body onload="JavaScript:alert()">body onload</body>
<body background="JavaScript:alert()">body background</body><span ID="ActiveX ID"></span>
Upper ASCII Character Combinations
In Table G.7 you will find upper ASCII (extended range) character combinations
for use in verifying that your application can handle various valid upper
ASCII input.
Table G.7 Upper ASCII Character Combinations
|
Characters
|
Unicode point
|
Comment
|
|---|
|
öÜß
|
[U+00F6][U+00DC][U+00DF]
|
High literals
| |
Ü¢£
|
[U+00DC][U+00A2][U+00A3]
|
High literals
| |
©®
|
[U+00A0][U+00A9][U+00AE]
|
Problem literals
| |
¿¾Õ
|
[U+00BF][U+00BE][U+00D5]
|
Regional literals
| |
&><"
|
[U+0026][U+003E][U+003C][U+0022]
|
Named entities
| |
©®¾¿Õ
|
[U+00A0][U+00A9][U+00AE][U+00BE][U+00BF][U+00D5]
|
Literals
| |
åE5å
|
[U+00E5][U+0045][U+35][U+E5]
|
Can be mistaken for the DOS delete mark
| |
€\$\
|
[U+20AC][U+005C][U+0024][U+005C]
|
| |
’
|
[U+00E2][U+20AC][U+2122]
|
|
Diacritics
Table G.8 contains the combining marks that can cause large problems and have
no ANSI equivalent; these are typed in combination with another character
to alter them (for example, typed in with c [u+0063] to create c¸
).
Table G.8 Diacritics
|
Unicode point
|
Name
|
|---|
|
[U+0333]
|
Combining double lowline
| |
[U+033F]
|
Combining double overline
| |
[U+0327]
|
Combining cedilla
|
High-Bit Characters
The characters listed in Table G.9 are different from their low-bit counterparts
and often end up converted to their low-bit counterparts when the software
cannot handle them. For instance, try taking script and substituting in
the correlating high-bit characters to see if a filter allows them through
and another component downgrades them, with the end result of script being
executed. These characters can also be problematic on their own as input.
Table G.9 High-Bit Characters
|
Characters
|
Unicode point
|
Name
|
|---|
|
|
[U+00AD]
|
Soft hyphen (SHY)
| |
‘
|
[U+2018]
|
Single opening quote
| |
’
|
[U+2019]
|
Single closing quote
| |
“
|
[U+201C]
|
Double opening quote
| |
”
|
[U+201D]
|
Double closing quote
| |
´
|
[U+00B4]
|
Acute accent
| |
¸
|
[U+00B8]
|
Cedilla
| |
|
[U+00A0]
|
Non-Breaking Space (NBSP)
| |
©
|
[U+00A9]
|
Copyright
| |
®
|
[U+00AE]
|
Registered Mark
| |
™
|
[U+2122]
|
Trademark
| |
–
|
[U+2013]
|
En-dash
| |
—
|
[U+2014]
|
Em-dash
| |
…
|
[U+2026]
|
Ellipsis
| |
⁄
|
[U+2044]
|
Fraction Slash
| |
‹
|
[U+2039]
|
Single Left-Pointing Angle
| |
›
|
[U+203A]
|
Single Right-Pointing Angle
| |
′
|
[U+2032]
|
Prime
| |
″
|
[U+2033]
|
Double Prime
|
Characters from Multibyte Character Sets
The rest of the tables in this appendix deal with double-byte characters and single-byte characters from the multibyte code pages.
Boundary Cases
Table G.10 contains characters for testing the first and last characters of
the various multibyte code page ranges.
Table G.10 Boundary Cases for the Multibyte Code Pages
|
Characters
|
Unicode point
|
Comment
|
|---|
|
|
[U+3000]
[81/40] in 932, [A1/A1] in 949 and 936, [A1/40] in 950
|
Ideographic space - beginning of first DBCS range on 932 code page
| |
滬
|
[U+6EEC]
[9F/FC] in 932
|
End of first DBCS range on 932 code page
| |
。
|
[U+FF61]
[A1] in 932
|
Beginning of Kana (single byte range) on 932 code page
| |
゚
|
[U+FF9F]
[DF] in 932
|
End of Kana
| |
漾
|
[U+6F3E][E0/40] in 932
|
Beginning of Second DBCS range on 932 code page
| |
黑
|
[U+9ED1]
[FC/4B] in 932
|
End of Second DBCS on 932 code page
| |
|
[U+E4C6]
[A1/40] in 936 code page
|
Beginning of CHS 936 code page
| |
|
[U+E4C5]
[FE/FE] in 936 code page
|
End of CHS 936 code page
| |
|
[U+EEB8]
[81/40] in 950 code page
|
Beginning of CHT 950 code page
| |
|
[U+E310]
[FE/FE] in 950 code page
|
End of CHT 950 code page
| |
갂
|
[U+AC02]
[81/41] in 949 code page
|
Beginning of Korean 949 code page
| |
詰
|
[U+8A70]
[FD/FE] in 949 code page
|
End of Korean 949 code page
|
Testing Individual Bytes that Make up the Double-Byte Character
Since the double-byte characters consist of 2 bytes read in individually, either
one of the bytes could be mistaken for a special lower ASCII character.
Because of this, you need to look at the special meaning of the lower
ASCII characters and take the code point that they occupy to identify
double-byte characters that have that code point as either a leading byte
or a trailing byte (see Tables G.11 through G.16).
Table G.11 Lead Byte Is 81
|
Character
|
Unicode code point
|
Code point
|
|---|
|
ー
|
[U+30FC]
|
[81/5B] on 932 code page
| |
‐
|
[U+2010]
|
[81/5D] on 932 code page
| |
\
|
[U+FF3C]
|
[81/5F] on 932 code page
| |
+
|
[U+FF0B]
|
[81/7B] on 932 code page
| |
-
|
[U+FF0D]
|
[81/7C] on 932 code page
| |
±
|
[U+00B1]
|
[81/7D] on 932 code page
| |
×
|
[U+00D7]
|
[81/7E] on 932 code page
|
Table G.12 Trailing Byte is 5C (ANSI Backslash Character - Need to Use as First, Middle, and Last Character in a String)
|
Character
|
Unicode code point
|
Code point
|
|---|
|
―
|
[U+2015]
|
[81/5C] on 932 code page
| |
|
[U+E0F7]
|
[81/5C] on Windows 932 code page
| |
乗
|
[U+4E57]
|
[81/5C] on 936 code page
| |
|
[U+EED4]
|
[81/5C] on 950 code page
|
Table G.13 Lead Byte Is E5 - Special DOS Deletion Mark
|
Character
|
Unicode code point
|
Code point
|
|---|
|
蕁
|
[U+8541]
|
[E5/40] on 932 code page
| |
蛬
|
[U+86EC]
|
[E5/7E] on 932 code page
| |
夜
|
[U+591C]
|
[E5/A8] on 949 code page
| |
女
|
[U+F981]
|
[E5/FC] on 949 code page
|
Table G.14 Trail Bytes Is AD - ANSI Soft Hyphen
|
Character
|
Unicode code point
|
Code point
|
|---|
|
伃
|
[U+4F03]
|
[81/AD] on 936 code page
| |
藄
|
[U+85C4]
|
[F0/AD] on 950 code page
|
The double-byte Romanji characters are Latin-looking characters that need to be used anywhere that Latin single-byte characters are expected.
Table G.15 Romanji Characters - Latin-Looking Characters from the 932 Page
|
Character
|
Unicode point
|
Comment
|
|---|
|
◯
|
[U+25EF]
|
Boundary
| |
0
|
[U+FF10]
|
Use the double-byte numbers where any number might be expected.
| |
1
|
[U+FF11]
|
Use the double-byte numbers where any number might be expected.
| |
@
|
[U+FF20]
|
Use the double-byte symbols where any symbol might be expected.
| |
A
|
[U+FF21]
|
Use the double-byte letters where any letter might be expected.
| |
Z
|
[U+FF3A]
|
Use the double-byte letters where any letter might be expected.
| |
a
|
[U+FF41]
|
Use the double-byte letters where any letter might be expected.
| |
z
|
[U+FF5A]
|
Use the double-byte letters where any letter might be expected.
| |
ぁ
|
[U+3041]
|
Boundary
| |
.
|
[U+FF0E]
|
Use the double-byte fullwidth period where any period might be expected.
| |
/
|
[U+FF0F]
|
Use the double-byte fullwidth solidus where any forward-slash might be expected.
| |
:
|
[U+FF1A]
|
Use the double-byte fullwidth colon where any colon might be expected.
| |
!
|
[U+FF01]
|
Use the double-byte fullwidth exclamation mark where any exclamation mark might be expected.
| |
‘
|
[U+2018]
|
Use the double-byte fullwidth left single quote where any quote might be expected.
| |
’
|
[U+2019]
|
Use the double-byte fullwidth right single quote where any quote might be expected.
| |
“
|
[U+201C]
|
Use the double-byte fullwidth left double quote where any quote might be expected.
| |
”
|
[U+201D]
|
Use the double-byte fullwidth right double quote where any quote might be expected.
| |
<
|
[U+FF1C]
|
Use the double-byte fullwidth less-than sign where any less-than sign might be expected.
| |
>
|
[U+FF1E]
|
Use the double byte fullwidth greater-than sign where any greater-than sign might be expected.
|
Table G.16 shows characters that represent potential problems in NetWare.
Table G.16 NetWare Potential Problem Characters
|
Character
|
Unicode code point
|
Code point
|
|---|
|
ェ
|
[U+FF6A]
|
[AA] on 932 code page
| |
ョ
|
[U+FF6E]
|
[AE] on 932 code page
| |
ソ
|
[U+FF7F]
|
[BF] on 932 code page
| |
穐
|
[U+7A50]
|
[88/AA] on 932 code page
| |
旭
|
[U+65ED]
|
[88/AE] on 932 code page
| |
袷
|
[U+88B7]
|
[88/BF] on 932 code page
|
Potential Problem Character Conversions
When the same character shares more than one code point, it can cause problems
when converting from the code page to Unicode and then back to the code
page. Tables G.17 and G.18 contain some examples of these types of problem
characters.
Table G.17 JPN-932
|
Character
|
Unicode code point
|
Code point
|
|---|
|
丨
|
[U+4E28]
|
[FA/68] which will equal [ED/4C]
| |
¦
|
[U+FFE4]
|
[FA/55] which will equal [EE/FA]
| |
厓
|
[U+5393]
|
[FA/8D]
| |
晙
|
[U+6659]
|
[FA/D7]
| |
纊
|
[U+7E8A]
|
[FA/5C]
| |
槢
|
[U+69E2]
|
[FA/EC]
|
Table G.18 CHT-950
|
Character
|
Unicode code point
|
Code point
|
|---|
|
═
|
[U+2550]
|
[A2/A4] which will equal [F9/F9]
| |
╞
|
[U+255E]
|
[A2/A5] which will equal [F9/E9]
| |
╪
|
[U+256A]
|
[A2/A6] which will equal [F9/EA]
| |
十
|
[U+5341]
|
[A2/CC] which will equal [A4/51]
| |
╡
|
[U+2561]
|
[A2/A7] which will equal [F9/EB]
| |
卅
|
[U+5345]
|
[A2/CE] which will equal [A4/CA]
| |
╭
|
[U+256D]
|
[F9/FA] which will equal [A2/7E]
|
Miscellaneous DBCS Problem Characters
Table G.19 contains a variety of other characters that may cause problems in
your application. These are ones that do not necessarily fall into classifications
of types of problems, but they are historically known to cause misbehavior.
Table G.19 Miscellaneous DBCS Problem Characters
|
Character
|
Unicode point
|
Comment
|
|---|
|
郂
|
[U+90C2]
|
936 code page CHS character.
| |
㏕
|
[U+33D5]
|
936 and 950 code pages.
| |
╴
|
[U+2574]
|
950 code page.
| |
~
|
[U+FF5E]
|
932, 936, 949, and 950 code pages.
Full-width tilde; can have a different Unicode mapping to the code page table depending on the platform.
| |
_
|
[U+FF3F]
|
932, 936, 949, and 950 code pages.
| |
#
|
[U+FF03]
|
932, 936, 949, and 950 code pages.
| |
&
|
[U+FF06]
|
932, 936, 949, and 950 code pages.
| |
▓
|
[U+2593]
|
936 and 950 code pages.
| |
가
|
[U+AC00]
|
949 code page.
| |
耀
|
[U+8000]
|
The E5 trailing byte of this Korean char can cause problems.
| |
肭
|
[U+80AD]
|
932, 936, and 950 code pages.
|
Multibyte Character Combinations
The problem characters that have been discussed in this section all come from
the multibyte character sets; however, thus far I have discussed only
individual code points. Table G.20 contains strings of multibyte characters
to use both in verification and in testing the ability of your application
to handle truly problematic characters.
Table G.20 Multibyte Character Combinations
|
Character
|
Unicode points
|
Comment
|
|---|
|
ヲゥォッ
|
[U+FF66][U+FF69][U+FF6B][U+FF6F]
|
String of four single-byte DBCS characters
| |
ヲゥ ォッ
|
[U+FF66][U+FF69][U+3000][U+FF6B][U+FF6F]
|
String of single-byte DBCS characters with a DBCS space in the middle
| |
ヲゥォッィ
|
[U+FF66][U+FF69][U+FF6B][U+FF6F][U+FF68]
|
String of five single-byte DBCS characters
| |
ヲゥォッィェ
|
[U+FF66][U+FF69][U+FF6B][U+FF6F][U+FF68][U+FF6A]
|
String of six single-byte DBCS characters
| |
黑鸙鶴滬
|
[U+9ED1][U+9E19][U+FA2D][U+6EEC]
|
String of four double-byte DBCS characters
| |
黑鸙鶴滬滸
|
[U+9ED1][U+9E19][U+FA2D][U+6EEC][U+6EF8]
|
String of five double-byte DBCS characters
| |
黑鸙鶴滬滸滾
|
[U+9ED1][U+9E19][U+FA2D][U+6EEC][U+6EF8][U+6EFE]
|
String of six double-byte DBCS characters
| |
ヲゥォッ黑鸙ヲゥォッ
|
[U+FF66][U+FF69][U+FF6B][U+FF6F][U+9ED1]
[U+9E19][U+FF66][U+FF69][U+FF6B][U+FF6F] |
String of DBCS characters starting and ending with single-byte characters with double-byte characters in the middle
| |
黑鸙ヲゥォッ黑鸙
|
[U+9ED1][U+9E19][U+FF66][U+FF69][U+FF6B][U+FF6F]
[U+9ED1][U+9E19] |
String of double-byte characters starting and ending with double-byte characters, with single-byte characters in the middle
| |
¥\\¥
|
[U+FFE5][U+005C][U+005C][U+FFE5]
|
Yen signs around two back-slashes
|
Unicode-Only Characters
Table G.21 contains characters that are not found in any code page, but rather
exist only in Unicode. These characters are useful in identifying problems
in an application that should be handling Unicode input, uncovering any
potential code page dependencies it has.
Table G.21 Unicode-Only Characters
|
Character
|
Unicode code point
|
Comment
|
|---|
|
|
[U+2002]
|
En space
| |
|
[U+2003]
|
Em space
| |
|
[U+200E]
|
Left-to-right mark
| |
|
[U+200F]
|
Right-to-left mark
| |
‑
|
[U+2011]
|
Non-breaking hyphen
| |
‟
|
[U+201F]
|
Double high reversed quotation marks
| |
|
[U+202A]
|
Left-to-right embedding
| |
|
[U+202B]
|
Right-to-left embedding
| |
�
|
[U+FFFD]
|
Replacement character
| |
|
[U+FEFF]
|
Byte order mark (BOM)
| |
|
[U+2028]
|
Line Separator mark (LSEP)
| |
सुस्वागतम
|
[U+0938][U+0941][U+0938][U+094D][U+0935]
[U+093E][U+0917][U+0924][U+092E] |
Devanagari characters-can be a problem
and unsupported in some areas |
UTF-8 Potential Problems
In UTF-8 encoding you have three ranges of characters because the characters can be encoded with 1, 2, or 3 bytes. Testing the boundaries here is very important. Another good test case is to take a long string of the 3-byte encoded Unicode characters and try to overrun buffers with them. This will turn up a number of missed buffer overflows because the error handling may be expecting 2 bytes per character (assumptions based on the double-byte characters), but not 3-byte characters. (See Table G.22.)
Table G.22 UTF-8 Potential Problems
| Character |
Unicode code point |
Comment |
| [space] |
[U+0020] |
First printable character that requires only 1-byte
encoding (Basic Latin-space) |
| ~ |
[U+007E] |
Last character that requires only 1-byte encoding (Basic
Latin) |
| |
[U+0081] |
First character that requires 2-byte encoding (Latin-1
supplement) |
| ۭ |
[U+06ED] |
Last character that requires only 2-byte encoding (Arabic)
|
| ँ |
[U+0901] |
First character that requires 3-byte encoding (Devanagari)
|
| 滬 |
[U+6EEC] |
Character in the middle of the 3-byte encoding range
(CJK Unified) |
| ○ |
[U+FFEE] |
End of the 3-byte encoding (Half-width form) |
|
|
|