The Web Testing Companion: The Insider's Guide to Efficient and
Effective Tests
Lydia Ash
Appendix G - Problem Characters and Sample Test Input
This appendix contains sample input that has a high likelihood
of causing misbehavior in many different types of applications.
The exact usage varies depending on the application-some will be
sensitive to these cases in a URL, others through a text input field,
and others will be very tolerant of the data and behave correctly.
Many applications will have their own sets of problematic input
that may contain these and may have some unique ones.
In order to make it easier for you to use the inputs in your own
testing, this file is available for download here. AppendixG.doc
Characters from the Single-Byte Character Sets
Control Characters
The control characters in Table G.1 are often left off of code
pages because these first 32 code points are common to them all
but are nonprintable entities.
| Unicode Point |
Abbreviation |
Keystroke |
Name |
Comments |
| [U+0000] |
NUL |
Ctrl+@ |
NULL |
This needs to be tested in every place where data can be
input or stored; many systems will crash or fail when this is
encountered because they are not expecting this; code needs
to handle these situations gracefully. |
| [U+0001] |
SOH |
Ctrl+A |
START OF HEADING |
|
| [U+0002] |
STX |
Ctrl+B |
START OF TEXT |
|
| [U+0003] |
ETX |
Ctrl+C |
END OF TEXT |
|
| [U+0004] |
EOT |
Ctrl+D |
END OF TRANSMISSION |
|
| [U+0005] |
ENQ |
Ctrl+E |
ENQUIRY |
|
| [U+0006] |
ACK |
Ctrl+F |
ACKNOWLEDGE |
|
| [U+0007] |
BEL |
Ctrl+G |
BELL |
(Beep)-caused teletype machines to ring a bell; will cause
many common terminal/term emulation programs to beep. |
| [U+0008] |
BS |
Ctrl+H |
BACKSPACE |
|
| [U+0009] |
HT |
Ctrl+I |
HORIZONTAL TAB |
|
| [U+000A] |
LF |
Ctrl+J |
LINE FEED |
|
| [U+000B] |
VT |
Ctrl+K |
VERTICAL TAB |
|
| [U+000C] |
FF |
Ctrl+L |
FORM FEED |
|
| [U+000D] |
CR |
Ctrl+M |
CARRIAGE RETURN |
|
| [U+000E] |
SO |
Ctrl+N |
SHIFT OUT |
Switches output device to alternate character set. |
| [U+000F] |
SI |
Ctrl+O |
SHIFT IN |
Switches output device to default character set. |
| [U+0010] |
DLE |
Ctrl+P |
DATA LINK ESCAPE |
|
| [U+0011] |
DC1 |
Ctrl+Q |
DEVICE CONTROL 1 |
Also the XON command for a modem soft handshake. |
| [U+0012] |
DC2 |
Ctrl+R |
DEVICE CONTROL 2 |
|
| [U+0013] |
DC3 |
Ctrl+S |
DEVICE CONTROL 3 |
Also the XOFF command for the modem soft handshake. |
| [U+0014] |
DC4 |
Ctrl+T |
DEVICE CONTROL 4 |
|
| [U+0015] |
NAK |
Ctrl+U |
NEGATIVE ACKNOWLEDGE |
|
| [U+0016] |
SYN |
Ctrl+V |
SYNCHRONOUS IDLE |
|
| [U+0017] |
ETB |
Ctrl+W |
END OF TRANSMISSION BLOCK |
|
| [U+0018] |
CAN |
Ctrl+X |
CANCEL |
|
| [U+0019] |
EM |
Ctrl+Y |
END OF MEDIUM |
|
| [U+001A] |
SUB |
Ctrl+Z |
SUBSTITUTE |
|
| [U+001B] |
ESC |
Ctrl+[ |
ESCAPE |
|
| [U+001C] |
FS |
Ctrl+\ |
FILE SEPARATOR |
|
| [U+001D] |
GS |
Ctrl+] |
GROUP SEPARATOR |
|
| [U+001E] |
RS |
Ctrl+^ |
RECORD SEPARATOR |
|
| [U+001F] |
US |
Ctrl+_ |
UNIT SEPARATOR |
|
IBM PC Keyboard Scan Codes
For special key combinations (for example, Alt+S, F5, and so on),
a special two-character escape sequence is used. Depending on the
language, the escape character can be either Escape [U+001B] or
NUL [U+0000]. I will assume that NUL is being used in Table G.2.
Having these codes can be very useful for automation or other places
where you need to send particular keys.
| Key Combination |
Escape Sequence |
| Alt+A |
[U+0000][U+001E] |
| Alt+B |
[U+0000][U+0030] |
| Alt+C |
[U+0000][U+002E] |
| Alt+D |
[U+0000][U+0020] |
| Alt+E |
[U+0000][U+0012] |
| Alt+F |
[U+0000][U+0021] |
| Alt+G |
[U+0000][U+0022] |
| Alt+H |
[U+0000][U+0023] |
| Alt+I |
[U+0000][U+0017] |
| Alt+J |
[U+0000][U+0024] |
| Alt+K |
[U+0000][U+0025] |
| Alt+L |
[U+0000][U+0026] |
| Alt+M |
[U+0000][U+0032] |
| Alt+N |
[U+0000][U+0031] |
| Alt+O |
[U+0000][U+0018] |
| Alt+P |
[U+0000][U+0019] |
| Alt+Q |
[U+0000][U+0010] |
| Alt+R |
[U+0000][U+0013] |
| Alt+S |
[U+0000][U+001A] |
| Alt+T |
[U+0000][U+0014] |
| Alt+U |
[U+0000][U+0016] |
| Alt+V |
[U+0000][U+002F] |
| Alt+W |
[U+0000][U+0011] |
| Alt+X |
[U+0000][U+002D] |
| Alt+Y |
[U+0000][U+0015] |
| Alt+Z |
[U+0000][U+002C] |
| PGUP |
[U+0000][U+0049] |
| PGDN |
[U+0000][U+0051] |
| HOME |
[U+0000][U+0047] |
| END |
[U+0000][U+004F] |
| UPARRW |
[U+0000][U+0048] |
| DNARRW |
[U+0000][U+0050] |
| LFTARRW |
[U+0000][U+004B] |
| RTARRW |
[U+0000][U+004D] |
| F1 |
[U+0000][U+003B] |
| F2 |
[U+0000][U+003C] |
| F3 |
[U+0000][U+003D] |
| F4 |
[U+0000][U+003E] |
| F5 |
[U+0000][U+003F] |
| F6 |
[U+0000][U+0040] |
| F7 |
[U+0000][U+0041] |
| F8 |
[U+0000][U+0042] |
| F9 |
[U+0000][U+0043] |
| F10 |
[U+0000][U+0044] |
| F11 |
[U+0000][U+0085] |
| F12 |
[U+0000][U+0086] |
| Alt+F1 |
[U+0000][U+0068] |
| Alt+F2 |
[U+0000][U+0069] |
| Alt+F3 |
[U+0000][U+006A] |
| Alt+F4 |
[U+0000][U+006B] |
| Alt+F5 |
[U+0000][U+006C] |
| Alt+F6 |
[U+0000][U+006D] |
| Alt+F7 |
[U+0000][U+006E] |
| Alt+F8 |
[U+0000][U+006F] |
| Alt+F9 |
[U+0000][U+0070] |
| Alt+F10 |
[U+0000][U+0071] |
| Alt+F11 |
[U+0000][U+008B] |
| Alt+F12 |
[U+0000][U+008C] |
Character Combinations
Using the control characters mentioned previously in this appendix,
each separately, is one type of test case; however, they can sometimes
be handled correctly individually yet mean something special when
used in certain combinations. Below is one key combination to test
that uses the control characters.
[U+000D][U+000A] - CRLF or (CR)(LF), carriage return, and a line
feed - means multiple things, such as the end of a packet segment;
two of these in a row also need to be tested as input or within
a stream of input because many protocols see two in a row as the
end of a transmission.
Lower ASCII
Table G.3 provides some information about each potentially problematic
lower ASCII character. Depending on the usage and context, these
characters can mean very different things. The notations are just
suggestions about how a character could be a sensitive or unwise
character.
| Character |
Code page point |
Unicode point |
Name |
Comment |
| |
0x20 |
[U+0020] |
Space |
Also a C reserved char-very useful for turning up problems
if first, last, or only char entered; problematic in a URL |
| ! |
0x21 |
[U+0021] |
Exclamation mark |
Problematic in a URL |
| " |
0x22 |
[U+0022] |
Double quotes |
A C reserved char and delimiter; problematic in a URL |
| # |
0x23 |
[U+0023] |
Number sign |
May be a delimiter; problematic in a URL |
| $ |
0x24 |
[U+0024] |
Dollar sign |
A reserved character in a query component |
| % |
0x25 |
[U+0025] |
Percent |
A C reserved char or a delimiter |
| & |
0x26 |
[U+0026] |
Ampersand |
Character in a query component; problematic in a URL |
| ' |
0x27 |
[U+0027] |
Apostrophe |
A C reserved char and unwise to leave unescaped; problematic
in a URL |
| ( |
0x28 |
[U+0028] |
Left parenthesis |
Problematic in a URL |
| ) |
0x29 |
[U+0029] |
Right parenthesis |
Problematic in a URL |
| * |
0x2A |
[U+002A] |
Asterisk |
|
| + |
0x2B |
[U+002B] |
Plus sign |
Character in a query component; problematic in a URL |
| , |
0x2C |
[U+002C] |
Comma |
Character in a query component; problematic in a URL |
| - |
0x2D |
[U+002D] |
Hyphen - minus |
|
| . |
0x2E |
[U+002E] |
Full stop (period) |
Especially as last char of a file name |
| / |
0x2F |
[U+002F] |
Solidus (slash) |
Especially as last char of a file name; also a C reserved
char or reserved in a query component; problematic in a URL
|
| : |
0x3A |
[U+003A] |
Colon |
A reserved character in a query component; problematic in
a URL |
| ; |
0x3B |
[U+003B] |
Semicolon |
A valid char in a URL, however can be problematic; may want
to escape anyway; reserved within a query component, can be
a parameter delimiter. |
| < |
0x3C |
[U+003C] |
Less-than sign |
Can be a delimiter or part of HTML or script; problematic
in a URL |
| = |
0x3D |
[U+003D] |
Equals sign |
Reserved character in a query component; problematic in a
URL |
| > |
0x3E |
[U+003E] |
Greater-than sign |
Can be a delimiter or part of HTML or script; problematic
in a URL |
| ? |
0x3F |
[U+003F] |
Question mark |
Reserved character in a query component; problematic in a
URL |
| @ |
0x40 |
[U+0040] |
Commercial At (at sign) |
Reserved character in a query component; problematic in a
URL unless part of the authentication |
| [ |
0x5B |
[U+005B] |
Left square bracket |
An unwise character to leave unescaped; problematic in a
URL ; also problematic in RTL |
| \ |
0x5C |
[U+005C] |
Reverse solidus (backslash) |
Especially as last char of a file name; an unwise character
to leave unescaped; problematic in a URL |
| ] |
0x5D |
[U+005D] |
Right square bracket |
An unwise character to leave unescaped; problematic in a
URL ; also problematic in RTL |
| ^ |
0x5E |
[U+005E] |
Circumflex accent |
An unwise character to leave unescaped; problematic in a
URL |
| _ |
0x5F |
[U+005F] |
Low line |
An unwise character to leave unescaped; problematic in a
URL |
| ` |
0x60 |
[U+0060] |
Grave accent |
An unwise character to leave unescaped; problematic in a
URL ; also problematic in RTL |
| { |
0x7B |
[U+007B] |
Left curly brace |
An unwise character to leave unescaped; problematic in a
URL |
| | |
0x7C |
[U+007C] |
Vertical line (pipe) |
An unwise character to leave unescaped; problematic in a
URL ; also problematic in RTL |
| } |
0x7D |
[U+007D] |
Right curly brace |
|
| ~ |
0x7E |
[U+007E] |
Tilde |
|
| |
0x7F |
[U+007F] |
Delete |
|
| « |
0xAB |
[U+00AB] |
Left-pointing double angle |
|
| _ |
0x1C |
[U+001C] |
File Separator |
|
Extended Range Problem Characters
Table G.4 contains potentially problematic extended range characters
from the single-byte code pages.
Table G.4 Extended Range Problem Characters
| Character |
Unicode point |
Name |
Comment |
| ö |
[U+00F6] |
Latin Small Letter O with Diaeresis |
Can be a problem in filenames on DBCS systems.
|
| § |
[U+00A7] |
Section Sign |
|
| ß |
[U+00DF] |
Latin Small Letter Sharp S |
|
| å |
[U+00E5] |
Latin Small Letter A with Ring Above |
DOS delete marker. Mostly significant if first
char in a string; essentially this is a Ctrl+z. |
| € |
[U+20AC] |
Euro Currency Symbol |
|
| ª |
[U+00AA] |
Feminine Ordinal Indicator |
This can sometimes be interpreted by Novell's
NetWare as a disconnect signal or other similar low-level command.
If your software will be used with NetWare, you will want to
plan your tests to include these. |
| ® |
[U+00AE] |
Registered Sign |
This can sometimes be interpreted by Novell's
NetWare as a disconnect signal or other similar low-level command.
If your software will be used with NetWare, you will want to
plan your tests to include these. |
| ¿ |
[U+00BF] |
Inverted Question Mark |
This can sometimes be interpreted by Novell's
NetWare as a disconnect signal or other similar low-level command.
If your software will be used with NetWare, you will want to
plan your tests to include these. |
| İ |
[U+0130] 0xDD on 1254 code page |
Latin Capital Letter I with Dot Above |
Only found in Turkish on the 1254 code page;
this can be seen being converted if the system does not properly
handle this. |
| ı |
[U+0131] 0xFD on 1254 code page |
Latin Small Dotless Letter I |
Only found in Turkish on the 1254 code page;
this can be seen being converted if the system does not properly
handle this. |
Problem Character Combinations
Table G.5 contains problem character combinations from the lower
ASCII, the extended range (or upper ASCII), and then combinations
of the two.
Table G.5 Problem Character Combinations
| Characters |
Unicode points |
Names |
Comment |
| :: |
[U+003A][U+003A] |
Two colons |
|
| ~1: |
[U+007E][U+0031][U+003A] |
A tilde, a number (any number), and a colon |
|
| .. |
[U+002E][U+002E] |
Two periods |
This can present security problems by allowing
access to files otherwise not accessible. |
| $$ |
[U+0024][U+0024] |
Two dollar signs |
|
| :€� |
[U+003A][U+20AC][U+FFFD] |
Colon, Euro symbol, and [U+FFFD] |
Although FFFD is not a "real" character, this
can present problems. |
| ++ |
[U+002B][U+002B] |
Two pluses |
|
| %0 |
[U+0025][U+0030] |
Percent sign, number zero |
Can cause problems in Perl scripts. |
| \n |
[U+005C][U+006E] |
Backslash, letter n |
Escape sequence for new line in JavaScript. |
| \b |
[U+005C][U+0062] |
Backslash, letter b |
Escape sequence for bolding in JavaScript. |
| %20 |
[U+0025][U+0032][U+0030] |
Percent sign, number two, number zero |
URL encoded sequence for a space. |
| 00:\ |
[U+0030][U+0030][U+003A][U+005C] |
Two number zeros, colon, backslash |
|
| & |
[U+0026] |
Ampersand |
|
| < |
[U+003C] |
Less-than sign |
|
| > |
[U+003E] |
Greater-than sign |
|
| = |
[U+003D] |
Equals sign |
|
| Ü¢£ |
[U+00DC][U+00A2][U+00A3] |
Letter U with diaeresis, cent sign, pound (currency)
sign - high literals |
|
| FFFFFFFF |
[U+0046][U+0046][U+0046][U+0046]
[U+0046][U+0046][U+0046][U+0046]
|
Eight letter F |
Input as a value, especially a regkey. |
| ::$DATA |
[U+003A][U+003A][U+0024][U+0044]
[U+0041][U+0054][U+0041]
|
Two colons, dollar sign, letters D, A, T, A |
Indicates data stream. |
Lower ASCII Character Combination Verification Cases
Table G.6 contains test cases to try in order to verify that your
application properly handles various lower ASCII characters. Whereas
the previous set of character combinations were chosen because of
their potential ability to break an application, these are chosen
for their ability to prove that the application is properly handling
valid lower ASCII input.
Table G.6 Character Combination Verification Cases
| Characters |
Unicode points |
Comment |
| aAzZ |
[U+0061][U+0041][U+007A][U+005A] |
Tests that basic alphabetic characters are accepted.
|
| 1234 |
[U+0031][U+0032][U+0033][U+0034] |
Tests that common numbers are accepted. |
| 12aZ |
[U+0031][U+0032][U+007A][U+005A] |
Tests that numbers and letters are accepted,
starting with numbers. |
| aZ12 |
[U+007A][U+005A][U+0031][U+0032] |
Tests that letters and numbers are accepted,
ending with numbers. |
| ~!;:?/* |
[U+007E][U+0021][U+003B][U+003A][U+003F]
[U+002F][U+002A]
|
Tests that common symbols are accepted. |
| /../ |
[U+002F][U+002E][U+002E][U+002F] |
Tests symbols, but in an arrangement that can
be interpreted as a file path. |
| ..%255c.. |
[U+002E][U+002E][U+0025][U+0032][U+0035][U+0035][U+0063][U+002E][U+002E]
|
Test case for URL canonicalization. |
| ..%%35%63.. |
[U+002E][U+002E][U+0025][U+0025][U+0033][U+0035][U+0025][U+0036][U+0033][U+002E][U+002E]
|
Test case for URL canonicalization. |
| ..%%35c.. |
[U+002E][U+002E][U+0025][U+0025][U+0033][U+0035][U+0063][U+002E][U+002E]
|
Test case for URL canonicalization. |
| ..%25%35%63.. |
[U+002E][U+002E][U+0025][U+0032][U+0035][U+0025][U+0033][U+0035][U+0025][U+0036][U+0033][U+002E][U+002E]
|
Test case for URL canonicalization. |
| ..%252f.. |
[U+002E][U+002E][U+0025][U+0032][U+0035][U+0032][U+0066][U+002E][U+002E]
|
Test case for URL canonicalization. |
| ..%255c.. |
[U+002E][U+002E][U+0025][U+0032][U+0035][U+0035][U+0063][U+002E][U+002E]
|
Test case for URL canonicalization. |
| ..%c0%2f.. |
[U+002E][U+002E][U+0025][U+0063][U+0030][U+0025][U+0032][U+0066][U+002E][U+002E]
|
Test case for URL canonicalization. |
| ..%c0%af.. |
[U+002E][U+002E][U+0025][U+0063][U+0030][U+0025][U+0061][U+0066][U+002E][U+002E]
|
Test case for URL canonicalization. |
| ..%c1%1c.. |
[U+002E][U+002E][U+0025][U+0063][U+0031][U+0025][U+0031][U+0063][U+002E][U+002E]
|
Test case for URL canonicalization. |
| ..%c1%9c.. |
[U+002E][U+002E][U+0025][U+0063][U+0031][U+0025][U+0039][U+0063][U+002E][U+002E]
|
Test case for URL canonicalization. |
| ..%255c../..%255c../..%255c/..%c1%1c../..%c1%1c../..%c1%1c..
|
|
Test case for URL canonicalization. |
| /À®./ |
[U+002F][U+00C0][U+00AE][U+002E][U+002F] |
Used with the previous test, specifically to
test parsers-if the previous input is not an allowed sequence,
then this should probably not be an allowed sequence. |
| \\?\C:\foo.txt |
[U+005C][U+005C][U+003F][U+005C][U+0043]
[U+003A][U+005C][U+0066][U+006F][U+006F]
[U+002E][U+0074][U+0078][U+0074]
|
Tests the assumption that the local file location
has the second character of a colon; NT specific. |
| \\127.0.0.1\C$\ |
[U+005C][U+005C][U+0031][U+0032][U+0037][U+002E]
[U+0030][U+002E][U+0030][U+002E][U+0031][U+005C]
[U+0043][U+0024][U+005C]
|
Tests the assumption that the local file location
has the second character of a colon; refers to the UNC localhost.
|
| < |
[U+0026][U+006C][U+0074][U+003B] |
HTML sequence for the less-than sign. |
| |
[U+0026][U+006E][U+0062][U+0073][U+0070][U+003B]
|
HTML sequence for a non-breaking space. |
| <br> |
[U+003C][U+0062][U+0072][U+003E] |
HTML tag for a break. |
| A |
[U+0026][U+0023][U+0036][U+0035][U+003B] |
Decimal HTML sequence for the letter A. |
| A |
[U+0026][U+0023][U+0078][U+0030][U+0030][U+0034]
[U+0031][U+003B]
|
Similar to previous example, but this is the
hexadecimal HTML sequence for the letter A. |
| 0xf |
[U+0030][U+0078][U+0066] |
May be assumed to be the hexadecimal reference
to a number, in this case it would be 15. |
| 0xa |
[U+0030][U+0078][U+0061] |
May be assumed that this is the hexadecimal reference
to another number, in this case it would be converted to 10.
|
| %UFF3C |
[U+0025][U+0055][U+0046][U+0046][U+0033][U+0043]
|
URL encoded DBCS backslash. |
| Iiİı |
[U+0049][U+0069][U+0130][U+0131] |
Tests the two Latin Latter I's and the two extra
Turkish I's. |
| <script>alert('Hello')</script> |
[U+003C][U+0073][U+0063][U+0072][U+0069][U+0070]
[U+0074][U+003E][U+0061][U+006C][U+0065][U+0072]
[U+0074][U+0028][U+0027][U+0048][U+0065][U+006C]
[U+006C][U+006F][U+0027][U+0029][U+003C][U+002F]
[U+0073][U+0063][U+0072][U+0069] [U+0070][U+0074]
[U+003E]
|
Script will pop up a Hello alert box if it is
executed-should not be executed. |
| '><script>alert('Hello')</script>
|
[U+0027][U+003E][U+003C][U+0073][U+0063][U+0072][U+0069]
[U+0070][U+0074][U+003E][U+0061][U+006C][U+0065][U+0072]
[U+0074][U+0028][U+0027][U+0048][U+0065][U+006C][U+006C]
[U+006F][U+0027][U+0029][U+003C][U+002F][U+0073][U+0063]
[U+0072][U+0069][U+0070][U+0074][U+003E]
|
Similar to the previous example, except this
will attempt to close a tag before the script. |
| "><script>alert('Hello')</script>
|
[U+0027][U+00322][U+003C][U+0073][U+0063][U+0072][U+0069]
[U+0070][U+0074][U+003E][U+0061][U+006C][U+0065][U+0072]
[U+0074][U+0028][U+0027][U+0048][U+0065][U+006C][U+006C]
[U+006F][U+0027][U+0029][U+003C][U+002F][U+0073][U+0063]
[U+0072][U+0069][U+0070][U+0074][U+003E]
|
Similar to the previous example; this will attempt
to close a tag before the script. |
| <Script>alert('Hello')</Script> |
[U+003C][U+0053][U+0063][U+0072][U+0069][U+0070][U+0074]
[U+003E][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028]
[U+0027][U+0048][U+0065][U+006C][U+006C][U+006F][U+0027]
[U+0029][U+003C][U+002F][U+0053][U+0063][U+0072][U+0069]
[U+0070][U+0074][U+003E]
|
Using mixed case in the script, testing for an
exact string match. |
| <sCript>alert('Hello')</sCript> |
[U+003C][U+0073][U+0043][U+0072][U+0069][U+0070][U+0074]
[U+003E][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028]
[U+0027][U+0048][U+0065][U+006C][U+006C][U+006F][U+0027]
[U+0029][U+003C][U+002F][U+0073][U+0043][U+0072][U+0069]
[U+0070][U+0074][U+003E]
|
Similar to the previous example, using mixed
case in the script, testing for an exact string match. |
| <SCRIPT>alert('Hello')</SCRIPT> |
[U+003C][U+0053][U+0043][U+0052][U+0049][U+0050][U+0054]
[U+003E][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028]
[U+0027][U+0048][U+0065][U+006C][U+006C][U+006F][U+0027]
[U+0029][U+003C][U+002F][U+0053][U+0043][U+0052][U+0049]
[U+0050][U+0054][U+003E]
|
Similar to the previous example, using all capitals
in the script ,testing for an exact string match. |
<script>alert('Hello')
</script> |
[U+0026][U+0023][U+0036][U+0030][U+003B][U+0073][U+0063]
[U+0072][U+0069][U+0070][U+0074][U+0026][U+0023][U+0036]
[U+0032][U+003B][U+0061][U+006C][U+0065][U+0072][U+0074]
[U+0028][U+0027][U+0048][U+0065][U+006C][U+006C][U+006F]
[U+0027][U+0029][U+0026][U+0023][U+0036][U+0030][U+003B]
[U+0026][U+0023][U+0034][U+0037][U+003B][U+0073][U+0063]
[U+0072][U+0069][U+0070][U+0074][U+0026][U+0023][U+0036]
[U+0032][U+003B]
|
Similar to the original script example, except
this string has the symbols in their decimal HTML reference.
|
%22><script%20for=window
%20event=%22onload()%22>
document.write(%22Hello%22);
document.close();</script>
Hello%22);document.close();
</script>.write(%22Hello%22);
document.close();</script> |
[U+0025][U+0032][U+0032][U+003E][U+003C][U+0073][U+0063]
[U+0072][U+0069][U+0070][U+0074][U+0025][U+0032][U+0030]
[U+0066][U+006F][U+0072][U+003D][U+0077][U+0069][U+006E]
[U+0064][U+006F][U+0077][U+0020][U+0025][U+0032][U+0030]
[U+0065][U+0076][U+0065][U+006E][U+0074][U+003D][U+0025]
[U+0032][U+0032][U+006F][U+006E][U+006C][U+006F][U+0061]
[U+0064][U+0028][U+0029][U+0025][U+0032][U+0032][U+003E]
[U+0064][U+006F][U+0063][U+0075][U+006D][U+0065][U+006E]
[U+0074][U+002E][U+0077][U+0072][U+0069][U+0074][U+0065]
[U+0028][U+0025][U+0032][U+0032][U+0048][U+0065][U+006C]
[U+006C][U+006F][U+0025][U+0032][U+0032][U+0029][U+003B]
[U+0064][U+006F][U+0063][U+0075][U+006D][U+0065][U+006E]
[U+0074][U+002E][U+0063][U+006C][U+006F][U+0073][U+0065]
[U+0028][U+0029][U+003B][U+003C][U+002F][U+0073][U+0063]
[U+0072][U+0069][U+0070][U+0074][U+003E][U+0048][U+0065]
[U+006C][U+006C][U+006F][U+0025][U+0032][U+0032][U+0029]
[U+003B][U+0064][U+006F][U+0063][U+0075][U+006D][U+0065]
[U+006E][U+0074][U+002E][U+0063][U+006C][U+006F][U+0073]
[U+0065][U+0028][U+0029][U+003B][U+003C][U+002F][U+0073]
[U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+002E]
[U+0077][U+0072][U+0069][U+0074][U+0065][U+0028][U+0025]
[U+0032][U+0032][U+0048][U+0065][U+006C][U+006C][U+006F]
[U+0025][U+0032][U+0032][U+0029][U+003B][U+0064][U+006F]
[U+0063][U+0075][U+006D][U+0065][U+006E][U+0074][U+002E]
[U+0063][U+006C][U+006F][U+0073][U+0065][U+0028][U+0029]
[U+003B][U+003C][U+002F][U+0073][U+0063][U+0072][U+0069]
[U+0070][U+0074][U+003E]
|
Similar to the previous example, except this
has all quotes and spaces URL escaped. |
<script>(unencode("<script>
alert('Hello')</script>"))</script> |
[U+003C][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074]
[U+003E][U+0028][U+0075][U+006E][U+0065][U+006E][U+006F]
[U+0064][U+0065][U+0028][U+0022][U+003C][U+0073][U+0063]
[U+0072][U+0069][U+0070][U+0074][U+003E][U+0061][U+006C]
[U+0065][U+0072][U+0074][U+0028][U+0027][U+0048][U+0065]
[U+006C][U+006C][U+006F][U+0027][U+0029][U+003C][U+002F]
[U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E]
[U+0022][U+0029][U+0029][U+003C][U+002F][U+0073][U+0063]
[U+0072][U+0069][U+0070][U+0074][U+003E]
|
Similar to previous examples, except this attempts
to use the unencode function to get script to execute. |
blah<script>(unencode ("<script>alert('Hello')
</script>"))</script> |
[U+0062][U+006C][U+0061][U+0068][U+003C][U+0073][U+0063]
[U+0072][U+0069][U+0070][U+0074][U+003E][U+0028][U+0075]
[U+006E][U+0065][U+006E][U+006F][U+0064][U+0065][U+0028]
[U+0022][U+003C][U+0073][U+0063][U+0072][U+0069][U+0070]
[U+0074][U+003E][U+0061][U+006C][U+0065][U+0072][U+0074]
[U+0028][U+0027][U+0048][U+0065][U+006C][U+006C][U+006F]
[U+0027][U+0029][U+003C][U+002F][U+0073][U+0063][U+0072]
[U+0069][U+0070][U+0074][U+003E][U+0022][U+0029][U+0029]
[U+003C][U+002F][U+0073][U+0063][U+0072][U+0069][U+0070]
[U+0074][U+003E]
|
Similar to above examples, except this attempts
to use the unencode function to get script to execute. |
blah'<script>(unencode("<script>alert('Hello')
</script>"))</script> |
[U+0062][U+006C][U+0061][U+0068][U+0027][U+003C][U+0073]
[U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+0028]
[U+0075][U+006E][U+0065][U+006E][U+006F][U+0064][U+0065]
[U+0028][U+0022][U+003C][U+0073][U+0063][U+0072][U+0069]
[U+0070][U+0074][U+003E][U+0061][U+006C][U+0065][U+0072]
[U+0074][U+0028][U+0027][U+0048][U+0065][U+006C][U+006C]
[U+006F][U+0027][U+0029][U+003C][U+002F][U+0073][U+0063]
[U+0072][U+0069][U+0070][U+0074][U+003E][U+0022][U+0029]
[U+0029][U+003C][U+002F][U+0073][U+0063][U+0072][U+0069]
[U+0070][U+0074][U+003E]
|
Similar to previous examples, except this attempts
to use the unencode function to get script to execute and a
single quote. |
blah"<script>(unencode("<script>alert('Hello')
</script>"))</script> |
[U+0062][U+006C][U+0061][U+0068][U+0022][U+003C][U+0073]
[U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+0028]
[U+0075][U+006E][U+0065][U+006E][U+006F][U+0064][U+0065]
[U+0028][U+0022][U+003C][U+0073][U+0063][U+0072][U+0069]
[U+0070][U+0074][U+003E][U+0061][U+006C][U+0065][U+0072]
[U+0074][U+0028][U+0027][U+0048][U+0065][U+006C][U+006C]
[U+006F][U+0027][U+0029][U+003C][U+002F][U+0073][U+0063]
[U+0072][U+0069][U+0070][U+0074][U+003E][U+0022][U+0029]
[U+0029][U+003C][U+002F][U+0073][U+0063][U+0072][U+0069]
[U+0070][U+0074][U+003E]
|
Similar to previous examples, except this attempts
to use the unencode function to get script to execute and a
double quote. |
<SCRIPT LANGUAGE="VBScript">
MsgBox "Hello!" </SCRIPT> |
[U+003C][U+0053][U+0043][U+0052][U+0049][U+0050][U+0054]
[U+0020][U+004C][U+0041][U+004E][U+0047][U+0055][U+0041]
[U+0047][U+0045][U+003D][U+0022][U+0056][U+0042][U+0053]
[U+0063][U+0072][U+0069][U+0070][U+0074][U+0022][U+003E]
[U+0020][U+004D][U+0073][U+0067][U+0042][U+006F][U+0078]
[U+0020][U+0022][U+0048][U+0065][U+006C][U+006C][U+006F]
[U+0021][U+0022][U+0020][U+003C][U+002F][U+0053][U+0043]
[U+0052][U+0049][U+0050][U+0054][U+003E]
|
VBScript of the previous example-alert box will
pop up if it is executed. |
| <a href="JavaScript:alert()">link</a>
|
[U+003C][U+0061][U+0020][U+0068][U+0072][U+0065][U+0066]
[U+003D][U+0022][U+004A][U+0061][U+0076][U+0061][U+0053]
[U+0063][U+0072][U+0069][U+0070][U+0074][U+003A][U+0061]
[U+006C][U+0065][U+0065][U+0072][U+0074][U+0028][U+0029]
[U+0022][U+003E][U+006C][U+0069][U+006E][U+006B][U+003C]
[U+002F][U+0061][U+003E]
|
|
| ‹script›alert(`Hello`)‹/script›
|
[U+2039][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074]
[U+203A][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028]
[U+2018][U+0048][U+0065][U+006C][U+006C][U+006F][U+2018]
[U+0029][U+2039][U+2044][U+0073][U+0063][U+0072][U+0069]
[U+0070][U+0074][U+203A]
|
Symbols have been replaced with their high-bit
counterparts. |
HTML tags can include script where it may not be anticipated. Because
these tags, and others, can include script with their attributes,
they cannot be considered safe. The following lines contain some
examples of how script can appear in what appear to be safe HTML
tags.
<img src="JavaScript:alert()">img src</img>
<bgsound src="JavaScript:alert()">bgsound src</bgsound>
<iframe src="JavaScript:alert()">iframe src</iframe>
<table background="JavaScript:alert()">table background</table>
<object data="JavaScript:alert()">object data</object>
<frameset onload="JavaScript:alert()">frameset
onload</frameset>
<body onload="JavaScript:alert()">body onload</body>
<body background="JavaScript:alert()">body background</body><span
ID="ActiveX ID"></span>
Upper ASCII Character Combinations
In Table G.7 you will find upper ASCII (extended range) character
combinations for use in verifying that your application can handle
various valid upper ASCII input.
Table G.7 Upper ASCII Character Combinations
| Characters |
Unicode point |
Comment |
| öÜß |
[U+00F6][U+00DC][U+00DF] |
High literals |
| Ü¢£ |
[U+00DC][U+00A2][U+00A3] |
High literals |
| ©® |
[U+00A0][U+00A9][U+00AE] |
Problem literals |
| ¿¾Õ |
[U+00BF][U+00BE][U+00D5] |
Regional literals |
| &><" |
[U+0026][U+003E][U+003C][U+0022] |
Named entities |
| ©®¾¿Õ |
[U+00A0][U+00A9][U+00AE][U+00BE][U+00BF][U+00D5] |
Literals |
| åE5å |
[U+00E5][U+0045][U+35][U+E5] |
Can be mistaken for the DOS delete mark |
| €\$\ |
[U+20AC][U+005C][U+0024][U+005C] |
|
| ’ |
[U+00E2][U+20AC][U+2122] |
|
Diacritics
Table G.8 contains the combining marks that can cause large problems
and have no ANSI equivalent; these are typed in combination with
another character to alter them (for example, typed in with c [u+0063]
to create ç ).
Table G.8 Diacritics
| Unicode point |
Name |
| [U+0333] |
Combining double lowline |
| [U+033F] |
Combining double overline |
| [U+0327] |
Combining cedilla |
High-Bit Characters
The characters listed in Table G.9 are different from their low-bit
counterparts and often end up converted to their low-bit counterparts
when the software cannot handle them. For instance, try taking script
and substituting in the correlating high-bit characters to see if
a filter allows them through and another component downgrades them,
with the end result of script being executed. These characters can
also be problematic on their own as input.
Table G.9 High-Bit Characters
| Characters |
Unicode point |
Name |
| |
[U+00AD] |
Soft hyphen (SHY) |
| ‘ |
[U+2018] |
Single opening quote |
| ’ |
[U+2019] |
Single closing quote |
| “ |
[U+201C] |
Double opening quote |
| ” |
[U+201D] |
Double closing quote |
| ´ |
[U+00B4] |
Acute accent |
| ¸ |
[U+00B8] |
Cedilla |
| |
[U+00A0] |
Non-Breaking Space (NBSP) |
| © |
[U+00A9] |
Copyright |
| ® |
[U+00AE] |
Registered Mark |
| ™ |
[U+2122] |
Trademark |
| – |
[U+2013] |
En-dash |
| — |
[U+2014] |
Em-dash |
| … |
[U+2026] |
Ellipsis |
| ⁄ |
[U+2044] |
Fraction Slash |
| ‹ |
[U+2039] |
Single Left-Pointing Angle |
| › |
[U+203A] |
Single Right-Pointing Angle |
| ′ |
[U+2032] |
Prime |
| ″ |
[U+2033] |
Double Prime |
Characters from Multibyte Character Sets
The rest of the tables in this appendix deal with double-byte
characters and single-byte characters from the multibyte code pages.
Boundary Cases
Table G.10 contains characters for testing the first and last characters
of the various multibyte code page ranges.
Table G.10 Boundary Cases for the Multibyte Code Pages
| Characters |
Unicode point |
Comment |
| |
[U+3000] [81/40] in 932, [A1/A1] in 949 and 936, [A1/40]
in 950 |
Ideographic space - beginning of first DBCS range on 932
code page |
| 滬 |
[U+6EEC] [9F/FC] in 932 |
End of first DBCS range on 932 code page |
| 。 |
[U+FF61] [A1] in 932 |
Beginning of Kana (single byte range) on 932 code page |
| ゚ |
[U+FF9F] [DF] in 932 |
End of Kana |
| 漾 |
[U+6F3E][E0/40] in 932 |
Beginning of Second DBCS range on 932 code page |
| 黑 |
[U+9ED1] [FC/4B] in 932 |
End of Second DBCS on 932 code page |
| |
[U+E4C6] [A1/40] in 936 code page |
Beginning of CHS 936 code page |
| |
[U+E4C5] [FE/FE] in 936 code page |
End of CHS 936 code page |
| |
[U+EEB8] [81/40] in 950 code page |
Beginning of CHT 950 code page |
| |
[U+E310] [FE/FE] in 950 code page |
End of CHT 950 code page |
| 갂 |
[U+AC02] [81/41] in 949 code page |
Beginning of Korean 949 code page |
| 詰 |
[U+8A70] [FD/FE] in 949 code page |
End of Korean 949 code page |
Testing Individual Bytes that Make up the Double-Byte Character
Since the double-byte characters consist of 2 bytes read in individually,
either one of the bytes could be mistaken for a special lower ASCII
character. Because of this, you need to look at the special meaning
of the lower ASCII characters and take the code point that they
occupy to identify double-byte characters that have that code point
as either a leading byte or a trailing byte (see Tables G.11 through
G.16).
Table G.11 Lead Byte Is 81
| Character |
Unicode code point |
Code point |
| ー |
[U+30FC] |
[81/5B] on 932 code page |
| ‐ |
[U+2010] |
[81/5D] on 932 code page |
| \ |
[U+FF3C] |
[81/5F] on 932 code page |
| + |
[U+FF0B] |
[81/7B] on 932 code page |
| - |
[U+FF0D] |
[81/7C] on 932 code page |
| ± |
[U+00B1] |
[81/7D] on 932 code page |
| × |
[U+00D7] |
[81/7E] on 932 code page |
Table G.12 Trailing Byte is 5C (ANSI Backslash Character - Need
to Use as First, Middle, and Last Character in a String)
| Character |
Unicode code point |
Code point |
| ― |
[U+2015] |
[81/5C] on 932 code page |
| |
[U+E0F7] |
[81/5C] on Windows 932 code page |
| 乗 |
[U+4E57] |
[81/5C] on 936 code page |
| |
[U+EED4] |
[81/5C] on 950 code page |
Table G.13 Lead Byte Is E5 - Special DOS Deletion Mark
| Character |
Unicode code point |
Code point |
| 蕁 |
[U+8541] |
[E5/40] on 932 code page |
| 蛬 |
[U+86EC] |
[E5/7E] on 932 code page |
| 夜 |
[U+591C] |
[E5/A8] on 949 code page |
| 女 |
[U+F981] |
[E5/FC] on 949 code page |
Table G.14 Trail Bytes Is AD - ANSI Soft Hyphen
| Character |
Unicode code point |
Code point |
| 伃 |
[U+4F03] |
[81/AD] on 936 code page |
| 藄 |
[U+85C4] |
[F0/AD] on 950 code page |
The double-byte Romanji characters are Latin-looking characters
that need to be used anywhere that Latin single-byte characters
are expected.
Table G.15 Romanji Characters - Latin-Looking Characters from
the 932 Page
| Character |
Unicode point |
Comment |
| ◯ |
[U+25EF] |
Boundary |
| 0 |
[U+FF10] |
Use the double-byte numbers where any number might be expected.
|
| 1 |
[U+FF11] |
Use the double-byte numbers where any number might be expected.
|
| @ |
[U+FF20] |
Use the double-byte symbols where any symbol might be expected.
|
| A |
[U+FF21] |
Use the double-byte letters where any letter might be expected.
|
| Z |
[U+FF3A] |
Use the double-byte letters where any letter might be expected.
|
| a |
[U+FF41] |
Use the double-byte letters where any letter might be expected.
|
| z |
[U+FF5A] |
Use the double-byte letters where any letter might be expected.
|
| ぁ |
[U+3041] |
Boundary |
| . |
[U+FF0E] |
Use the double-byte fullwidth period where any period might
be expected. |
| / |
[U+FF0F] |
Use the double-byte fullwidth solidus where any forward-slash
might be expected. |
| : |
[U+FF1A] |
Use the double-byte fullwidth colon where any colon might
be expected. |
| ! |
[U+FF01] |
Use the double-byte fullwidth exclamation mark where any
exclamation mark might be expected. |
| ‘ |
[U+2018] |
Use the double-byte fullwidth left single quote where any
quote might be expected. |
| ’ |
[U+2019] |
Use the double-byte fullwidth right single quote where any
quote might be expected. |
| “ |
[U+201C] |
Use the double-byte fullwidth left double quote where any
quote might be expected. |
| ” |
[U+201D] |
Use the double-byte fullwidth right double quote where any
quote might be expected. |
| < |
[U+FF1C] |
Use the double-byte fullwidth less-than sign where any less-than
sign might be expected. |
| > |
[U+FF1E] |
Use the double byte fullwidth greater-than sign where any
greater-than sign might be expected. |
Table G.16 shows characters that represent potential problems
in NetWare.
Table G.16 NetWare Potential Problem Characters
| Character |
Unicode code point |
Code point |
| ェ |
[U+FF6A] |
[AA] on 932 code page |
| ョ |
[U+FF6E] |
[AE] on 932 code page |
| ソ |
[U+FF7F] |
[BF] on 932 code page |
| 穐 |
[U+7A50] |
[88/AA] on 932 code page |
| 旭 |
[U+65ED] |
[88/AE] on 932 code page |
| 袷 |
[U+88B7] |
[88/BF] on 932 code page |
Potential Problem Character Conversions
When the same character shares more than one code point, it can
cause problems when converting from the code page to Unicode and
then back to the code page. Tables G.17 and G.18 contain some examples
of these types of problem characters.
Table G.17 JPN-932
| Character |
Unicode code point |
Code point |
| 丨 |
[U+4E28] |
[FA/68] which will equal [ED/4C] |
| ¦ |
[U+FFE4] |
[FA/55] which will equal [EE/FA] |
| 厓 |
[U+5393] |
[FA/8D] |
| 晙 |
[U+6659] |
[FA/D7] |
| 纊 |
[U+7E8A] |
[FA/5C] |
| 槢 |
[U+69E2] |
[FA/EC] |
Table G.18 CHT-950
| Character |
Unicode code point |
Code point |
| ═ |
[U+2550] |
[A2/A4] which will equal [F9/F9] |
| ╞ |
[U+255E] |
[A2/A5] which will equal [F9/E9] |
| ╪ |
[U+256A] |
[A2/A6] which will equal [F9/EA] |
| 十 |
[U+5341] |
[A2/CC] which will equal [A4/51] |
| ╡ |
[U+2561] |
[A2/A7] which will equal [F9/EB] |
| 卅 |
[U+5345] |
[A2/CE] which will equal [A4/CA] |
| ╭ |
[U+256D] |
[F9/FA] which will equal [A2/7E] |
Miscellaneous DBCS Problem Characters
Table G.19 contains a variety of other characters that may cause
problems in your application. These are ones that do not necessarily
fall into classifications of types of problems, but they are historically
known to cause misbehavior.
Table G.19 Miscellaneous DBCS Problem Characters
| Character |
Unicode point |
Comment |
| 郂 |
[U+90C2] |
936 code page CHS character. |
| ㏕ |
[U+33D5] |
936 and 950 code pages. |
| ╴ |
[U+2574] |
950 code page. |
| ~ |
[U+FF5E] |
932, 936, 949, and 950 code pages. Full-width tilde; can
have a different Unicode mapping to the code page table depending
on the platform. |
| _ |
[U+FF3F] |
932, 936, 949, and 950 code pages. |
| # |
[U+FF03] |
932, 936, 949, and 950 code pages. |
| & |
[U+FF06] |
932, 936, 949, and 950 code pages. |
| ▓ |
[U+2593] |
936 and 950 code pages. |
| 가 |
[U+AC00] |
949 code page. |
| 耀 |
[U+8000] |
The E5 trailing byte of this Korean char can cause problems.
|
| 肭 |
[U+80AD] |
932, 936, and 950 code pages. |
Multibyte Character Combinations
The problem characters that have been discussed in this section
all come from the multibyte character sets; however, thus far I
have discussed only individual code points. Table G.20 contains
strings of multibyte characters to use both in verification and
in testing the ability of your application to handle truly problematic
characters.
Table G.20 Multibyte Character Combinations
| Character |
Unicode points |
Comment |
| ヲゥォッ |
[U+FF66][U+FF69][U+FF6B][U+FF6F] |
String of four single-byte DBCS characters |
| ヲゥ ォッ |
[U+FF66][U+FF69][U+3000][U+FF6B][U+FF6F] |
String of single-byte DBCS characters with a DBCS space in
the middle |
| ヲゥォッィ |
[U+FF66][U+FF69][U+FF6B][U+FF6F][U+FF68] |
String of five single-byte DBCS characters |
| ヲゥォッィェ |
[U+FF66][U+FF69][U+FF6B][U+FF6F][U+FF68][U+FF6A] |
String of six single-byte DBCS characters |
| 黑鸙鶴滬 |
[U+9ED1][U+9E19][U+FA2D][U+6EEC] |
String of four double-byte DBCS characters |
| 黑鸙鶴滬滸 |
[U+9ED1][U+9E19][U+FA2D][U+6EEC][U+6EF8] |
String of five double-byte DBCS characters |
| 黑鸙鶴滬滸滾 |
[U+9ED1][U+9E19][U+FA2D][U+6EEC][U+6EF8][U+6EFE] |
String of six double-byte DBCS characters |
| ヲゥォッ黑鸙ヲゥォッ
|
[U+FF66][U+FF69][U+FF6B][U+FF6F][U+9ED1]
[U+9E19][U+FF66][U+FF69][U+FF6B][U+FF6F] |
String of DBCS characters starting and ending with single-byte
characters with double-byte characters in the middle |
| 黑鸙ヲゥォッ黑鸙
|
[U+9ED1][U+9E19][U+FF66][U+FF69][U+FF6B][U+FF6F]
[U+9ED1][U+9E19] |
String of double-byte characters starting and ending with
double-byte characters, with single-byte characters in the middle
|
| ¥\\¥ |
[U+FFE5][U+005C][U+005C][U+FFE5] |
Yen signs around two back-slashes |
Unicode-Only Characters
Table G.21 contains characters that are not found in any code page,
but rather exist only in Unicode. These characters are useful in
identifying problems in an application that should be handling Unicode
input, uncovering any potential code page dependencies it has.
Table G.21 Unicode-Only Characters
| Character |
Unicode code point |
Comment |
| |
[U+2002] |
En space |
| |
[U+2003] |
Em space |
| |
[U+200E] |
Left-to-right mark |
| |
[U+200F] |
Right-to-left mark |
| ‑ |
[U+2011] |
Non-breaking hyphen |
| ‟ |
[U+201F] |
Double high reversed quotation marks |
| |
[U+202A] |
Left-to-right embedding |
| |
[U+202B] |
Right-to-left embedding |
| � |
[U+FFFD] |
Replacement character |
| |
[U+FEFF] |
Byte order mark (BOM) |
|
|
[U+2028] |
Line Separator mark (LSEP) |
| सुस्वागतम
|
[U+0938][U+0941][U+0938][U+094D][U+0935]
[U+093E][U+0917][U+0924][U+092E] |
Devanagari characters-can be a problem
and unsupported in some areas |
UTF-8 Potential Problems
In UTF-8 encoding you have three ranges of characters because the
characters can be encoded with 1, 2, or 3 bytes. Testing the boundaries
here is very important. Another good test case is to take a long
string of the 3-byte encoded Unicode characters and try to overrun
buffers with them. This will turn up a number of missed buffer overflows
because the error handling may be expecting 2 bytes per character
(assumptions based on the double-byte characters), but not 3-byte
characters. (See Table G.22.)
Table G.22 UTF-8 Potential Problems
| Character |
Unicode code point |
Comment |
| [space] |
[U+0020] |
First printable character that requires only
1-byte encoding (Basic Latin-space) |
| ~ |
[U+007E] |
Last character that requires only 1-byte encoding
(Basic Latin) |
| |
[U+0081] |
First character that requires 2-byte encoding
(Latin-1 supplement) |
| ۭ |
[U+06ED] |
Last character that requires only 2-byte encoding
(Arabic) |
| ँ |
[U+0901] |
First character that requires 3-byte encoding
(Devanagari) |
| 滬 |
[U+6EEC] |
Character in the middle of the 3-byte encoding
range (CJK Unified) |
| ○ |
[U+FFEE] |
End of the 3-byte encoding (Half-width form)
|
|
|
|