The Web Testing Companion: The Insider's Guide to Efficient and Effective
Tests
Lydia Ash
Code Pages
Pregenerated Windows code pages are excellent data for input as they
allow you to paste in an excellent set of test data for almost any text
entry field. Here they are presented for reference, either to cut and
paste through your browser, or to download each codepage.
Note: Some characters may not be transmitted, received, or displayed
correctly, although every attempt has been made towards this. Some code
points may still need to be generated using a tool such as Character Map
so that you can ensure you have the proper code point for your testing.
You can view the code in html form or download it in doc format.
The mappings between languages used and the code pages that cover them
is not direct, but rather loose. This is a general guide for which code
pages to use to generate test data for applications that will be localized
or globalized for various languages.
ISO 8859
The International Organization for Standardization (ISO) lays out many
standards for the computing industry. Each part of ISO/IEC 8859 specifies
a character set that is suitable both for data- and text-processing applications
and for information interchange.
For information processing, it includes 8-bit single-byte coded graphic
character sets as follows:
- Part 1: Latin alphabet No.1 (1997) - second edition
- Part 2: Latin alphabet No.2 (1998) - second edition
- Part 3: Latin alphabet No.3 (1998) - second edition
- Part 4: Latin alphabet No.4 (1998) - second edition
- Part 5: Latin/Cyrillic alphabet (1998) - second edition
- Part 6: Latin/Arabic alphabet (1998) - second edition
- Part 7: Latin/Greek alphabet (1998) - second edition
- Part 8: Latin/Hebrew alphabet (1998) - second edition
- Part 9: Latin alphabet No.5 (1998) - second edition
- Part 10: Latin alphabet No.6 (1998) - second edition
- Part 11: Latin/Thai alphabet (1998)
- Part 12: Unassigned
- Part 13: Latin alphabet No.7 (1998)
- Part 14: Latin alphabet No.8 (1998)
- Part 15: Latin alphabet No.9 (1998)
Each part specifying a Latin Alphabet lists the languages for
which it has been designed. These are:
- Latin Alphabet No. 1. Albanian, Basque, Breton, Catalan, Danish,
Dutch, English, Faroese, Finnish, French (with restrictions), Frisian,
Galician, German, Greenlandic, Icelandic, Irish Gaelic (new orthography),
Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhjaeto-Romanic,
Scottish Gaelic, Spanish, and Swedish.
- Latin Alphabet No. 2. Albanian, Croat, Czech, English, German,
Hungarian, Latin, Polish, Romanian, Slovak, Slovene, and Sorbian.
- Latin Alphabet No. 3. Esperanto and Maltese, and if needed
in conjunction with these, English, French (with restrictions), German,
Italian, Latin, and Portuguese. Coding of Turkish characters is deprecated
in this code.
- Latin Alphabet No. 4. Danish, English, Estonian, Finnish, German,
Greenlandic, Latin, Latvian, Lithuanian, Norwegian, Sámi (with
restrictions), Slovene, and Swedish.
- Latin Alphabet No. 5. Albanian, Basque, Breton, Catalan, Danish,
Dutch, English, Faroese, Finnish, French (with restrictions), Frisian,
Galician, German, Greenlandic, Irish Gaelic (new orthography), Italian,
Latin, Luxemburgish, Norwegian, Portuguese, Spanish, Rhaeto-Romanic,
Scottish Gaelic, Spanish, Swedish, and Turkish.
- Latin Alphabet No. 6. Danish, English, Estonian, Faroese, Finnish,
German, Greenlandic, Icelandic, Irish Gaelic (new orthography), Latin,
Lithuanian, Norwegian, Sámi (with restrictions), Slovene, and
Swedish.
- Latin Alphabet No. 7. Danish, English, Estonian, Finnish, German,
Latin, Latvian, Lithuanian, Norwegian, Polish, Slovene, and Swedish.
- Latin Alphabet No. 8. Albanian, Basque, Breton, Catalan, Cornish,
Danish, Dutch, English, French (with restrictions), Frisian, Galician,
German, Greenlandic, Irish Gaelic (old and new orthographies), Ialian,
Latin, Luxemburgish, Manx Gaelic, Norwegian, Portuguese, Rhaeto-Romanic,
Scottish Gaelic, Spanish, Swedish, and Welsh.
- Latin Alphabet No. 9. Albanian, Basque, Breton, Catalan, Danish,
Dutch, English, Estonian, Faroese, Finnish, French, Frisian, Galician,
German, Greenlandic, Icelandic, Irish Gaelic (new orthography), Italian,
Latin, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Scottish
Gaelic, Spanish, and Swedish.
Note: For writing French, three characters not included in
Latin Alphabets 1, 3, 5, and 8 are also needed. These are included
in Latin Alphabet No. 9.
ISO 8859 versus Windows Code Pages
While the ISO standards are very clear, sometimes their mappings
to a Windows code page (or any other corporate interpretation) is
not so precise. Because of the slight differences that may occur in
the interpretations, I refer to the relationship as a correlation
rather than a direct mapping. These correlations will be useful when
testing various languages and the globalization of your application.
The code pages and more are available on the companion Web site to
the book.
- 1252 correlates with ISO 8859-1
- 1250 correlates with ISO 8859-2
- 1257 correlates with ISO 8859-4
- 1256 correlates with ISO 8859-6
- 1253 correlates with ISO 8859-7
- 1255 correlates with ISO 8859-8
- 874 correlates with ISO 8859-11
Additional Windows Code Pages
Other important Windows code pages are not strict interpretations
of the ISO standards or the original standards that developed them.
Many have additional ranges added for better coverage of the language
of the people they represent.
- Windows 936 code page is the GB 2312-80 (based from the ISO 646) with
the Hanzi corrections. (CHS)
- Windows 932 code page is JIS X 0208-1990 plus the Microsoft extensions
by SJIS code. (JPN)
- Windows 950 is the Big Five set plus row 89 of the ETen extension.
(CHT/Taiwanese)
- Windows 949 is 5601 plus extensions. (Korean)
- ISCII is a newly developed code page for Indic.
- GB 18030 is the newest revision of the CHS code page and includes
4-byte characters.
|
|
|