Letter database: languages, character sets, names etc

Data on languages using latin script

This data is initially taken from the report of the UN group of experts on geographical names "Toponymic Data Exchange Formats and Standards" to the seventh UN conference on the standardization of geographical names New York, January 1998 (press release) and from draft "Characters and character sets for various languages" by Harald Alvestrand. This is the most recent copy I found of this document - draft-alvestrand-lang-char-03.txt. Characters mentioned in the draft but missing in the geographical names are listed as 'important'.

Characters actually used in a language (Ô in English), language codes (id/in for Indonesian), names (Faeroese/Faroese), differentiation (Sámi variants), the glyphs (how should dcaron look like) and the list of languages covered are all disputable. Try to look at the bright side. But if you think that some important aspect is missing or wrong, please don't hesitate to mail your comments to Indrek Hein, kiisu@eki.ee.

There are many existing romanisation (transliteration and transcription) systems in use for both roman and non-roman scripts. This database lists only the systems that are widely used in writing geographical names, hence the abbreviation BGN/PCGN -- United States Board on Geographic Names / UK Permanent Committee on Geographical Names. There are many other existing transliteration schemes approved and used by ISO, bigger libraries, national bodies etc.

Languages codes are used in order of preference:

ISO 639-1 two-letter code if available
ISO 639-2 three-letter terminology code
first four letters of the language's name in English

Too many languages in the Cyrillic section had neither codes assigned nor settled English name. In this section the language code, if available, is shown in brackets after the language name.

The following languages (some represented by romanization systems) do not require any additional characters to basic Latin: Armenian, Aymara, Belarusian, Creole, English, Fijian, Georgian, Greenlandic, Ikiribati, Kinyarwanda, Kirundi, Kosraean, Latin, Malay, Maldivian, Nauruan, Ndebele, Neomelanesian (Tok Pisin), Nukuoro, Palauan, Papiamento, Pedi, Ponapean, Quechua, Sesotho, siSwati, Somali, Soninke, Swahili, Thai, Toucouleur, Trukese, Tsonga, Tuvaluan, Ukrainian, Woleaian, Xhosa, Zulu. The list is incomplete and some of the forementioned languages are included in the database as they nevertheless have a number of 'important' characters or other possible transcription systems. The only thing that we can be reasonably sure about is that for Latin, the basic Latin alphabet should suffice... so please try to forget about the use of macron over long vowels or we have no sure things left.

Some characters for Latin and Cyrillic are not in the UCS. These characters are of two types -- they are either based on a modified shape of an existing character or a combination of a character and additional diacritical marks. All modified (and new) shapes will eventually be allocated in the UCS, characters that may be decomposed, esp. those characters that are needed only in some rare transcription schemes may not get a separate code. The general principle in this database is that every character occurring in some language's alphabet needs a separate code; combining characters are needed only for transcription schemes. Additional characters are given codes from the private use area of the UCS starting with E000.