(from the CLDR-users archives; source: CLDR-users index for Jun. 2007)
From: "Don Osborn"
Subject: CLDR and diversity of locales (not) filed
Date: Sun, 3 Jun 2007 17:02:17 -0400
Philippe posted this on Unicode@unicode.org and I hope it's okay with him and all to repost it here with a few comments (including a couple in the text).
This message calls to mind several issues. First, I think that it is not the general rule to develop locales for any international language spoken anywhere, but only for languages (1) indigenous to particular countries or (2) present there with a particular character in that country.
Second, while agreeing on the general point that there are many locales that need to be written, I am cautious about listing "missing locales" without a clear idea of the need and situation of each. This is especially the case where people doing the listing have some info but neither an expertise nor significant first-hand familiarity with the language situations in question. Part of the reason is the potential errors. Another part is that, particularly for many languages with complex dialect situations (such as quite a few in Africa), the existing language tags (in ISO-639) are not necessarily ideal for locales and localization. The latter cases need attention by experts - and that is not to slow things down, but to appeal for bringing such experts into the process more proactively.
And third, a "meta" observation. This discussion began on the Unicode list, and I repost here, but it also has to do with language tagging issues dealt with on ietf-langauges and ltru lists. I will refrain from crossposting there, but for me it's another illusration of how these issues cross boundaries and issues that seem to relate specifically to one domain (such as locales) quickly link to other issues. Again, that's not to slow work down, but to appeal for awareness.
> -----Original Message-----
> From: Philippe Verdy
> Sent: Thursday, May 31, 2007 8:35 PM
> To: 'Don Osborn'; 'Daniel Yacob'; email@example.com;
> Subject: RE: [OT]non-terrestrial writing systems
> If I just consider the current data found in the CLDR language-territory
> map, we currently have a total of 443 languages spoken by 5,972,581,880
> So there are still lots of people speaking uncovered modern languages (rough
> estimate, about 1.5 billion) possibly more because those CLDR estimates are
> only for the primary language:
> if I look for example at France, the CLDR data says that French is spoken
> only by 51 millions people out of more than 71 millions residents, and an
> incredibly large 16 millions people (but most probably only as a secondary
> language; and it forgets more common primary language spoken in France by
> French natives: Arabic, Berber, Rom/Tzigane, Armenian, and lots of other
> languages spoken by more recent immigrants with a legal residence: the same
> languages as well as Romanian, Polish, Chinese, Vietnamese, Persian,
> Turkish, and many African languages...). (Note that the CLDR data includes
> statistics for migrants, but minors the statistics for French-speaking US
> This just confirms that the CLDR data just concentrates on the primary
> language, or at a official lingua franca for languages spoken by a community
> spread in very small minorities over a territory, and that are not directly
> identifiable. But the same data contains statistics for old regional
> languages, even though most of them are only spoken as a secondary language.
> The case of English in France is very significant.
> I'm sure that those statistics are tweaked in favour of more important
> languages, but even in this case, they are missing lots of people in the
> world; notably: there's data missing for the languages in:
> * [MX] Mexico, [BO] Bolivia and [PE] Peru: lots of Amerindian languages
> * [MQ] Martinique (France): French is given very low statistics, probably
> French Creole is missing (but in Guadeloupe, the statistics indicate
> standard French spoken by everyone, without any creole?)
> * [DZ] Algeria and [TU] Tunisia: missing Berber, Fulah (Peul)...
Fula/Peul is not in Algeria or Tunisia
> * [CG] Congo-Brazzaville, [CM] Cameroon: missing lots of African
> * [CI] Côte d'Ivoire: missing lots of African languages, or statistics are
> most probably about 100 times too low if considering only the lingua-franca
> languages (only French and Koro?), possibly a input bug! Missing English
There was a misattribution of a locale for a language in Nigeria that is also known under a similar name. Not sure if that has been corrected, though will look at it again.
> * [GM] Gambia: lots of African languages
> * [ZA] South-Africa: only the official languages are listed, plus Swati,
> Swahili, South-Ndebele, Hindi being the only non African language listed
> (where is also Chinese?)
> * [RE] Reunion (France): Reunion French Creole is listed along with Tamil,
> but Chinese is missing
> * [SC] Seychelles: where are Indian languages?
> * [JE] Jersey and [GG] Guernsey: where are English, Normand, Jersiais and
> * [GI] Gibraltar: most probably, Spanish is missing there.
> * [RU] Russia: many Asian languages (including Chinese and Mongolian) and
> German, Yiddish, Hebrew...
> * [CN] China (Dem. Rep.), [MO] Macau SAR, [HK] Hong Kong SAR: missing
> Southern Chinese dialects, plus Hmong and Turkic languages.
> * [MS] Malaysia: lots of native languages
> * [PH] Philippines, [TH] Thailand: their native languages are spread all
> around the world through navigation
> * [ID] Indonesia: certainly lots of native languages missing
> * [CK] Cook Islands: missing Cook Islands Maori (only English listed)
> * [NC] New-Caledonia (France): missing native polynesian languages
> * [WF] Wallis-and-Futuna (France): missing native polynesian languages
> Most missing languages are in South-East Asian archipelagos, India, China,
> all over Africa, Central America, and North-West of South America. Only
> European languages and large Asian languages are "well" covered at least
> with the primary language plus some regional languages.
> And anyway, we still lack resources for important historic languages in the
< RE: symbols/group non-breaking space | CLDR-users | RE: Google data >