Category Archives: ICT

Economics of language and the “long tail” effect – part 2

On the Wikinomics blog, Dan Herman responded to my discussion of use of the long tail model for languages. He raised some interesting points that I’ll come to in a moment.

Part of the reason I posted on the long-tail concept is that I believe it will be useful in various ways to analyzing the situation of less widely-spoken languages (LWSLs; previously I’ve used MINELs which says less about the size of the speaking community). I deliberately framed it in the context of the economics of language because I see the long tail as a model useful in the broader context of that field. In any event, we’re just beginning to explore this and it would be of interest to know of other efforts.

A clarification also needs to be made between what I’m seeing as two dynamics in the long tail of languages. Dan writes (referring to a previous Wikinomics posting that I referenced):

As Paul highlights in his post there are several tools and applications that, in theory, faciliate learning, or given Don’s take, not leaving, the long-tail.

It seems to me that these are really two different, although related, things. On the one hand, Paul looked more at how the potential “consumer” of language learning would perceive minority languages. On the other hand, I’m mostly interested in the view from the points of view closer to where the language is spoken, from individuals, households and communities who speak the language, to regional and national entities that serve them – govt., business, NGOs, education. The latter are all a different kind of “consumer” than potential language learners. (Parenthetically, I think this difference reflects one that I’ve noted in events related to the International Year of Languages: some people and organizations are focusing more on language learning and others more on a nexus of issues relating to language rights, endangered languages, etc.)

All of these viewpoints are valid, of course, but when considering language development and indeed survival it is useful to know whether ICT’s effect of lowering barriers for doing various things in/for less widely-spoken languages down the long tail ultimately balances or outweighs other factors that either encourage speakers of less-widely spoken languages to focus uniquely on more widely-spoken languages at the head of the distribution. Which is to say in effect, that the long-tail effect makes production and use of content and products in a language somewhere down the tail – say Soninke (language spoken by about a million people in Mali, Senegal & Mauritania, which has a historical link to the Ghana empire) – easier and cheaper for Soninke speakers than it was previously. But how will this affect use and development of the language?

In his Wikinomics blog article, Dan is skeptical, posing the question this way:

… in a world where the language of economics is conducted in one, perhaps two, and in the future maybe three languages, can a combination of technology, ethno-nationalism and culture trump trade and economics?

I’m not sure we can answer either question but it might help to look at the long tail in different ways to see what’s involved. In his book, The Long Tail, Chris Anderson shows that if you zero in on a section of the long tail, you find … another long tail distribution (see p. 21). One could for instance do the same with languages based on population of speakers, or, to consider the viewpoint from a country and its citizens, look at just the languages in that country. For example, the following graph uses figures from Ethnologue of first language (L1) speakers of Languages of Mali by L1 speakers languages of Mali :

This is another classic long-tail distribution. I’ve used color codes for very closely related tongues that are interintelligible (at least to some degree – this is a question that could be discussed at length another time). For instance, dark blue is used for the Manding tongues like Bambara, Jula, Malinke and Khassonke. The red color is for languages not in one of those groups. Soninke (snk) is one of these, with 700,000 speakers and 1 million or so overall – pretty significant in a particular region and fourth among the language categories Ethnologue lists for Mali.

Of course, in a multilingual societies people generally learn other languages no matter where their mother tongue may be in the distribution. So it makes more sense in terms of usage to plot out first & second (or additional) language speakers. In the following graph I plot out the combined figures for the closely related groups – whether they be called “language,” “macrolanguage,” or language cluster – and add estimated second language (L2) speakers above those: Malian language grous by L1&L2 speakers

There is some uncertainty about L2 speakership – estimates about the percentage of Mali’s 10+ million population that speak Bambara run from 65-80%; and for the official language of French, one probably low estimate is 15%. Fulfulde has historically been a lingua franca in central Mali.

And there are other ways we could graph out long tails of language as well. For instance on more local levels. Or, since there is a lot of trade and movement among countries of the West Africa region of which Mali is a part, and many of the language communities are divided by borders, one could do regional or subregional graphs.

What is the point? First, the dominant “two or three” languages when you narrow the geographical scale are not necessarily – and in fact usually are not – the same as one sees on the international level. English, Mandarin Chinese and Spanish may be the most significant worldwide, but none of them are major in Mali for instance. And languages that are relatively far down the tail in the international distribution may be at the top on a country or regional scale. Some languages specific to a country or region have some significant advantages in this context. And indeed, locally dominant languages do displace weaker languages to some degree. This may be the case with Bambara in Mali, or at least in much of the country, for instance.

Second, a language like Soninke which is pretty far down the tail in the international scale, has a higher profile nationally or subregionally (remembering it is a cross-border language).

The global distribution hides these realities. While it is true I think that the long-tail effect of advances in ICT generally lower the barriers and increase the potential for various kinds of work with LWSLs way down the tail (to the point where the main problems encountered are when the languages have few resources) – including for language learners (among whom the particular category of “heritage language learners” deserves special note) – it may be that the long tail distributions on more local levels are more informative for discussions of linguistic situations and language policy.

In other words, the significance of ICT’s effect on the potential to do various work (like publishing) in LWSLs may best be seen in reference to long tail distributions on country and regional levels.

Dan suggests that

As countries migrate through the demographic transition, and subsequently become increasingly urbanized, there’s an inherent move towards common languages in order to faciliate the trade of services and goods.

Whether this means more a “trimming” of the tail or more an evolution of the language portfolios of multilingual speakers and communities is open to discussion. None of us are suggesting that speakers of LWSLs should abandon their languages in favor of languages of wider communication (LWCs), but the question is whether a combination of application of ICTs and good language and education policies can facilitate people keeping and developing their languages, even if their numbers be few.

Facebooktwitterredditpinterestlinkedintumblrmail

Economics of language and the “long tail” effect

“The economics of language has been neglected and deserves much greater attention,” wrote economist Donald Lamberton in a book he edited in 2002. That may not have been too much of a revelation at the time – only a few years earlier (1994) another economist, François Grin, wrote that this field was tolerated “as an intriguing fringe interest” by the discipline of economics. I’d like to briefly explore an intriguing idea on the fringe of that fringe: whether there are or could be “long-tail” dynamics that give some advantages to minority languages.

But first, what is “economics of language”? Grin, in the same article mentioned above defined it as covering the study of:

…the effects of language on income (possibly revealing the presence of language-based discrimination), language learning by immigrants, patterns of language maintenance and spread in multilingual polities or between trading partners, minority language protection and promotion, the selection and design of language policies, language use in the workplace, and market equilibrium for language-specific goods and services.

Actually some of these issues are getting increased attention (another book on the topic was just published last year by Barry R. Chiswick and Paul W. Miller, for instance), so I suspect that economics of language is becoming a little more mainstream. (A good online review of the subject under the title “The Economics of Multilingualism” was written by Grin and François Vaillancourt.)

Long tail graph from Wikimedia CommonsWhat does the “long tail” have to do with any of this? Well to begin with, the distribution of languages by number of speakers, if plotted out on a graph like the figure (from the Wikimedia Commons) to the right, is a long tail distribution. The question is whether this means anything with regard to the economics of languages – and in particular for minority or less-widely spoken languages (the ones I’ve liked to call MINELs) which are in the long tail.

By way of explanation, the “long tail” refers to a distribution where a few categories have a lot of each (they would be the green-shaded area in the figure), and many categories have progressively fewer (the yellow-shaded part). It was popularized by Chris Anderson in a 2004 article, and then a 2006 book, on new marketing strategies facilitated by the internet. As such, it is a kind of economic model.

How do languages fit this pattern? I plotted out a bar graph for the 50 languages with the most mother tongue speakers using figures from Wikipedia (originally from Ethnologue) and an online utility at Shodor.org.
First 50 languagesIt’s “quick and dirty” but gives an idea of how the actual distribution compares to the long tail model. Needless to say, there is a very long and low “tail” to the right in this distribution after the first 50 languages.

I got the idea of connecting the long tail concept with languages from Laurent Elder of IDRC. When I finally got to read up on the subject it began to make sense. At least partway…

I have been among those suggesting that information and communication technologies make a lot of things possible or less expensive for MINELs that were impossible or too costly before. Desktop publishing or using webpages reduces barriers to producing and sharing text in any language – critical for languages with few resources and examples of a long tail effect. Cheaper communications via VOIP and expanded availability of cellphones facilitate dispersed member of a minority language community being able to speak their languages with each other. Community radio (a new use of an old technology) opens new ways of using the oral language. And so on. To be sure, dominant languages can use the same technologies, but the real advantage I think is for the non-dominant languages.

On the other hand – and here the application of the long-tail concept to language runs into problems perhaps similar to other attempts to apply economic analysis to languages – people don’t move “down the tail” to niche markets with language in the way they might with music or books (two of the prominent examples in Anderson’s writing on the subject). With language, the most prominent fact is that people live in the long tail, as it were, and there are some incentives to move up the tail to dominant languages. Part of the issue is how the new technologies facilitate not abandoning the linguistic home in the long tail when dominant languages are learned and used. Most people after all learn more than one language.

In any event, the long tail seems to be a useful concept in looking at the present and future of world languages. When I did a little research on this last fall, I came across an article on the Wikinomics blog that looked at the distribution of languages on the internet and posed questions re language learning. In other words, is there a long tail market for language services (mainly language learning)? This is a different take than mine above but also interesting. There may yet be others and perhaps, as the field of economics of language develops, more ambitious applications of the concept.

Facebooktwitterredditpinterestlinkedintumblrmail

Reflecting on “Computing’s Final Frontier”

In the March issue of PC Magazine, John Dvorak comments on four areas of computer technology in his column entitled “Computing’s Final Frontier“: voice recognition; machine translation (MT); optical character recognition (OCR); and spell-checkers. Basically he’s decrying how little progress has been made on these in recent years relative to the vast improvements in computer capacities.

I’d like to comment briefly on all four. Two of those – voice recognition, or actually speech recognition, and MT – are areas that I think have particular importance and potential for non-dominant languages (what I’ve referred to elsewhere as “MINELs,” for minority, indigenous, national, endangered or ethnic, and local languages) including African languages on which I’ve been focusing. OCR is key to such work as getting out-of-print books in MINELs online. And spell-checkers are fundamental.

Voice recognition. Dvorak seems to see the glass half empty. I can’t claim to know the technology as he does, and maybe my expectations are too low, but from what I’ve seen of Dragon NaturallySpeaking, the accuracy of speech recognition in that specific task environment is quite excellent. We may do well to separate out two kinds of expectations: one, the ability of software to act as an accurate and dutiful (though at times perhaps a bit dense) scribe, and the other as something that can really analyze the language. For some kinds of production, the former is already useful. I’ll come back to the topic of software and language analysis towards the end of this post.

Machine translation. I’ve had a lot of conversations with people about MT, and a fair amount of experience with some uses of it. I’m convinced of its utility even today with its imperfections. It’s all too easy, however, to point out the flaws and express skepticism. Of course anyone who has used MT even moderately has encountered some hilarious results (mine include English to Portuguese “discussion on fonts” becoming the equivalent of “quarrels in baptismal sinks,” and the only Dutch to English MT I ever did which yielded “butt zen” from what I think was a town name). But apart from such absurdities, MT can do a lot – I’ll enjoy the laughs MT occasionally provides and take advantage of the glass half full here too.

But some problems with MT results are not just inadequacies of the programs. From my experience using MT, I’ve come to appreciate the fact that the quality of writing actually makes a huge difference in MT output. Run-on sentences, awkward phrasing, poor punctuation and simple spelling errors can confuse people, so how can MT be expected to do better?

Dvorak also takes a cheap shot when he considers it a “good gag” to translate with MT through a bunch of languages back to the original. Well you can get the same effect with the old grapevine game of whispering a message through a line of people and see what you get at the end – in the same language! At my son’s school they did a variant of this with a simple drawing seen and resketched one student at a time until it got through the class. If MT got closer to human accuracy you’d still have such corruption of information.

A particularly critical role I see for MT is in streamlining the translation of various materials into MINELs and among related MINELs, using work systems that involve perhaps different kinds of MT software as well as people to refine the products and feedback into improvements. In my book, “smart money” would take this approach. MT may never replace the human translator, but it can do a lot that people can’t.

Optical character resolution. Dvorak finds fault with OCR, but I have to say that I’ve been quite impressed with what I’ve seen. The main problems I’ve had have been with extended Latin characters and limited dictionaries – and both of those are because I’m using scanners at commercial locations, not on machines where I can make modifications. In other words I’d be doing better than 99% accuracy for a lot of material if I had my own scanners.

On the other hand, when there are extraneous marks – even minor ones – in the text, the OCR might come up with the kind of example Dvorak gives of symbols mixed up with letters. If you look at the amazing work that has been done with Google Patent Search, you’ll notice on older patents a fair amount of misrecognized character strings (words). So I’d agree that it seems like one ought to be able to program the software to be able to sort out characters and extraneous marks through some systematic analysis (a series of algorithms?) – picking form out of noise, referencing memory of texts in the language, etc.

In any event, enhancing OCR would help considerably with more digitization, especially as we get to digitizing publications in extended Latin scripts on stenciled pages and poor quality print of various sorts too often used for materials in MINELs.

Spell-checkers. For someone like me concerned with less-resourced languages, the issues with spell-checkers are different and more basic – so let me get that out of the way first. For many languages it is necessary to get a dictionary together first, and that may have complications like issues of standard orthographies and spellings, variant forms, and even dictionary resources being copyrighted.

In the context of a super-resourced language like English, Dvorak raises a very valid criticism here regarding how the wrong word correctly spelled is not caught by the checker. However, it seems to me that the problem would be appropriately addressed by a grammar-checker, which should spot words out of context.

This leads to the question of why we don’t have better grammar-checkers? I recall colleagues raving in the mid-90s about the then new WordPerfect Grammatik, but it didn’t impress me then (nevertheless, one article in 2005 found it was further along than Word’s grammar checker). The difference is more than semantic – grammar checkers rely on analysis of language, which is a different matter than checking character strings against dictionary entries (i.e., spell-checkers).

Although this is not my area of expertise, it seems that the real issue beneath all of the shortcomings Dvorak discusses is the applications of analysis of language in computing (human language technology). Thus some of the solutions could be related – algorithms for grammar checking could spot properly-spelled words out of place and also be used in OCR to analyze a sentence with an ambiguous word/character string. These may in turn relate to the quality of speech recognition. The problems in MT are more daunting but in some ways related. So, a question is, are the experts in each area approaching these with reference to the others, or as discrete and separate problems?

A final thought is that this “final frontier” – what I have sometimes referred to as “cutting edge” technologies – is particularly important for speakers of less-resourced languages in multilingual societies. MT can save costs and make people laugh in the North, but it has the potential to help save languages and make various kinds of information available to people who wouldn’t have it otherwise. Speech recognition is useful in the North, but in theory could facilitate the production of a lot of material in diverse languages that might not happen otherwise (it’s a bit more complex than that, but I’ll come back to it another time). OCR adds increments to what is available in well-resourced languages, but can make a huge difference in available materials for some less-resourced languages, for which older publications are otherwise locked away in distant libraries.

So, improvement and application of these cutting edge technologies is vitally important for people / markets not even addressed by PC Magazine. I took issue with some of what Dvorak wrote in this column but ultimately his main point is spot on in ways he might not have been thinking of.

Facebooktwitterredditpinterestlinkedintumblrmail