All posts by Don

Reflecting on “Computing’s Final Frontier”

In the March issue of PC Magazine, John Dvorak comments on four areas of computer technology in his column entitled “Computing’s Final Frontier“: voice recognition; machine translation (MT); optical character recognition (OCR); and spell-checkers. Basically he’s decrying how little progress has been made on these in recent years relative to the vast improvements in computer capacities.

I’d like to comment briefly on all four. Two of those – voice recognition, or actually speech recognition, and MT – are areas that I think have particular importance and potential for non-dominant languages (what I’ve referred to elsewhere as “MINELs,” for minority, indigenous, national, endangered or ethnic, and local languages) including African languages on which I’ve been focusing. OCR is key to such work as getting out-of-print books in MINELs online. And spell-checkers are fundamental.

Voice recognition. Dvorak seems to see the glass half empty. I can’t claim to know the technology as he does, and maybe my expectations are too low, but from what I’ve seen of Dragon NaturallySpeaking, the accuracy of speech recognition in that specific task environment is quite excellent. We may do well to separate out two kinds of expectations: one, the ability of software to act as an accurate and dutiful (though at times perhaps a bit dense) scribe, and the other as something that can really analyze the language. For some kinds of production, the former is already useful. I’ll come back to the topic of software and language analysis towards the end of this post.

Machine translation. I’ve had a lot of conversations with people about MT, and a fair amount of experience with some uses of it. I’m convinced of its utility even today with its imperfections. It’s all too easy, however, to point out the flaws and express skepticism. Of course anyone who has used MT even moderately has encountered some hilarious results (mine include English to Portuguese “discussion on fonts” becoming the equivalent of “quarrels in baptismal sinks,” and the only Dutch to English MT I ever did which yielded “butt zen” from what I think was a town name). But apart from such absurdities, MT can do a lot – I’ll enjoy the laughs MT occasionally provides and take advantage of the glass half full here too.

But some problems with MT results are not just inadequacies of the programs. From my experience using MT, I’ve come to appreciate the fact that the quality of writing actually makes a huge difference in MT output. Run-on sentences, awkward phrasing, poor punctuation and simple spelling errors can confuse people, so how can MT be expected to do better?

Dvorak also takes a cheap shot when he considers it a “good gag” to translate with MT through a bunch of languages back to the original. Well you can get the same effect with the old grapevine game of whispering a message through a line of people and see what you get at the end – in the same language! At my son’s school they did a variant of this with a simple drawing seen and resketched one student at a time until it got through the class. If MT got closer to human accuracy you’d still have such corruption of information.

A particularly critical role I see for MT is in streamlining the translation of various materials into MINELs and among related MINELs, using work systems that involve perhaps different kinds of MT software as well as people to refine the products and feedback into improvements. In my book, “smart money” would take this approach. MT may never replace the human translator, but it can do a lot that people can’t.

Optical character resolution. Dvorak finds fault with OCR, but I have to say that I’ve been quite impressed with what I’ve seen. The main problems I’ve had have been with extended Latin characters and limited dictionaries – and both of those are because I’m using scanners at commercial locations, not on machines where I can make modifications. In other words I’d be doing better than 99% accuracy for a lot of material if I had my own scanners.

On the other hand, when there are extraneous marks – even minor ones – in the text, the OCR might come up with the kind of example Dvorak gives of symbols mixed up with letters. If you look at the amazing work that has been done with Google Patent Search, you’ll notice on older patents a fair amount of misrecognized character strings (words). So I’d agree that it seems like one ought to be able to program the software to be able to sort out characters and extraneous marks through some systematic analysis (a series of algorithms?) – picking form out of noise, referencing memory of texts in the language, etc.

In any event, enhancing OCR would help considerably with more digitization, especially as we get to digitizing publications in extended Latin scripts on stenciled pages and poor quality print of various sorts too often used for materials in MINELs.

Spell-checkers. For someone like me concerned with less-resourced languages, the issues with spell-checkers are different and more basic – so let me get that out of the way first. For many languages it is necessary to get a dictionary together first, and that may have complications like issues of standard orthographies and spellings, variant forms, and even dictionary resources being copyrighted.

In the context of a super-resourced language like English, Dvorak raises a very valid criticism here regarding how the wrong word correctly spelled is not caught by the checker. However, it seems to me that the problem would be appropriately addressed by a grammar-checker, which should spot words out of context.

This leads to the question of why we don’t have better grammar-checkers? I recall colleagues raving in the mid-90s about the then new WordPerfect Grammatik, but it didn’t impress me then (nevertheless, one article in 2005 found it was further along than Word’s grammar checker). The difference is more than semantic – grammar checkers rely on analysis of language, which is a different matter than checking character strings against dictionary entries (i.e., spell-checkers).

Although this is not my area of expertise, it seems that the real issue beneath all of the shortcomings Dvorak discusses is the applications of analysis of language in computing (human language technology). Thus some of the solutions could be related – algorithms for grammar checking could spot properly-spelled words out of place and also be used in OCR to analyze a sentence with an ambiguous word/character string. These may in turn relate to the quality of speech recognition. The problems in MT are more daunting but in some ways related. So, a question is, are the experts in each area approaching these with reference to the others, or as discrete and separate problems?

A final thought is that this “final frontier” – what I have sometimes referred to as “cutting edge” technologies – is particularly important for speakers of less-resourced languages in multilingual societies. MT can save costs and make people laugh in the North, but it has the potential to help save languages and make various kinds of information available to people who wouldn’t have it otherwise. Speech recognition is useful in the North, but in theory could facilitate the production of a lot of material in diverse languages that might not happen otherwise (it’s a bit more complex than that, but I’ll come back to it another time). OCR adds increments to what is available in well-resourced languages, but can make a huge difference in available materials for some less-resourced languages, for which older publications are otherwise locked away in distant libraries.

So, improvement and application of these cutting edge technologies is vitally important for people / markets not even addressed by PC Magazine. I took issue with some of what Dvorak wrote in this column but ultimately his main point is spot on in ways he might not have been thinking of.

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

International Year of Planet Earth

IYPE Registered TMThe International Year of Planet Earth (IYPE) is another of the several “Year” observances declared by the U.N. for 2008 (I previously mentioned the International Year of Languages [IYL], and will come to the others later). It actually runs from 2007 through 2009, but had its formal launch at UNESCO on 12-13 February 2008.

The aim of IYPE is given as: “to capture people’s imagination with the exciting knowledge we possess about our planet, and to see that knowledge used to make the Earth a safer, healthier and wealthier place for our children and grandchildren.” It is intended to “support research projects within defined themes focusing on Earth Sciences in the service of society.”

IYPE is described as “a joint initiative by UNESCO and the International Union of Geological Sciences (IUGS)” which involves “[t]welve Founding Partners, 26 Associate Partners and a growing number of International Partner organisations from all continents and representing all major geoscientific communities in the world,” as well as about 70 national committees.

The approach is explained on the IYPE website as:

The International Year of Planet Earth aims to ensure greater and more effective use by society of the knowledge accumulated by the world’s 400,000 Earth scientists. The Year’s ultimate goal of helping to build safer, healthier and wealthier societies around the globe is expressed in the Year’s subtitle ‘Earth science for Society’.

The International Year runs from January 2007 to December 2009, the central year of the triennium (2008) having been proclaimed by the UN General Assembly as the UN Year. The UN sees the Year as a contribution to their sustainable development targets as it promotes wise (sustainable) use of Earth materials and encourages better planning and management to reduce risks for the world’s inhabitants.

This is clearly a substantial and well-organized effort, with important potential benefits in terms of public awareness, organizational networking, and longer-term outcomes.

When considering IYPE and IYL, it is tempting to contrast the resources and planning, but without going into that, the differences seem to derive mainly from IYPE having had a kind of consortium in place fairly early in the process. I think this is an important lesson for the success of any “Year” observance: to have a dedicated organization that can help coordinate observance and activities. I’ll return to this topic later.

I’m also tempted to see potential connections between IYPE and IYL – how can the two themes be linked in specific ways to enrich the impact of each?

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

International Mother Language Day & International Year of Languages

IMLD 2008 logoToday is the ninth annual International Mother Language Day (IMLD), and the date of the official launch of the International Year of Languages (IYL). UNESCO also has a portal page for more info in IYL.
I’ve posted various information about IMLD and IYL in a special section of this site on Support for IYL 2008. It includes links to pages relating to IYL and IMLD on the UNESCO site, as well as a lot of other items and links.

The IYL offers the opportunity to make the case for various initiatives on language and linguistic diversity. One of the things I’m hoping for is progress towards a more effective “civil society” network linking organizations and initiatives with diverse but complementary purposes. More on that later.

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

New blog

This is my second blog. The first one – “Beyond Niamey” – began as an experiment. I was interested to see what I could do with the medium, and how I might show use of multiple languages (mainly French, Fulfulde/Pular and Bambara). Ultimately I have used and continue to use that one intermittently to (1) post about some of my work, (2) write on African languages and the information society, and (3) as a place to aggregate recent postings from a number of lists I contribute to.

This blog is to be another experiment, as a place to post on the several professional and disciplinary areas I am and have been involved in, and the connections among them. Namely:

  • Agriculture
  • Environment and natural resource management (NRM)
  • Education
  • Information and communication technologies (ICT)
  • Languages
  • Development (international, community, and rural)
Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail