Tag Archives: MT

After digitizing libraries, translating knowledge?

Nine years ago I asked the question “Can we localize entire libraries?” In the wake of the National Library of Norway‘s (Nasjonalbiblioteket) digitization of its holdings and collaboration with Nigeria on materials in the latter’s languages, it seems like time to review what mass digitization could mean for translation of knowledge into diverse languages.

My original question in 2008 came from looking at trends in digitizing books – notably Google Books – and machine translation (MT). It elicited some interesting responses, including Kirtee Vashi’s mention of an Asia Online project planning to link these two trends.

Book scanners. Source: TecheBlog.com

In the ensuing years, Google Books’ digitization program – the biggest and most promising book digitization effort – ran into controversy over rights to reproduce copyrighted materials beginning in 2008. This ultimately has put their entire vision of digital access to a vast library of works in doubt. And the unrelated Asia Online project, which used statistical MT to translate 3.5 million pages of the English Wikipedia into Thai, was stopped in mid-2011 in the aftermath of a changed political situation in Thailand and funding issues.  (Asia Online has since become Omniscien Technologies)

So while the technologies for digitization and for MT – the two pieces in localizing libraries of information – are established and improving, each has encountered some combination of legal, political , or funding issues limiting their use individually for mass expansion of access to knowledge, as well as their potential use in tandem.

However, could the Norwegian program, announced in 2013, and the project it has with Nigeria, announced earlier this year, introduce a new dynamic, at least for mass digitization? Could and should large national libraries take the lead in this area?

The idea of digitizing libraries has generally been advocated in terms of access to knowledge, without particular reference to the languages in which publications are written. But languages are critical not only for access to knowledge, but also for facilitating scholarship and the interfacing of ways of knowing. Hence the need to associate mass digitization and MT.

There is at least one proposed project mentioning the potential for translation of books that have been digitized – Internet Archive’s initiative to digitize 4 million books (a semifinalist in the MacArthur Foundation’s 100 & Change grant competition).

Any such digital data produced by the Nasjonalbiblioteket, Google Books, Internet Archive, or any other organization could be translated with MT into other languages, with a few caveats (quality of optical character recognition [OCR]; how well resourced a particular language is; and of course the accuracy of the MT). This means that potentially any mass digitization could be mass translated into a large number of languages, given legal cover and sufficient funding.

What about the accuracy of MT, and how useful could mass MT of mass digitization be if there are inaccuracies? These are critical questions for any project to use MT to translate digitizations. Responses could reference, for instance, domain-specific MT, which is generally more accurate than general MT, provided of course that the material matches the domain used. Or perhaps some system for post-editing could be devised.

This is an exciting area that needs more attention and policy support. Books and other production in print can be digitized on a mass scale, making the knowledge in them more widely available. Digitized text can be machine translated into other languages, and the quality of that can be made high enough for use by speakers of the target languages. As much as the printing press revolutionized access to knowledge of that age, so too the potential to digitize and translate what is in print promises another revolution benefiting more people directly.



Can we localize entire libraries?

How close are we to being able to localize entire libraries?

The question is not as crazy as it might seem. Projects for “mass digitization of books” have been using technology like robots for some years already with the idea of literally digitizing all books and entire libraries. This goes way beyond the concept of e-books championed by Michael Hart and Project Gutenberg. Currently, Google Book Search and the Open Content Alliance (OCA) seem to be the main players among a varied lot of digital library projects. Despite the closing of Microsoft’s Live Search, it seems like projects digitizing older publications plus appropriate cycling of new publications (everything today is digital before it’s printed anyway) will continue to expand vastly what is available for digital libraries and book searches.

The fact of having so much in digital form could open other possibilities besides just searching and reading online.

Consider the field of localization, which is actually a diverse academic and professional language-related field covering translation, technology, and adaptation to specific markets. The localization industry is continually developing new capacities to render material from one language in another. Technically this involves computer assisted translation tools (basically translation memory and increasingly, machine translation [MT]) and methodologies for managing content. The aims heretofore have been pretty focused on particular needs of companies and organizations to reach linguistically diverse markets (localization is relatively minor still in international development, and where markets are not so lucrative).

I suspect however that the field of localization will not remain confined to any particular area. For one thing, as the technologies it is using advance, they will find diverse uses. In my previous posting on this blog, I mentioned Lou Cremers‘ assertion that improving MT will tend to lead to a larger amount of text being translated. His context was work within organizations, but why not beyond?

Keep in mind also that there are academic programs now in localization, notably the Localisation Research Centre at the University of Limerick (Ireland), which by their nature will also explore and expand the boundaries of their field.

At what point might one consider harnessing of the steadily improving technologies and methodologies for content localization to the potential inherent in vast and increasing quantities of digitized material?


Paradigm shift on machine translation?

Multilingual #95, coverThe April-May issue of Multilingual, which I’m just catching up with, features seven articles on machine translation (MT). Having a long term interest in this area (which is not to say any expertise) and in its potential for less-widely spoken languages, and having broached the topic on this blog once previously, I thought I’d take a moment to briefly review these articles. They are (links lead to abstracts for non-subscribers):

The evolution of machine translation — Jaap van der Meer
Machine translation: not a pseudoscience — Vadim Berman
Putting MT to work — Lou Cremers
Monolingual translation: automated post-editing — Hugh Lawson-Tancred
Machine translation: is it worth the trouble? — Kerstin Berns & Laura Ramírez
Challenges of Asian-language MT — Dion Wiggins & Philipp Koehn
Advanced automatic MT post-editing — Rafael Guzmán

In the first of the articles, Jaap van der Meer characterizes changes in attitudes about MT over the last 4 years as “revolutionary” — a move “from complete denial [of MT’s utility] to complete acceptance.” What happened? The answer seems to be a number of events and changes rather than a single triggering factor, perhaps an evolution to a “tipping point” of sorts. There have been ongoing improvements in MT, there was the establishment of the Translation Automation User Society (TAUS) in 2004 which “helped stimulate a positive mindset towards MT,” and the empowerment of internet users in the use of MT. Van der Meer also points out a shift in emphasis from finding “fully automated high quality translation” (FAHQT) to what he calls “fully automated useful translation” (FAUT – an acronym that presumably should not be read in French). The latter is not only a more realistic goal, but also one that reflects needs and uses in many cases.

As for the future, van den Meer sees a “shift from traditional national languages to ever more specialized technical languages.” My question is whether we can at the same time also see significant moves for less widely spoken languages.

Van den Meer’s article sets the tone and has me asking if indeed we are at a point where a fundamental shift is occurring the way we think of MT. The other articles look at specific issues.

Vadim Berman looks at some hurdles to making MT work, highlighting the importance of educating users – including mention of a recurrent theme: the importance of clean text going into the translation.

Two of the articles, by Lou Cremers and by Kerstin Berns and Laura Ramírez, discuss the practical value of MT in enterprise settings.

Cremers has some interesting thoughts about the utility of MT in an enterprise setting, something that has long seemed impractical, certainly when compared to translation memory (TM). He begins by noting that “a high end MT system will really work if used correctly, and may save a considerable amount of time and money,” and then procedes to discuss several factors he sees as key to getting good ROI: terminologies and dictionaries; quality input text; volume (pointing out among other things the fact that good MT will tend to lead to a larger amount of text being translated – a key point for considering the value of MT in other spheres of activity I might add); and workflow.

The “correct use” of MT relates largely to the quality of the text: “surprisingly simple writing rules governing he use of articles and punctuation marks will drastically improve MT output.”

Cremers offers a summation which seems to speak for several of the articles:

It’s not the absolute quality of the MT output that is important, but rather how much time it saves the translator in completing the task. In that way it is not different from TM. In both cases, human intervention is needed to produce high-quality translations.

Berns and Ramírez walk through the costs and benefits of MT in a business context. Here the issue is investing in a system but the reasoning could be applicable to different settings. They suggest that the kind of material to be translated is (unsurprisingly) a good guide to the potential utility of MT:

Do you have large text volumes with very short translation times and a high terminology density? Then it is very likely that MT will be a good solution for you. On the other hand, if you have small text volumes with varying text types and complex sentence structures, then it probably will bu too much effot to set up an effective process.

Two of the articles, by Hugh Lawson-Tancred and Rafael Guzmán, discuss “post-editing” as a tool to improve the output of MT.

Lawson-Tancred suggests – contrary to several of the other authors – that the utility of preparing the text going into MT may not be so critical, and that “the monolingual environment of the post-editor is a better place to smooth out the wrinkles of the translation process….” Interestingly, this concept focuses on context, with the basic unit for processing being 5-20 words (that is between the word level of dictionaries and whole sentences). His concludes by speculating that automated post-editing could “develop into a whole new area of applied computational linguistics.”

Guzmán, who has written a number of other articles on post-editing, discusses the use of TM in the context of verifying (post-editing) the product of MT. This basically involves ways of lining up texts in the source and translated languages for context and disambiguation. There are several examples using Spanish and English.

Finally, Dion Wiggins and Philipp Koehn discuss MT involving Asian languages, which most often entails different scripts. There are examples from several Asian languages illustrating the challenges involved.

This is an interesting set of articles to read to get a sense of the current state of the art as regards the application and applied research on MT. It’s a bit of a stretch for a non-specialist with limited context like me to wrap his mind around the ensemble of technical concepts and practices. One does come away, though, with the impression that MT is already a practical tool for a range of real-world tasks, and that we will be seeing much more widespread and sophisticated uses of it, often in tandem with allied applications (notably TM and post-editing). Are we seeing a paradigm shift in attitudes about MT?

At this time I’d really like to see a program to encourage young computer science students from diverse linguistic backgrounds in developing countries and indigenous communities to get into the field of research on MT. I’m convinced that it has the potential if approached strategically to revolutionize the prospects for minority languages and the ways we think about “language barriers.” That is more than just words – it has to do with education, knowledge and enhanced modes of communication. By extension, the set of human language technologies of which MT is a part, can in one way or another play a significant role in the evolution of linguistic diversity and common language(s) over the coming generations.


Reflecting on “Computing’s Final Frontier”

In the March issue of PC Magazine, John Dvorak comments on four areas of computer technology in his column entitled “Computing’s Final Frontier“: voice recognition; machine translation (MT); optical character recognition (OCR); and spell-checkers. Basically he’s decrying how little progress has been made on these in recent years relative to the vast improvements in computer capacities.

I’d like to comment briefly on all four. Two of those – voice recognition, or actually speech recognition, and MT – are areas that I think have particular importance and potential for non-dominant languages (what I’ve referred to elsewhere as “MINELs,” for minority, indigenous, national, endangered or ethnic, and local languages) including African languages on which I’ve been focusing. OCR is key to such work as getting out-of-print books in MINELs online. And spell-checkers are fundamental.

Voice recognition. Dvorak seems to see the glass half empty. I can’t claim to know the technology as he does, and maybe my expectations are too low, but from what I’ve seen of Dragon NaturallySpeaking, the accuracy of speech recognition in that specific task environment is quite excellent. We may do well to separate out two kinds of expectations: one, the ability of software to act as an accurate and dutiful (though at times perhaps a bit dense) scribe, and the other as something that can really analyze the language. For some kinds of production, the former is already useful. I’ll come back to the topic of software and language analysis towards the end of this post.

Machine translation. I’ve had a lot of conversations with people about MT, and a fair amount of experience with some uses of it. I’m convinced of its utility even today with its imperfections. It’s all too easy, however, to point out the flaws and express skepticism. Of course anyone who has used MT even moderately has encountered some hilarious results (mine include English to Portuguese “discussion on fonts” becoming the equivalent of “quarrels in baptismal sinks,” and the only Dutch to English MT I ever did which yielded “butt zen” from what I think was a town name). But apart from such absurdities, MT can do a lot – I’ll enjoy the laughs MT occasionally provides and take advantage of the glass half full here too.

But some problems with MT results are not just inadequacies of the programs. From my experience using MT, I’ve come to appreciate the fact that the quality of writing actually makes a huge difference in MT output. Run-on sentences, awkward phrasing, poor punctuation and simple spelling errors can confuse people, so how can MT be expected to do better?

Dvorak also takes a cheap shot when he considers it a “good gag” to translate with MT through a bunch of languages back to the original. Well you can get the same effect with the old grapevine game of whispering a message through a line of people and see what you get at the end – in the same language! At my son’s school they did a variant of this with a simple drawing seen and resketched one student at a time until it got through the class. If MT got closer to human accuracy you’d still have such corruption of information.

A particularly critical role I see for MT is in streamlining the translation of various materials into MINELs and among related MINELs, using work systems that involve perhaps different kinds of MT software as well as people to refine the products and feedback into improvements. In my book, “smart money” would take this approach. MT may never replace the human translator, but it can do a lot that people can’t.

Optical character resolution. Dvorak finds fault with OCR, but I have to say that I’ve been quite impressed with what I’ve seen. The main problems I’ve had have been with extended Latin characters and limited dictionaries – and both of those are because I’m using scanners at commercial locations, not on machines where I can make modifications. In other words I’d be doing better than 99% accuracy for a lot of material if I had my own scanners.

On the other hand, when there are extraneous marks – even minor ones – in the text, the OCR might come up with the kind of example Dvorak gives of symbols mixed up with letters. If you look at the amazing work that has been done with Google Patent Search, you’ll notice on older patents a fair amount of misrecognized character strings (words). So I’d agree that it seems like one ought to be able to program the software to be able to sort out characters and extraneous marks through some systematic analysis (a series of algorithms?) – picking form out of noise, referencing memory of texts in the language, etc.

In any event, enhancing OCR would help considerably with more digitization, especially as we get to digitizing publications in extended Latin scripts on stenciled pages and poor quality print of various sorts too often used for materials in MINELs.

Spell-checkers. For someone like me concerned with less-resourced languages, the issues with spell-checkers are different and more basic – so let me get that out of the way first. For many languages it is necessary to get a dictionary together first, and that may have complications like issues of standard orthographies and spellings, variant forms, and even dictionary resources being copyrighted.

In the context of a super-resourced language like English, Dvorak raises a very valid criticism here regarding how the wrong word correctly spelled is not caught by the checker. However, it seems to me that the problem would be appropriately addressed by a grammar-checker, which should spot words out of context.

This leads to the question of why we don’t have better grammar-checkers? I recall colleagues raving in the mid-90s about the then new WordPerfect Grammatik, but it didn’t impress me then (nevertheless, one article in 2005 found it was further along than Word’s grammar checker). The difference is more than semantic – grammar checkers rely on analysis of language, which is a different matter than checking character strings against dictionary entries (i.e., spell-checkers).

Although this is not my area of expertise, it seems that the real issue beneath all of the shortcomings Dvorak discusses is the applications of analysis of language in computing (human language technology). Thus some of the solutions could be related – algorithms for grammar checking could spot properly-spelled words out of place and also be used in OCR to analyze a sentence with an ambiguous word/character string. These may in turn relate to the quality of speech recognition. The problems in MT are more daunting but in some ways related. So, a question is, are the experts in each area approaching these with reference to the others, or as discrete and separate problems?

A final thought is that this “final frontier” – what I have sometimes referred to as “cutting edge” technologies – is particularly important for speakers of less-resourced languages in multilingual societies. MT can save costs and make people laugh in the North, but it has the potential to help save languages and make various kinds of information available to people who wouldn’t have it otherwise. Speech recognition is useful in the North, but in theory could facilitate the production of a lot of material in diverse languages that might not happen otherwise (it’s a bit more complex than that, but I’ll come back to it another time). OCR adds increments to what is available in well-resourced languages, but can make a huge difference in available materials for some less-resourced languages, for which older publications are otherwise locked away in distant libraries.

So, improvement and application of these cutting edge technologies is vitally important for people / markets not even addressed by PC Magazine. I took issue with some of what Dvorak wrote in this column but ultimately his main point is spot on in ways he might not have been thinking of.