After digitizing libraries, translating knowledge?

Nine years ago I asked the question “Can we localize entire libraries?” In the wake of the National Library of Norway‘s (Nasjonalbiblioteket) digitization of its holdings and collaboration with Nigeria on materials in the latter’s languages, it seems like time to review what mass digitization could mean for translation of knowledge into diverse languages.

My original question in 2008 came from looking at trends in digitizing books – notably Google Books – and machine translation (MT). It elicited some interesting responses, including Kirtee Vashi’s mention of an Asia Online project planning to link these two trends.

Book scanners. Source: TecheBlog.com

In the ensuing years, Google Books’ digitization program – the biggest and most promising book digitization effort – ran into controversy over rights to reproduce copyrighted materials beginning in 2008. This ultimately has put their entire vision of digital access to a vast library of works in doubt. And the unrelated Asia Online project, which used statistical MT to translate 3.5 million pages of the English Wikipedia into Thai, was stopped in mid-2011 in the aftermath of a changed political situation in Thailand and funding issues.  (Asia Online has since become Omniscien Technologies)

So while the technologies for digitization and for MT – the two pieces in localizing libraries of information – are established and improving, each has encountered some combination of legal, political , or funding issues limiting their use individually for mass expansion of access to knowledge, as well as their potential use in tandem.

However, could the Norwegian program, announced in 2013, and the project it has with Nigeria, announced earlier this year, introduce a new dynamic, at least for mass digitization? Could and should large national libraries take the lead in this area?

The idea of digitizing libraries has generally been advocated in terms of access to knowledge, without particular reference to the languages in which publications are written. But languages are critical not only for access to knowledge, but also for facilitating scholarship and the interfacing of ways of knowing. Hence the need to associate mass digitization and MT.

There is at least one proposed project mentioning the potential for translation of books that have been digitized – Internet Archive’s initiative to digitize 4 million books (a semifinalist in the MacArthur Foundation’s 100 & Change grant competition).

Any such digital data produced by the Nasjonalbiblioteket, Google Books, Internet Archive, or any other organization could be translated with MT into other languages, with a few caveats (quality of optical character recognition [OCR]; how well resourced a particular language is; and of course the accuracy of the MT). This means that potentially any mass digitization could be mass translated into a large number of languages, given legal cover and sufficient funding.

What about the accuracy of MT, and how useful could mass MT of mass digitization be if there are inaccuracies? These are critical questions for any project to use MT to translate digitizations. Responses could reference, for instance, domain-specific MT, which is generally more accurate than general MT, provided of course that the material matches the domain used. Or perhaps some system for post-editing could be devised.

This is an exciting area that needs more attention and policy support. Books and other production in print can be digitized on a mass scale, making the knowledge in them more widely available. Digitized text can be machine translated into other languages, and the quality of that can be made high enough for use by speakers of the target languages. As much as the printing press revolutionized access to knowledge of that age, so too the potential to digitize and translate what is in print promises another revolution benefiting more people directly.



Can we localize entire libraries?

How close are we to being able to localize entire libraries?

The question is not as crazy as it might seem. Projects for “mass digitization of books” have been using technology like robots for some years already with the idea of literally digitizing all books and entire libraries. This goes way beyond the concept of e-books championed by Michael Hart and Project Gutenberg. Currently, Google Book Search and the Open Content Alliance (OCA) seem to be the main players among a varied lot of digital library projects. Despite the closing of Microsoft’s Live Search, it seems like projects digitizing older publications plus appropriate cycling of new publications (everything today is digital before it’s printed anyway) will continue to expand vastly what is available for digital libraries and book searches.

The fact of having so much in digital form could open other possibilities besides just searching and reading online.

Consider the field of localization, which is actually a diverse academic and professional language-related field covering translation, technology, and adaptation to specific markets. The localization industry is continually developing new capacities to render material from one language in another. Technically this involves computer assisted translation tools (basically translation memory and increasingly, machine translation [MT]) and methodologies for managing content. The aims heretofore have been pretty focused on particular needs of companies and organizations to reach linguistically diverse markets (localization is relatively minor still in international development, and where markets are not so lucrative).

I suspect however that the field of localization will not remain confined to any particular area. For one thing, as the technologies it is using advance, they will find diverse uses. In my previous posting on this blog, I mentioned Lou Cremers‘ assertion that improving MT will tend to lead to a larger amount of text being translated. His context was work within organizations, but why not beyond?

Keep in mind also that there are academic programs now in localization, notably the Localisation Research Centre at the University of Limerick (Ireland), which by their nature will also explore and expand the boundaries of their field.

At what point might one consider harnessing of the steadily improving technologies and methodologies for content localization to the potential inherent in vast and increasing quantities of digitized material?