Can we localize entire libraries?

How close are we to being able to localize entire libraries?

The question is not as crazy as it might seem. Projects for “mass digitization of books” have been using technology like robots for some years already with the idea of literally digitizing all books and entire libraries. This goes way beyond the concept of e-books championed by Michael Hart and Project Gutenberg. Currently, Google Book Search and the Open Content Alliance (OCA) seem to be the main players among a varied lot of digital library projects. Despite the closing of Microsoft’s Live Search, it seems like projects digitizing older publications plus appropriate cycling of new publications (everything today is digital before it’s printed anyway) will continue to expand vastly what is available for digital libraries and book searches.

The fact of having so much in digital form could open other possibilities besides just searching and reading online.

Consider the field of localization, which is actually a diverse academic and professional language-related field covering translation, technology, and adaptation to specific markets. The localization industry is continually developing new capacities to render material from one language in another. Technically this involves computer assisted translation tools (basically translation memory and increasingly, machine translation [MT]) and methodologies for managing content. The aims heretofore have been pretty focused on particular needs of companies and organizations to reach linguistically diverse markets (localization is relatively minor still in international development, and where markets are not so lucrative).

I suspect however that the field of localization will not remain confined to any particular area. For one thing, as the technologies it is using advance, they will find diverse uses. In my previous posting on this blog, I mentioned Lou Cremers‘ assertion that improving MT will tend to lead to a larger amount of text being translated. His context was work within organizations, but why not beyond?

Keep in mind also that there are academic programs now in localization, notably the Localisation Research Centre at the University of Limerick (Ireland), which by their nature will also explore and expand the boundaries of their field.

At what point might one consider harnessing of the steadily improving technologies and methodologies for content localization to the potential inherent in vast and increasing quantities of digitized material?


10 thoughts on “Can we localize entire libraries?”

  1. Don,

    I think this is indeed a good question, and it will be interesting to see where the future of digitalization ends up. But do you think MT has really reached a level where we can digitalize and translate entire libraries of books and have decent quality (remember, non-linguists complain about one or two insignificant errors, and that is usually the higher end of the 90th percentile, which is pretty difficult to pull off in the first place, forget 100%)? It will takes large leaps before the MT is there.

    In addition to pioneering MT, I think for the time being there should be more emphasis placed on people translating, and good tools being built for them (just like localization in software, with quality specialized dictionaries, translation memory applications, etc.) The technical know-how would also be of huge benefit to those involved.

    1. Thanks for this feedback. It is clear that the current state of MT is not to that level, but an optimistic reading of the trend would have it getting much better. I am thinking though not just of MT alone, but probably a more sophisticated mix of applications with vastly improved MT in the middle, drawing from (and expanding a lot on) experience with localization. Part of what I’m thinking is that while the experience base of the localization industry is far from translating books, some of their approaches (which involve people, but increasing amounts of automation and some MT already) might actually be adaptable to the task. It would be very interesting to see how far it could go.

      That said, it is admittedly hard to imagine satisfactory translations of literature coming out of an automated process, no matter how sophisticated.

  2. How about pairing books / ISBNs with their existing translations? It does not require great precision and can be done with existing technologies, even very inaccurate machine translation.

    1. Thanks Vadim, Yes this would seem to be a logical step to incorporate quality translations where they exist. Even without the idea of localizing mass digitized materials, pairing ISBNs could be helpful – it exists in a way in library records, at least when you are looking at a translation from the original in another tongue, but not in a systematic way.

      Oddly though, the item on “mass digitization” linked in the blog posting mentions that it might be more economical to risk duplication in scanning en masse than to determine if a duplicate exists before scanning something. Maybe a “mass localization” of digitized books would proceed on the same logic. But since one would want the best translations possible, a checking for pre-existing translations would have to be part of the process.

  3. Thanks for mentioning our work at the LRC at the University of Limerick, Don. We have actually been working on the idea of Localization4all last semester with some students and experts from the industry (www.localization4all), the idea being that localisation should not remain with large multinationals who are using it as a function of their commercial globalisation efforts, but branch out to anybody who has a localisation requirement. Satisfying this requirment requires access to technology (not just MT, but incl. MT), training, and access (IPR) to digital content. We are currently planning to explore ideas around Localization4all at our upcoming 13th annual conference in Dublin (2/3 October), see See you there!?

    1. You’re welcome, Reinhard, and thank you for calling our attention to Localisation4all. Interesting initiative. There are certainly some interesting possibilities in expanding application of “L10n” methodologies.

      The library idea actually came to me in looking at a 2006 article by J.J. Britz and P.J. Lor entitled “The Role of Libraries in Combating Information Poverty in Africa” (In R.N. Sharma, ed. The Impact of Technology on Asian, African, and Middle Eastern Library Collections. Scarecrow Press). My topic is totally tangential to their article, but they do bring up local language issues in the sense not of translation or localization, but rather of, well, the library analog of local content (I finally got to read it all today while transiting around on errands).

      In the larger sense, as we know, various ICTs offer ways of expanding learning, communication, and knowledge transmission in (potentially) any and (theoretically) all languages. Facilitating their use on individual, community, CBO/NGO, and small enterprise levels as I understand your Localisation4all effort envisions is one dimension. Perhaps Britz and Lor’s discussion of national and local libraries doing more with local language and cultural content (technology being inevitably a tool for that) would be another dimension. Something like mass localization of massly digitized books would be another (which starts to become supply side – what if a government decides that it wants vast amounts of health and applied science material made available in one or more of its languages?) . And so on.

      I did see the CFP for LRC XIII that came to my mailbox yesterday, and would like to contribute and attend. Not sure about means and time, but am looking at the possibilities.

  4. I am with a company called Asia Online that is embarked on exactly this kind of a project. We intend to convert a large amount of open source content including things like the full English Wikipedia, Project Gutenberg and many other sources into several Asian languages. This project is a combination of the best SMT (Statistical Machine Translation) , large system infrastructure to manage a web wide collaboration and constantly improve the quality of the automated translation and continually feed the SMT engines with better quality data and will have a strong crowdsourcing aspect to the whole project.

    Asia Online is undertaking the largest ever translation project, translating over 1 billion pages of content from English into Asian languages.

    Asia is a region of mixed cultures, mixed languages and mixed average per capita income ranging from extreme wealth to extreme poverty. The richer markets of China, Japan and Korea have significant industry, intellectual property and GDP, while poorer developing economies have major gaps. One of those key gaps is access to knowledge in the local language. Internet penetration in Asia as a whole has reached just 13.7%, yet already nearly 40% of all Internet users are in Asia, with forecasts expecting this will reach 50% by 2012. China recently passed the US as the largest population of Internet users, with 220 million. While Internet use increases dramatically across Asia, there is a dire lack of local language content and very few successful local websites in many significant markets. The combined content from all Asian markets outside of China, Japan and Korea makes up less that 0.03 percent of all content online. There are currently less than 10 million local language web pages in all of South East Asia; Indonesia has a 223 million population and more Internet users (over 26 million) today than Australia/New Zealand combined, yet there are just 800,000 web pages in the Indonesian language. Asia Online has made delivering large volumes of content to these markets a priority.

    To bridge the gap of knowledge, Asia Online has sourced 100 million pages of high quality English language content, most of which is high value and educational in nature. The content is being translated from English into Asian languages using Asia Online’s own statistical machine translation engine and a variety of techniques designed to deliver translation at an industrial scale. But technology and content alone are not enough to fill this void. The project is being delivered in each market, starting with Thailand in July 2008, as a nation building project. It is deployed as a crowd sourcing approach where users are solicited to help proof read the machine translated content, with the corrections being applied as new learning data for the statistical machine translation engine. It has gained the support of government, industry and academia due to its unique approach of helping the deliver knowledge to anyone in a country that has access to a PC. Asia Online’s approach towards nation building via crowd sourcing means that the amount of knowledge in a country is significantly improved and credit is given all around to each user than helps to improve the quality of the content. In doing so, Asia Online has become, not only the world’s largest translation project, but also the largest literacy project ever undertaken.

    We will launch the Thai site by August and expect to see widespread support of this effort from the Thai educational community.

    1. Thanks Kirti, Asia Online’s translation project sounds quite exciting, and is along the lines that I was thinking when I wrote this blog entry. The association with literacy that you mention is important, as is reference to the discrepancy between language of web content (mainly English) to first language of users (mainly and increasingly other tongues).

      It does seem that we are on the verge of a pretty significant cascade of changes impelled by the mass amount of digitized data available. In another context, Wired magazine’s Chris Anderson terms the phenomenon the “Petabyte Age.” What are its implications for language and languages?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.