All posts by Don

Mass digitization and oral traditions

In the previous post, I looked at a possible ramification of “mass digitization” of text. But what about the spoken word? And more precisely verbal presentations, performance, and broadcast in languages often described as having “oral traditions” (and generally less material in writing)? Can we do something significant to capture significant amounts of speech in such languages in digital recordings?

There are some projects to digitize older recordings on tape, and certainly a need for more effort in this area, but what I am thinking of here is recording contemporary use of language that is normally ephemeral (gone once uttered), along with gaining access to recordings of spoken language that may not be publicly accessible. One place to start might be community radio stations in regions where less-resourced languages are spoken.

The object would be to build digital audio libraries for diverse languages that don’t have much if any publication in text. This could permit various kinds of work. In the case of endangered tongues, this kind of thing would fall under the rubric of language documentation (for preservation and perhaps revitalization), but what I am suggesting is a resource for language development for languages spoken by wider communities.

Digital audio is more than just a newer format for recording. As I understand it, digital storage of audio has some qualitative differences, notably the potential to search by sound (without the intermediary of writing) and eventually, one presumes, to be manipulated and transformed in various ways (including rendering in text). Such a resource could be of use in other ways, such as collecting information on things like emerging terminologies in popular use (a topic that has interested me since hearing how community radio stations in Senegal come up with ways to express various new concepts in local languages). Altogether, digital audio seems to have the potential to be used in more ways than we are used to thinking about in reference to sound recordings.

Put another way, recordings can be transcribed and serve as “audio corpora,” in a more established way. But what if one had large volumes of untranscribed digital recordings, and the potential to search the audio (without text) and later to convert it into text (accuracy in this area, which would not involve the normal training involved with use of current speech-to-text programs will be one of the challenges)?

Can digital technology do for audio content something analogous to what it can do for text? What sort of advantages might such an effort bring for education and development in communities which use less-resourced languages? Could it facilitate the emergence of “neo-oral” traditions that integrate somehow with developing literate traditions in the same languages?


Can we localize entire libraries?

How close are we to being able to localize entire libraries?

The question is not as crazy as it might seem. Projects for “mass digitization of books” have been using technology like robots for some years already with the idea of literally digitizing all books and entire libraries. This goes way beyond the concept of e-books championed by Michael Hart and Project Gutenberg. Currently, Google Book Search and the Open Content Alliance (OCA) seem to be the main players among a varied lot of digital library projects. Despite the closing of Microsoft’s Live Search, it seems like projects digitizing older publications plus appropriate cycling of new publications (everything today is digital before it’s printed anyway) will continue to expand vastly what is available for digital libraries and book searches.

The fact of having so much in digital form could open other possibilities besides just searching and reading online.

Consider the field of localization, which is actually a diverse academic and professional language-related field covering translation, technology, and adaptation to specific markets. The localization industry is continually developing new capacities to render material from one language in another. Technically this involves computer assisted translation tools (basically translation memory and increasingly, machine translation [MT]) and methodologies for managing content. The aims heretofore have been pretty focused on particular needs of companies and organizations to reach linguistically diverse markets (localization is relatively minor still in international development, and where markets are not so lucrative).

I suspect however that the field of localization will not remain confined to any particular area. For one thing, as the technologies it is using advance, they will find diverse uses. In my previous posting on this blog, I mentioned Lou Cremers‘ assertion that improving MT will tend to lead to a larger amount of text being translated. His context was work within organizations, but why not beyond?

Keep in mind also that there are academic programs now in localization, notably the Localisation Research Centre at the University of Limerick (Ireland), which by their nature will also explore and expand the boundaries of their field.

At what point might one consider harnessing of the steadily improving technologies and methodologies for content localization to the potential inherent in vast and increasing quantities of digitized material?


Paradigm shift on machine translation?

Multilingual #95, coverThe April-May issue of Multilingual, which I’m just catching up with, features seven articles on machine translation (MT). Having a long term interest in this area (which is not to say any expertise) and in its potential for less-widely spoken languages, and having broached the topic on this blog once previously, I thought I’d take a moment to briefly review these articles. They are (links lead to abstracts for non-subscribers):

The evolution of machine translation — Jaap van der Meer
Machine translation: not a pseudoscience — Vadim Berman
Putting MT to work — Lou Cremers
Monolingual translation: automated post-editing — Hugh Lawson-Tancred
Machine translation: is it worth the trouble? — Kerstin Berns & Laura Ramírez
Challenges of Asian-language MT — Dion Wiggins & Philipp Koehn
Advanced automatic MT post-editing — Rafael Guzmán

In the first of the articles, Jaap van der Meer characterizes changes in attitudes about MT over the last 4 years as “revolutionary” — a move “from complete denial [of MT’s utility] to complete acceptance.” What happened? The answer seems to be a number of events and changes rather than a single triggering factor, perhaps an evolution to a “tipping point” of sorts. There have been ongoing improvements in MT, there was the establishment of the Translation Automation User Society (TAUS) in 2004 which “helped stimulate a positive mindset towards MT,” and the empowerment of internet users in the use of MT. Van der Meer also points out a shift in emphasis from finding “fully automated high quality translation” (FAHQT) to what he calls “fully automated useful translation” (FAUT – an acronym that presumably should not be read in French). The latter is not only a more realistic goal, but also one that reflects needs and uses in many cases.

As for the future, van den Meer sees a “shift from traditional national languages to ever more specialized technical languages.” My question is whether we can at the same time also see significant moves for less widely spoken languages.

Van den Meer’s article sets the tone and has me asking if indeed we are at a point where a fundamental shift is occurring the way we think of MT. The other articles look at specific issues.

Vadim Berman looks at some hurdles to making MT work, highlighting the importance of educating users – including mention of a recurrent theme: the importance of clean text going into the translation.

Two of the articles, by Lou Cremers and by Kerstin Berns and Laura Ramírez, discuss the practical value of MT in enterprise settings.

Cremers has some interesting thoughts about the utility of MT in an enterprise setting, something that has long seemed impractical, certainly when compared to translation memory (TM). He begins by noting that “a high end MT system will really work if used correctly, and may save a considerable amount of time and money,” and then procedes to discuss several factors he sees as key to getting good ROI: terminologies and dictionaries; quality input text; volume (pointing out among other things the fact that good MT will tend to lead to a larger amount of text being translated – a key point for considering the value of MT in other spheres of activity I might add); and workflow.

The “correct use” of MT relates largely to the quality of the text: “surprisingly simple writing rules governing he use of articles and punctuation marks will drastically improve MT output.”

Cremers offers a summation which seems to speak for several of the articles:

It’s not the absolute quality of the MT output that is important, but rather how much time it saves the translator in completing the task. In that way it is not different from TM. In both cases, human intervention is needed to produce high-quality translations.

Berns and Ramírez walk through the costs and benefits of MT in a business context. Here the issue is investing in a system but the reasoning could be applicable to different settings. They suggest that the kind of material to be translated is (unsurprisingly) a good guide to the potential utility of MT:

Do you have large text volumes with very short translation times and a high terminology density? Then it is very likely that MT will be a good solution for you. On the other hand, if you have small text volumes with varying text types and complex sentence structures, then it probably will bu too much effot to set up an effective process.

Two of the articles, by Hugh Lawson-Tancred and Rafael Guzmán, discuss “post-editing” as a tool to improve the output of MT.

Lawson-Tancred suggests – contrary to several of the other authors – that the utility of preparing the text going into MT may not be so critical, and that “the monolingual environment of the post-editor is a better place to smooth out the wrinkles of the translation process….” Interestingly, this concept focuses on context, with the basic unit for processing being 5-20 words (that is between the word level of dictionaries and whole sentences). His concludes by speculating that automated post-editing could “develop into a whole new area of applied computational linguistics.”

Guzmán, who has written a number of other articles on post-editing, discusses the use of TM in the context of verifying (post-editing) the product of MT. This basically involves ways of lining up texts in the source and translated languages for context and disambiguation. There are several examples using Spanish and English.

Finally, Dion Wiggins and Philipp Koehn discuss MT involving Asian languages, which most often entails different scripts. There are examples from several Asian languages illustrating the challenges involved.

This is an interesting set of articles to read to get a sense of the current state of the art as regards the application and applied research on MT. It’s a bit of a stretch for a non-specialist with limited context like me to wrap his mind around the ensemble of technical concepts and practices. One does come away, though, with the impression that MT is already a practical tool for a range of real-world tasks, and that we will be seeing much more widespread and sophisticated uses of it, often in tandem with allied applications (notably TM and post-editing). Are we seeing a paradigm shift in attitudes about MT?

At this time I’d really like to see a program to encourage young computer science students from diverse linguistic backgrounds in developing countries and indigenous communities to get into the field of research on MT. I’m convinced that it has the potential if approached strategically to revolutionize the prospects for minority languages and the ways we think about “language barriers.” That is more than just words – it has to do with education, knowledge and enhanced modes of communication. By extension, the set of human language technologies of which MT is a part, can in one way or another play a significant role in the evolution of linguistic diversity and common language(s) over the coming generations.


Distant rumblings

A quick personal note about the major earthquake in Sichuan, China. The epicenter was just west of Chengdu – city where my wife and son are – but they are fine, as is her extended family there, and the building.

I actually was in Bamako, Mali when it happened and heard about it at the end of a long day of meetings. Since the time was in the wee hours of the morning in Chengdu, I could not call, but accessed the business center of the Grand Hôtel, where I was staying. My son’s school’s website merely said they’d be closed for the next day – so the building was intact unlike at least one school to the west near Dujiangyan. News reports seemed to indicate that Chengdu city had not had the kind of devastation experienced in the mountainous regions. Eventually, late, I was able to call and get through to verify that all were okay.

So for now I’m back in the Washington area and we’ll continue this tricontinental living until the summer as planned.

We’d done some traveling out in the areas that are now in the news. Chengdu is on a plain, a “land of abundance” with ample waters controlled by the ancient weir (dam) in Dujiangyang. But mountains rise dramatically after that and beyond is the Tibetan plateau. Some very striking country – areas you go through and feel like one of the small figures in those wide Chinese landscape paintings. One of the places I’ve been thinking a lot of is a tiny community perched in the side of an impossibly steep mountain slope across the valley from village we once stayed in in an ethnically Tibetan area of Sichuan. Can’t remember the name of the place but one hopes that mountain didn’t shrug too hard. There are so many places out there, and their structures, and their inhabitants who must have suffered and struggle right now, or who perished.


Opening of the National Museum of Language

Logo of the National Museum of LanguageI had the opportunity yesterday (April 29) to visit the National Museum of Language (NML) in College Park, Maryland (U.S.) by invitation for a special preview day. The museum opens to the public formally on Saturday, May 3.

The name gives the impression that it is government-owned, like the various other “national” museums and galleries in Washington, DC, it is actually a private effort by a small non-profit organization. It’s also physically rather small with basically one main exhibit room, and a small suite with some more displays, a meeting room for activities and classes, and a small office. But it is a beginning that was a long time coming – apparently the concept goes back to 1971 and the actual organization began in 1997 (and incorporated 10 years ago).

NML’s vision is described as:

The mission of The National Museum of Language is to enhance understanding of all aspects of language in history, contemporary affairs, and the future.

… and it intends to “[foster] the study of the nature of language, its development, and its role and importance in society, and by exploring linguistic problems and ways of overcoming them” in order to serve people in diverse pursuits and walks of life, and to promote understanding.

The first exhibit of NML – “Writing Language: Passing It On” – focuses on writing systems. It has some nice examples and some interactive computer programs:

The opening exhibit … will show both alphabetic writing systems (Arabic, Latin, Greek, and Hebrew) and logographic writing systems (Chinese and Japanese). [from the press release]

In addition to looking at the exhibits, I also had the chance to talk with several of the principal leaders of the museum, notably Dr. Amelia Murdoch, the president, and Drs. Pat Barr-Harrison and Jill Robbins of the board of directors. They shared some ideas and plans about the museum project. Eventually they and their colleagues hope to be able to move into a facility of their own – either something existing or new, like the image displayed on the NML website.

The opening of the NML is a significant step, even if a small one, and hopefully it will get more attention and funding to realize its potential. Symbolically it is nice that it occurred in the International Year of Languages (IYL).

The NML is also one of a handful of locations devoted to languages around the world that deal with languages as a whole. One of the ideas that I’m personally interested in is finding a way to get these “language museums” linked in a productive network. In fact this is an emerging category of museum that in its broadest sense might include also language-specific museums. The IYL would seem to be an ideal time to develop connections and put in place structures that can facilitate collaboration and assistance to new initiatives for language-related displays and institutes for public awareness and learning.


“The World of Soy” due out soon

The World of Soy, a new volume of articles about the uses of soybeans in various parts of the world, is nearing publication. It is edited by Christine M. Du Bois, Chee-Beng Tan, and Sidney Mintz, and the publisher is the University of Illinois Press (UIP).

This is the result of a project that goes back to a panel on soybeans worldwide held at the 8th Symposium on Chinese Dietary Culture in Chengdu, China on 17-19 October 2003.

The panel was a special project, the papers being destined for a book and not the proceedings, and the presenters having a focus on soybeans. Some papers, like mine on soybean use in West Africa, had little to do directly with Chinese dietary culture – the central link of course being the origin of the crop, and the broader interest to the symposium being comparative aspects of use of soy among diverse cultures.

The book that is about to be published, as the individual papers in it, has evolved a bit from the original work five years ago. Among other things, there are additional contributions. Taken as a whole, the publicity notes that the book

discusses important issues central to soy production and consumption: genetically engineered soybeans, increasing soybean cultivation, soyfood marketing techniques, the use of soybeans as an important soil restorative, and the rendering of soybeans for human consumption.

Although the table of contents is not listed in UIP’s current publicity, Christine Du Bois kindly supplied me with it so I can repost here (I’ve modified the presentation slightly):

INTRODUCTION: The Significance of Soy – Sidney W. Mintz, Chee-Beng Tan, and Christine M. Du Bois


1. Legumes in the History of Human Nutrition – Lawrence Kaplan
2. Early Uses of Soybean in Chinese History – H. T. Huang
3. Fermented Beans and Western Taste – Sidney W. Mintz
4. Genetically Engineered Soy – Christine M. Du Bois and Ivan Sergio Freire de Sousa


5. Tofu and Related Products in Chinese Foodways – Chee-Beng Tan
6. Tofu Feasts in Sichuan Cuisine – Jianhua Mao
7. Fermented Soybean Products and Japanese Standard Taste – Erino Ozeki
8. Fermented Soyfoods in South Korea: The Industrialization of Tradition – Katarzyna J. Cwiertka and Akiko Moriya
9. Tofu in Vietnamese Life – Can Van Nguyen
10. Soyfoods in Indonesia – Myra Sidharta
11. Social Context and Diet: Changing Soy Production and Consumption in the United States – Christine M. Du Bois
12. Soybeans and Soyfoods in Brazil, with Notes on Argentina: Sketch of an Expanding World Commodity – Ivan Sergio Freire de Sousa and Rita de Cássia Milagres Teixeira Vieira
13. Soy in Bangladesh: History and Prospects – Christine M. Du Bois
14. Soybeans and Soybean Products in West Africa: Adoption by Farmers and Adaptation to Foodways – Donald Z. Osborn

CONCLUSION: Soy’s Dominance and Destiny – Christine M. Du Bois and Sidney W. Mintz

Appendix A. Scientific Names for Plants and Edible Fungi
Appendix B. More on Tofu in Chengdu

Altogether I think this is a very important volume. From a personal point of view I enjoyed working on the article I contributed. More on that in a later post.