In the previous post, I looked at a possible ramification of “mass digitization” of text. But what about the spoken word? And more precisely verbal presentations, performance, and broadcast in languages often described as having “oral traditions” (and generally less material in writing)? Can we do something significant to capture significant amounts of speech in such languages in digital recordings?
There are some projects to digitize older recordings on tape, and certainly a need for more effort in this area, but what I am thinking of here is recording contemporary use of language that is normally ephemeral (gone once uttered), along with gaining access to recordings of spoken language that may not be publicly accessible. One place to start might be community radio stations in regions where less-resourced languages are spoken.
The object would be to build digital audio libraries for diverse languages that don’t have much if any publication in text. This could permit various kinds of work. In the case of endangered tongues, this kind of thing would fall under the rubric of language documentation (for preservation and perhaps revitalization), but what I am suggesting is a resource for language development for languages spoken by wider communities.
Digital audio is more than just a newer format for recording. As I understand it, digital storage of audio has some qualitative differences, notably the potential to search by sound (without the intermediary of writing) and eventually, one presumes, to be manipulated and transformed in various ways (including rendering in text). Such a resource could be of use in other ways, such as collecting information on things like emerging terminologies in popular use (a topic that has interested me since hearing how community radio stations in Senegal come up with ways to express various new concepts in local languages). Altogether, digital audio seems to have the potential to be used in more ways than we are used to thinking about in reference to sound recordings.
Put another way, recordings can be transcribed and serve as “audio corpora,” in a more established way. But what if one had large volumes of untranscribed digital recordings, and the potential to search the audio (without text) and later to convert it into text (accuracy in this area, which would not involve the normal training involved with use of current speech-to-text programs will be one of the challenges)?
Can digital technology do for audio content something analogous to what it can do for text? What sort of advantages might such an effort bring for education and development in communities which use less-resourced languages? Could it facilitate the emergence of “neo-oral” traditions that integrate somehow with developing literate traditions in the same languages?
Collection and archiving of audio can certainly be a means of creating a corpus for a language that is not often written, but it suffers from the same incompleteness that a text collection suffers from, namely the fact that a big chunk of material in a language, whether oral or written, is of little use without translation and annotation. Providing translation and annotation is probably the hardest part.
Thanks Bill, for the clarification. Part of what I’m wondering is how much we can hope for from automated tools in order to address (or greatly facilitate addressing) the issues you mention. With a positive expectation (I hope realistically positive) in this area, I’m thinking it’s time to think about ways to systematically gather raw data – digital recordings of minority languages in many settings such as community radio – so that when tools are developed/refined enough, we have a mass of language material to process. Also, in the meantime it is apparently possible to search sounds/words on digital recordings that are not yet transcribed, translated or annotated.
Most of the “audio search” tools you’ll find do not actually search the sound – they
search the metadata. The few that do a true audio search are not wonderfully accurate: even in the case of something easy like newscaster speech the best I’ve seen is 93% accuracy. That’s useful for some things but not really that great. Automated transcription may be usable for languages like English, but I don’t know what it will take to get it to a reasonable level of accuracy for a wide range of languages.
Bill
This is helpful information, thanks. I guess a significant question is how quickly these sort of tools will be improved. One of the reasons I’ve suggested elsewhere that we should be thinking about how to facilitate the training of speakers of minority languages in HLT/NLP (mainly in university computer science programs) is the thought that some among them may be able to play key roles in overcoming some of the hurdles you allude to.
As regards automated transcription, Tunde Adegbola (ALT-I, Ibadan) recently mentioned to me a project that used tone (pitch contour) in processing Yoruba data – the interesting thing here is that tonal melody provides a very helpful cue in addition to other stuff one can use in automated language processing. I can’t remember the name or affiliation of the project however.
All of this is also a matter of scale. For big languages like Hausa, Yoruba, Swahili we may hope to see something in the future, but for the smaller ones there is not nearly enough interest to warrant developing tools for automated processing.
Hi Mark, Yes, Tunde and colleagues have been working on this tonal approach. The tone patterns of course are also what make the talking drums intelligible. Really incredible notion, using tone this way – one could apparently also have a tone corpus. Thanks for bringing this up – it may be the subject of a future posting.
Re the issue of scale, I understand your logic. Part of what I am wondering, though, is whether capturing audio of diverse languages could be done relatively cheaply and widely. For instance with community radio, how about working with local stations on this (some funds for equipment, making the concept clear and getting permissions, providing feedback to them as participants in a larger project, etc.).
Then beyond that – will we get to the point that the tools for processing less widely spoken languages become cheap/easy enough that it doesn’t matter if there is low demand. I.e., can we eventually have tools that can do a lot of the development of programs for processing various languages?
Another question is the reverse: What future is there for languages that aren’t used in some way with advanced technology (recorded, manipulated, transformed between speech and text, translated to and from other languages)?
IOW can we afford not to put some increment of resources into digitizing all languages on a mass basis, regardless of perceived demand today?
Hi Don and Mark,
Just stumbled on these postings while searching for information on the Talking Drum. To take you back to the issue of using tone as a search cue. What I will really like to see is the use of a short speech segment to search a large speech file for relevant sections. What we are doing however is the use of speech (tone) to search text. It works as a ‘Talking Drum Recogniser’. The drum plays a tune and the corresponding text in a database is displayed on the screen or read out through the sound card. So it is actually possible to use speech to locate text but not speech to locate speech.