International Mother Language Day: Why AI still prefers to speak English
Computational Linguistics Professor Dietrich Klakow. Photo: Iris Maurer
Whether voice controls, chatbots, dictation or translation programs – many people now use software that can process natural language on a daily basis. Experience has shown that most of these applications work best in English. Why this is so, whether this will change in the future, and what prospects there are for less common (native) languages, is explained by Saarbrücken computational linguistics professor Dietrich Klakow on the occasion of the International Mother Language Day on February 21.
As a native German speaker, one is still relatively well off, says Dietrich Klakow, professor of “Spoken Language Systems” at Saarland University. That’s because most IT speech applications work quite well in German. “But it’s true, many systems in the field of language processing still work best in English,” confirms the professor, who conducts research at the Saarland Informatics Campus.
There are two main reasons for this: First, most applications of computer-based language processing are built on machine learning, a subfield of artificial intelligence. “In machine learning, a programmer doesn’t tell the algorithm exactly what to do, but trains it with vast amounts of data from which the algorithm can learn on its own,” explains Dietrich Klakow. And this is precisely the first reason: English is the most widely spoken language in the world, so most of the available training data is also in English. “In addition, English is comparatively straightforward in terms of grammar, which is why computers cope well with it,” says Klakow.
The second reason, he says, are the researchers themselves: “Science is an international field of work, so the working language is usually English – also in computer science. So when researching or developing something new, one does so in a way that is easy for colleagues to follow. That’s why most researchers work and publish in English,” says Klakow. This, in turn, leads to many applications being developed in English first – the first machine-translated language pair was English-French. The first synthesized voice was software that read out English newspaper articles. “Most applications have a multi-year head start in English. And the major European languages are usually the first to follow suit,” the professor explains.
But what about smaller languages that have few speakers? “By far, most of the world’s languages are not supported at all. There are about 7,000 languages, of which only about 400 have more than a million speakers – and even those 400 are not all researched extensively enough to be used in natural language applications,” Klakow says. The “Google Translator,” which can provide a good first glimpse of the languages that have been researched in computational linguistics, supports a total of 133 languages at various levels as of February 2023.
A much more serious problem than small languages that are not sufficiently researched in computational linguistics are very widespread languages that are hardly supported or not supported at all. Because here we are very quickly dealing with globally socially relevant issues of digital participation, says Dietrich Klakow. “Many African languages, for example, which easily have ten to 50 million native speakers, can hardly or only very poorly be processed by computers,” says the professor. Together with his doctoral students Jesujoba Oluwadara Alabi, David Ifeoluwa Adelani and Marius Mosbach, Dietrich Klakow has therefore developed a method to fine-tune existing language models to the 17 most widely spoken African languages in a memory-efficient way. Last October, he and his colleagues for this work received a Best Paper Award at the International Conference on Computational Linguistics, one of the leading conferences in computational linguistics.
Research to expand the machines’ language horizons continues. Asked how these language capabilities might develop in the future, Klakow says: „In terms of machine processing, more efficient machine learning models that require less training data, or better methods to be able to artificially generate training data, will certainly raise even more languages to a ‘product-ready’ level in the future. My guess is that in ten to 15 years, the 400 most common languages could all have reached this level.” However, he does not believes that all the world’s languages will ever function equally well: “There will never be enough training data to program a ‘Zulu ChatGPT’, for example. In this respect, English will probably always be ahead,” says the professor.
More Information::
Publication:
Jesujoba O. Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow. 2022. Adapting Pre-trained Language Models to African Languages via Multilingual Adaptive Fine-Tuning. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4336–4349, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Questions answers:
Prof. Dr. Dietrich Klakow
Spoken Language Systems
Universität des Saarlandes, Saarland Informatics Campus
Tel: +49 681 302 58 122
Mail: Dietrich.Klakow@lsv.uni-saarland.de
Background Saarland Informatics Campus:
900 scientists (including 400 PhD students) and about 2500 students from more than 80 nations make the Saarland Informatics Campus (SIC) one of the leading locations for computer science in Germany and Europe. Four world-renowned research institutes, namely the German Research Center for Artificial Intelligence (DFKI), the Max Planck Institute for Informatics, the Max Planck Institute for Software Systems, the Center for Bioinformatics as well as Saarland University with three departments and 24 degree programs cover the entire spectrum of computer science.
Editor:
Philipp Zapf-Schramm
Saarland Informatics Campus
Phone: +49 681 302-70741
E-Mail: pzapf@cs.uni-saarland.de