麻豆国产

Computer science professor is using AI to help endangered languages be heard

Body

There are more than 6,000 languages spoken in the world, and almost half of them are endangered. George Mason 麻豆国产 researcher Antonios Anastasopoulos is working to keep those endangered languages alive and has built a Natural Language Processing (NLP) group at the university devoted to this work. 

The recent recipient of a $599,956 CAREER Award from the National Science Foundation and a $300,000 Small Business Innovation Research (SBIR) Award from Barron Associates, Anastasopoulos is building automatic translation tools for under-served populations, including speakers of certain Indo-Pacific languages that don鈥檛 have access to language technologies.  

He is also a collaborating senior researcher at Greece鈥檚 , a hub connecting the global AI and data science research community.  

Anastasopoulos began his work with languages when he visited local communities to record endangered dialects as an undergraduate student in his native Greece. 鈥淲hat makes a language endangered is when it stops getting passed down across the generations,鈥 said Anastasopoulos, an assistant professor in George Mason鈥檚 . 

Antonios Anastasopolous (left), Indigenous Huilliche activist Marite Perez, and UC Boulder Professor Alexis Palmer in Chile. Photo provided.

鈥淚 created databases for a small Greek dialect that鈥檚 spoken in South Italy called Griko. I built that small tool for this one community, but there are thousands of similar communities all over the world that speak languages that are completely neglected. They have no institutional support, so my work feels very meaningful,鈥 said Anastasopoulos.   

Every language lies in a continuum, and no matter how small the language is, you will find variations, explained Anastasopoulos. If a system is built that has only seen data from one variety of a language, it will perform worse for variations that the model has never seen.  

鈥淚 just kept building on this concept, which is how I ended up developing this whole research program here at George Mason. The real motivation behind my work is that it's just something that simply needs to get done,鈥 he said.  

Endangered languages are very common in places such as Latin America, explained Anastasopoulos, where the government and socially powerful class often speak Spanish or Portuguese, to the exclusion of hundreds of Indigenous language communities. These Indigenous peoples still have to participate in society, so their unique languages eventually become moribund.   

鈥淔or example, I had some folks from an Indigenous community in Chile contact me and inquire about my work. The Mapuche had, historically, resisted Spanish conquest, and they reached out asking for guidance in building AI tools to help with the instruction and preservation of their language, Mapuzugun,鈥 said Anastasopoulos.  

Fahim Faisal, a PhD candidate in the Department of Computer Science who works with Anastasopoulos as part of George Mason鈥檚 Natural Language Processing (NLP) group, has experienced the limitations modern technology can have on cultural dialects.  

Fahim Faisal. Photo provided.

鈥淚'm from Bangladesh and my primary language is Bengali, so when I try to interact with everyday technology, for example, [Amazon鈥檚] Alexa, it can't always understand what I鈥檓 saying because of my dialectal variety and my accent differences,鈥 said Faisal.  

鈥淭here are lots of variations when people speak in terms of accent and dialect, so we鈥檙e trying to implement that cultural variety into language modeling, so it鈥檚 still accessible when you go from one part of the world to another,鈥 he said.   

Earlier in 2024, Faisal鈥檚 paper 鈥鈥 received one of the at the Association for Computational Linguistics, the premier NLP conference, for showing that LLMs cannot handle dialects as well as standard language varieties.  

Computer science PhD candidate Milind Agarwal applied to George Mason specifically to work alongside Anastasopoulos.  

Agarwal works on documented digitization of archival data, as well as large- scale data extraction and language identification.  

鈥淚鈥檓 trying to assist new artificial intelligence technologies to learn better so that it鈥檚 accessible to the smaller language communities and to ensure that there aren't huge swaths of people who are completely left out of this this tech revolution essentially,鈥 said Agarwal, who has published six papers on the topic.  

One of Agarwal鈥檚 papers, 鈥鈥 won the 2024 Best Student Impact Paper Award at a regional conference.  

鈥淲e're working with language community members directly, because they have a stake in revitalization and keeping their languages alive,鈥 Agarwal said. 鈥淲e continuously share our results with the community members to get feedback and make sure that the work we're doing is in line with the people who will use it rather than being research that's disconnected from the ground. That has been invaluable because it has grounded our work and made sure that it actually has an impact.鈥