Our Privacy Statement & Cookie Policy

By continuing to browse our site you agree to our use of cookies, revised Privacy Policy and Terms of Use. You can change your cookie settings through your browser.

I agree

Exclusive interview: Hungary's quest to preserve linguistic heritage in the age of AI

Gong Zhe

 , Updated 00:11, 30-Jul-2025
04:39

As the World Artificial Intelligence Conference concludes today in Shanghai, we at CGTN Digital spoke exclusively with Tamas Varadi, PhD, senior research fellow at the Hungarian Research Center for Linguistics and a guest at the conference, about Hungary's unique approach to AI language development.

A linguistic island in the AI ocean

"Hungary is a small country with approximately 10 million speakers, and Hungarian does not belong to the Indo-European language family," Tamas Varadi explained. "It's essentially a linguistic island. From a global developer's perspective, it's a niche market."

That said, Hungary has its own strengths in developing large language models. The center shifted its research paradigm to neural deep learning methods in the 2020s, Varadi told CGTN. The core strength lies in data: "We now have the largest curated, cleaned, and deduplicated training corpus for Hungarian."

Varadi revealed the center implemented its first native Hungarian models "two weeks before ChatGPT blew up." Initially, they were confident: "GPT-3 had Hungarian in it, amounting to 128 million words as against the 32 billion Hungarian words we trained our first model on."

The dilemma of multilingual models

However, newer multilingual models changed the landscape. "When multilingual models came out in succession, particularly Meta's, we found that the whole pre-training model was scaled up to an extent that even though the ratio is still about 0.006 percent, that very small relative data amounts to 40 billion words in Hungarian."

The "extremely overwhelming pace of development" brought many challenges to the team.

"I am quite amazed by what I've seen at this conference – what global companies like Meta and Chinese models have access to," he said, adding that his team "work on a very limited basis" within which a model takes months to develop.

Confidence in culture

Despite this, Varadi believes in their approach, saying, "I don't think such global models have the expertise and attention" to individual language components.

"Therefore, we are proud that our curated language – which is not only harvested from the internet but complemented with data from libraries and repositories – gives us full control over representing Hungarian culture."

When we suggested that preserving linguistic diversity must be done by the local people, Varadi emphatically agreed.

Search Trends