How we trained "Kamilė" Lithuanian language correction model
2026-04-03
How do you teach computers to write?

Even though researchers have been tackling this question since the early days of electronic digital technologies, language complexity became tractable only quite recently. The most notable step-like jump in quality occurred around 2017-2018, when it started to feel like language models are starting to understand you. The days of Clippy-like systems were finally behind us. Suddenly, language became one of the fastest means of translating your intentions into digital form and we started moving away from that nagging feeling of distrust when using natural language processing products (if you forgot what that felt like, you can still ask Siri ** Hey, Siri, set the expiration date of this joke to the day Apple starts using Gemini models in its products.).
The main drivers behind this rapid progress are: (1) the relentless pace of Moore's Law ** Moore's Law: Wiki, (2) massive data centers, and (3) breakthroughs in machine learning theory. It is especially important to highlight the invention of the transformer neural network architecture, which enabled modern large language models (LLMs) and made it possible to efficiently utilize the scale of supercomputers.
Spending too much time proving the value of LLMs is probably not necessary, as all of you have already experienced their usefulness at work or with everyday questions (or perhaps while using writing assistance tools like Kablelis/Grammarly). Some of you might be curious to dive deeper into the details: what allows LLMs to master many languages so well, including obscure languages like Lithuanian?
In fact, these systems learn in a way similar to us: by learning directly from examples of spoken and written language.
In this post, we will share the lessons we learned from training over 70 iterations of grammar correction language models and how we dealt with Lithuanian-specific challenges.
Training data
When looking for training data, you do not have to look far: the internet is right here. And there is a lot of text on the internet. So much, that it is hard not to emphasize these models' ability to handle superhuman amounts of information. ** Reading 8 hours per day, it would take an average person about 100,000 years to fully read through the dataset shown below.
But once you look closer, you immediately notice a "problem": English dominates the data landscape - there is more of it than all other languages combined. This is especially clear in the chart below, which shows that in these internet datasets, Lithuanian is one of the least represented official languages in Europe (ahead of only Latvian, Estonian, and Slovenian).
European language data volume for LLM training (Terabytes (TB); Common Crawl 2025)
Of course, the internet is not the only source of text, but collecting and processing data from other sources is much harder and requires substantial funds and effort.
This ratio also implies something else: since English is the modern lingua franca, this imbalance will only grow over time. So if we want to preserve smaller languages in the internet era, active effort is required.
Problem #1: scarcity of publicly available Lithuanian-language data.
Lithuanian language specifics
On top of having much less Lithuanian data, our language is morphologically much more complex.
To illustrate this, let us take a simple verb - "naudoti" (to use) - and its English equivalent, "use." Let us try to count how many different words and forms we can derive in each language from the roots of these verbs.
Map of forms and derivatives of the word "use"
in English (69 words).
Interactive map of forms and derivatives of the word "naudoti"
in Lithuanian (5,644 words).
Problem #2: more data is needed to train high-quality models for Lithuanian due to its complexity.
Synthetic data generation
Kablelis.lt models specialize in spelling and writing-error correction. The scope of this problem is much more tractable than building general-purpose models, because data can be generated synthetically. You only need many different, statistically realistic ways to corrupt text.
Synthetic data generation process. A similar amount of data is needed to train Kablelis models.
Why synthetic data?
Results
What next?
Have questions or suggestions? Write to us - we will gladly share more technical details.
Kablelis team

