Strategy

Scaling AI for Bharat: The Blueprint for Multilingual Language Models

To ensure no Indian language is left behind, India must adopt a scalable AI strategy by combining large, mid-sized, and small models that learn from each other.

CIOL Bureau

04 May 2025 07:57 IST

Updated On 04 May 2025 17:43 IST

New Update

Arun Subramaniyan, Founder & CEO, Articul8

A growing number of AI practitioners believe that India should focus on small language models for its lesser-resourced tongues, arguing this is a practical way to avoid the high costs of training massive AI models. But defaulting to only small models for low-resource Indian languages is being shortsighted.

Advertisment

Yes, smaller models are cheaper and easier to fine-tune but leaning on them exclusively risks cementing a two-tier AI system: well-resourced languages get advanced AI, while others are left with second-rate tools. India’s linguistic diversity – 22 official languages and thousands of dialects – demands more ambitious solutions.

India must refuse the narrative that our languages are too “poor” in data to deserve cutting-edge AI. Instead, we should tackle the root issue – the data scarcity – rather than use it as an excuse to think small.

Scaling Ambition to Data Richness

Advertisment

A smarter strategy is to scale our models to the richness of available data, building AI models of all sizes – large, mid-sized, and small – as warranted. Where a language has abundant digital text (for example, Hindi or Tamil), we should train large language models that capture its full nuance and vocabulary. Where data is more modest, we can opt for mid-sized models.

Crucially, if a language is truly low resource today, the answer is not to permanently relegate it to a tiny model, but to invest in gathering more data and leverage knowledge from related languages. The Indian government is already pushing in this direction, crowdsourcing content through initiatives like Bhasha Daan to boost datasets for Hindi, Bengali, Tamil, Telugu, and beyond. This effort acknowledges a key fact: many Indian languages currently lack sizable digital corpora – e.g. Hindi has on the order of only 200k Wikipedia pages versus hundreds of millions in English, and languages like Tamil or Telugu have mere tens of thousands.

By scaling model size to data availability, we achieve efficiency without compromising on ambition. High-resource languages can reach state-of-the-art performance with large models, while mid-resource ones get tailored models that grow as data grows. This tiered approach ensures we don’t leave any language stuck with outdated tech. Every Indian language, big or small, warrants a pathway to advanced AI, with model sizes that evolve as the language’s digital presence expands.

Advertisment

United by Linguistic Roots and Structures

One of India’s superpowers in AI development is the rich interconnection among its languages. Our tongues are not isolated silos – they are members of linguistic families with shared roots, overlapping vocabularies, and analogous grammars developed over centuries together. We should be leveraging these connections to share learning across languages. For instance, most Indo-Aryan languages (like Hindi, Marathi, Gujarati, Bengali, Punjabi, etc.) descend from Sanskrit and Prakrit, meaning they share a core vocabulary and grammatical base. It’s common to find identical or very similar words across Hindi and Marathi; words for basic concepts like prem (love), ānand (happiness), or nadī (river) mean the same in both. Linguists have noted up to a 50–75% overlap in lexical similarity between Hindi and Marathi by some measures – a huge head-start for any AI model trained on one to understand the other. In practical terms, a language model that has learned Hindi will already have a strong grasp of Marathi’s building blocks (and vice versa), due to these historical links.

Similarly, the Dravidian languages (Tamil, Telugu, Kannada, Malayalam, etc.) form another tight-knit family. They share distinctive structural features – for example, a fixed subject–object–verb word order in sentences, heavy use of agglutinative morphology (adding suffixes for tense, case, etc.), and many cognate words for everyday terms. We can design our models to jointly train on related languages, allowing them to internalize the common patterns. The benefit? A model doesn’t have to start from scratch for each language – the syntax, style, and even idiomatic metaphors learned from one can assist in understanding another. This is especially valuable for low-resource tongues: a Punjabi model, for example, can greatly improve by training together with Hindi or Gujarati data, which act as a scaffold for Punjabi’s grammar and vocabulary. Indeed, research shows that Indo-Aryan languages “assist each other” in training, yielding significantly better results on NLP tasks. In one experiment, a named-entity recognition system for a low-resource language like Odia or Punjabi saw up to a 150% relative improvement in accuracy when it was supplemented with data from related Indic languages. That is a great validation of the power of cross-language learning. We would be foolish not to exploit these natural alliances between our languages.

Advertisment

Beyond direct ancestry, India’s languages also share cultural and conceptual frameworks that AI models can capitalize on. Our epics, proverbs, and metaphors often transcend a single tongue – a Hindi model that learns the concept of “Ramayana” or “karma” is indirectly also learning something it can transfer when working in Bengali or Marathi, where these concepts exist under identical or similar names. Even the phenomenon of code-mixing (like Hinglish or mixing Telugu and English) means many Indians seamlessly blend languages in one context; an AI that is trained multilingually will handle these blends far better than separate small models ever could. The takeaway is clear: Indian languages form a connected ecosystem, and our AI models should be designed to reflect and leverage that connectivity.

India can lead in developing truly multilingual AI that treats our languages as collaborative partners, not isolated problems.

An Interoperable Multi-Model Strategy

Advertisment

Harnessing this linguistic synergy calls for a portfolio of interoperable models – each tuned to a different scope, yet all working in concert. We should envision an AI architecture where a handful of large models, several medium-sized models, and numerous specialist models talk to each other and learn from each other. Here’s what that strategy looks like:

Large Multilingual Foundation Models:

Train a few colossal models on all Indian languages (and dialects) together. These would be akin to broad “foundation” AI models that capture general knowledge and semantics across languages. They can learn universal representations – for example, understanding that “water” (English), pānī (Hindi), and tanni (Tamil) all denote the same concept. These big models become the knowledge banks that smaller models can draw from.

Advertisment

Mid-Sized Language Family Models:

On the next tier, develop dedicated models for specific language families or groups – for instance, one model specializing in Indo-Aryan languages, another for Dravidian languages, perhaps others for smaller families (like a model for the Tibeto-Burman languages of the North-East). By focusing on a narrower set of linguistically similar languages, these mid-sized models can delve deeper into the nuances (e.g. the particular grammar rules or regional idioms) without being diluted by totally unrelated data. They’d require fewer resources than the giant foundation model yet still cover multiple languages each. Importantly, they should remain interoperable with the large model – meaning knowledge flows both ways.

We might fine-tune a family-specific model from the foundation model or use the family model to adapt the foundation’s outputs more finely to each language in the family. Notably, using language cohorts in this way (grouping by Indo-Aryan, Dravidian, etc.) leverages the shared grammar and vocabulary within each group, achieving better accuracy than treating each language alone.

Advertisment

Fine-Tuned Specialist Models:

Finally, build specialized models for individual languages or for domain-specific tasks in those languages. These could be smaller (even small models, indeed) but they would be fine-tuned versions derived from the above two tiers. For example, a state government might need a model specifically tuned for Marathi legal documents or a chatbot primarily for Tamil healthcare information. Rather than train a tiny Marathi or Tamil model from scratch (which would be limited by Marathi-only or Tamil-only data), we would start with the Indo-Aryan or Dravidian mid-tier model (which already understands a lot of the language) and then fine-tune/distill it on the domain data.

The result is a compact, efficient model for the task that still benefits from the intelligence of larger models. Moreover, because it shares parentage with other models in the hierarchy, any improvements or new data can be fed back into the larger ecosystem. These specialist models ensure that no language or niche use-case is left unsupported. They are the “last mile” connectivity in our AI infrastructure, bringing the power of the big models to every local context.

This interoperable setup is powerful. It means we don’t have to choose between one giant model for everything and many isolated small models. We can have a network of models that combine the strengths of each size. The large models provide broad understanding and cross-language transfer learning; the mid models provide efficient specialization on linguistic subtleties, and the small models provide nimbleness for specific tasks – all while sharing a common knowledge base. Such a framework avoids duplication of effort (we’re not training 22 separate models from zero) and avoids one-size-fits-all monotony.

Leading with Ambition and Inclusion

Pursuing this multi-tier model strategy positions India not as a follower, but as a global pioneer in inclusive AI development. It’s an approach that says no language left behind. This is both an efficient use of resources and a statement of ambition. Efficiency comes from the interoperability, and ambition comes from refusing to cap any language’s potential. If tomorrow a wealth of Konkani text becomes available, our framework can scale up a Konkani model to take full advantage, plugging into the Indo-Aryan family model and growing.

This strategy resonates with India’s broader vision for technology. The nation’s AI roadmap – from the National Language Translation Mission to initiatives like Bhashini – explicitly aims to break the barriers of language digital divides. The government has been ramping up compute infrastructure and research funding, signaling that India plans to “do everything within the AI stack” and build solutions for our own languages

To wrap up, betting only on small language models for India’s lesser-resourced languages would be selling ourselves short. By building an ensemble of large, mid-sized, and specialized AI models, we can ensure every Indian language, from the most spoken to the most endangered, has a place in our AI future.

By Arun Subramaniyan, Founder & CEO, Articul8

Arun Subramaniyan is a pioneering Generative AI expert and the Founder and CEO of Articul8, a company backed by DigitalBridge, FinVC, Communitas, and Intel. Previously, Arun was VP and GM at Intel’s Data Center and AI Group, where he led global AI product adoption. Before Intel, he headed AWS’s Global Solutions Team, specializing in machine learning, quantum computing, and high-performance computing. Arun’s leadership continues to drive innovation in enterprise AI solutions.

(Disclaimer: The views expressed in this article are solely those of the author and do not reflect CyberMedia’s stance.)