mmBERT: A Practical Implementation of Multilingual Encoder with Annealed Language Learning

mmBERT: A Practical Implementation of Multilingual Encoder with Annealed Language Learning

mmBERT is a new encoder-only multilingual model trained on 3T tokens across 1,800+ languages that introduces an annealed language learning curriculum, inverse masking, and inverse temperature sampling to deliver large gains on classification and retrieval across both high- and low-resource languages. It builds on ModernBERT and shows parity with top proprietary systems on classification while surpassing prior multilingual encoders like XLM-R on XTREME/MTEB-style tasks.

Key contributions

  • Annealed Language Learning: Languages are staged across training phases (roughly 60 → 110 → 1,833), with sampling temperatures annealed from higher resource bias toward more uniform sampling, enabling efficient transfer to low-resource languages late in training.
  • Inverse masking schedule: Masking ratio decreases phase-by-phase (e.g., 30% → 15% → 5%), allowing coarse representation learning early and refinement later, improving masked-language-model pretraining efficiency.
  • Inverse temperature sampling: Sampling temperatures anneal (e.g., τ:0.7→0.5→0.3τ:0.7→0.5→0.3) to gradually flatten language distribution as more languages are introduced.
  • Late-phase low-resource injection: Over 1,700 low-resource languages are added only during the short decay phase, producing large gains from limited data via transfer from earlier phases.

My book with 20+ End to End Data Science Case Studies from 5 different domains is available on Amazon.

Cracking Data Science Case Study Interview: Data, Features, Models and System Design

Architecture and tokenizer

mmBERT follows the ModernBERT encoder design with optimizations for long-context efficiency and throughput, while adopting the Gemma 2 tokenizer to better cover multilingual scripts and expand vocabulary. Reported configurations include a base-size model with approximately 22 layers and a larger vocabulary, increasing total parameters primarily due to embeddings.

Training schedule

  • Phase 1 (Pre-train): ~2.3T tokens, 60 languages, higher mask ratio, stable LR; builds strong multilingual foundations dominated by higher-resource languages.
  • Phase 2 (Mid-train): ~600B tokens, context extension to ~8K, 110 languages, lower mask ratio, higher-quality data.
  • Phase 3 (Decay): ~100B tokens, inverse-sqrt LR decay, all 1,833 languages, lowest mask ratio; introduces 1,700+ low-resource languages for targeted transfer.

Why annealing works

Staging languages avoids over-epoching scarce low-resource corpora and mitigates catastrophic interference from noisy data by first learning robust multilingual features on richer corpora. Annealing the sampling temperature gradually shifts the distribution, enabling better balance between high-resource competence and coverage of long-tail languages.

Benchmarks and results

The paper reports state-of-the-art or near-SOTA results among multilingual encoders, beating XLM-R on key multilingual understanding suites and achieving competitiveness with proprietary systems on classification metrics, while also excelling on retrieval-style tasks. Gains are pronounced for languages that appear only in the final phase, demonstrating the transfer benefits of the annealed curriculum.

Practical implications

  • Retrieval and classification: Encoder-only models remain ideal for embedding-heavy IR pipelines and production classifiers thanks to speed and memory efficiency; mmBERT further improves multilingual coverage and long-tail quality.
  • Low-resource NLP: The approach offers a recipe to lift many underrepresented languages without requiring massive new corpora, by leveraging transfer and carefully scheduled exposure.
  • Long context: Architecturally, mmBERT supports extended context windows (e.g., ~8K) inherited from ModernBERT optimizations, which benefits document classification and semantic search.

Fine-tuning guidance

  • Task heads: For classification, attach a pooled CLS head; for retrieval, prefer mean pooling of last-layer token embeddings or layer-mix pooling tuned on a dev set. Start with learning rates in the 1e−51e−5 to 5e−55e−5 range and weight decay ≈0.01≈0.01.
  • Multilingual batching: Use temperature-based sampling when batching multilingual datasets to avoid overfitting to high-resource languages; anneal during curriculum-style fine-tuning if extending to more languages.
  • Long sequences: For 4K–8K inputs, enable fused attention kernels where available and unpadding to sustain throughput; gradient checkpointing is helpful on 24GB-class GPUs.

Example: sentence embeddings with mmBERT

The snippet below outlines a typical retrieval embedding pipeline. It uses standard transformers APIs with mean pooling; adapt pooling to task needs and consider layerwise-weighted pooling for certain retrieval benchmarks.

import torch
from transformers import AutoTokenizer, AutoModel

device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "jhu-clsp/mmBERT-base" # replace with the actual repo id once available
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id).to(device)
model.eval()

def mean_pool(last_hidden_state, attention_mask):
mask = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
summed = torch.sum(last_hidden_state * mask, dim=1)
counts = torch.clamp(mask.sum(dim=1), min=1e-9)
return summed / counts

texts = [
"Neural information retrieval models are widely used in production.",
"Les modèles d'extraction d'information neuronaux sont largement utilisés en production.",
"النماذج العصبية للاسترجاع تستخدم على نطاق واسع في الإنتاج.",
]
with torch.no_grad():
batch = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt").to(device)
outputs = model(**batch)
embeddings = mean_pool(outputs.last_hidden_state, batch["attention_mask"])
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)

print(embeddings.shape) # (3, hidden_size)

Example: multilingual text classification

This minimal trainer fine-tunes a linear head on top of the CLS token; for imbalanced multilingual datasets, consider class-weighted loss or focal loss, and use stratified sampling per language.

import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer, AutoModel, get_linear_schedule_with_warmup, AdamW

class TextCls(Dataset):
def __init__(self, texts, labels, tokenizer, max_length=512):
self.texts, self.labels, self.tokenizer, self.max_length = texts, labels, tokenizer, max_length
def __len__(self): return len(self.texts)
def __getitem__(self, idx):
enc = self.tokenizer(self.texts[idx], truncation=True, padding="max_length", max_length=self.max_length, return_tensors="pt")
item = {k: v.squeeze(0) for k, v in enc.items()}
item["labels"] = torch.tensor(self.labels[idx], dtype=torch.long)
return item

model_id = "jhu-clsp/mmBERT-base" # placeholder, update when released
tokenizer = AutoTokenizer.from_pretrained(model_id)
backbone = AutoModel.from_pretrained(model_id)

num_labels = 3
cls_head = nn.Linear(backbone.config.hidden_size, num_labels)
model = nn.Sequential(backbone, nn.Identity()) # forward hook will handle CLS extraction

def forward(batch):
outputs = backbone(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"])
cls = outputs.last_hidden_state[:, 0] # CLS token
return cls_head(cls)

# Prepare data
texts = ["I love multilingual IR.", "Det här är fantastiskt.", "Это неоднозначно."]labels = [2, 2, 0] # example
train_ds = TextCls(texts*64, labels*64, tokenizer)
train_loader = DataLoader(train_ds, batch_size=16, shuffle=True)

# Optimizer and schedule
optimizer = AdamW(list(backbone.parameters()) + list(cls_head.parameters()), lr=3e-5, weight_decay=0.01)
num_epochs = 3
num_steps = num_epochs * len(train_loader)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=int(0.1*num_steps), num_training_steps=num_steps)

device = "cuda" if torch.cuda.is_available() else "cpu"
backbone.to(device); cls_head.to(device)

backbone.train(); cls_head.train()
for epoch in range(num_epochs):
for batch in train_loader:
batch = {k: v.to(device) for k, v in batch.items()}
logits = forward(batch)
loss = nn.CrossEntropyLoss()(logits, batch["labels"])
optimizer.zero_grad(); loss.backward(); optimizer.step(); scheduler.step()
print(f"epoch {epoch+1} loss {loss.item():.4f}")

Research takeaways

  • Encoders are back: With efficient long-context training and annealed curricula, encoders can match or exceed large decoders on NLU while being far cheaper to run in retrieval/classification stacks.
  • Curriculum beats brute force: Strategically delaying low-resource languages until the decay phase yields larger aggregate gains than uniform multilingual pretraining from scratch.
  • Open recipe lineage: The training stack cites ModernBERT/Ettin-style open data and techniques, suggesting reproducibility and extensibility for future multilingual encoders.

Limitations and open questions

  • Script coverage vs. vocabulary growth: Adopting a broader tokenizer improves coverage but inflates embedding tables; exploring adaptive or subword-agnostic tokenization could reduce this overhead.
  • Data quality variance: Late-phase inclusion of many low-resource languages may import noise; robust filtering or teacher reranking could further stabilize training.
  • Downstream adaptation: While zero-shot gains are strong, careful task-specific heads and pooling strategies remain important to fully realize retrieval gains.

Final thoughts

mmBERT provides a compelling, scalable blueprint for multilingual encoders: stage languages with annealed sampling, reduce masking over time, and leverage long-context encoders to transfer knowledge efficiently to the long tail. For practitioners, it is an immediately useful base for multilingual retrieval, classification, and document understanding workloads.

Notebook:

Google Colab


mmBERT: A Practical Implementation of Multilingual Encoder with Annealed Language Learning was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.

Share this article
0
Share
Shareable URL
Prev Post

vLLM x Qwen3-Next: Hybrid Attention, Multi-Token Prediction, and Thinking Controls for…

Next Post

Don’t buy GPUs for AI

Read next
Subscribe to our newsletter
Get notified of the best deals on our Courses, Tools and Giveaways..