ATE-2: State-of-the-Art Armenian Text Embeddings and the ArmBench-TextEmbed Benchmark

Community Article Published March 19, 2026

Low-resource languages (LRLs) usually face a high barrier to entry in modern NLP. The prevailing assumption is that to train effective text embedding models for tasks like Retrieval-Augmented Generation (RAG) and semantic search, you need massive and/or high-quality dataset.

With the release of our ATE-2 (Armenian Text Embeddings 2) models, we are challenging that assumption.

We are open-sourcing a complete ecosystem for Armenian text embeddings: new base and large models, the ArmBench-TextEmbed standardized benchmark, and the underlying training dataset.

Insight: 10k Noisy Pairs is All You Need

In our upcoming paper (accepted to LoResLM @ EACL 2026), we generated small-scale, noisy synthetic data by translating English Reddit title-body pairs using open-weights models.

Our experiments revealed a surprising "Less is More" phenomenon:

  • Fine-tuning a multilingual encoder (mE5) on just 10,000 noisy synthetic pairs yielded 11-12% average improvements across our benchmark.
  • It drove a 20%+ relative improvement in retrieval performance.
  • The model trained on 10k noisy pairs matched the performance of models trained on 1 million examples or a comparable high-quality data.

Scaling the data, using SOTA LLMs for better translation quality, or diversifying domains did not yield significant gains over this 10k baseline. Semantic alignment for LRLs saturates early and is highly robust to noise.

SOTA Performance: Native and Transliterated

A major challenge in practical Armenian NLP is the widespread use of transliterated text (writing Armenian words using the Latin alphabet). ATE-2 was built to handle both native and transliterated queries.

On the new ArmBench-TextEmbed, ATE-2-large not only outperforms its predecessor but massively outperforms leading open-source and proprietary models as shown in the leaderboard snaphot below (with selected models only).

Model Size Armenian Script Translit
ATE-2-large (Ours) 560M 0.805 0.461
Gemini-embedding-001 - 0.774 0.373
Qwen3-Embedding-8B 7.6B 0.756 0.295
ATE-2-base (Ours) 278M 0.767 0.278
OpenAI text-embed-3-large - 0.296 0.088

Note: For a granular breakdown of Retrieval, STS, and Classification tasks, please check the ArmBench-TextEmbed Leaderboard.

Democratizing LRL Embeddings

By proving that high-performance semantic alignment can be achieved with tiny, noisy, synthetic datasets, we hope this framework provides a blueprint for resource-constrained communities working on other LRLs.

Explore the Ecosystem:

Community

Sign up or log in to comment