SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects
Paper
•
2309.07445
•
Published
AfroXLMR-base-114L was created by an MLM adaptation of the expanded XLM-R-base model on 114 languages widely spoken in Africa including 4 high-resource languages.
A mix of mC4, Wikipedia and OPUS data
There are 76 languages available :
We would like to thank Google Cloud for providing us access to TPU v4-8 through the free cloud credits. Model trained using flax, before converted to pytorch.
@misc{adelani2023sib200,
title={SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects},
author={David Ifeoluwa Adelani and Hannah Liu and Xiaoyu Shen and Nikita Vassilyev and Jesujoba O. Alabi and Yanke Mao and Haonan Gao and Annie En-Shiun Lee},
year={2023},
eprint={2309.07445},
archivePrefix={arXiv},
primaryClass={cs.CL}
}