AMHTokenizer
Open-source Amharic tokenizer: decomposes 300+ syllables into base consonants + vowels, reduces to 40 tokens, speeds AI, cuts model size, scales NLP.
Contribute
Become a financial contributor.
Financial Contributions
AMHTokenizer is all of us
Our contributors 1
Thank you for supporting AMHTokenizer.
Connect
Let’s get the ball rolling!
News from AMHTokenizer
Updates on our activities and progress.
AMHTokenizer Now on Open Collective – Join and Support!
We’ve added AMHTokenizer to Open Collective! This means our community can now support the project directly and help us continue improving Amharic language tokenization. Stay tuned for more updates...
Published on October 20, 2025 by Sefineh Tesfa
Conversations
Let’s get the discussion going! This is a space for the community to converse, ask questions, say thank you, and get things done together.
Thank you!
Published on October 17, 2025 by Sefineh Tesfa
Thank you for your dedication to advancing BPE with support for syllabic languages.
About
Amharic, spoken by over 58 million people in Ethiopia and the diaspora, has been largely overlooked in modern AI. While English and other Latin-based languages benefit from advanced NLP tools, Amharic presents a unique challenge: its Fidel script contains over 300 distinct characters, each representing a consonant–vowel syllable. Standard tokenizers treat each symbol separately, leading to large vocabularies, sparse data, and slow, inefficient model training.
To address this, I developed the Amharic Tokenizer, an open-source Python library designed specifically for Amharic. By decomposing each Fidel into its base consonant and vowel components, the tokenizer reduces the effective vocabulary from over 300 tokens to just 40, while preserving the full meaning of the language.
This approach improves AI training efficiency by up to 70%, enables rare syllables to be composed dynamically, and significantly reduces model size. It also sets the foundation for scaling to other African languages, which share similar scripts, potentially impacting millions of speakers.
Funding will support continued development, dataset creation, and open-source model training, helping African languages achieve parity with English in AI systems. With your support, we can ensure these languages are not left behind, but become central to the next generation of AI innovation.
To address this, I developed the Amharic Tokenizer, an open-source Python library designed specifically for Amharic. By decomposing each Fidel into its base consonant and vowel components, the tokenizer reduces the effective vocabulary from over 300 tokens to just 40, while preserving the full meaning of the language.
This approach improves AI training efficiency by up to 70%, enables rare syllables to be composed dynamically, and significantly reduces model size. It also sets the foundation for scaling to other African languages, which share similar scripts, potentially impacting millions of speakers.
Funding will support continued development, dataset creation, and open-source model training, helping African languages achieve parity with English in AI systems. With your support, we can ensure these languages are not left behind, but become central to the next generation of AI innovation.