AMHTokenizer

Open-source Amharic tokenizer: decomposes 300+ syllables into base consonants + vowels, reduces to 40 tokens, speeds AI, cuts model size, scales NLP.

Contribute


Become a financial contributor.

Financial Contributions

Goal

LIMITED: 100 LEFT OUT OF 100

Support the development of AMH Tokenizer, the first open-source Amharic tokenization library for NLP. Your contribution helps us maintain the codeb... Read more

$0.00 USD of $1,000 USD raised (0%)

Starts at
$5 USD
Goal
Sponsor

LIMITED: 1000 LEFT OUT OF 1000

Become a sponsor for $100.00 and support us

$0.00 USD of $1,000 USD raised (0%)

Starts at
$100 USD
Custom contribution
Donation
Make a custom one-time or recurring contribution.

AMHTokenizer is all of us

Our contributors 1

Thank you for supporting AMHTokenizer.

Sefineh Tesfa

Admin
Excited to launch AMHTokenizer on Open Collecti...

Connect


Let’s get the ball rolling!

News from AMHTokenizer

Updates on our activities and progress.

AMHTokenizer Now on Open Collective – Join and Support!

We’ve added AMHTokenizer to Open Collective! This means our community can now support the project directly and help us continue improving Amharic language tokenization. Stay tuned for more updates...
Read more
Published on October 20, 2025 by Sefineh Tesfa

Conversations

Let’s get the discussion going! This is a space for the community to converse, ask questions, say thank you, and get things done together.

Thank you!

Published on October 17, 2025 by Sefineh Tesfa

Thank you for your dedication to advancing BPE with support for syllabic languages.

About


Amharic, spoken by over 58 million people in Ethiopia and the diaspora, has been largely overlooked in modern AI. While English and other Latin-based languages benefit from advanced NLP tools, Amharic presents a unique challenge: its Fidel script contains over 300 distinct characters, each representing a consonant–vowel syllable. Standard tokenizers treat each symbol separately, leading to large vocabularies, sparse data, and slow, inefficient model training.
To address this, I developed the Amharic Tokenizer, an open-source Python library designed specifically for Amharic. By decomposing each Fidel into its base consonant and vowel components, the tokenizer reduces the effective vocabulary from over 300 tokens to just 40, while preserving the full meaning of the language.
This approach improves AI training efficiency by up to 70%, enables rare syllables to be composed dynamically, and significantly reduces model size. It also sets the foundation for scaling to other African languages, which share similar scripts, potentially impacting millions of speakers.
Funding will support continued development, dataset creation, and open-source model training, helping African languages achieve parity with English in AI systems. With your support, we can ensure these languages are not left behind, but become central to the next generation of AI innovation.

Our team

Sefineh Tesfa

Admin
Excited to launch AMHTokenizer on Open Collecti...