Finally a v1.0 release with 3x more data
The largest Russian STT dataset up-to-date
- ~16m utterances;
- ~20 000 hours;
- 2,3 TB of data(in .wav format in int16);
- A wide variety of practical, close to real-life domains;
Major highlights
- ~3 000 hours of a completely new domain - public speech;
- A huge Radio dataset update with **10 000+ hours** ;
- A 5% demo version of new Radio/Public Speech datasets;
- Vastly improved dataset normalization;
- Overall annotation quality is improved:
- Upstream model quality improvement;
- No more "dangling" letters;
- Improved voice activity detection;
See the above TLDR bullets;
Next steps
- Major past error clean-up planned in 1.1;
- Refine and publish speaker labels, probably add speakers for old datasets;
- Improve / re-upload some of the existing datasets, refine the STT labels;
- Probably add new languages;
- Add pre-trained models;