Finally a v1.0 release with 3x more data

Published on November 5, 2019 by Alexander Veysov

The largest Russian STT dataset up-to-date

- ~16m utterances;

- ~20 000 hours;

- 2,3 TB of data(in .wav format in int16);

- A wide variety of practical, close to real-life domains;

Major highlights

- ~3 000 hours of a completely new domain - public speech;

- A huge Radio dataset update with **10 000+ hours** ;

- A 5% demo version of new Radio/Public Speech datasets;

- Vastly improved dataset normalization;

- Overall annotation quality is improved:

- Upstream model quality improvement;

- No more "dangling" letters;

- Improved voice activity detection;

See the above TLDR bullets;

Next steps

- Major past error clean-up planned in 1.1;

- Refine and publish speaker labels, probably add speakers for old datasets;

- Improve / re-upload some of the existing datasets, refine the STT labels;

- Probably add new languages;

- Add pre-trained models;