Here you can access the training data and validation data needed to participate in the TidyLang 2026 Challenge. Only the official Tidy-X partition may be used from the Mozilla Common Voice dataset.
The complete dataset package containing both training and validation data for the TidyLang 2026 Challenge. This is the only official data from Mozilla Common Voice that participants may use.
Contents:
Download:
Click the link below to access the dataset page or download the API script:
Available via Mozilla Data Collective
Trial pairs for the verification task and manifest files for the train/validation data portion (language verification: same-language vs different-language pairs).
Download the validation trial pairs and the training/validation manifest:
Available via TidyLang2026-baseline
The evaluation set is not released before the evaluation phase. Details about the evaluation data (including size, languages, format, and trial structure) are kept confidential to ensure a fair and unbiased benchmark. We will release the evaluation data and the evaluation trial pair lists when the evaluation phase opens. At that time, registered participants will receive instructions on how to access the evaluation set and submit results. Please follow the Important Dates and Registration pages for updates.
The evaluation trial pair list will be released together with the evaluation data when the evaluation phase opens.
We will release the evaluation data and the evaluation trial pair lists when the evaluation phase opens. The submission format will be published at that time. No details are disclosed beforehand.
Coming soon — stay tuned!
Registration: Please complete the registration process before downloading the dataset.
pip install datacollective
download_tidyvoice.py script from the dataset download section above, then:
YOUR_API_KEY_HERE with your Mozilla Data Collective API keyOUTPUT_DIR to your desired download locationpython download_tidyvoice.pyThe dataset is organized with speakerID folders directly inside each dataset folder, which then contain languageID subfolders with the corresponding audio files for that speaker in that specific language.
Tidy-X_Train/Valid
├── speaker_001/
│ ├── en/ # English recordings
│ │ ├── file1.wav
│ │ ├── file2.wav
│ │ └── ...
│ ├── fa/ # Persian recordings
│ │ ├── file1.wav
│ │ └── ...
│ └── fr/ # French recordings
│ └── ...
├── speaker_002/
│ ├── de/ # German recordings
│ ├── it/ # Italian recordings
│ └── ...
└── ...
Structure explanation:
If you encounter any issues with the dataset download or have questions about the data format, please contact:
If you use the Tidy-X / TidyVoice dataset in your research, please cite:
@misc{farhadi2026tidy,
title={TidyVoice: A Curated Multilingual Dataset for Speaker Verification Derived from Common Voice},
author={Aref Farhadipour and Jan Marquenie and Srikanth Madikeri and Eleanor Chodroff},
year={2026},
journal={ICASSP2026},
url={https://arxiv.org/abs/2601.16358},
}