Download Tidy-X Dataset

Here you can access the training data and validation data needed to participate in the TidyLang 2026 Challenge. Only the official Tidy-X partition may be used from the Mozilla Common Voice dataset.


Dataset Components

Tidy-X Dataset (Train + Validation)

The complete dataset package containing both training and validation data for the TidyLang 2026 Challenge. This is the only official data from Mozilla Common Voice that participants may use.

Contents:

  • Tidy-X Train: Training dataset with multi-lingual speaker recordings (40 languages)
  • Tidy-X Valid: Validation dataset for system tuning and validation
  • Speaker identity labels and language annotations
  • Cross-lingual speaker samples across both splits (each speaker with 2–10 languages)

Download:

  • Size: ~50 GB (approximate)
  • Format: .wav files, 16 kHz sampling frequency

📥 Tidy-X Complete Dataset (Train + Validation)

Click the link below to access the dataset page or download the API script:

🔗 View Dataset Page 📥 Download Script

Available via Mozilla Data Collective


Trial Pairs and Manifests for Validation (Development)

Trial pairs for the verification task and manifest files for the train/validation data portion (language verification: same-language vs different-language pairs).

📥 Trial Pairs & Manifests (Validation)

Download the validation trial pairs and the training/validation manifest:

🔗 Trial Pairs (trials_val_lang.zip) 🔗 Manifest (training_manifest.txt)

Available via TidyLang2026-baseline


Evaluation Data

The evaluation set is not released before the evaluation phase. Details about the evaluation data (including size, languages, format, and trial structure) are kept confidential to ensure a fair and unbiased benchmark. We will release the evaluation data and the evaluation trial pair lists when the evaluation phase opens. At that time, registered participants will receive instructions on how to access the evaluation set and submit results. Please follow the Important Dates and Registration pages for updates.


Trial Pairs for Evaluation

The evaluation trial pair list will be released together with the evaluation data when the evaluation phase opens.

📋 Evaluation Trial Pairs

We will release the evaluation data and the evaluation trial pair lists when the evaluation phase opens. The submission format will be published at that time. No details are disclosed beforehand.

Coming soon — stay tuned!


Download Instructions

  1. Registration: Please complete the registration process before downloading the dataset.

  2. Create Mozilla Data Collective API Key:
  3. Install Required Package:
    pip install datacollective
    
  4. Download Using Python Script: Download the download_tidyvoice.py script from the dataset download section above, then:
    • Replace YOUR_API_KEY_HERE with your Mozilla Data Collective API key
    • Update OUTPUT_DIR to your desired download location
    • Run: python download_tidyvoice.py


Data Structure

The dataset is organized with speakerID folders directly inside each dataset folder, which then contain languageID subfolders with the corresponding audio files for that speaker in that specific language.

Tidy-X_Train/Valid
├── speaker_001/
│   ├── en/          # English recordings
│   │   ├── file1.wav
│   │   ├── file2.wav
│   │   └── ...
│   ├── fa/          # Persian recordings
│   │   ├── file1.wav
│   │   └── ...
│   └── fr/          # French recordings
│       └── ...
├── speaker_002/
│   ├── de/          # German recordings
│   ├── it/          # Italian recordings
│   └── ...
└── ...

Structure explanation:

  • Tidy-X Train: Training data with speakerID folders at the root
  • Tidy-X Valid: Validation data with speakerID folders at the root
  • Each speakerID folder contains all recordings for that speaker
  • languageID subfolders organize recordings by language (en, fa, fr, de, it, etc.)
  • This structure enables easy access to multi-lingual-per-speaker data for language recognition under controlled speaker overlap


Support

If you encounter any issues with the dataset download or have questions about the data format, please contact:

  • Email: aref.farhadipour@uzh.ch


Citation

If you use the Tidy-X / TidyVoice dataset in your research, please cite:

@misc{farhadi2026tidy,
      title={TidyVoice: A Curated Multilingual Dataset for Speaker Verification Derived from Common Voice},
      author={Aref Farhadipour and Jan Marquenie and Srikanth Madikeri and Eleanor Chodroff},
      year={2026},
      journal={ICASSP2026},
      url={https://arxiv.org/abs/2601.16358},
}