
The TidyLang Challenge addresses the critical problem of language recognition when the same speaker speaks multiple languages. Language recognition systems are typically evaluated under the assumption that speaker identity is a nuisance variable. However, in realistic multilingual environments, speakers often switch languages across different contexts, creating a risk that models rely on speaker-specific traits (“shortcut learning”) rather than robust linguistic cues.
This challenge uses the Tidy-X dataset—the same curated, large-scale multilingual corpus derived from Mozilla Common Voice that emphasizes language switching, with multi-lingual-per-speaker data (each speaker contributes utterances in 2–10 languages). Participants will build systems that disentangle speaker identity from language and generalize to completely unseen (zero-shot) languages. Performance is evaluated on two tasks: (1) Language identification on 35 seen languages (training data) — reported as Macro accuracy; (2) Unseen language recognition on 40 unseen languages — enrollment-based (20–65 s per enrollment ID, compare with test utterance), reported as EER. More detail is given in the Evaluation Plan and Baseline Systems pages.
By providing standardized data, open-source baselines, and a rigorous evaluation protocol, this challenge aims to drive research towards trustworthy, identity-invariant, and linguistically grounded language recognition technologies.
The TidyLang Challenge is a Speaker-Controlled and Zero-Shot Language Recognition challenge. The only permitted data from Mozilla Common Voice is the official Tidy-X training and validation partition; all other Common Voice data is strictly forbidden. The core task is spoken language recognition at the utterance level under controlled speaker-overlap conditions.
Evaluation tasks (in both closed- and open-condition):
See the Evaluation Plan and Baseline Systems for protocols and trial formats.
The challenge uses the Tidy-X dataset, a curated partition from Mozilla Common Voice featuring:
Note on splits: The training and validation portions used in this challenge are different from the original splits of the Tidyvox dataset. Participants must follow the official manifest provided in the baseline repository: training_manifest.txt.
Evaluation set: Details about the evaluation data (including size, languages, and trial structure) are not disclosed before the evaluation phase to ensure a fair and unbiased benchmark. We will release the evaluation data and the evaluation trial pair lists when the evaluation phase opens.
Participants can submit to two conditions. The same Common Voice rule applies to both: no additional data from the Common Voice dataset may be used for training beyond the official Tidy-X partition.
Goal: A level playing field so that participants focus on methodological innovation (e.g., architecture, training strategy, loss design) rather than extra data.
Goal: Explore how much language recognition can be improved in a general setting by leveraging additional LID-oriented data (public or private), while keeping the evaluation comparable and the Common Voice boundary clear.
For each condition (closed and, if submitted, open), the final evaluation consists of two tasks:
More detail on protocols and trial formats is given in the Evaluation Plan. Rankings may be computed per task and/or per condition; details will be given when the evaluation phase opens.
During validation (development phase): Both an identification set (35 seen languages) and an enrollment-based verification set (enrollment IDs + trial file) are provided, so participants can evaluate both tasks locally before the final evaluation phase.
Development Phase: Participants use the provided training and validation data to develop and tune their systems. Validation includes both an identification set (35 seen languages) and an enrollment-based verification set, so you can measure Macro accuracy and EER on the validation data. You can experiment with different approaches, architectures, and hyperparameters using the official Tidy-X splits.
Evaluation Phase: When the evaluation phase opens, the evaluation set and submission procedure (including the CodaBench link) will be announced. Participants must submit results for the closed-condition; open-condition submission is optional. Results will be reported for both language identification (Macro accuracy) and unseen language recognition (EER) in each condition. Rankings will be determined based on performance on the evaluation set.
In each speech signal from a single person, we have multiple types of information: the identity of the speaker, the content of the speech, emotional information, language information, etc. In this challenge, we aim to develop systems that, when receiving a speech signal from a human, can recognize the language in a way that is independent of speaker identity—relying on phonetic and phonotactic cues rather than speaker-specific shortcuts, and generalizing to unseen languages.
This image was generated and edited using Runway and Qwen-VL models.