BirdCLEF 2026
Acoustic species ID in Pantanal soundscapes — research-track Kaggle. Pipeline live with a class-prior baseline; deep audio modeling is next.
The challenge
Five-second windows of long soundscapeSoundscapeaudioA long continuous field recording capturing all ambient acoustic activity at a location — many overlapping vocalizations, mixed taxa, environmental noise. Distinct from single-species clip datasets.full entry → recordings,
one row per (recording, window) — fill a probability for every
(window, species) cell of a dense submission matrix. Train labels are
weakWeak labelsaudioLabels at a coarser granularity than the prediction target. A recording-level 'this clip contains species X' label is weak for a 5-second-window prediction task.full entry →: the primary_label is what the field
recorder believed dominated the clip, not a frame-level annotation.
The taxonomy mixes birds with mammals and amphibians, so a
clipwiseClipwise classificationaudioA single classification prediction per audio clip (or per fixed-length window). Treats the clip as one bag of features; doesn't try to localize where in the clip a sound occurs.full entry → classifier alone will leak non-bird taxa
into bird probability columns.
Research-track. Prize 75,000 USD. Deadline 2026-06-03. Recordings from Mato Grosso do Sul, Brazil — the Pantanal.
Data schema
train.csv train.csv lives in the Kaggle drop, not in this repo. Columns above are what inspect_birdclef_data.py reads (lines 24-25).
taxonomy.csv sample_submission.csv The row_id naming scheme is the strongest signal: scoring happens per 5-second window, not per recording.
Dataset spotlight
The taxonomy mixes 162 birds with non-bird taxa. A clipwise bird classifier will leak probability mass into mammal/amphibian columns unless explicitly filtered.
Top labels are tightly packed (≈490 each) — the head of the distribution is well-balanced, but the tail of the 234 species drops off sharply.
XenoCanto deployments cluster in the Pantanal; iNaturalist recordings extend across South America.
The orange box is the Pantanal recorder-deployment area called out in the data brief. XenoCanto submissions cluster there; iNaturalist contributions are scattered across the continent — useful for species-presence priors, less useful for soundscape acoustics.
The current baseline
The shipped baseline is intentionally minimal. It does the boring-but-important thing first: confirm that the submission shape, the row-ID scheme, and the inference loop work end-to-end before any model is in the picture.
#!/usr/bin/env python3
import csv
import json
from collections import Counter
from pathlib import Path
ROOT = Path(__file__).resolve().parent
DATA_DIR = ROOT / "data"
OUTPUT_DIR = ROOT / "outputs/analysis"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
def read_rows(path: Path) -> list[dict[str, str]]:
with path.open("r", encoding="utf-8", newline="") as handle:
return list(csv.DictReader(handle))
def main() -> None:
train_rows = read_rows(DATA_DIR / "train.csv")
taxonomy_rows = read_rows(DATA_DIR / "taxonomy.csv")
sample_rows = read_rows(DATA_DIR / "sample_submission.csv")
class_counts = Counter(row["class_name"] for row in taxonomy_rows)
label_counts = Counter(row["primary_label"] for row in train_rows if row.get("primary_label"))
collection_counts = Counter(row["collection"] for row in train_rows if row.get("collection"))
summary = {
"train_rows": len(train_rows),
"taxonomy_rows": len(taxonomy_rows),
"submission_rows": len(sample_rows),
"num_submission_classes": max(len(sample_rows[0]) - 1, 0) if sample_rows else 0,
"taxonomy_class_counts": dict(class_counts.most_common()),
"collection_counts": dict(collection_counts.most_common()),
"top_primary_labels": label_counts.most_common(15),
"recording_location": (DATA_DIR / "recording_location.txt").read_text(encoding="utf-8").strip(),
}
output_path = OUTPUT_DIR / "data_summary.json"
output_path.write_text(json.dumps(summary, indent=2) + "\n", encoding="utf-8")
print(json.dumps(summary, indent=2))
if __name__ == "__main__":
main() train.py counts primary_label occurrences and writes a priors JSON.
predict.py fills every (window, class) cell of sample_submission.csv
with those priors. The leaderboard score is whatever a class-prior
table earns — but the pipeline ships.
Why priors-first
Audio modeling on weak labels is expensive: mel-spec generation, RAM for waveforms, GPU for whatever CNN/transformer head we end up with, training time per backbone. Going priors-first defers all of that until the submission shape, the dashboard logging, and the remote-training path are all known-good. The discipline is the same as the jaguar competition: cheap, correct plumbing first, expensive modeling second.
Planned approach
- 01 01 · confirm the official metric before any optimization decisions Kaggle's evaluation page is the source of truth; macro-F1 vs CMAP vs weighted-AP change the loss choice.
- 02 02 · time-windowed validation mimic per-window leaderboard scoring Split by recording (not by row) to prevent leakage; score the validation set with the same row_id grid as test.
- 03 03 · audio preprocess pipeline mel-spec vs raw waveform vs hybrid Decide window size, hop, mel bins; persist preprocessed features to disk to avoid GPU bottleneck.
- 04 04 · clipwise vs event detection the modeling decision Clipwise classifier (one prediction per 5s window) is the simplest; event detection (frame-level) is more powerful but harder to train on weak labels.
- 05 05 · handle weak labels + mixed taxa non-bird filter + label noise A non-bird gate (mammal/amphibian filter) reduces false positives in the bird probability columns.
What’s distinct about this competition
Three things, none of which a standard ImageNet-style classifier template handles cleanly:
- Mixed taxa. The taxonomy is not bird-only. A naive multi-class softmax will dilute bird probability mass across non-bird columns.
- Weak labels.
primary_labelis recording-level, not frame-level. The model has to learn frame-level relevance from recording-level supervision. - Open-set risk. The Pantanal will record species not in the training taxonomy. The submission format is closed-set probability, but the field recordings are open-set acoustic.