BirdCLEF 2026 · kaggle

The challenge

Five-second windows of long soundscape recordings, one row per (recording, window) — fill a probability for every (window, species) cell of a dense submission matrix. Train labels are weak: the primary_label is what the field recorder believed dominated the clip, not a frame-level annotation. The taxonomy mixes birds with mammals and amphibians, so a clipwise classifier alone will leak non-bird taxa into bird probability columns.

Research-track. Prize 75,000 USD. Deadline 2026-06-03. Recordings from Mato Grosso do Sul, Brazil — the Pantanal.

Data schema

data schema train.csv

column type note

primary_label categorical the species (or non-bird taxon) the recorder annotated as dominant

collection categorical recording campaign / source identifier — useful for grouped CV

filename / recording_id identifier audio file pointer

(other Kaggle-standard cols) text rating, lat/lon, date — present but not used in baseline

train.csv lives in the Kaggle drop, not in this repo. Columns above are what inspect_birdclef_data.py reads (lines 24-25).

data schema taxonomy.csv

column type note

class_name categorical taxonomic class — birds + mammals + amphibians

primary_label identifier matches train.csv primary_label

scientific_name / common_name text species labels for human consumption

data schema sample_submission.csv

column type note

row_id identifier BC2026_Test_XXXX_SYY_TIMESTAMP_WINDOW · windowed scoring

<species_1> numeric predicted probability for this window

... numeric one column per species in taxonomy

The row_id naming scheme is the strongest signal: scoring happens per 5-second window, not per recording.

Dataset spotlight

234 species across 5 taxa

chart data

label	value
Aves	162 species
Amphibia	35 species
Insecta	28 species
Mammalia	8 species
Reptilia	1 species

The taxonomy mixes 162 birds with non-bird taxa. A clipwise bird classifier will leak probability mass into mammal/amphibian columns unless explicitly filtered.

top 15 most-recorded primary_labels

chart data

label	value
rubthr1	499 rows
banana	498 rows
fepowl	497 rows
soulap1	497 rows
houspa	496 rows
coffal1	495 rows
osprey	495 rows
socfly1	494 rows
compau	493 rows
yeofly1	493 rows
bncfly	492 rows
bobfly1	492 rows
bbwduc	491 rows
trsowl	491 rows
whtdov	491 rows

Top labels are tightly packed (≈490 each) — the head of the distribution is well-balanced, but the tail of the 234 species drops off sharply.

source breakdown · 35,549 total recordings

chart data

label	value
XenoCanto	23,043 recordings
iNaturalist	12,506 recordings

XenoCanto deployments cluster in the Pantanal; iNaturalist recordings extend across South America.

recording locations (sampled, n=240)

iNaturalist · 76 XenoCanto · 164

point summary

group	points
iNaturalist	76
XenoCanto	164
latitude bounds	-30.00 to -10.00
longitude bounds	-75.00 to -40.00

The orange box is the Pantanal recorder-deployment area called out in the data brief. XenoCanto submissions cluster there; iNaturalist contributions are scattered across the continent — useful for species-presence priors, less useful for soundscape acoustics.

The current baseline

The shipped baseline is intentionally minimal. It does the boring-but-important thing first: confirm that the submission shape, the row-ID scheme, and the inference loop work end-to-end before any model is in the picture.

inspect_birdclef_data.py python · 46 lines

#!/usr/bin/env python3
import csv
import json
from collections import Counter
from pathlib import Path


ROOT = Path(__file__).resolve().parent
DATA_DIR = ROOT / "data"
OUTPUT_DIR = ROOT / "outputs/analysis"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)


def read_rows(path: Path) -> list[dict[str, str]]:
    with path.open("r", encoding="utf-8", newline="") as handle:
        return list(csv.DictReader(handle))


def main() -> None:
    train_rows = read_rows(DATA_DIR / "train.csv")
    taxonomy_rows = read_rows(DATA_DIR / "taxonomy.csv")
    sample_rows = read_rows(DATA_DIR / "sample_submission.csv")

    class_counts = Counter(row["class_name"] for row in taxonomy_rows)
    label_counts = Counter(row["primary_label"] for row in train_rows if row.get("primary_label"))
    collection_counts = Counter(row["collection"] for row in train_rows if row.get("collection"))

    summary = {
        "train_rows": len(train_rows),
        "taxonomy_rows": len(taxonomy_rows),
        "submission_rows": len(sample_rows),
        "num_submission_classes": max(len(sample_rows[0]) - 1, 0) if sample_rows else 0,
        "taxonomy_class_counts": dict(class_counts.most_common()),
        "collection_counts": dict(collection_counts.most_common()),
        "top_primary_labels": label_counts.most_common(15),
        "recording_location": (DATA_DIR / "recording_location.txt").read_text(encoding="utf-8").strip(),
    }

    output_path = OUTPUT_DIR / "data_summary.json"
    output_path.write_text(json.dumps(summary, indent=2) + "\n", encoding="utf-8")
    print(json.dumps(summary, indent=2))


if __name__ == "__main__":
    main()

train.py counts primary_label occurrences and writes a priors JSON. predict.py fills every (window, class) cell of sample_submission.csv with those priors. The leaderboard score is whatever a class-prior table earns — but the pipeline ships.

Why priors-first

Audio modeling on weak labels is expensive: mel-spec generation, RAM for waveforms, GPU for whatever CNN/transformer head we end up with, training time per backbone. Going priors-first defers all of that until the submission shape, the dashboard logging, and the remote-training path are all known-good. The discipline is the same as the jaguar competition: cheap, correct plumbing first, expensive modeling second.

Planned approach

next-five-moves roadmap

01
01 · confirm the official metric before any optimization decisions Kaggle's evaluation page is the source of truth; macro-F1 vs CMAP vs weighted-AP change the loss choice.
02
02 · time-windowed validation mimic per-window leaderboard scoring Split by recording (not by row) to prevent leakage; score the validation set with the same row_id grid as test.
03
03 · audio preprocess pipeline mel-spec vs raw waveform vs hybrid Decide window size, hop, mel bins; persist preprocessed features to disk to avoid GPU bottleneck.
04
04 · clipwise vs event detection the modeling decision Clipwise classifier (one prediction per 5s window) is the simplest; event detection (frame-level) is more powerful but harder to train on weak labels.
05
05 · handle weak labels + mixed taxa non-bird filter + label noise A non-bird gate (mammal/amphibian filter) reduces false positives in the bird probability columns.

What's distinct about this competition

Three things, none of which a standard ImageNet-style classifier template handles cleanly:

Mixed taxa. The taxonomy is not bird-only. A naive multi-class softmax will dilute bird probability mass across non-bird columns.
Weak labels. primary_label is recording-level, not frame-level. The model has to learn frame-level relevance from recording-level supervision.
Open-set risk. The Pantanal will record species not in the training taxonomy. The submission format is closed-set probability, but the field recordings are open-set acoustic.