§ Audio In progress

BirdCLEF 2026

Acoustic species ID in Pantanal soundscapes — research-track Kaggle. Pipeline live with a class-prior baseline; deep audio modeling is next.

dataset weak labels at recording level · dense per-window submission field-recording audio (FLAC / OGG)
metric TBA per Kaggle eval higher is better
infra sbl1 · local
last touched 2026-03-12 kaggle competition ↗
technique stack
class-prior baselinewindowed inference

The challenge

Five-second windows of long soundscape recordings, one row per (recording, window) — fill a probability for every (window, species) cell of a dense submission matrix. Train labels are weak: the primary_label is what the field recorder believed dominated the clip, not a frame-level annotation. The taxonomy mixes birds with mammals and amphibians, so a clipwise classifier alone will leak non-bird taxa into bird probability columns.

Research-track. Prize 75,000 USD. Deadline 2026-06-03. Recordings from Mato Grosso do Sul, Brazil — the Pantanal.

Data schema

data schema train.csv
column type note
primary_label categorical the species (or non-bird taxon) the recorder annotated as dominant
collection categorical recording campaign / source identifier — useful for grouped CV
filename / recording_id identifier audio file pointer
(other Kaggle-standard cols) text rating, lat/lon, date — present but not used in baseline

train.csv lives in the Kaggle drop, not in this repo. Columns above are what inspect_birdclef_data.py reads (lines 24-25).

data schema taxonomy.csv
column type note
class_name categorical taxonomic class — birds + mammals + amphibians
primary_label identifier matches train.csv primary_label
scientific_name / common_name text species labels for human consumption
data schema sample_submission.csv
column type note
row_id identifier BC2026_Test_XXXX_SYY_TIMESTAMP_WINDOW · windowed scoring
<species_1> numeric predicted probability for this window
... numeric one column per species in taxonomy

The row_id naming scheme is the strongest signal: scoring happens per 5-second window, not per recording.

Dataset spotlight

The current baseline

The shipped baseline is intentionally minimal. It does the boring-but-important thing first: confirm that the submission shape, the row-ID scheme, and the inference loop work end-to-end before any model is in the picture.

inspect_birdclef_data.py python · 46 lines
#!/usr/bin/env python3
import csv
import json
from collections import Counter
from pathlib import Path


ROOT = Path(__file__).resolve().parent
DATA_DIR = ROOT / "data"
OUTPUT_DIR = ROOT / "outputs/analysis"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)


def read_rows(path: Path) -> list[dict[str, str]]:
    with path.open("r", encoding="utf-8", newline="") as handle:
        return list(csv.DictReader(handle))


def main() -> None:
    train_rows = read_rows(DATA_DIR / "train.csv")
    taxonomy_rows = read_rows(DATA_DIR / "taxonomy.csv")
    sample_rows = read_rows(DATA_DIR / "sample_submission.csv")

    class_counts = Counter(row["class_name"] for row in taxonomy_rows)
    label_counts = Counter(row["primary_label"] for row in train_rows if row.get("primary_label"))
    collection_counts = Counter(row["collection"] for row in train_rows if row.get("collection"))

    summary = {
        "train_rows": len(train_rows),
        "taxonomy_rows": len(taxonomy_rows),
        "submission_rows": len(sample_rows),
        "num_submission_classes": max(len(sample_rows[0]) - 1, 0) if sample_rows else 0,
        "taxonomy_class_counts": dict(class_counts.most_common()),
        "collection_counts": dict(collection_counts.most_common()),
        "top_primary_labels": label_counts.most_common(15),
        "recording_location": (DATA_DIR / "recording_location.txt").read_text(encoding="utf-8").strip(),
    }

    output_path = OUTPUT_DIR / "data_summary.json"
    output_path.write_text(json.dumps(summary, indent=2) + "\n", encoding="utf-8")
    print(json.dumps(summary, indent=2))


if __name__ == "__main__":
    main()
inspect_birdclef_data.py — the deterministic EDA that runs every push (45 lines, no audio dependencies) view on github ↗

train.py counts primary_label occurrences and writes a priors JSON. predict.py fills every (window, class) cell of sample_submission.csv with those priors. The leaderboard score is whatever a class-prior table earns — but the pipeline ships.

Why priors-first

Audio modeling on weak labels is expensive: mel-spec generation, RAM for waveforms, GPU for whatever CNN/transformer head we end up with, training time per backbone. Going priors-first defers all of that until the submission shape, the dashboard logging, and the remote-training path are all known-good. The discipline is the same as the jaguar competition: cheap, correct plumbing first, expensive modeling second.

Planned approach

What’s distinct about this competition

Three things, none of which a standard ImageNet-style classifier template handles cleanly:

  • Mixed taxa. The taxonomy is not bird-only. A naive multi-class softmax will dilute bird probability mass across non-bird columns.
  • Weak labels. primary_label is recording-level, not frame-level. The model has to learn frame-level relevance from recording-level supervision.
  • Open-set risk. The Pantanal will record species not in the training taxonomy. The submission format is closed-set probability, but the field recordings are open-set acoustic.