Jaguar Re-ID · kaggle

The challenge

Kaggle hands you train.csv — every image labeled with the jaguar identity it depicts — and test.csv, which contains millions of (query_image, gallery_image) pairs. For each pair, return a single similarity value. The leaderboard scores those similarities directly. Not classification accuracy. Similarity. Most entrants walked in treating it as classification — but this is a re-identification task, and that was the trap.

The dataset shape

1,895 training rows across 31 unique jaguars — Marcela (183 frames), Ousado (179), Medrosa (170) at one end; Bernard and Ipepo (13 each) at the other. Mean is 61 frames per identity, median 45. That long tail dictates how the offline validation split has to be built: stratify per identity or your val set will under-sample the minority animals and your selection metric will pick the wrong checkpoint.

frames per identity (all 31 jaguars)

chart data

identity	frames	preview
Marcela	183 frames	R2 bucket candidates
Ousado	179 frames	R2 bucket candidates
Medrosa	170 frames	R2 bucket candidates
Lua	120 frames	R2 bucket candidates
Kwang	113 frames	R2 bucket candidates
Kamaikua	105 frames	R2 bucket candidates
Jaju	104 frames	R2 bucket candidates
Benita	86 frames	R2 bucket candidates
Ti	86 frames	R2 bucket candidates
Saseka	79 frames	R2 bucket candidates
Katniss	63 frames	R2 bucket candidates
Tomas	63 frames	R2 bucket candidates
Overa	62 frames	R2 bucket candidates
Bagua	60 frames	R2 bucket candidates
Pixana	50 frames	R2 bucket candidates
Solar	45 frames	R2 bucket candidates
Pyte	36 frames	R2 bucket candidates
Ariely	34 frames	R2 bucket candidates
Guaraci	31 frames	R2 bucket candidates
Madalena	27 frames	R2 bucket candidates
Alira	23 frames	R2 bucket candidates
Bororo	22 frames	R2 bucket candidates
Abril	21 frames	R2 bucket candidates
Apeiara	20 frames	R2 bucket candidates
Akaloi	19 frames	R2 bucket candidates
Patricia	19 frames	R2 bucket candidates
Oxum	17 frames	R2 bucket candidates
Estella	16 frames	R2 bucket candidates
Pollyanna	16 frames	R2 bucket candidates
Bernard	13 frames	R2 bucket candidates
Ipepo	13 frames	R2 bucket candidates

mean 61.1 · median 45 · max 183 (Marcela) · min 13 · identities with sample frames available reveal the visual variation behind the retrieval task

identity counts by frame-bucket

chart data

label	value
1–5	0 identities
6–10	0 identities
11–25	11 identities
26–50	6 identities
51+	14 identities

how many of the 31 jaguars fall in each frame-count bucket — 14 well-represented (51+ frames), 6 mid, 11 underrepresented

The architectural pivot — classification to retrieval

The first local baseline was an ImageNet-pretrained ResNet-50 with a triplet loss + cross-entropy combo. Public score: 0.421. It plateaued fast.

The unlock was reframing the task. The leaderboard scores similarity geometry, so the model should be optimized for embedding structure, not for class accuracy. Three moves followed in lockstep:

Replace the offline metric. pair_auc and retrieval_map computed on a fixed validation split with leaderboard-like pair sampling. Without this, every cloud-trained model looked roughly the same and we were burning GPU-hours ranking noise.
Replace the architecture stack. EVA-02 Large at 448², GeM pooling, ArcFace head. The backbone choice mattered more than any single hyperparameter; ArcFace structured the embedding space the way the leaderboard actually scored it.
Postprocess the similarity matrix. TTA by horizontal flip; average query expansion; k-reciprocal reranking; a final blend of raw and reranked scores.

Model architecture

EVA-02 winning config (model)

backbone

family: EVA-02 Large patch-14
weights: eva02_large_patch14_448.mim_m38m_ft_in22k_in1k
input: 448 × 448
embedding dim: backbone default (1024)

head

pooling: GeM — learnable p (init 3.0)
classifier: ArcFace (ArcMarginProduct)
arcface scale: 30.0
arcface margin: 0.5
loss: cross-entropy on margin logits only
label smoothing: 0.0

The baseline model, kept on disk for reproducibility, is just a ResNet-50 with embedding_dim=512, cross-entropy + 0.2 · triplet_loss (batch-hard, margin=0.2), label smoothing 0.1. It exists to be beaten.

Training protocol

EVA-02 winning config (training)

optimization

optimizer: AdamW (fused on CUDA)
learning rate: 2e-5
weight decay: 1e-3
schedule: Cosine annealing · T_max = epochs
warmup: none

runtime

batch size: 4
grad accumulation: 8 (effective 32)
epochs: 10
AMP dtype: bfloat16
memory format: channels-last
TF32: on
torch.compile: off
split seed: 314159 (fixed)
train seed: 42 (per-seed)

Source: queue-vast-eva02-v1.sh + train_baseline.py defaults. split_seed is decoupled from train_seed so multi-seed sweeps stay comparable.

Validation strategy

Selection metric: pair_auc (NOT classification accuracy). Secondary: retrieval_map · retrieval_top1 · pair_ap.

Split: 80/20 stratified per identity, governed by a fixed split_seed=314159. This is the single most important repo-wide decision — when split-seed and train-seed are decoupled, multi-seed sweeps are directly comparable across runs.

Multi-seed: the eva02 plan supports SEEDS="42 1337"-style env overrides. The reranker pass shows that variance across seeds is real and a 3-seed average is the safer leaderboard bet.

Score progression

public leaderboard progression

chart data

run	public mAP
simple CNN · ImageNet-pretrained CNN with triplet loss — the local baseline	0.421
retrieval pipeline · EVA-02 Large 448 + GeM pooling + ArcFace head + TTA (flip)	0.807
+ similarity postproc · AQE + k-reciprocal reranking + blended raw/reranked scores	0.871

where each leap came from

chart data

step	public mAP
baseline	0.421
retrieval-first pipeline · EVA-02 L/448 backbone, GeM pool, ArcFace head, TTA flip	0.807 (+0.386)
AQE + k-reciprocal + blending · average query expansion, k-reciprocal reranking, raw/reranked blend	0.871 (+0.064)

The inference pipeline

The 0.807 → 0.871 leap was not in the model. It was in what the model's embeddings get fed through after they leave the GPU. The postprocessing is transductive — it exploits the full structure of the test set, not just per-query nearest neighbors.

from embedding to submission

01
test images queries + gallery
02
backbone forward EVA-02 L @ 448²
03
L2 normalize cosine-ready vectors
04
TTA flip average two passes
05
AQE k=3 · α=3.0
06
k-reciprocal rerank k1=20 · k2=6 · λ=0.3
07
blend raw / reranked ratio = 0.2
08
submission score_transform = clip

Every box is a flag in queue-vast-eva02-v1.sh. The pipeline is reproducible end-to-end from one shell script.

A representative slice of the loss

The baseline combines softmax cross-entropy with a batch-hard triplet term. The eva02 plan turns the triplet weight off (ArcFace structures the embedding geometry on its own); the snippet below shows the fallback path that the lighter ConvNeXt sweeps still rely on.

train_baseline.py L1–L40 python · 40 lines

#!/usr/bin/env python3
from __future__ import annotations

import argparse
import inspect
import json
import sys
from contextlib import nullcontext
from datetime import datetime, timezone
from pathlib import Path

import numpy as np
import torch
from torch import nn
from torch.utils.data import DataLoader
from tqdm import tqdm

sys.path.append(str(Path(__file__).resolve().parent / "src"))

from jaguar_reid_utils import (  # noqa: E402
    JaguarTrainDataset,
    PKBatchSampler,
    batch_hard_triplet_loss,
    build_model,
    build_transforms,
    compute_retrieval_metrics,
    load_label_mapping,
    load_train_rows,
    make_split,
)


def parse_args():
    root = Path(__file__).resolve().parent
    parser = argparse.ArgumentParser()
    parser.add_argument("--train-csv", type=Path, default=root / "data" / "extracted" / "train.csv")
    parser.add_argument("--image-dir", type=Path, default=root / "data" / "extracted" / "train" / "train")
    parser.add_argument("--output-dir", type=Path, default=root / "outputs" / "baseline")
    parser.add_argument("--model", default="resnet50.a1_in1k")
    parser.add_argument("--epochs", type=int, default=8)

Lessons banked for next time

Validate the way the leaderboard scores. A wrong offline metric costs more than a wrong hyperparameter; it can hide a 0.2 mAP gap for weeks.
Embedding geometry beats label accuracy. Once the head moved from softmax to ArcFace, the score curve resumed.
Transductive postprocessing was the second engine. AQE + k-reciprocal reranking + blending added more in postproc than another epoch of training ever did.
Decouple split-seed and train-seed. It is the cheapest reform a Kaggle pipeline can make.
Cheap GPUs first, expensive GPUs only for high-value hypotheses. The retrospective in apps/kaggle/docs/jaguar-retrospective.md is blunt: long speculative remote sweeps had poor ROI.