Jaguar Re-ID
Pairwise jaguar re-identification — scored as image similarity, not classification. The retrieval-first refactor moved a 0.421 public baseline to 0.871.
The challenge
Kaggle hands you train.csv — every image labeled with the jaguar
identity it depicts — and test.csv, which contains millions of
(query_image, gallery_image) pairs. For each pair, return a single
similarity value. The leaderboard scores those similarities
directly. Not classification accuracy. Similarity. Most entrants
walked in treating it as classification — but this is a re-identificationRe-identification (Re-ID)computer visionGiven a query image of an entity (a person, animal, vehicle), retrieve other images of the same entity from a gallery. Scored by similarity, not by class label.full entry →Wikipedia
task, and that was the trap.
The dataset shape
1,895 training rows across 31 unique jaguars — Marcela (183 frames), Ousado (179), Medrosa (170) at one end; Bernard and Ipepo (13 each) at the other. Mean is 61 frames per identity, median 45. That long tail dictates how the offline validation split has to be built: stratify per identity or your val set will under-sample the minority animals and your selection metric will pick the wrong checkpoint.
mean 61.1 · median 45 · max 183 (Marcela) · min 13
how many of the 31 jaguars fall in each frame-count bucket — 14 well-represented (51+ frames), 6 mid, 11 underrepresented
Four identities · four frames each
Real training frames at 96×96. Same individual, different lighting, posture, occlusion — this is why the task is similarity, not classification.
The architectural pivot — classification to retrieval
The first local baseline was an ImageNet-pretrained ResNet-50 with a triplet lossTriplet lossgeneral mlA metric-learning loss that pulls an anchor's embedding toward a positive (same class) and pushes it away from a negative (different class), enforcing a margin. Batch-hard variants pick the hardest positive and negative within each minibatch.full entry →FaceNet arxiv 1503.03832 + cross-entropy combo. Public score: 0.421. It plateaued fast.
The unlock was reframing the task. The leaderboard scores similarity geometry, so the model should be optimized for embeddingEmbeddinggeneral mlA fixed-length numerical vector representing an input. The geometry between vectors (cosine distance, Euclidean distance) encodes similarity between inputs.full entry → structure, not for class accuracy. Three moves followed in lockstep:
- Replace the offline metric.
pair_aucPair AUCcomputer visionAUC computed over all (query, gallery) pairs, treating same-identity as positive and different-identity as negative. The leaderboard-aligned offline metric for pairwise similarity tasks.full entry → andretrieval_mapRetrieval mAPcomputer visionMean Average Precision — the standard retrieval metric. For each query, compute precision at every recall level, average them, then average over all queries. 1.0 is perfect.full entry →Wikipedia computed on a fixed validation split with leaderboard-like pair sampling. Without this, every cloud-trained model looked roughly the same and we were burning GPU-hours ranking noise. - Replace the architecture stack. EVA-02EVA-02computer visionA vision transformer pretrained at large scale with masked image modeling. The 448²-input Large variant is a strong general-purpose backbone for downstream image tasks including retrieval.full entry →arxiv 2303.11331 Large at 448², GeM poolingGeM poolingcomputer visionGeneralized mean pooling — a learnable interpolation between average-pooling (p=1) and max-pooling (p=∞). The exponent p is trained jointly with the network.full entry →arxiv 1711.02512, ArcFaceArcFacecomputer visionAn additive-angular-margin softmax loss for metric learning. Adds a fixed angular margin to the ground-truth class logit in cosine space, encouraging tighter intra-class and wider inter-class angular separation.full entry →arxiv 1801.07698 head. The backboneBackbonegeneral mlThe feature-extractor portion of a model (e.g. ResNet50, EVA-02). Outputs intermediate feature maps that downstream heads (classifier, detector, segmentation) consume.full entry → choice mattered more than any single hyperparameter; ArcFace structured the embedding space the way the leaderboard actually scored it.
- Postprocess the similarity matrix. TTATTA — test-time augmentationgeneral mlApply multiple augmentations to each test input (horizontal flip, crops, color jitter), run the model on each, and average the predictions or embeddings. Trades inference compute for accuracy.full entry → by horizontal flip; average query expansionAQE — average query expansioncomputer visionA retrieval-postprocessing trick: replace each query embedding with a weighted average of itself and its top-k nearest gallery embeddings, then re-search. Often gives a free retrieval mAP gain.full entry →Chum et al. 2007 (PDF); k-reciprocal rerankingk-reciprocal rerankingcomputer visionA retrieval-postprocessing method that reorders results by whether two items are in each other's k-nearest-neighbor sets — a stronger 'mutual closeness' signal than raw similarity.full entry →arxiv 1701.08398; a final blend of raw and reranked scores.
Model architecture
- family
- EVA-02 Large patch-14
- weights
- eva02_large_patch14_448.mim_m38m_ft_in22k_in1k
- input
- 448 × 448
- embedding dim
- backbone default (1024)
- pooling
- GeM — learnable p (init 3.0)
- classifier
- ArcFace (ArcMarginProduct)
- arcface scale
- 30.0
- arcface margin
- 0.5
- loss
- cross-entropy on margin logits only
- label smoothing
- 0.0
The baseline model, kept on disk for reproducibility, is just a
ResNet-50 with embedding_dim=512, cross-entropySoftmaxmathsoftmax(z)_i = exp(z_i) / Σ exp(z_j). Turns a vector of logits into a probability distribution. Paired with cross-entropy loss for classification.full entry → +
0.2 · triplet_loss (batch-hard, margin=0.2), label smoothingLabel smoothingtrainingSoften one-hot training labels by mixing in a small uniform distribution (e.g. true class = 0.9, others = 0.1/N). Prevents overconfident logits and acts as a mild regularizer.full entry →arxiv 1906.02629
0.1. It exists to be beaten.
Training protocol
- optimizer
- AdamW (fused on CUDA)
- learning rate
- 2e-5
- weight decay
- 1e-3
- schedule
- Cosine annealing · T_max = epochs
- warmup
- none
- batch size
- 4
- grad accumulation
- 8 (effective 32)
- epochs
- 10
- AMP dtype
- bfloat16
- memory format
- channels-last
- TF32
- on
- torch.compile
- off
- split seed
- 314159 (fixed)
- train seed
- 42 (per-seed)
Source: queue-vast-eva02-v1.sh + train_baseline.py defaults. split_seed is decoupled from train_seed so multi-seed sweeps stay comparable.
Validation strategy
Selection metric: pair_auc (NOT classification accuracy). Secondary: retrieval_map · retrieval_top1 · pair_ap.
Split: 80/20 stratifiedStratified splitgeneral mlA train/validation split that preserves the class proportions of the full dataset in each subset. For imbalanced multi-class problems, the default random split will under-sample rare classes.full entry →scikit-learn docs per identity, governed by a fixed split_seed=314159.
This is the single most important repo-wide decision — when split-seed and
train-seed are decoupled, multi-seedMulti-seed protocolgeneral mlRunning the same experiment with multiple random seeds and averaging or ensembling the results. Reveals genuine improvements (consistent across seeds) vs. lucky noise (one good seed).full entry → sweeps are
directly comparable across runs.
Multi-seed: the eva02 plan supports SEEDS="42 1337"-style env overrides.
The reranker pass shows that variance across seeds is real and a 3-seed
average is the safer leaderboard bet.
Score progression
The inference pipeline
The 0.807 → 0.871 leap was not in the model. It was in what the model’s embeddings get fed through after they leave the GPU. The postprocessing is transductiveTransductive inferencegeneral mlAn inference paradigm that uses the structure of the test set itself (not just trained model weights) to refine predictions. AQE and k-reciprocal reranking are transductive techniques.full entry →Wikipedia — it exploits the full structure of the test set, not just per-query nearest neighbors.
- 01 test images queries + gallery
- 02 backbone forward EVA-02 L @ 448²
- 03 L2 normalize cosine-ready vectors
- 04 TTA flip average two passes
- 05 AQE k=3 · α=3.0
- 06 k-reciprocal rerank k1=20 · k2=6 · λ=0.3
- 07 blend raw / reranked ratio = 0.2
- 08 submission score_transform = clip
Every box is a flag in queue-vast-eva02-v1.sh. The pipeline is reproducible end-to-end from one shell script.
A representative slice of the loss
The baseline combines softmax cross-entropy with a batch-hard triplet term. The eva02 plan turns the triplet weight off (ArcFace structures the embedding geometry on its own); the snippet below shows the fallback path that the lighter ConvNeXt sweeps still rely on.
#!/usr/bin/env python3
from __future__ import annotations
import argparse
import inspect
import json
import sys
from contextlib import nullcontext
from datetime import datetime, timezone
from pathlib import Path
import numpy as np
import torch
from torch import nn
from torch.utils.data import DataLoader
from tqdm import tqdm
sys.path.append(str(Path(__file__).resolve().parent / "src"))
from jaguar_reid_utils import ( # noqa: E402
JaguarTrainDataset,
PKBatchSampler,
batch_hard_triplet_loss,
build_model,
build_transforms,
compute_retrieval_metrics,
load_label_mapping,
load_train_rows,
make_split,
)
def parse_args():
root = Path(__file__).resolve().parent
parser = argparse.ArgumentParser()
parser.add_argument("--train-csv", type=Path, default=root / "data" / "extracted" / "train.csv")
parser.add_argument("--image-dir", type=Path, default=root / "data" / "extracted" / "train" / "train")
parser.add_argument("--output-dir", type=Path, default=root / "outputs" / "baseline")
parser.add_argument("--model", default="resnet50.a1_in1k")
parser.add_argument("--epochs", type=int, default=8) Lessons banked for next time
- Validate the way the leaderboard scores. A wrong offline metric costs more than a wrong hyperparameter; it can hide a 0.2 mAP gap for weeks.
- Embedding geometry beats label accuracy. Once the head moved from softmax to ArcFace, the score curve resumed.
- Transductive postprocessing was the second engine. AQE + k-reciprocal reranking + blending added more in postproc than another epoch of training ever did.
- Decouple split-seed and train-seed. It is the cheapest reform a Kaggle pipeline can make.
- Cheap GPUs first, expensive GPUs only for high-value hypotheses.
The retrospective in
apps/kaggle/docs/jaguar-retrospective.mdis blunt: long speculative remote sweeps had poor ROI.