§ Computer Vision Strong

Jaguar Re-ID

Pairwise jaguar re-identification — scored as image similarity, not classification. The retrieval-first refactor moved a 0.421 public baseline to 0.871.

baseline 0.421
best 0.871
metric mAP (retrieval) Δ +0.450 · +106.9%
dataset 1,895 train · 31 identities · ~187k test pairs images
metric mAP (retrieval) higher is better
infra sbl1 · local vast.ai · remote GPU
last touched 2026-03-08 kaggle competition ↗
technique stack
EVA-02 Large 448GeM poolingArcFace headTTA flipAQEk-reciprocal rerankraw/reranked blend

The challenge

Kaggle hands you train.csv — every image labeled with the jaguar identity it depicts — and test.csv, which contains millions of (query_image, gallery_image) pairs. For each pair, return a single similarity value. The leaderboard scores those similarities directly. Not classification accuracy. Similarity. Most entrants walked in treating it as classification — but this is a re-identification task, and that was the trap.

The dataset shape

1,895 training rows across 31 unique jaguars — Marcela (183 frames), Ousado (179), Medrosa (170) at one end; Bernard and Ipepo (13 each) at the other. Mean is 61 frames per identity, median 45. That long tail dictates how the offline validation split has to be built: stratify per identity or your val set will under-sample the minority animals and your selection metric will pick the wrong checkpoint.

Four identities · four frames each

The architectural pivot — classification to retrieval

The first local baseline was an ImageNet-pretrained ResNet-50 with a triplet loss + cross-entropy combo. Public score: 0.421. It plateaued fast.

The unlock was reframing the task. The leaderboard scores similarity geometry, so the model should be optimized for embedding structure, not for class accuracy. Three moves followed in lockstep:

  1. Replace the offline metric. pair_auc and retrieval_map computed on a fixed validation split with leaderboard-like pair sampling. Without this, every cloud-trained model looked roughly the same and we were burning GPU-hours ranking noise.
  2. Replace the architecture stack. EVA-02 Large at 448², GeM pooling, ArcFace head. The backbone choice mattered more than any single hyperparameter; ArcFace structured the embedding space the way the leaderboard actually scored it.
  3. Postprocess the similarity matrix. TTA by horizontal flip; average query expansion; k-reciprocal reranking; a final blend of raw and reranked scores.

Model architecture

EVA-02 winning config (model)
backbone
family
EVA-02 Large patch-14
weights
eva02_large_patch14_448.mim_m38m_ft_in22k_in1k
input
448 × 448
embedding dim
backbone default (1024)
head
pooling
GeM — learnable p (init 3.0)
classifier
ArcFace (ArcMarginProduct)
arcface scale
30.0
arcface margin
0.5
loss
cross-entropy on margin logits only
label smoothing
0.0

The baseline model, kept on disk for reproducibility, is just a ResNet-50 with embedding_dim=512, cross-entropy + 0.2 · triplet_loss (batch-hard, margin=0.2), label smoothing 0.1. It exists to be beaten.

Training protocol

EVA-02 winning config (training)
optimization
optimizer
AdamW (fused on CUDA)
learning rate
2e-5
weight decay
1e-3
schedule
Cosine annealing · T_max = epochs
warmup
none
runtime
batch size
4
grad accumulation
8 (effective 32)
epochs
10
AMP dtype
bfloat16
memory format
channels-last
TF32
on
torch.compile
off
split seed
314159 (fixed)
train seed
42 (per-seed)

Source: queue-vast-eva02-v1.sh + train_baseline.py defaults. split_seed is decoupled from train_seed so multi-seed sweeps stay comparable.

Validation strategy

Selection metric: pair_auc (NOT classification accuracy). Secondary: retrieval_map · retrieval_top1 · pair_ap.

Split: 80/20 stratified per identity, governed by a fixed split_seed=314159. This is the single most important repo-wide decision — when split-seed and train-seed are decoupled, multi-seed sweeps are directly comparable across runs.

Multi-seed: the eva02 plan supports SEEDS="42 1337"-style env overrides. The reranker pass shows that variance across seeds is real and a 3-seed average is the safer leaderboard bet.

Score progression

The inference pipeline

The 0.807 → 0.871 leap was not in the model. It was in what the model’s embeddings get fed through after they leave the GPU. The postprocessing is transductive — it exploits the full structure of the test set, not just per-query nearest neighbors.

A representative slice of the loss

The baseline combines softmax cross-entropy with a batch-hard triplet term. The eva02 plan turns the triplet weight off (ArcFace structures the embedding geometry on its own); the snippet below shows the fallback path that the lighter ConvNeXt sweeps still rely on.

train_baseline.py L1–L40 python · 40 lines
#!/usr/bin/env python3
from __future__ import annotations

import argparse
import inspect
import json
import sys
from contextlib import nullcontext
from datetime import datetime, timezone
from pathlib import Path

import numpy as np
import torch
from torch import nn
from torch.utils.data import DataLoader
from tqdm import tqdm

sys.path.append(str(Path(__file__).resolve().parent / "src"))

from jaguar_reid_utils import (  # noqa: E402
    JaguarTrainDataset,
    PKBatchSampler,
    batch_hard_triplet_loss,
    build_model,
    build_transforms,
    compute_retrieval_metrics,
    load_label_mapping,
    load_train_rows,
    make_split,
)


def parse_args():
    root = Path(__file__).resolve().parent
    parser = argparse.ArgumentParser()
    parser.add_argument("--train-csv", type=Path, default=root / "data" / "extracted" / "train.csv")
    parser.add_argument("--image-dir", type=Path, default=root / "data" / "extracted" / "train" / "train")
    parser.add_argument("--output-dir", type=Path, default=root / "outputs" / "baseline")
    parser.add_argument("--model", default="resnet50.a1_in1k")
    parser.add_argument("--epochs", type=int, default=8)

Lessons banked for next time

  • Validate the way the leaderboard scores. A wrong offline metric costs more than a wrong hyperparameter; it can hide a 0.2 mAP gap for weeks.
  • Embedding geometry beats label accuracy. Once the head moved from softmax to ArcFace, the score curve resumed.
  • Transductive postprocessing was the second engine. AQE + k-reciprocal reranking + blending added more in postproc than another epoch of training ever did.
  • Decouple split-seed and train-seed. It is the cheapest reform a Kaggle pipeline can make.
  • Cheap GPUs first, expensive GPUs only for high-value hypotheses. The retrospective in apps/kaggle/docs/jaguar-retrospective.md is blunt: long speculative remote sweeps had poor ROI.