# AQE — average query expansion
also: average query expansion, query expansion
A retrieval-postprocessing trick: replace each query embedding with a weighted average of itself and its top-k nearest gallery embeddings, then re-search. Often gives a free retrieval mAP gain.
Average query expansion (AQE) is transductive: it exploits the structure of the test set to refine each query. After an initial search, take the top-k results, blend their embeddings into the query (e.g. weighted by similarity^α with α=3.0, k=3), and search again. This pulls queries closer to dense clusters of related images without retraining the model.
An additive-angular-margin softmax loss for metric learning. Adds a fixed angular margin to the ground-truth class logit in cosine space, encouraging tighter intra-class and wider inter-class angular separation.
ArcFace (Deng et al. 2019) is one of the dominant losses for face recognition and image retrieval. It reframes the final classifier as an angle-based formulation: each class is a unit vector on a hypersphere; logits are cosines of angles between the input embedding and each class vector. ArcFace then subtracts a fixed margin from the angle of the ground-truth class (m=0.5 is common) before scaling by a temperature (s=30 is common). The margin pushes embeddings to be both class-discriminative and geometrically clean — useful for any downstream retrieval.
# EVA-02
also: eva02, eva02_large_patch14
A vision transformer pretrained at large scale with masked image modeling. The 448²-input Large variant is a strong general-purpose backbone for downstream image tasks including retrieval.
EVA-02 is a series of vision transformers from BAAI that scale masked image modeling to billions of parameters and high resolutions. The Large variant at 448×448 input, fine-tuned through ImageNet-22k and then ImageNet-1k, is currently one of the strongest open-weights image encoders for transfer learning. It is the timm model `eva02_large_patch14_448.mim_m38m_ft_in22k_in1k`.
# GeM pooling
also: gem, generalized mean pooling
Generalized mean pooling — a learnable interpolation between average-pooling (p=1) and max-pooling (p=∞). The exponent p is trained jointly with the network.
GeM pooling computes (mean(x^p))^(1/p) over the spatial dimensions, with p a learnable scalar (initialized to 3.0 by convention). At p=1 it reduces to average pooling; as p→∞ it approaches max pooling. Larger p emphasizes the most-activated spatial regions, useful for retrieval tasks where a distinctive part of the image matters more than its overall appearance.
# k-reciprocal reranking
also: k-reciprocal, reranking
A retrieval-postprocessing method that reorders results by whether two items are in each other's k-nearest-neighbor sets — a stronger 'mutual closeness' signal than raw similarity.
k-reciprocal reranking (Zhong et al. 2017) defines the k-reciprocal neighbor set as items that are both in each other's top-k. The reranked distance combines the original distance with a Jaccard-style overlap of these sets, optionally smoothed by an expanded k. Hyperparameters k1 (initial neighbors), k2 (expansion), and λ (blend weight) control the strength. Standard final step in Re-ID pipelines.
# Pair AUC
also: pairwise auc
AUC computed over all (query, gallery) pairs, treating same-identity as positive and different-identity as negative. The leaderboard-aligned offline metric for pairwise similarity tasks.
Pair AUC samples (or enumerates) all pairs from the validation set, labels them positive (same identity) or negative (different identity), and computes the standard ROC-AUC over their similarity scores. Unlike retrieval mAP it doesn't reward correct ordering of top-k results — but it does correlate closely with leaderboards that score (query, gallery) pairs directly.
# Re-identification (Re-ID)
also: re-id
Given a query image of an entity (a person, animal, vehicle), retrieve other images of the same entity from a gallery. Scored by similarity, not by class label.
Re-identification (Re-ID) is a retrieval task: the model sees images of an entity at training time and must match new images of the same entity at test time, often without that entity ever being a 'class' the model was explicitly trained to predict. The model's job is to produce embeddings such that same-entity pairs are close in embedding space and different-entity pairs are far apart. Distinct from classification: a closed label set is replaced by an open set of identities.
# Retrieval mAP
also: mean average precision, map
Mean Average Precision — the standard retrieval metric. For each query, compute precision at every recall level, average them, then average over all queries. 1.0 is perfect.
For a single query, Average Precision is the area under the precision-recall curve over the ranked retrieval list. Retrieval mAP averages AP across all queries. It rewards models that put correct matches at the top of the ranked list (not just somewhere in the top-k). Distinct from classification mAP, which averages per-class AP.
PyTorch Image Models — Ross Wightman's library of pretrained vision backbones with a unified loading interface. The de facto registry for transfer learning in PyTorch.
timm (PyTorch Image Models) hosts hundreds of pretrained vision architectures with a single `timm.create_model('name', pretrained=True)` API. Model name strings encode the architecture, pretraining dataset, and fine-tuning lineage (e.g. `eva02_large_patch14_448.mim_m38m_ft_in22k_in1k`).
# Vision Transformer (ViT)
also: vit
An image classifier that treats images as sequences of fixed-size patches fed to a standard transformer encoder. Patch-14 means each patch is 14×14 pixels.
Introduced by Dosovitskiy et al. 2020, the Vision Transformer (ViT) replaces convolutions with a standard transformer encoder applied to image patches embedded as tokens. It outperforms CNNs at large scale, especially when pretrained with self-supervised objectives like masked image modeling. EVA-02 is a ViT variant.