kaggle · shivam bhardwaj filed 2026-05-18

REF / GLOSSARY · LEARNING SURFACE id 47 entries

01 · reference

Glossary

Every jargon term used in the writeups, defined briefly. Inline terms carry short definitions; this page keeps the longer reference entries, sources, and competition links in one place.

47 terms

02 · Computer Vision 10 entries

# AQE — average query expansion

also: average query expansion, query expansion

A retrieval-postprocessing trick: replace each query embedding with a weighted average of itself and its top-k nearest gallery embeddings, then re-search. Often gives a free retrieval mAP gain.

Average query expansion (AQE) is transductive: it exploits the structure of the test set to refine each query. After an initial search, take the top-k results, blend their embeddings into the query (e.g. weighted by similarity^α with α=3.0, k=3), and search again. This pulls queries closer to dense clusters of related images without retraining the model.

# ArcFace

An additive-angular-margin softmax loss for metric learning. Adds a fixed angular margin to the ground-truth class logit in cosine space, encouraging tighter intra-class and wider inter-class angular separation.

ArcFace (Deng et al. 2019) is one of the dominant losses for face recognition and image retrieval. It reframes the final classifier as an angle-based formulation: each class is a unit vector on a hypersphere; logits are cosines of angles between the input embedding and each class vector. ArcFace then subtracts a fixed margin from the angle of the ground-truth class (m=0.5 is common) before scaling by a temperature (s=30 is common). The margin pushes embeddings to be both class-discriminative and geometrically clean — useful for any downstream retrieval.

# EVA-02

also: eva02, eva02_large_patch14

A vision transformer pretrained at large scale with masked image modeling. The 448²-input Large variant is a strong general-purpose backbone for downstream image tasks including retrieval.

EVA-02 is a series of vision transformers from BAAI that scale masked image modeling to billions of parameters and high resolutions. The Large variant at 448×448 input, fine-tuned through ImageNet-22k and then ImageNet-1k, is currently one of the strongest open-weights image encoders for transfer learning. It is the timm model `eva02_large_patch14_448.mim_m38m_ft_in22k_in1k`.

# GeM pooling

also: gem, generalized mean pooling

Generalized mean pooling — a learnable interpolation between average-pooling (p=1) and max-pooling (p=∞). The exponent p is trained jointly with the network.

GeM pooling computes (mean(x^p))^(1/p) over the spatial dimensions, with p a learnable scalar (initialized to 3.0 by convention). At p=1 it reduces to average pooling; as p→∞ it approaches max pooling. Larger p emphasizes the most-activated spatial regions, useful for retrieval tasks where a distinctive part of the image matters more than its overall appearance.

# k-reciprocal reranking

also: k-reciprocal, reranking

A retrieval-postprocessing method that reorders results by whether two items are in each other's k-nearest-neighbor sets — a stronger 'mutual closeness' signal than raw similarity.

k-reciprocal reranking (Zhong et al. 2017) defines the k-reciprocal neighbor set as items that are both in each other's top-k. The reranked distance combines the original distance with a Jaccard-style overlap of these sets, optionally smoothed by an expanded k. Hyperparameters k1 (initial neighbors), k2 (expansion), and λ (blend weight) control the strength. Standard final step in Re-ID pipelines.

# Pair AUC

also: pairwise auc

AUC computed over all (query, gallery) pairs, treating same-identity as positive and different-identity as negative. The leaderboard-aligned offline metric for pairwise similarity tasks.

Pair AUC samples (or enumerates) all pairs from the validation set, labels them positive (same identity) or negative (different identity), and computes the standard ROC-AUC over their similarity scores. Unlike retrieval mAP it doesn't reward correct ordering of top-k results — but it does correlate closely with leaderboards that score (query, gallery) pairs directly.

# Re-identification (Re-ID)

also: re-id

Given a query image of an entity (a person, animal, vehicle), retrieve other images of the same entity from a gallery. Scored by similarity, not by class label.

Re-identification (Re-ID) is a retrieval task: the model sees images of an entity at training time and must match new images of the same entity at test time, often without that entity ever being a 'class' the model was explicitly trained to predict. The model's job is to produce embeddings such that same-entity pairs are close in embedding space and different-entity pairs are far apart. Distinct from classification: a closed label set is replaced by an open set of identities.

# Retrieval mAP

also: mean average precision, map

Mean Average Precision — the standard retrieval metric. For each query, compute precision at every recall level, average them, then average over all queries. 1.0 is perfect.

For a single query, Average Precision is the area under the precision-recall curve over the ranked retrieval list. Retrieval mAP averages AP across all queries. It rewards models that put correct matches at the top of the ranked list (not just somewhere in the top-k). Distinct from classification mAP, which averages per-class AP.

# timm

PyTorch Image Models — Ross Wightman's library of pretrained vision backbones with a unified loading interface. The de facto registry for transfer learning in PyTorch.

timm (PyTorch Image Models) hosts hundreds of pretrained vision architectures with a single `timm.create_model('name', pretrained=True)` API. Model name strings encode the architecture, pretraining dataset, and fine-tuning lineage (e.g. `eva02_large_patch14_448.mim_m38m_ft_in22k_in1k`).

# Vision Transformer (ViT)

also: vit

An image classifier that treats images as sequences of fixed-size patches fed to a standard transformer encoder. Patch-14 means each patch is 14×14 pixels.

Introduced by Dosovitskiy et al. 2020, the Vision Transformer (ViT) replaces convolutions with a standard transformer encoder applied to image patches embedded as tokens. It outperforms CNNs at large scale, especially when pretrained with self-supervised objectives like masked image modeling. EVA-02 is a ViT variant.

02 · Bioinformatics 7 entries

# k-mer

A substring of length k over a biological alphabet (e.g. {A,C,G,U} for RNA). The set of k-mers in a sequence, or the histogram of their counts, is a common feature representation.

k-mer features are the bioinformatics analog of word n-grams. For RNA with k=3 and alphabet size 4, there are 4³ = 64 possible k-mers, and each sequence maps to a 64-dimensional count vector. Useful for fast sequence similarity (cosine on normalized k-mer vectors) without needing full alignment.

# MSA — multiple sequence alignment

also: multiple sequence alignment

An alignment of three or more biological sequences (DNA, RNA, protein) that exposes conserved positions and evolutionary relationships. Strong feature for structure prediction.

An MSA places each input sequence in a column-aligned table, with gaps inserted to maximize alignment of homologous positions. Columns where one residue dominates across many sequences reveal positions under selective pressure — these typically participate in structural constraints (base pairs, active sites). MSAs are the key feature behind AlphaFold's success on proteins and remain useful for RNA.

# Needleman–Wunsch alignment

also: global alignment

A dynamic-programming algorithm for optimal global alignment of two biological sequences. Fills an O(mn) table of partial-alignment scores and traces back the best path.

Needleman–Wunsch (1970) is the first dynamic-programming sequence alignment algorithm. Given a substitution matrix and gap penalty, it computes the highest-scoring global alignment of two sequences. Modern toolkits (Biopython, EMBOSS) implement it; for very long sequences faster heuristic aligners (BLAST, minimap2) replace it.

# RibonanzaNet

A sequence-only transformer for RNA structure understanding, trained on the Ribonanza chemical-mapping Kaggle dataset by NVIDIA. RibonanzaNet2 / 3D add 3D coordinate prediction heads.

RibonanzaNet was the winning entry from the Ribonanza Kaggle 2024, where the task was to predict chemical reactivity of every nucleotide. The pretrained model produces per-position embeddings useful for downstream RNA structure tasks; RibonanzaNet2 and RibonanzaNet3D extend it with 3D coordinate prediction heads via additional fine-tuning.

# RMSD — root mean square deviation

The standard structural similarity metric. Compute the root-mean-square of pairwise atom-coordinate differences between two superposed structures. Lower is better; 0 Å is identical structures.

RMSD = √(mean(‖x_i^A − x_i^B‖²)) over corresponding atoms i, after rigid-body superposition of structure B onto structure A. Typical thresholds: <1 Å is near-perfect, 1–3 Å is acceptable for proteins, >10 Å is incorrect. For RNA, RMSDs tend to be higher because RNA secondary/tertiary structure is more flexible than proteins.

# Stoichiometry (structural)

The chain composition of a macromolecular complex. 'A:1' means a single chain labeled A; 'A:2' is a homodimer; 'A:1;B:1' is a heterodimer with two distinct chains.

In structural biology, stoichiometry describes how many copies of each distinct chain are present in a complex. RNA structures are often single-chain (A:1) but can be multi-chain (A:1;B:1 indicates two different RNA molecules in complex). Stoichiometry is metadata that helps narrow candidate templates during retrieval-based prediction.

# Watson–Crick base pair

Canonical RNA/DNA base-pairing geometry: A pairs with U (or T), G pairs with C. Hydrogen-bond geometry constrains the relative position and orientation of paired nucleotides.

Watson–Crick pairing is the canonical hydrogen-bonding pattern between complementary bases. In RNA secondary structure, predicting which residues pair (A·U, G·C, sometimes G·U wobble) determines the entire fold of stem-loop motifs. In 3D coordinate prediction, enforcing Watson–Crick geometry between predicted pairs is a strong physical prior.

02 · Audio 5 entries

# Clipwise classification

A single classification prediction per audio clip (or per fixed-length window). Treats the clip as one bag of features; doesn't try to localize where in the clip a sound occurs.

Clipwise classification produces one prediction vector per clip (or per fixed-length window like 5 seconds). Simpler than event detection — but loses temporal precision. Standard architecture: a CNN over the mel-spectrogram, global pooling, then a linear classifier head.

# Event detection

also: sound event detection, sed

Predict not just what sounds are present but when they start and end. Frame-level predictions instead of clip-level. Harder than clipwise but localizes vocalizations within a long recording.

Sound event detection (SED) outputs a binary or probability time-series per class at frame-level resolution. Models typically combine convolutional feature extraction with a recurrent or transformer head that predicts onset/offset frames. SED is the natural fit for soundscape monitoring but requires either frame-level labels or careful weak-label training.

# Mel-spectrogram

also: mel-spec

A time-frequency representation of audio where the frequency axis is warped to the mel scale (logarithmic, matching human pitch perception). The default audio input for neural network audio classifiers.

Compute the short-time Fourier transform of the waveform, then map the linear frequency bins to mel bins via a triangular filter bank. The result is a 2D image (time × mel-frequency) suitable for any CNN or vision transformer trained on standard image tasks. Hyperparameters: window size, hop, number of mel bins (64–128 typical).

# Soundscape

A long continuous field recording capturing all ambient acoustic activity at a location — many overlapping vocalizations, mixed taxa, environmental noise. Distinct from single-species clip datasets.

Soundscape recordings — typically minutes to hours long, made with autonomous recording units in the field — are the realistic deployment scenario for acoustic monitoring. They differ from single-species curated clip datasets (XenoCanto-style) in that multiple species vocalize simultaneously, weather and ambient noise are uncontrolled, and most of the recording contains no target species at all.

# Weak labels

Labels at a coarser granularity than the prediction target. A recording-level 'this clip contains species X' label is weak for a 5-second-window prediction task.

Weak supervision is supervision where the label is correct but imprecise: a clip is labeled with the dominant species, but the actual vocalization may only occupy a fraction of the recording. Multiple-instance learning, attention pooling, and pseudo-labeling are common ways to learn frame-level predictions from clip-level labels.

02 · Tabular & Classical ML 5 entries

# K-means clustering

Partition n samples into k clusters by iteratively assigning each sample to its nearest centroid and re-computing centroids as cluster means. Minimizes within-cluster variance.

K-means alternates two steps until convergence: (1) assign each point to the nearest of k centroids by Euclidean distance, (2) update each centroid to the mean of its assigned points. Sensitive to feature scales (standardize first), sensitive to initialization (use multiple n_init runs and keep the best), and assumes spherical clusters of similar size. K is a hyperparameter — the elbow plot or silhouette score can help choose it.

# One-hot encoding

also: onehotencoder, one-hot

Turn a categorical column with N distinct values into N binary columns, one per value. Required for models that expect numeric features (linear models, Random Forest accepts it natively).

One-hot encoding lets an arbitrary categorical column feed a numeric-only model. With drop='first', the first category is dropped to avoid the dummy-variable trap (perfect collinearity). Distinct from label encoding (assigning integer codes), which implies an ordering that usually isn't there. For high-cardinality categoricals, target encoding or embeddings are common alternatives.

# PCA — Principal Component Analysis

also: principal component analysis

A linear dimensionality reduction technique: find orthogonal directions of maximum variance in the data, project onto the top-k. Used for visualization (k=2) and feature compression.

PCA computes the eigenvectors of the data covariance matrix; the top-k by eigenvalue are the directions that capture the most variance. Projecting the data onto these directions gives a k-dimensional approximation. The explained-variance ratio of each component tells you what fraction of total variance it accounts for. Always standardize features first.

# Random Forest

An ensemble of decision trees, each trained on a bootstrap sample of the data and a random subset of features. Aggregates by majority vote (classification) or mean (regression). Robust default.

Random Forest (Breiman 2001) builds N decision trees independently, randomizing both the training data (bagging) and the candidate features at each split. The final prediction is the mean (regression) or majority vote (classification) across trees. Feature importances are estimated from how much each feature reduces impurity averaged across trees. Strong out-of-the-box baseline for tabular problems with no tuning.

# StandardScaler

scikit-learn's z-score standardizer: subtract the mean and divide by the standard deviation of each feature. Mandatory before distance-based algorithms (k-means, k-NN, PCA, SVM).

StandardScaler transforms each feature x to (x − μ) / σ where μ and σ are computed from the training set. Always fit on the train fold only, then transform both train and test — fitting on the test would leak distributional information. Distance-based and gradient-based algorithms are sensitive to feature scale; tree-based algorithms (Random Forest, gradient boosting) generally are not.

02 · Training Infrastructure 7 entries

# AdamW

The default optimizer in modern transformer training. Adam with a corrected weight-decay term — decoupling regularization from gradient updates fixes Adam's known weight-decay bug.

Loshchilov & Hutter (2017) showed that Adam's L2-regularization-via-loss is mathematically distinct from true weight decay. AdamW applies decay directly to the parameters (θ ← θ − lr·decay·θ) instead of through the gradient — this restores the equivalence to SGD-with-momentum-and-decay and improves generalization. Now the standard for almost all deep learning.

# channels-last memory format

A PyTorch tensor memory layout (NHWC instead of NCHW) that matches cuDNN's preferred order on modern GPUs. Often a ~10–30% throughput win for convolutional or vision-transformer training.

`tensor.contiguous(memory_format=torch.channels_last)` reorders 4D tensors from PyTorch's default NCHW to NHWC. Modern cuDNN kernels run faster on NHWC; the model and inputs need to opt in. Pairs naturally with bfloat16. Net effect: free throughput improvement if the model supports it (most CNNs and ViTs do).

# Cosine annealing

also: cosineannealinglr, cosine schedule

A learning-rate schedule that decays the lr from its initial value to (typically) zero following a half cosine curve over the training duration. No tuning needed beyond initial lr and total steps.

Cosine annealing sets lr_t = lr_0 · (1 + cos(π · t / T)) / 2 where t is the current step (or epoch) and T is the total. Decay is smooth, slow at first and fast at the end, mirroring the natural training-loss curve. Variants include cosine with warm restarts (SGDR) and cosine with linear warmup. A drop-in replacement for step decay with one fewer hyperparameter.

# Gradient accumulation

Run several smaller minibatches and sum their gradients before stepping the optimizer. Achieves the effect of a larger effective batch on hardware that can't fit it in memory.

For an effective batch size B with N accumulation steps, run B/N forward/backward passes, accumulate gradients with `loss.backward()` each time, then call `optimizer.step()` and `optimizer.zero_grad()` once after the Nth step. Mathematically equivalent to a single batch of size B (with mean-loss reduction), but slower. Necessary for fitting big models (EVA-02 L 448²) into consumer GPU memory.

# Label smoothing

Soften one-hot training labels by mixing in a small uniform distribution (e.g. true class = 0.9, others = 0.1/N). Prevents overconfident logits and acts as a mild regularizer.

Standard cross-entropy with one-hot labels pushes the model toward infinitely large logits for the correct class. Label smoothing replaces the target distribution (1, 0, 0, …) with ((1−ε), ε/N, ε/N, …) for some small ε (often 0.1). Improves calibration and generalization. Sometimes counterproductive with margin-based losses like ArcFace, which already shape the logit geometry.

# Mixed precision (AMP)

also: amp, bfloat16, bf16, fp16

Training with reduced-precision floats (bfloat16 or fp16) for forward/backward passes while keeping certain ops in fp32. Cuts memory and accelerates training on modern GPUs.

PyTorch's automatic mixed precision (AMP) wraps training in `torch.autocast` to cast eligible ops to bf16 or fp16. bfloat16 has the same exponent range as fp32 — eliminating fp16's overflow problems on losses and gradients — and is the default on A100/H100/4090. fp16 needs gradient scaling; bf16 typically doesn't. Channels-last memory format pairs well with bf16 for further throughput gains.

# Weight decay

A regularization term that shrinks weights toward zero each optimizer step: θ ← θ − lr · decay · θ. Equivalent in effect to L2 regularization, but applied directly rather than through the loss.

Weight decay pulls model parameters toward zero by a small multiplicative factor each step (typical values 1e-4 to 1e-3 for transformers). Encourages simpler hypotheses. AdamW's correct implementation decouples weight decay from the adaptive gradient step — making the hyperparameter behave consistently across optimizers.

02 · General ML 9 entries

# Backbone

The feature-extractor portion of a model (e.g. ResNet50, EVA-02). Outputs intermediate feature maps that downstream heads (classifier, detector, segmentation) consume.

In transfer learning, the 'backbone' is the part of the model pretrained on a large source task (commonly ImageNet for vision) and reused — often with frozen or finely-tuned weights — for a downstream task. The choice of backbone usually dominates final performance in image tasks; head architecture is secondary.

# Embedding

also: embeddings, feature vector

A fixed-length numerical vector representing an input. The geometry between vectors (cosine distance, Euclidean distance) encodes similarity between inputs.

An embedding is the output of a model's penultimate layer — a vector (typically 128 to 2048 dimensions) that summarizes the input. Good embeddings have the property that semantically similar inputs produce nearby vectors. Most retrieval, recommendation, and metric-learning systems are built on embeddings rather than raw model outputs.

# Ensemble

Combining predictions from multiple models (different seeds, different architectures, different folds). Usually beats any single model; the cheapest accuracy gain in Kaggle.

Ensembling averages probabilities (classification), embeddings (retrieval), or coordinates (regression) across multiple models. Diversity matters more than individual model strength — three different architectures often beat three seeds of the same architecture. Late-stage Kaggle competitions are largely ensembling exercises.

# Multi-seed protocol

also: multi-seed sweep

Running the same experiment with multiple random seeds and averaging or ensembling the results. Reveals genuine improvements (consistent across seeds) vs. lucky noise (one good seed).

Single-seed results in deep learning are noisy enough that two experiments can differ by 1–2% just from initialization randomness. A multi-seed protocol runs the full pipeline 3–5 times with different seeds and reports mean ± std. For ensembling, the predictions from each seed are averaged at inference — usually a free accuracy improvement over the best single seed.

# Stratified split

A train/validation split that preserves the class proportions of the full dataset in each subset. For imbalanced multi-class problems, the default random split will under-sample rare classes.

When a dataset is class-imbalanced, a uniform random 80/20 split will produce validation sets that randomly under- or over-represent minority classes — making the validation metric noisy and hard to compare across runs. A stratified split groups by class first, then samples 80/20 within each group, ensuring class frequencies match.

# Transductive inference

An inference paradigm that uses the structure of the test set itself (not just trained model weights) to refine predictions. AQE and k-reciprocal reranking are transductive techniques.

In transductive inference the predictor has access to all test inputs at once and can exploit their relationships. This is in contrast with inductive inference where each test input is scored independently against a fixed model. Transductive postprocessing is powerful in retrieval (the gallery's structure refines query results) but doesn't generalize to streaming or online prediction.

# Triplet loss

also: batch-hard triplet

A metric-learning loss that pulls an anchor's embedding toward a positive (same class) and pushes it away from a negative (different class), enforcing a margin. Batch-hard variants pick the hardest positive and negative within each minibatch.

Triplet loss: L = max(0, d(anchor, positive) − d(anchor, negative) + margin). The 'batch-hard' variant mines, within each minibatch, the positive with the largest distance and the negative with the smallest distance — making the loss focus on the most informative triplets rather than averaging over easy ones. Common alternative to ArcFace; often combined with cross-entropy.

# TTA — test-time augmentation

also: test-time augmentation

Apply multiple augmentations to each test input (horizontal flip, crops, color jitter), run the model on each, and average the predictions or embeddings. Trades inference compute for accuracy.

Test-time augmentation reduces single-input variance by ensembling the model's predictions across plausible transformations of the same input. For classification, average the class probabilities; for retrieval, average the L2-normalized embeddings. Horizontal flip alone often gives most of the gain for object-symmetric tasks.

# Validation set

also: val set

A held-out subset of the labeled training data, used to choose hyperparameters and select the best checkpoint without touching the test set. Should mimic the test distribution as closely as possible.

The validation set is used during training and model selection; the test set is only used once, at the end, for reporting. A poorly designed validation set is one of the most common ways to mislead yourself — if it doesn't represent the test distribution (different class balance, different difficulty, different time period), choosing the 'best' model by val metric can pick a worse model on test.

02 · Math & Metrics 4 entries

# Cosine similarity

The cosine of the angle between two vectors: cos(θ) = (a · b) / (‖a‖‖b‖). Ranges from −1 (opposite) through 0 (orthogonal) to 1 (identical direction). The standard similarity metric for unit-norm embeddings.

Cosine similarity ignores vector magnitude and measures only direction — making it appropriate when only the angular relationship between embeddings is meaningful. After L2 normalization of all vectors, cosine similarity becomes equivalent to (a clipped) dot product, which is what most retrieval pipelines actually compute.

# L2 normalization

also: l2 norm, unit-norm

Scaling a vector to unit length: v ← v / ‖v‖₂. After normalization, cosine similarity equals the dot product. A standard preprocessing step before retrieval.

L2 normalization is essential for embedding-based retrieval: it removes the confounding effect of vector magnitude (which can encode confidence or feature scale), leaving only direction. After normalization, the dot product between two embeddings exactly equals their cosine similarity, and the inner-product index becomes a cosine-similarity index.

# Pearson correlation

also: correlation coefficient

Standard linear-correlation coefficient: cov(X, Y) / (σ_X · σ_Y). Ranges from −1 (perfect negative) through 0 (no linear relationship) to +1 (perfect positive). Doesn't detect non-linear patterns.

Pearson's r measures the strength and direction of the linear relationship between two variables. r² is the fraction of variance in one variable explained by a linear fit on the other. Pearson is sensitive to outliers and only captures linear structure — for monotonic but non-linear relationships, Spearman's rank correlation is more appropriate.

# Softmax

also: cross-entropy

softmax(z)_i = exp(z_i) / Σ exp(z_j). Turns a vector of logits into a probability distribution. Paired with cross-entropy loss for classification.

Softmax exponentiates each logit and normalizes by the sum — converting raw model outputs into a categorical probability distribution. Cross-entropy loss with one-hot targets reduces to −log p_y, where p_y is the predicted probability of the correct class. ArcFace and similar metric losses are reformulations of the same softmax-cross-entropy structure in cosine space.