Stanford RNA 3D Folding 2 · kaggle

The challenge

Predict five 3D coordinate sets per residue, evaluated as best-of-five structural RMSD against the held-out experimental crystallography. Five guesses per target hedge against ambiguity, but the leaderboard still rewards the top-1 aim. The competition is research-track with a 75,000 USD prize and a 2026-03-25 deadline.

The dataset shape

5,716 train targets. Sequences span four orders of magnitude in length. MSAs are gigabytes deep. And — the detail that shapes any serious solution — more than half of the train coordinates are incomplete.

raw size of the train side

chart data

label	value
train targets · RNA sequences with experimental structure	5,716
MSA depth (max) · aligned sequences per target	14,000
longest seq (nt) · ribonucleotides	125,000

MSA depth is the long pole on RAM/disk; long sequences are the long pole on inference compute.

train targets by coordinate completeness

chart data

label	value
with full coords	2,671
incomplete coords	3,045

More than half of train targets are missing chunks of the experimental structure. Any loss function that doesn't mask is implicitly fitting noise.

train target count by sequence length (nucleotides)

chart data

label	value
≤50	1,736 targets
51–200	1,743 targets
201–1k	348 targets
1k–5k	1,557 targets
5k–20k	331 targets
20k+	1 targets

Spans four orders of magnitude. Min 10 nt, max 125,580 nt, mean 1,364 nt. Most targets are short ribozymes; the tail is full-length viral RNA.

top 10 ligands across the train set

chart data

label	value
MG	364 targets
MG;ZN	331 targets
ZN	277 targets
K;MG	65 targets
SO4	62 targets
GTP;MG;ZN	32 targets
CA	30 targets
NCO	29 targets
K	28 targets
MN	26 targets

MG dominates as the structural cofactor; ZN and combinations follow. Single targets often have multiple ligands (MG;ZN, K;MG, GTP;MG;ZN).

top 8 chain stoichiometries

chart data

label	value
A:1	1,287 targets
B:1	511 targets
A:2	386 targets
B:1;A:1	216 targets
C:1	162 targets
A:1;B:1	150 targets
C:2	101 targets
R:1	86 targets

A:1 monomeric chains dominate; A:2 and B:1;A:1 indicate dimer/co-fold structures.

A three-tier hybrid

A pure deep-learning model would ignore the strongest signal in the data — that training targets with similar sequence often have similar backbones. A pure classical retriever leaves the leaderboard tail on the table. So the repo runs all three: retrieve, refine, rerank.

three-tier hybrid

01
Tier 1 — Classical sequence-NN k-mer retrieval + coord warping Cheap, parallelisable on CPU. Establishes the strongest baseline by reusing training-set 3D structure where sequences match.
02
Tier 2 — RibonanzaNet3D refinement fine-tuned NVIDIA RibonanzaNet2 + Linear(256→3) head Dropout sampling at inference produces 4 stochastic + 1 deterministic candidate per target.
03
Tier 3 — Neural reranker 128 prefiltered → 24 reranked → 5 submitted MLP over kmer + sequence stats + structural stats. Picks the final five from the candidate pool.

Tier 1 — Classical sequence-NN

The strongest baseline in the repo is not a neural network. It's a classical retrieve-and-warp pipeline — using k-mer features and Needleman–Wunsch alignment to find similar training RNAs and copy their coordinates — that runs on CPU and finishes a full validation pass in minutes.

sequence-NN baseline

feature space

k-mer order: k = 3
alphabet: {A, C, G, U}
histogram dim: 4³ = 64
normalization: L2

candidate retrieval

primary: SequenceMatcher (Python stdlib)
alternative: Biopython PairwiseAligner (Needleman-Wunsch)
top-k: configurable
metadata weights: ligand Jaccard · stoichiometry · diversity penalty

coordinate adaptation

length match: block-warping (overlay aligned positions)
length mismatch: linear interpolation along 3 axes
backbone constraint: 5.5 – 6.5 Å between consecutive residues
base-pair geometry: Watson–Crick on i+3..i+24 for short molecules

Source: baseline_sequence_nn.py, 758 lines. The constraint enforcement at output time is the difference between predictions that hold up under structural review and ones that don't.

top-1 RMSD — baseline vs reranked

chart data

run	top-1 RMSD
sequence-NN · classical retrieval baseline	33.98
+ reranking · rerank candidates by sequence similarity	33.89

best-of-5 RMSD — baseline vs reranked

chart data

run	best-of-5 RMSD
sequence-NN · five closest neighbors warped to query	26.05
+ reranking · selected best-of-5 after rerank	25.43

Fine-tune NVIDIA's RibonanzaNet2 — a pretrained sequence model from the Ribonanza Kaggle — with a 3D coordinate head. The base provides context-aware per-residue embeddings; a single linear layer maps each to its (x, y, z).

RibonanzaNet3D

model

base: NVIDIA RibonanzaNet2 (pretrained)
head: Linear(256, 3)
max sequence length: 512 nt
above max: fall through to classical baseline

training

optimizer: AdamW, lr=2e-4, wd=1e-4
dropout: 0.1
batch size: 2
grad accumulation: 4 (effective 8)
epochs: 24

inference (candidate diversity)

stochastic passes: 4 (dropout-on)
deterministic pass: 1 (eval-mode)
output: 5 candidates per target

Sources: train_ribonanzanet3d_v2.py + ribonanzanet3d_hybrid_infer.py + start_ribonanzanet3d_v2.sh.

Tier 3 — Neural reranker

The reranker is small. The job is to pick well, not to predict. It takes 128 prefiltered candidates from the classical pipeline, scores each, and surfaces the final 5. Trained on a 3-way temporal split so no candidate from after the query's collection date is ever considered.

hybrid reranker

training split (temporal)

core training set: 88% — oldest cutoffs
training queries: 2% — middle cutoffs
validation queries: 10% — newest cutoffs
leakage guard: exclude templates >= query cutoff

feature stack

sequence: query + template kmer (k=3, 64-dim each)
seq stats: length · GC% · AU%
structure stats: radius of gyration · mean/std step distance · end-to-end
length-ratio gate: 20% minimum

training

optimizer: AdamW, lr=3e-4, wd=1e-4
batch size: 64
epochs: 24
embedding batch: 16 · max_len 1536 · overlap 256
candidate funnel: 128 → 24 → 5

Source: train_hybrid_reranker.py + start_hybrid_reranker_overnight.sh.

Distributed search infrastructure

The classical search across 5,716 train targets is embarrassingly parallel and pure CPU. The repo runs it on three surfaces:

search topology

01
sbl1 (dev box) orchestration + merge
02
sbl2 / sbl3 / sbl4 tailnet tmux shards
03
GCP Batch spot CPU bursts
04
merge step merge_sequence_nn_search.py

No cluster manager. rsync + tmux + a fleet config array. The simplest thing that works.

run_sequence_nn_shard.sh L1–L30 bash · 30 lines

#!/usr/bin/env bash
set -euo pipefail

ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
OUT_DIR="$ROOT/outputs/analysis/shards"

if [[ $# -lt 2 ]]; then
  echo "usage: $0 <shard-index> <num-shards> [workers] [extra search args...]" >&2
  exit 1
fi

SHARD_INDEX="$1"
NUM_SHARDS="$2"
WORKERS="${3:-$(nproc)}"
if [[ $# -ge 3 ]]; then
  shift 3
else
  shift 2
fi

mkdir -p "$OUT_DIR"
PYTHON_BIN="${PYTHON_BIN:-python}"
if [[ -f "$ROOT/.venv/bin/activate" ]]; then
  # Prefer the challenge-local environment when present.
  # shellcheck disable=SC1091
  source "$ROOT/.venv/bin/activate"
  PYTHON_BIN="python"
fi

search_cmd=(

What's in flight

RibonanzaNet3D training is the active workstream; the hybrid reranker runs overnight. Their honest current contribution is small (33.98 → 33.89 is a 0.27% relative gain on top-1 RMSD), but the structural pieces — the 3-tier hybrid, the temporal-split reranker training, the candidate diversity protocol — are the levers that compound across future runs.

The challenge

The dataset shape

A three-tier hybrid

Tier 1 — Classical sequence-NN

Tier 2 — RibonanzaNet3D refinement

Tier 3 — Neural reranker

Distributed search infrastructure

What's in flight