§ infrastructure

A pipeline that scales
across four kinds of compute.

Everything here is reproducible. The same scripts run on the dev laptop, on a rented H100, on a transient GCP Batch worker, and across the sbl tailnet. The competitions are the proof; the workflow is the moat.


§ 01 · local-first iteration sbl1 · local

Run it on the laptop before you rent a GPU.

The repo's discipline is local-first: every competition has a workable, end-to-end baseline that runs on sbl1 — an RTX 4070 SUPER and an i9-14900KF. The baseline doesn't have to compete; it has to compile, train, predict, and submit without touching a remote machine. That's the contract.

The reason isn't cost — it's tightness of the feedback loop. A wrong assumption discovered in five minutes locally is an order of magnitude cheaper than the same assumption discovered after two hours of remote training plus the round trip to pull logs back. Cloud compute follows local correctness, never the other way around.

Orchestration is a Makefile. make competition-inspect stages a new challenge. make bootstrap wakes a remote. make train, make pull, make submit all take a single competition= argument and route correctly. The shell scripts under it are composable building blocks.

Makefile L1–L32 makefile · 32 lines
.PHONY: kaggle-install kaggle-auth gcp-install gcp-setup gcp-submit gcp-status gcp-pull gcp-stop competition-inspect competition-new watch-local bootstrap sync train train-1 train-1-auto monitor pull predict predict-smoke submit deploy stop e2e-smoke e2e-full all

kaggle-install:
	./scripts/install-kaggle-cli.sh

kaggle-auth:
	./scripts/setup-kaggle-auth.sh local

gcp-install:
	./scripts/install-gcloud-cli.sh

gcp-setup:
	KAGGLE_COMPETITION="$(competition)" ./scripts/gcp-batch-setup.sh

gcp-submit:
	KAGGLE_COMPETITION="$(competition)" ./scripts/gcp-batch-submit.sh

gcp-status:
	KAGGLE_COMPETITION="$(competition)" ./scripts/gcp-batch-status.sh

gcp-pull:
	KAGGLE_COMPETITION="$(competition)" ./scripts/gcp-batch-pull.sh

gcp-stop:
	KAGGLE_COMPETITION="$(competition)" ./scripts/gcp-batch-stop.sh

competition-inspect:
	./scripts/inspect-competition.sh "$(competition)"

competition-new:
	./scripts/new-competition.sh "$(competition)" "$(name)"

§ 02 · remote gpu bursts vast.ai · remote GPU

Rent only when the experiment design is strong.

Vast.ai is the workhorse for short, high-conviction runs — an EVA-02 448 sweep, an overnight ArcFace seed pass. The bootstrap script does eight things in one command: install the toolchain, push Kaggle auth, sync the repo, create a venv, download the dataset, set up tmux, and start the trainer. The dev box never has to copy data manually.

The retrospective is blunt about Vast economics: long speculative remote sweeps had poor ROI. Vast belongs to hypotheses that have already passed an offline validation gate. When that gate is honest, an H100 hour is the cheapest gain on the board. When it isn't, the same hour is the most expensive way to learn nothing.

bootstrap-vast.sh L1–L30 bash · 30 lines
#!/usr/bin/env bash
set -euo pipefail

REMOTE_HOST="${1:-${KAGGLE_REMOTE_HOST:-vast}}"
REMOTE_REPO_ROOT="${KAGGLE_REMOTE_ROOT:-/workspace/kaggle}"
REMOTE_COMP_DIR="${REMOTE_REPO_ROOT}/competitions/jaguar-re-id"
LOCAL_REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"

echo "[bootstrap] remote host: ${REMOTE_HOST}"
echo "[bootstrap] remote repo root: ${REMOTE_REPO_ROOT}"

"${LOCAL_REPO_ROOT}/scripts/install-codex-remote.sh" "${REMOTE_HOST}" "${REMOTE_REPO_ROOT}"
"${LOCAL_REPO_ROOT}/scripts/push-kaggle-auth-remote.sh" "${REMOTE_HOST}"

ssh "${REMOTE_HOST}" "mkdir -p '${REMOTE_REPO_ROOT}'"

echo "[bootstrap] syncing repo to remote"
rsync -avz \
  --exclude .git \
  --exclude __pycache__ \
  --exclude .venv \
  --exclude .venv-dashboard \
  --exclude 'competitions/*/data' \
  --exclude 'competitions/*/outputs' \
  --exclude dashboard/outputs \
  --exclude dashboard/node_modules \
  --exclude dashboard/playwright-report \
  --exclude dashboard/test-results \
  "${LOCAL_REPO_ROOT}/" "${REMOTE_HOST}:${REMOTE_REPO_ROOT}/"

§ 03 · transient cpu bursts gcp batch · cpu

For embarrassingly parallel CPU work, never pay for an idle VM.

The Stanford RNA search shards across 5,716 train targets are pure CPU work — classical sequence alignment, no gradient updates. The right tool isn't a beefy GPU; it's a batch of spot instances that live for the duration of the search and then disappear.

GCP Batch handles it. gcp-batch-submit.sh packages the same shell script the cluster uses, hands it to Batch with a task count and a parallelism, and the artifacts come back through GCS. No idle VM. No keep-alive container. The bill is a function of actual CPU-seconds.

run_gcp_batch_task.sh L1–L25 bash · 25 lines
#!/usr/bin/env bash
set -euo pipefail

ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
cd "$ROOT"

workers="${RNA_GCP_WORKERS:-8}"
search_args=()
while IFS= read -r arg; do
  search_args+=("$arg")
done < <(bash -lc 'source competitions/stanford-rna-3d-folding-2/cluster_fleet_config.sh && search_args')

PYTHON_BIN="${PYTHON_BIN:-python3}"
if [[ -f .venv/bin/activate ]]; then
  # shellcheck disable=SC1091
  source .venv/bin/activate
  PYTHON_BIN="python"
fi

PYTHON_BIN="$PYTHON_BIN" bash \
  competitions/stanford-rna-3d-folding-2/run_sequence_nn_shard.sh \
  "${BATCH_TASK_INDEX:-0}" \
  "${BATCH_TASK_COUNT:-1}" \
  "$workers" \
  "${search_args[@]}"

§ 04 · distributed shard search sbl cluster · distributed

The sbl tailnet absorbs overnight work for free.

sbl2, sbl3, and sbl4 sit on the same tailnet as the dev box. When a long classical search kicks off, launch_cluster_search.sh rsyncs the project to each host, opens a tmux session, and runs one shard per box. The fleet config maps shard index to hostname.

The result of the search is a JSON shard file per host. A merge step on sbl1 reconciles them into a single sequence_nn_search_merged.json. No cluster manager, no scheduler, no helm chart — just rsync, tmux, and a config array. The simplest thing that works.

launch_cluster_search.sh L1–L30 bash · 30 lines
#!/usr/bin/env bash
set -euo pipefail

ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$ROOT/cluster_fleet_config.sh"

if [[ "${#RNA_CLUSTER_HOSTS_ARR[@]}" -eq 0 ]]; then
  echo "[cluster] no hosts configured" >&2
  exit 1
fi

SYNC_EXCLUDES=(
  --exclude '__pycache__'
  --exclude 'outputs'
  --exclude '.pytest_cache'
  --exclude '.mypy_cache'
)

NUM_SHARDS="${#RNA_CLUSTER_HOSTS_ARR[@]}"

for shard_index in "${!RNA_CLUSTER_HOSTS_ARR[@]}"; do
  host="${RNA_CLUSTER_HOSTS_ARR[$shard_index]}"
  session_name="$(tmux_session_name "$shard_index")"
  shard_output="$REMOTE_SHARD_DIR/sequence_nn_search.shard${shard_index}.json"

  echo "[cluster] syncing to $host"
  ssh -o BatchMode=yes -o ConnectTimeout=8 "$host" "mkdir -p $REMOTE_PROJECT_DIR"
  rsync -az "${SYNC_EXCLUDES[@]}" "$ROOT/" "$host:$REMOTE_PROJECT_DIR/"

  search_args_quoted=""

§ 05 · operator dashboard

One pane of glass over every training run.

The dashboard/ directory is a Flask + browser-tested web UI. It probes the remote GPU over SSH for live training status, renders training logs in real time, tracks submissions, and exposes auth-gated buttons for the common operations: sync, pull, submit, stop. It's not a demo surface; it's the operator console.

It is deliberately not public. The competitions site you're reading now is the public face; the dashboard runs locally or behind a Cloudflare Tunnel for SSH-keyed users. Mixing those audiences would be a category error.


§ 06 · reusable competition scaffold

A new competition is one command, then prose.

scripts/new-competition.sh <slug> reads the Kaggle metadata for the slug, creates competitions/<slug>/, writes a starter challenge.json, drops a templated AGENTS.md, and registers the project with the dashboard. The first commit is the smallest workable baseline; the dashboard registers the new training log within minutes.

The challenge.json schema is intentionally small — just enough metadata for the dashboard to know which remote host to talk to, where logs live, and which scripts wire the sync/pull/submit verbs. Everything else lives in the source.

new-competition.sh L1–L30 bash · 30 lines
#!/usr/bin/env bash
set -euo pipefail

source "$(dirname "$0")/_env.sh"

competition="${1:-${KAGGLE_COMPETITION:-}}"
name="${2:-}"

if [[ -z "${competition}" ]]; then
  echo "[new-competition] usage: $0 <competition-slug> [display-name]" >&2
  exit 1
fi

"${SCRIPT_DIR}/inspect-competition.sh" "${competition}"

competition_dir="${REPO_ROOT}/competitions/${competition}"
challenge_json="${competition_dir}/challenge.json"
agents_file="${competition_dir}/AGENTS.md"

if [[ -z "${name}" ]]; then
  name="$(python3 - <<'PY' "${competition}"
import sys
slug = sys.argv[1]
print(" ".join(part.capitalize() for part in slug.split("-")))
PY
)"
fi

if [[ ! -f "${challenge_json}" ]]; then
  python3 - <<'PY' "${challenge_json}" "${competition}" "${name}"