A pipeline that scales
across four kinds of compute.
Everything here is reproducible. The same scripts run on the dev laptop, on a rented H100, on a transient GCP Batch worker, and across the sbl tailnet. The competitions are the proof; the workflow is the moat.
Run it on the laptop before you rent a GPU.
The repo's discipline is local-first: every competition has a
workable, end-to-end baseline that runs on sbl1 — an
RTX 4070 SUPER and an i9-14900KF. The baseline doesn't have to
compete; it has to compile, train, predict, and submit without
touching a remote machine. That's the contract.
The reason isn't cost — it's tightness of the feedback loop. A wrong assumption discovered in five minutes locally is an order of magnitude cheaper than the same assumption discovered after two hours of remote training plus the round trip to pull logs back. Cloud compute follows local correctness, never the other way around.
Orchestration is a Makefile. make competition-inspect
stages a new challenge. make bootstrap wakes a
remote. make train, make pull,
make submit all take a single competition=
argument and route correctly. The shell scripts under it are
composable building blocks.
.PHONY: kaggle-install kaggle-auth gcp-install gcp-setup gcp-submit gcp-status gcp-pull gcp-stop competition-inspect competition-new watch-local bootstrap sync train train-1 train-1-auto monitor pull predict predict-smoke submit deploy stop e2e-smoke e2e-full all
kaggle-install:
./scripts/install-kaggle-cli.sh
kaggle-auth:
./scripts/setup-kaggle-auth.sh local
gcp-install:
./scripts/install-gcloud-cli.sh
gcp-setup:
KAGGLE_COMPETITION="$(competition)" ./scripts/gcp-batch-setup.sh
gcp-submit:
KAGGLE_COMPETITION="$(competition)" ./scripts/gcp-batch-submit.sh
gcp-status:
KAGGLE_COMPETITION="$(competition)" ./scripts/gcp-batch-status.sh
gcp-pull:
KAGGLE_COMPETITION="$(competition)" ./scripts/gcp-batch-pull.sh
gcp-stop:
KAGGLE_COMPETITION="$(competition)" ./scripts/gcp-batch-stop.sh
competition-inspect:
./scripts/inspect-competition.sh "$(competition)"
competition-new:
./scripts/new-competition.sh "$(competition)" "$(name)" Rent only when the experiment design is strong.
Vast.ai is the workhorse for short, high-conviction runs — an EVA-02 448 sweep, an overnight ArcFace seed pass. The bootstrap script does eight things in one command: install the toolchain, push Kaggle auth, sync the repo, create a venv, download the dataset, set up tmux, and start the trainer. The dev box never has to copy data manually.
The retrospective is blunt about Vast economics: long speculative remote sweeps had poor ROI. Vast belongs to hypotheses that have already passed an offline validation gate. When that gate is honest, an H100 hour is the cheapest gain on the board. When it isn't, the same hour is the most expensive way to learn nothing.
#!/usr/bin/env bash
set -euo pipefail
REMOTE_HOST="${1:-${KAGGLE_REMOTE_HOST:-vast}}"
REMOTE_REPO_ROOT="${KAGGLE_REMOTE_ROOT:-/workspace/kaggle}"
REMOTE_COMP_DIR="${REMOTE_REPO_ROOT}/competitions/jaguar-re-id"
LOCAL_REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
echo "[bootstrap] remote host: ${REMOTE_HOST}"
echo "[bootstrap] remote repo root: ${REMOTE_REPO_ROOT}"
"${LOCAL_REPO_ROOT}/scripts/install-codex-remote.sh" "${REMOTE_HOST}" "${REMOTE_REPO_ROOT}"
"${LOCAL_REPO_ROOT}/scripts/push-kaggle-auth-remote.sh" "${REMOTE_HOST}"
ssh "${REMOTE_HOST}" "mkdir -p '${REMOTE_REPO_ROOT}'"
echo "[bootstrap] syncing repo to remote"
rsync -avz \
--exclude .git \
--exclude __pycache__ \
--exclude .venv \
--exclude .venv-dashboard \
--exclude 'competitions/*/data' \
--exclude 'competitions/*/outputs' \
--exclude dashboard/outputs \
--exclude dashboard/node_modules \
--exclude dashboard/playwright-report \
--exclude dashboard/test-results \
"${LOCAL_REPO_ROOT}/" "${REMOTE_HOST}:${REMOTE_REPO_ROOT}/" For embarrassingly parallel CPU work, never pay for an idle VM.
The Stanford RNA search shards across 5,716 train targets are pure CPU work — classical sequence alignment, no gradient updates. The right tool isn't a beefy GPU; it's a batch of spot instances that live for the duration of the search and then disappear.
GCP Batch handles it. gcp-batch-submit.sh packages the
same shell script the cluster uses, hands it to Batch with a task
count and a parallelism, and the artifacts come back through GCS.
No idle VM. No keep-alive container. The bill is a function of
actual CPU-seconds.
#!/usr/bin/env bash
set -euo pipefail
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
cd "$ROOT"
workers="${RNA_GCP_WORKERS:-8}"
search_args=()
while IFS= read -r arg; do
search_args+=("$arg")
done < <(bash -lc 'source competitions/stanford-rna-3d-folding-2/cluster_fleet_config.sh && search_args')
PYTHON_BIN="${PYTHON_BIN:-python3}"
if [[ -f .venv/bin/activate ]]; then
# shellcheck disable=SC1091
source .venv/bin/activate
PYTHON_BIN="python"
fi
PYTHON_BIN="$PYTHON_BIN" bash \
competitions/stanford-rna-3d-folding-2/run_sequence_nn_shard.sh \
"${BATCH_TASK_INDEX:-0}" \
"${BATCH_TASK_COUNT:-1}" \
"$workers" \
"${search_args[@]}" The sbl tailnet absorbs overnight work for free.
sbl2, sbl3, and sbl4 sit on
the same tailnet as the dev box. When a long classical search
kicks off, launch_cluster_search.sh rsyncs the project
to each host, opens a tmux session, and runs one shard per box.
The fleet config maps shard index to hostname.
The result of the search is a JSON shard file per host. A merge
step on sbl1 reconciles them into a single
sequence_nn_search_merged.json. No cluster manager,
no scheduler, no helm chart — just rsync, tmux, and a config
array. The simplest thing that works.
#!/usr/bin/env bash
set -euo pipefail
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$ROOT/cluster_fleet_config.sh"
if [[ "${#RNA_CLUSTER_HOSTS_ARR[@]}" -eq 0 ]]; then
echo "[cluster] no hosts configured" >&2
exit 1
fi
SYNC_EXCLUDES=(
--exclude '__pycache__'
--exclude 'outputs'
--exclude '.pytest_cache'
--exclude '.mypy_cache'
)
NUM_SHARDS="${#RNA_CLUSTER_HOSTS_ARR[@]}"
for shard_index in "${!RNA_CLUSTER_HOSTS_ARR[@]}"; do
host="${RNA_CLUSTER_HOSTS_ARR[$shard_index]}"
session_name="$(tmux_session_name "$shard_index")"
shard_output="$REMOTE_SHARD_DIR/sequence_nn_search.shard${shard_index}.json"
echo "[cluster] syncing to $host"
ssh -o BatchMode=yes -o ConnectTimeout=8 "$host" "mkdir -p $REMOTE_PROJECT_DIR"
rsync -az "${SYNC_EXCLUDES[@]}" "$ROOT/" "$host:$REMOTE_PROJECT_DIR/"
search_args_quoted="" One pane of glass over every training run.
The dashboard/ directory is a Flask + browser-tested
web UI. It probes the remote GPU over SSH for live training status,
renders training logs in real time, tracks submissions, and exposes
auth-gated buttons for the common operations: sync, pull, submit,
stop. It's not a demo surface; it's the operator console.
It is deliberately not public. The competitions site you're reading now is the public face; the dashboard runs locally or behind a Cloudflare Tunnel for SSH-keyed users. Mixing those audiences would be a category error.
A new competition is one command, then prose.
scripts/new-competition.sh <slug> reads the
Kaggle metadata for the slug, creates competitions/<slug>/,
writes a starter challenge.json, drops a templated
AGENTS.md, and registers the project with the
dashboard. The first commit is the smallest workable baseline; the
dashboard registers the new training log within minutes.
The challenge.json schema is intentionally small —
just enough metadata for the dashboard to know which remote host
to talk to, where logs live, and which scripts wire the
sync/pull/submit verbs. Everything else lives in the source.
#!/usr/bin/env bash
set -euo pipefail
source "$(dirname "$0")/_env.sh"
competition="${1:-${KAGGLE_COMPETITION:-}}"
name="${2:-}"
if [[ -z "${competition}" ]]; then
echo "[new-competition] usage: $0 <competition-slug> [display-name]" >&2
exit 1
fi
"${SCRIPT_DIR}/inspect-competition.sh" "${competition}"
competition_dir="${REPO_ROOT}/competitions/${competition}"
challenge_json="${competition_dir}/challenge.json"
agents_file="${competition_dir}/AGENTS.md"
if [[ -z "${name}" ]]; then
name="$(python3 - <<'PY' "${competition}"
import sys
slug = sys.argv[1]
print(" ".join(part.capitalize() for part in slug.split("-")))
PY
)"
fi
if [[ ! -f "${challenge_json}" ]]; then
python3 - <<'PY' "${challenge_json}" "${competition}" "${name}"