Scaling Pre-Training to 500M Sequences: Infrastructure Lessons

When we scaled our pre-training corpus from 200M to 500M plant sequences this autumn, we expected the main challenge to be compute. It was not. The harder problems were data pipeline throughput, storage architecture, and a set of failure modes that only appear at scale and that are nearly impossible to anticipate from smaller runs. This post is an honest account of what changed and what broke.

Why 500M and Why Now

Our 200M-sequence model — the version we described in our February post — was trained on publicly available plant genome assemblies from a curated set of roughly 80 species. The representation was reasonable for the major cereals and some vegetable crops, but thin for legumes and almost nonexistent for several important African and South Asian dryland crops. Two things happened in parallel that made expansion feasible: a large batch of previously restricted sequence data from a multi-institution consortium became accessible under a data sharing framework we were part of, and the cost of the compute required to train at this scale dropped to a point where it was no longer the binding constraint.

The new corpus includes sequences from over 140 species, with substantially better representation of tetraploid and hexaploid genomes. Polyploid crops are where a lot of the breeding action is — wheat, oilseed rape, cotton, and several others — and our 200M model handled them poorly because the training data skewed toward simpler diploid genomes where assembly quality is higher.

Data Pipeline: Where the First Bottleneck Appeared

Our original pipeline was designed to tokenise and shard sequences on-the-fly during training. At 200M sequences this was fine — the tokenisation throughput kept ahead of the GPU utilisation with a reasonable number of preprocessing workers. At 500M, we hit a point where the preprocessing workers could not keep up, and the GPUs were spending a meaningful fraction of their time idle waiting for the next batch. The symptom was a characteristic staircase pattern in our GPU utilisation graphs: high throughput for a few minutes, then a brief dip, then back up.

The fix was to pre-tokenise the entire corpus and write it to disk before starting the training run. This sounds obvious in retrospect. The cost is storage: pre-tokenised sequences in our format run about 40% larger than the raw FASTA input because we store token IDs alongside sequence metadata needed for our attention masking scheme. At 500M sequences, that overhead is material. We ended up restructuring our storage to separate hot pre-tokenised data on fast NVMe from cold raw FASTA archives on cheaper object storage, with a background sync process that pre-tokenises new data as it arrives.

Distributed Training: The Problems That Only Appear at This Scale

We run training across multiple nodes. At 200M sequences, a training run completes in a time window where transient hardware failures are rare. At 500M, the runs are long enough that hardware failures become statistically normal events. In our most recent pre-training run, we had two node failures and one storage mount timeout during a six-week job. None of these are catastrophic if your checkpointing strategy is correct — and ours was not, for the first run.

The issue was checkpoint frequency. We were checkpointing every 24 hours, which seemed conservative. A node failure six hours into a checkpoint interval means losing six hours of training progress and restarting from the last checkpoint. At the learning rates we use and with the sequence volume involved, six hours of progress is not trivial to recompute, and the cost of the lost compute is real. We moved to checkpoint every four hours, which adds modest storage overhead but keeps the maximum potential loss to a level we can absorb.

The second distributed-specific failure mode: gradient synchronisation timeouts. At scale, slow nodes — nodes that are performing correctly but running a few percent slower than the median due to hardware variance — can cause all other nodes to wait at synchronisation barriers. We added timeout and node-exclusion logic to our training harness so that a single slow node can be dropped from the run and replaced rather than stalling the entire job.

Data Quality Changes at 500M

Scaling the corpus required ingesting data from sources with lower average assembly quality than our original curated set. The filtering pipeline we described in a previous post was designed around a quality threshold calibrated for high-coverage Illumina assemblies. Third-generation long-read assemblies from some of the newly included species have a different error profile — lower substitution error rates but more structural variants and occasional chimeric contigs that pass our original filters.

We added a contig-level length and GC-content consistency check, and a pairwise overlap filter to catch obvious chimeras. These are not expensive checks but they required re-processing a significant fraction of the new data. The lesson: when the data source distribution changes, the filtering calibration needs to be revisited — it does not transfer automatically.

Evaluation: Checking That Scale Helped

Scale is not automatically beneficial. Adding poorly curated or heavily redundant data can degrade downstream fine-tuning performance even if pre-training loss continues to decrease, because the model learns to represent noise or over-represents common sequences at the expense of rare but informative ones. We run a standard evaluation suite after each pre-training run against a held-out set of phenotype prediction tasks to check that the new model is actually better, not just larger.

On the 500M model, performance improved on legume and dryland crop tasks — which is what we expected given the improved species coverage — and held flat on the major cereals. We did see a slight regression on one tomato disease resistance task that we are still investigating. Our hypothesis is that the increased polyploid content in the training corpus shifts the embedding geometry in ways that affect the tomato fine-tuning, but we have not confirmed this. We are running ablations on the polyploid fraction of the training data to isolate the effect.

What We Would Do Differently

If we were starting this scaling run again, we would build the pre-tokenisation pipeline and the fast-checkpoint infrastructure before beginning, not in response to problems discovered mid-run. Both were predictable needs in hindsight. We would also spend more time upfront characterising the quality distribution of the new data sources before beginning corpus integration — the chimera filter should have been identified in a sampling pass before full ingestion.

The infrastructure lessons from this run are being codified into our standard pre-training setup. We are not planning a further corpus expansion in the near term — the constraint on model improvement has shifted from data volume back to fine-tuning methodology and evaluation quality. But when we do scale again, the infrastructure should handle it without the mid-run firefighting that characterised this run.