How we train genomic foundation models on plant sequences.
A methodological overview of our data curation pipeline, tokenization strategy, pre-training objectives, and how we validate predictions against independent phenotype sets.
Four steps from raw sequence to validated model
Data collection and curation
We aggregate plant genomic sequences from Ensembl Plants, NCBI SRA, GRIN, Sol Genomics Network, and LegumeInfo. Each sequence passes a 5-stage quality filter: completeness check, contamination screen, assembly-level scoring, duplicate detection (MinHash), and cultivar-label verification against reference accession records. Sequences failing any stage are discarded. This pipeline runs before every training iteration — the corpus is a living artefact, not a one-time snapshot.
Tokenization
We use a hybrid k-mer vocabulary (k=6 base + k=8 extended) built from a representative subsample of the training corpus, stratified by species and repeat density. BPE-derived vocabularies designed for protein or NLP over-merge repetitive regions — a serious problem in plant genomics where transposable elements and tandem repeats constitute up to 80% of some genomes. Our tokenizer preserves these regions as distinct tokens rather than compressing them into a single UNK-equivalent.
Pre-training
Two pre-training objectives run jointly. Masked sequence modeling (15% of tokens masked, uniform and span-based) teaches local variant context. Next-region prediction (predict the token distribution of a downstream genomic window given an upstream window) encodes long-range genomic structure across introns, promoter regions, and regulatory elements. Training batches are assembled to ensure each gradient step includes sequences from at least 8 species — preventing the model from over-fitting to the most data-rich crops.
Validation against independent phenotype records
Model quality is not measured by training loss. We hold out a 10% split of the phenotype records from GRIN and SeedNet, then compute Pearson r² between our embedding-derived predictions and measured trait values. Current reported figures are internal benchmarks only — not peer-reviewed. We publish these results openly because we believe transparency before peer review builds more trust than releasing only after journal acceptance.
Performance versus baseline methods
Internal benchmark only — not peer-reviewed. Results on held-out GRIN + SeedNet phenotype records.
| Trait | Our model r² | GWAS-only r² | gBLUP r² | Species tested |
|---|---|---|---|---|
| Drought Tolerance Index | 0.847 | 0.61 | 0.73 | Maize (12,000 acc.) |
| Yield Stability Score | 0.791 | 0.54 | 0.68 | Wheat (8,400 acc.) |
| Disease Resistance — Fusarium | 0.812 | 0.58 | 0.71 | Tomato (3,200 acc.) |
| Flowering Time Variance | 0.883 | 0.72 | 0.81 | Soybean (6,100 acc.) |
r² = Pearson correlation coefficient squared. gBLUP = genomic Best Linear Unbiased Prediction (standard genomic selection baseline). Results represent means across 5-fold cross-validation. Full methodology and raw results available on request.
What the model does and doesn't know
We enumerate known model boundaries because a prediction without declared scope is not a prediction — it is guesswork. Three limitations are material enough to communicate before you build a pipeline on our API.
Variants with population frequency below ~2% are statistically sparse in our training corpus. For breeding programs working with landrace accessions or wild-relative introgressions, prediction confidence intervals will be wider and should be interpreted with caution. We flag this automatically in API responses via the rare_allele_proportion field.
Our current model was primarily pre-trained on inbred lines and reference-quality accessions. Highly heterozygous materials (autotetraploid potato, some sugarcane accessions) have lower prediction r² — typically 0.1-0.15 below values reported for diploid crops. We are actively developing a heterozygote-aware encoding layer for a future model version.
Formal validation exists for wheat, maize, tomato, soybean, sorghum, barley, pepper, and rice. Predictions for other crops rely on transfer from related species and should be treated as exploratory until independent validation is completed. The API returns a validation_tier field (A/B/C) indicating validation depth for each prediction.
How we share our work before peer review
We publish before peer review
Architecture notes, validation benchmark results, and training data documentation are posted to bioRxiv before journal submission. Genomics researchers evaluating tools for production pipelines should not have to wait 12–18 months for a journal to confirm what the pre-print already shows.
Our evaluation set is public
The held-out phenotype evaluation set used for all validation benchmarks reported on this site is available under CC BY 4.0 on our Research page. Other groups can run any method — GWAS-only, gBLUP, or their own models — against the same test set and report directly comparable r² values.
We co-develop with university labs
University genomics labs contribute phenotype records and sequenced accessions; we provide pre-training compute and model access. Co-authored pre-prints result from successful collaborations. See the Research page for current collaboration proposals.