Methodology

How we train genomic foundation models on plant sequences.

A methodological overview of our data curation pipeline, tokenization strategy, pre-training objectives, and how we validate predictions against independent phenotype sets.

Training methodology

Four steps from raw sequence to validated model

Data collection and curation

We aggregate plant genomic sequences from Ensembl Plants, NCBI SRA, GRIN, Sol Genomics Network, and LegumeInfo. Each sequence passes a 5-stage quality filter: completeness check, contamination screen, assembly-level scoring, duplicate detection (MinHash), and cultivar-label verification against reference accession records. Sequences failing any stage are discarded. This pipeline runs before every training iteration — the corpus is a living artefact, not a one-time snapshot.

Tokenization

We use a hybrid k-mer vocabulary (k=6 base + k=8 extended) built from a representative subsample of the training corpus, stratified by species and repeat density. BPE-derived vocabularies designed for protein or NLP over-merge repetitive regions — a serious problem in plant genomics where transposable elements and tandem repeats constitute up to 80% of some genomes. Our tokenizer preserves these regions as distinct tokens rather than compressing them into a single UNK-equivalent.

Pre-training

Two pre-training objectives run jointly. Masked sequence modeling (15% of tokens masked, uniform and span-based) teaches local variant context. Next-region prediction (predict the token distribution of a downstream genomic window given an upstream window) encodes long-range genomic structure across introns, promoter regions, and regulatory elements. Training batches are assembled to ensure each gradient step includes sequences from at least 8 species — preventing the model from over-fitting to the most data-rich crops.

Validation against independent phenotype records

Model quality is not measured by training loss. We hold out a 10% split of the phenotype records from GRIN and SeedNet, then compute Pearson r² between our embedding-derived predictions and measured trait values. Current reported figures are internal benchmarks only — not peer-reviewed. We publish these results openly because we believe transparency before peer review builds more trust than releasing only after journal acceptance.

Validation results

Performance versus baseline methods

Internal benchmark only — not peer-reviewed. Results on held-out GRIN + SeedNet phenotype records.

Trait	Our model r²	GWAS-only r²	gBLUP r²	Species tested
Drought Tolerance Index	0.847	0.61	0.73	Maize (12,000 acc.)
Yield Stability Score	0.791	0.54	0.68	Wheat (8,400 acc.)
Disease Resistance — Fusarium	0.812	0.58	0.71	Tomato (3,200 acc.)
Flowering Time Variance	0.883	0.72	0.81	Soybean (6,100 acc.)

r² = Pearson correlation coefficient squared. gBLUP = genomic Best Linear Unbiased Prediction (standard genomic selection baseline). Results represent means across 5-fold cross-validation. Full methodology and raw results available on request.

Transparency

What the model does and doesn't know

We enumerate known model boundaries because a prediction without declared scope is not a prediction — it is guesswork. Three limitations are material enough to communicate before you build a pipeline on our API.

Rare alleles are underrepresented in training

Variants with population frequency below ~2% are statistically sparse in our training corpus. For breeding programs working with landrace accessions or wild-relative introgressions, prediction confidence intervals will be wider and should be interpreted with caution. We flag this automatically in API responses via the rare_allele_proportion field.

Prediction accuracy drops for highly heterozygous species

Our current model was primarily pre-trained on inbred lines and reference-quality accessions. Highly heterozygous materials (autotetraploid potato, some sugarcane accessions) have lower prediction r² — typically 0.1-0.15 below values reported for diploid crops. We are actively developing a heterozygote-aware encoding layer for a future model version.

Currently validated on 8 crops only

Formal validation exists for wheat, maize, tomato, soybean, sorghum, barley, pepper, and rice. Predictions for other crops rely on transfer from related species and should be treated as exploratory until independent validation is completed. The API returns a validation_tier field (A/B/C) indicating validation depth for each prediction.

How we share our work before peer review

01 — Pre-prints

We publish before peer review

Architecture notes, validation benchmark results, and training data documentation are posted to bioRxiv before journal submission. Genomics researchers evaluating tools for production pipelines should not have to wait 12–18 months for a journal to confirm what the pre-print already shows.

02 — Benchmark dataset

Our evaluation set is public

The held-out phenotype evaluation set used for all validation benchmarks reported on this site is available under CC BY 4.0 on our Research page. Other groups can run any method — GWAS-only, gBLUP, or their own models — against the same test set and report directly comparable r² values.

03 — Collaboration model

We co-develop with university labs

University genomics labs contribute phenotype records and sequenced accessions; we provide pre-training compute and model access. Co-authored pre-prints result from successful collaborations. See the Research page for current collaboration proposals.