Benchmarking Genomic Models on Crop Trait Data

There is no ImageNet equivalent for genomic crop trait prediction. This is a genuine problem. When a new model claims to outperform prior art on agronomic trait prediction, the comparison is almost always against a different dataset, a different trait decomposition, a different cross-validation scheme, and sometimes a different definition of what "prediction accuracy" means. The field is producing results that are not comparable, and researchers — ourselves included — are complicit in this because we evaluate on whatever labeled data we have access to.

This post describes the evaluation suite we built and how we use it internally. We are sharing the methodology because we think standardization in this space matters, and because we would rather invite criticism of our benchmark design than keep it proprietary.

Why Standard ML Evaluation Protocols Fail Here

The first instinct of an ML engineer evaluating a genomic model is to do an 80/20 random train/test split and report Pearson r on the held-out set. This is wrong for plant breeding data in at least three ways.

First, random splits leak population structure. If line A and line B are 90% genomically similar — siblings in a breeding program — a model that sees A in training will predict B trivially from similarity alone, not from learned biology. Random-split accuracy numbers overestimate performance in prospective deployment, where the target lines are genuinely novel.

Second, breeding programs want to know how well the model generalizes across environments, not just across lines in the same environment. A model with r = 0.91 on held-out lines from the same field trial may degrade to r = 0.62 on the same lines grown under different management practices or in a different agro-climatic zone. Most published benchmarks report the first number. Breeders need the second.

Third, crop trait phenotyping datasets are typically small by ML standards — 200 to 5,000 labeled accessions for most crops outside of maize and wheat. Standard confidence intervals on Pearson r are too wide to distinguish genuinely different models. Reporting a single r value without interval estimation is not useful information.

Our Benchmark Suite: Three Datasets, Four Protocols

Dataset 1: GRIN Phenotype Records

The USDA GRIN (Germplasm Resources Information Network) database contains phenotypic evaluation records for tens of thousands of accessions across dozens of crops. We use GRIN records for maize, sorghum, and soybean where (a) accessions have publicly available whole-genome resequencing data in NCBI SRA and (b) evaluation records were collected under controlled water stress protocols, giving a standardized stress intensity rather than opportunistic drought variation.

For maize drought, we have 12,000 accessions after filtering. For sorghum stay-green, 4,200. For soybean seed protein under heat stress, 1,800. These are not large numbers, but they are among the best-characterized public collections for these traits.

Dataset 2: CIMMYT Historical Trial Data

CIMMYT releases subsets of historical nursery data through their data portal for academic use. We obtained multi-environment trial records for Triticum aestivum yield and quality traits from nurseries across 8 countries from 2014–2020, matched to genotype data from the wheat 90K SNP array. This gives us 3,700 variety-environment combinations, which is sufficient for the cross-environment generalization analysis.

Dataset 3: Internal Phenotype Collection

We have built a modest in-house dataset through collaboration with three early-stage plant breeding organizations in France and the Netherlands (not named here, per their request). Combined, this covers approximately 2,400 accessions across tomato (Solanum lycopersicum) and sunflower (Helianthus annuus) with field phenotyping records for disease resistance and oil composition traits respectively. This dataset is not public, but we use it for internal ablations and for trait categories not well-represented in GRIN.

The Four Evaluation Protocols

Protocol A: Relatedness-Aware Cross-Validation

We estimate pairwise genomic relatedness from the SNP matrix and use k-means clustering on the relatedness matrix to define folds for cross-validation. Lines in the same fold have high pairwise relatedness; lines in different folds do not. This prevents the training-test leakage from population structure described above. Performance under Protocol A is consistently 8–18 percentage points lower than random-split performance for all models tested. We consider Protocol A numbers the honest estimate of prospective accuracy.

Protocol B: Cross-Environment Prediction

Train on phenotype records from environments E1 through En-1, predict for environment En. Cycle through all environments and average. This directly measures GxE generalization. Note that sequence-based models have no information about environment, so they can only capture the genotypic main effect component of the GxE variance. Protocol B accuracies are systematically lower than Protocol A accuracies; the gap estimates the proportion of variance that is GxE-driven and thus inherently unpredictable from sequence alone.

Protocol C: Low-Sample Fine-Tuning Ladder

Train a trait prediction head on n = 50, 100, 200, 500, 1000, and 2000 labeled accessions. Plot accuracy versus n. The key number is the inflection point: how many labeled accessions are needed to reach 80% of the maximum achievable accuracy? For our embedding-based approach on GRIN maize drought, the inflection point is around 300 labeled accessions. For a random-forest model on raw SNP features, it is around 1,200. Protocol C is the benchmark most relevant to breeders who are considering integrating genomic prediction but have limited historical phenotype data.

Protocol D: Held-Out Species Transfer

Train on labeled data from species S1 through Sk-1, evaluate zero-shot on species Sk. This measures whether the foundation model's pre-trained representations transfer across species without any species-specific fine-tuning. Our current model achieves Pearson r of 0.48–0.62 on held-out species for drought-related traits — substantially above chance, but not production-ready for deployment without at least minimal fine-tuning. We are not claiming that zero-shot cross-species prediction is solved; we are tracking it as a long-term goal.

How We Report Numbers

For every evaluation, we report: Pearson r with 95% bootstrap confidence interval (n = 1,000 bootstrap samples), Spearman ρ for rank-order accuracy (more relevant for breeding applications where you care about relative ranking, not absolute value), and the fraction of variance explained (r²) as a single-number summary. We also report the RMSE in phenotype units where the units are meaningful, which is not always the case for dimensionless scores.

We compare against three baselines: random-forest on raw SNP features (PLINK-format), standard gBLUP implemented in BGLR, and the best prior genomic prediction model we are aware of for the specific trait/species combination. We do not compare against methods we cannot reproduce from published descriptions or available code.

An Honest Assessment of Where We Are

Under Protocol A (the honest protocol), our model outperforms gBLUP by 6–14 percentage points across the GRIN traits we have evaluated. Under Protocol B, the gap narrows to 3–8 points because both methods are constrained by the GxE variance that neither can model from sequence. Under Protocol C, the advantage is largest at low n (50–200 samples), which is where it matters most in practice. Under Protocol D, we are above baseline but not at the accuracy level needed for production deployment without fine-tuning.

We think these are honest numbers. They are not the best possible numbers — there are tricks that inflate accuracy on specific benchmarks that we have deliberately excluded from our evaluation design. We would rather publish a standardized benchmark with lower headline numbers than claim performance that does not replicate in a breeder's actual pipeline.

If you are evaluating genomic prediction tools and comparing across vendors or methods, ask for Protocol A numbers. If they only have random-split accuracy, that is not sufficient to make an informed decision about whether the tool will work in your program.

Benchmarking Genomic Models on Crop Trait Data: A Practical Guide