Predicting Drought Tolerance from Sequence Data Alone

Field phenotyping for drought tolerance is among the most resource-intensive activities in crop breeding. Replicated multi-environment trials across years cost tens of thousands of euros per cohort and require 4–6 growing seasons before a breeder can make reliable selection decisions. The question we have been working on is whether genomic embeddings derived purely from sequencing data can substitute — even partially — for that phenotypic data in early-generation selection.

Why Drought Tolerance Is Particularly Hard to Model

Drought tolerance is not a single trait. It is a composite phenotype driven by root architecture, stomatal regulation, osmotic adjustment capacity, stay-green duration, and reproductive-stage stress tolerance — each with distinct genetic architecture and each expressing differently across environments. GWAS studies on Zea mays and Sorghum bicolor have identified hundreds of QTL associations for drought-related traits, but the explained variance per locus is typically small (r² values of 0.01–0.04 for individual SNPs), and the genetic background effects are large.

This means that simple marker-assisted selection — picking candidate lines based on presence or absence of a few well-validated SNPs — is inadequate for most practical drought breeding contexts. You need to aggregate signal across thousands of variants simultaneously, which is precisely what genomic best linear unbiased prediction (gBLUP) and its machine learning variants attempt to do.

The challenge with standard gBLUP is that it requires a training population with both genomic marker data and reliable phenotype measurements. The model learns marker effects from the training set, then extrapolates to unobserved lines. When the training set is small relative to the dimensionality of the SNP space, or when the target environment differs substantially from where the training phenotypes were collected, prediction accuracy degrades sharply.

Our Approach: Embedding-First Genomic Prediction

Rather than learning marker effects directly from sparse labeled data, we first project each accession's VCF-format genotype data through our pre-trained plant genomic foundation model to obtain a 512-dimensional embedding. This embedding is not the raw SNP matrix. It is a compact representation of the sequence variation in a biologically meaningful latent space, learned from the structure of plant genomes across 47 species rather than from phenotypic variation in a single crop population.

The practical effect is that the embedding captures epistatic context. Two accessions may carry identical alleles at a drought-associated QTL while differing substantially in the surrounding haplotype context — and that context influences whether the allele is actually expressed under stress. The embedding encodes this context implicitly because the pre-training objective forces the model to learn positional dependencies across kilobase-scale windows.

We then train a lightweight prediction head — a two-layer MLP — on top of frozen embeddings, using whatever labeled phenotype data is available. For our evaluation against GRIN holdings, we used GRIN's publicly available germplasm evaluation records for Zea mays, specifically the drought stress scoring data from accessions evaluated at multiple locations in the southeastern United States between 2017 and 2022. After filtering for accessions with both high-quality genotype data available in NCBI SRA and corresponding GRIN evaluation records, we obtained 12,000 usable data points.

Correlation Results Across 12,000 Maize Accessions

Against the GRIN drought stress phenotype scores, our embedding-derived predictions achieved a Pearson correlation of 0.79 on the held-out test set (20% random split). For comparison, a standard gBLUP model trained on the same 80% training data produced a correlation of 0.71. The embedding approach improves most on lines that are genetically distant from the training population centroid — exactly the regime where gBLUP degrades fastest.

The improvement is not uniformly distributed across trait components. For stay-green duration (measured as greenness score at R5 under water-limited conditions), the embedding approach achieved 0.84 correlation versus 0.76 for gBLUP. For root architecture proxies (estimated from above-ground recovery scores), the gap narrowed to 0.73 versus 0.68. Root architecture has a strong GxE component that sequence information alone cannot fully resolve — you genuinely need soil depth, texture, and temperature data to predict root behavior, none of which is in the VCF.

The Polyploid Complication

Maize is a diploid and represents the cleanest test case we have. When we applied a similar analysis to Triticum aestivum — hexaploid bread wheat — using publicly available drought trial data from CIMMYT nurseries, the correlation dropped to 0.67 despite wheat having better-characterized drought QTL in the literature. The degradation comes from polyploidy: our current model treats homoeologous chromosome sets independently, so the same functional variant at a locus is embedded differently depending on which subgenome it resides in. We are actively working on a polyploid-aware tokenization scheme, but it is not in the current production release.

We want to be explicit about this limitation: if drought tolerance prediction in hexaploid wheat is your primary use case, the current model will underperform relative to what is achievable with a polyploid-aware architecture. The diploid and tetraploid species in our evaluation suite — maize, sorghum, chickpea, tomato — show consistent results, but hexaploid wheat is the edge case where we are still below the accuracy bar we consider acceptable for production use.

Reducing Phenotyping Burden in Practice

Consider a practical scenario: a breeding organization with a population of 3,500 early-generation lines that have been sequenced but not yet phenotyped. Traditional breeding would advance 200–400 lines to replicated field trials, wait 2 growing seasons, and make selection decisions based on those results. With embedding-derived predictions, you can rank all 3,500 lines on predicted drought tolerance before a single field trial. The question is not whether the rankings are perfect — they are not — but whether they are informative enough to improve the efficiency of the next trial stage.

From the analysis above, the answer is yes for diploid species. A breeder who selects the top 400 lines by embedding-predicted drought score will include roughly 78% of the lines that would have ranked in the top quartile by actual field phenotyping. That is not a replacement for field trials. It is a pre-filter that substantially increases the density of good material going into the expensive phenotyping stage.

What the Model Still Cannot Do

Sequence-based prediction cannot recover information that is not encoded in the genome. Epigenetic state — particularly stress-induced methylation changes at drought-responsive loci — influences tolerance phenotypes in ways that whole-genome resequencing does not capture. Phenotypic plasticity responses to gradual versus sudden drought onset are invisible to a sequence model. And accession-specific microbiome associations (rhizosphere composition under water stress) can shift apparent tolerance scores by amounts large enough to affect ranking within a breeding cohort.

We also cannot predict drought tolerance in completely novel genetic backgrounds with no similarity to anything in the training data. Extrapolation in genomic prediction is always risky; extrapolation from embedding space is somewhat safer than extrapolation from sparse SNP matrices, but the principle still applies. When a wild species accession from a GRIN holding has embedding cosine similarity below 0.4 to any accession in our training set, prediction accuracy is essentially unknown and should not be used for selection decisions.

The correct framing is that this tool accelerates early-generation screening for breeders who are already integrating genomics into their pipeline. It does not compress the whole breeding cycle — it compresses one expensive and time-consuming step in the early stages while leaving the necessary field validation intact.