From SNP Arrays to Breeding Value: What Embedding Geometry Tells Us

Breeding programs that adopted genotyping arrays in the 2010s made a practical tradeoff: instead of full-genome resequencing, they genotyped at 50,000 to 600,000 SNP positions selected to capture population-level diversity at lower cost. The resulting SNP arrays — the Illumina 90K wheat chip, the MaizeSNP50 array, the SoySNP50 — became the de facto data format for genomic selection. Most production breeding pipelines today run on this data, and it makes sense to ask whether our foundation model can work with it directly or whether it requires full-genome sequences.

From Array Data to Embeddings: The Practical Path

SNP array data arrives as genotype calls at fixed marker positions — a vector of {0, 1, 2} values encoding homozygous reference, heterozygous, and homozygous alternate at each locus. This is not natively the input format our model expects. Our model ingests sequence context: a 512-nucleotide window around each variant, which requires knowing what sequence surrounds the SNP.

For array-genotyped accessions, we construct pseudo-sequence input by (1) lifting each SNP position to the current species reference, (2) extracting the ±256 nucleotide flanking sequence from the reference FASTA, and (3) substituting the genotype-specific allele at the SNP position. The result is not a whole-genome sequence — it is a set of 512-nucleotide windows, one per genotyped SNP. We embed each window independently and then aggregate across SNP windows using a weighted mean weighted by minor allele frequency in the population, which prioritizes informative variants over common-variant flanking noise.

This approach produces a single 512-dimensional embedding per accession from array data, directly comparable to the embedding derived from full-genome resequencing. Our internal validation shows Pearson r of 0.94 between array-derived and WGS-derived embeddings for the same accessions in maize — high enough for most downstream applications, though not lossless.

What Clustering Reveals Without Labels

The first thing we do with a new species dataset is visualize the embedding space — UMAP projection to 2D, colored by phenotype if labels are available, or by geographic origin or pedigree if they are not. The patterns that emerge tell you immediately whether the embedding is doing something useful.

In a maize panel of 8,200 accessions drawn from the GRIN collection, the UMAP shows clean geographic clustering corresponding roughly to temperate, subtropical, and tropical adaptation groups. This is expected and reflects known population structure. More interesting is what happens within the temperate cluster: a subgroup of lines with above-average drought tolerance (measured by GRIN evaluation scores) forms a distinct subregion of the embedding space despite having no drought labels in the data the model was trained on. The model did not learn "drought tolerance" explicitly — it learned sequence-level features that happen to be enriched in drought-tolerant genotypes because drought tolerance genes have detectable sequence signatures.

We do not always see this. In a smaller sunflower panel (n = 420 accessions, internal dataset), the embedding space structure was dominated by population of origin and did not show meaningful separation by oil composition traits. The signal-to-noise ratio is lower in small panels because population structure explains most of the variance.

Embedding Geometry and Breeding Value Prediction

The connection between embedding geometry and estimated breeding value (EBV) is more direct than it might seem. In classical genomic selection, EBV is estimated as the sum of marker effects across the genome — a linear function of the SNP vector. In our framework, the embedding is a nonlinear transformation of the SNP context, and the EBV prediction is a function of the embedding.

For traits with predominantly additive genetic architecture — yield components in wheat, seed protein in soybean — the linear approximation is nearly as accurate as the nonlinear one. The advantage of the embedding approach is marginal (typically 3–5 Pearson r points) and may not justify the additional complexity if you have a well-calibrated gBLUP baseline.

For traits with substantial epistatic architecture — disease resistance loci in most species, some abiotic stress responses — the nonlinear embedding representation captures variance that the additive model misses. In our Fusarium resistance evaluation on wheat (internal dataset, 1,100 accessions), embedding-based prediction outperformed gBLUP by 11 percentage points under Protocol A cross-validation. The resistance mechanism involves interaction between multiple NLR gene copies, and that interaction is represented in the 512-nucleotide context windows but collapsed to a single allele effect in the SNP model.

Interpreting Subspace Structure

A recurring question from breeders integrating our API is: can we identify which embedding dimensions correspond to specific traits? The answer is: partially, and with important caveats.

We trained linear probes (logistic or linear regression from embedding dimensions to trait labels) for each annotated trait in the GRIN maize dataset. For drought tolerance, the highest-weight dimensions in the linear probe correlated with regions of the embedding space that, when traced back to the input sequence, corresponded to known drought-responsive gene families: aquaporin family members, dehydrin genes, and several of the well-characterized maize ABA signaling pathway loci. This is encouraging for interpretability — the embedding is encoding something biologically real, not arbitrary.

However, linear probes cannot recover traits that are distributed across many small-effect loci with no individual dimension dominating. Quantitative yield traits in highly polygenic crops like wheat have no clean embedding subspace associated with them. The trait signal is diffuse across most of the 512 dimensions, which is mathematically equivalent to saying the trait is highly polygenic in the sequence representation. For these traits, you cannot interpret the embedding in terms of individual known genes — you have to trust the aggregate prediction accuracy and accept that the mechanism is opaque.

import numpy as np
from livingmodels import EmbeddingClient

client = EmbeddingClient(api_key="...")
embs = client.embed_vcf("panel.vcf", reference="ZmB73v5")

# Linear probe on drought score
from sklearn.linear_model import Ridge
probe = Ridge(alpha=0.01).fit(embs[train_idx], drought_scores[train_idx])
r = np.corrcoef(probe.predict(embs[test_idx]), drought_scores[test_idx])[0,1]
print(f"Linear probe Pearson r: {r:.3f}")

The Correlation Drops in Specific Regimes

We have mapped where the embedding-to-breeding-value correlation degrades. Three regimes produce reliably lower accuracy:

Highly admixed populations: Lines with complex multi-way ancestry do not cluster cleanly in the embedding space, and the aggregate embedding may blend signals from incompatible genetic backgrounds. Protocol A cross-validation accuracy for admixed panels is typically 10–15 points below what we see in structured populations.
SNP arrays with low marker density: Arrays below ~30,000 markers provide insufficient coverage to construct representative embeddings via the flanking-sequence approach. The missing context between sparsely spaced markers causes the aggregated embedding to be less informative. For 50K+ arrays, this is not a concern; for lower-density marker panels common in some legume programs, results should be interpreted cautiously.
Traits with strong cytoplasmic inheritance: Mitochondrial and chloroplast-encoded variation is not well-represented in nuclear SNP arrays, and our current model does not natively incorporate organellar sequences. For traits with cytoplasmic effects — some male sterility loci, certain quality traits with chloroplast involvement — the nuclear embedding misses part of the relevant variation.

We are not saying that these limitations make the approach impractical — the majority of economically important breeding traits in the crops we work with fall outside these three regimes. But breeders applying the model to novel contexts should explicitly check whether their target trait and population fit the conditions under which the correlations hold.

Practical Workflow Integration

For a breeding program that already runs Illumina 90K arrays on wheat lines and uses BGLR for gBLUP prediction, the integration path is: (1) export VCF from your standard genotype calling pipeline, (2) call our embed_vcf endpoint to produce 512-dim embeddings, (3) concatenate embeddings with your existing genomic relationship matrix as an additional feature block in BGLR or pass to a standalone prediction head. The embedding typically adds 5–9 Pearson r points on top of gBLUP alone on traits with epistatic architecture, while adding negligible computation overhead at inference time.

We do not recommend replacing gBLUP entirely with embedding-only prediction in established programs with well-calibrated training sets. The additive model has decades of statistical theory and validated performance behind it. The embedding is best understood as a complementary feature that captures what the additive model cannot represent — not as a replacement for it.