Embedding Drift in Longitudinal Breeding Programs

A genomic prediction model is not static in the way a weather forecasting model is static. The population it is deployed on changes over time. A breeding program that runs genomic selection is, by design, shifting its genetic composition each cycle toward higher predicted performance. After two or three selection cycles, the lines being evaluated look less and less like the lines the model was fine-tuned on. We call this process embedding drift, and it is one of the less-discussed failure modes in applied genomic prediction.

What Embedding Drift Actually Means

When we fine-tune a genomic prediction model on a breeding population, the model learns a mapping from sequence embeddings to phenotype values that is calibrated to the genetic architecture of that specific population at that specific time. The embedding space — the 512-dimensional vector representation our model assigns to each accession — captures variation relative to what was in the training data. If the population mean for a key drought-tolerance locus shifts from 0.3 to 0.6 across selection cycles, the model is now making predictions in a region of embedding space that was underrepresented in its training data. The predictions may still be directionally correct, but their precision degrades.

This is not specific to our approach — it affects any genomic prediction method, including classical genomic best linear unbiased prediction. What is different with embedding-based methods is that the drift is measurable in a way that is more interpretable than the equivalent shift in a marker-effect matrix. We can compute, for each new batch of lines, where their embeddings fall in the distribution of training embeddings and how far they have drifted from the training centroid.

A Concrete Case from a European Cereal Programme

We have been working with a European cereal breeder who has been using our platform for several breeding cycles. After the third cycle of genomic selection, they noticed that the model's correlation with field observations was lower than it had been in the first cycle. The raw numbers: cycle 1 validation gave Pearson r = 0.67; cycle 3 validation gave r = 0.54. That is a meaningful degradation — the difference between a model that is actively useful for candidate selection and one that is only marginally better than simpler baselines.

When we looked at the embedding distributions, the pattern was clear. The cycle 3 evaluation set had embeddings that had shifted significantly along the drought-tolerance axis compared to the cycle 1 training set — exactly as expected from sustained directional selection. The model had been trained on a population where roughly 40% of accessions fell in the high-tolerance range; in cycle 3, close to 65% of candidates were in that range. The model was being asked to discriminate within a region of embedding space where it had relatively little training signal.

How We Detect Drift in Deployment

We built a drift monitoring component into our platform that runs automatically each time a new batch of lines is submitted for prediction. The monitor computes three things: the Mahalanobis distance from the new batch's embedding centroid to the training population centroid, the overlap coefficient between the new batch's embedding distribution and the training distribution along the primary prediction axis, and the fraction of new-batch embeddings that fall outside the convex hull of the training embeddings (the out-of-distribution fraction).

We flag batches where any of these metrics exceed a threshold and surface a warning to the user: "This batch shows elevated embedding drift from your training population. Consider a fine-tuning update before relying on these predictions for final selection decisions." The thresholds are not universal — they depend on the trait and species — and we currently set them conservatively based on empirical observation of when drift starts to measurably degrade prediction accuracy. This is an area where we expect to refine the calibration as we accumulate more longitudinal data from programmes that have been running for multiple cycles.

Response Options: When to Retrain and When Not To

The obvious response to drift is to retrain. But retraining requires labeled data from the new population — phenotype records from field evaluation of cycle N lines — and those records are often available only after the window for cycle N+1 selection has closed. The practical question is what to do in the meantime.

We have had reasonable results with a lightweight recalibration approach: rather than retraining the full model, we fit a new regression head on a small set of cycle N lines for which preliminary phenotype estimates are available — often from rapid greenhouse assays rather than full field evaluation. This recalibrates the model's output scale to the shifted population without the compute and data requirements of full fine-tuning. It is not a substitute for eventual retraining, but it narrows the gap in prediction quality by a meaningful margin in the cases we have tested.

A second option, when the drift is primarily in the mean and not in the variance structure, is to apply a post-hoc adjustment to the predictions using the embedding shift as a covariate. This is more of an empirical correction than a principled update, but it is fast and requires no additional phenotype data. We treat it as a stopgap and recommend against using it as a long-term solution.

Implications for Breeding Program Design

The existence of embedding drift has implications for how a genomic prediction deployment should be planned, not just monitored. A breeding programme that anticipates multiple cycles of genomic selection should plan, from the beginning, a phenotyping strategy that generates fine-tuning data at regular intervals — not just a one-time calibration set at the start of deployment. The cadence does not need to be every cycle: phenotyping a representative subset of lines every second or third cycle, combined with continuous drift monitoring, keeps the model calibrated at manageable cost.

We raise this with new collaborators at the project scoping stage because it affects the cost model for long-term platform use. A programme that plans a single calibration run and expects the model to stay accurate indefinitely is setting itself up for the kind of quiet degradation that we saw in the cereal programme case. The model will continue to produce numbers — but those numbers will become progressively less reliable without anyone necessarily noticing until a field-season result is surprising.

Embedding Drift in Longitudinal Breeding Programs

What Embedding Drift Actually Means

A Concrete Case from a European Cereal Programme

How We Detect Drift in Deployment

Response Options: When to Retrain and When Not To

Implications for Breeding Program Design

From SNP Arrays to Breeding Value: What Embedding Geometry Tells Us

Cross-Species Transfer Learning: Rice Models Predict Sorghum Traits

Integrating Genomic Predictions into a Seed Company's Selection Pipeline