Why Pangenome Training Changes Everything for Crop Diversity

For the first three years of training plant genomic models, we used single reference assemblies — one genome per species, representing the canonical accession from the sequencing consortium that produced it. Maize was B73. Wheat was the IWGSC reference. Soybean was Williams 82. This is how most plant genomics computation works, and for many analytical tasks it is adequate. For foundation model training, we have become convinced it is not.

What a Single Reference Assembly Misses

A reference assembly represents one genotype. In any crop species with significant breeding history or wild diversity, the reference genotype is by definition an elite variety or model accession — often chosen precisely because it is well-behaved genomically (regular ploidy, reasonable repeat content, minimal heterozygosity). The genetic content of that single genotype is not the genetic content of the species.

This is not a new observation in plant genomics. The concept of the pangenome — the full set of sequences present in at least one member of a population — has been formalised for plants primarily through work on maize, tomato, and more recently wheat. The key finding across these species: a substantial fraction of genes and regulatory sequences are "dispensable," meaning they are absent from some accessions but present in others. Estimates for dispensable gene content range from roughly 15% in tomato to over 40% in maize, depending on the panel size and analytical method.

For our modelling purposes, what this means is the following. If we train on B73 maize, we have no representation of sequence context for any gene that is absent from B73. When we encounter a diverse line at inference time — an African landrace, a tropical open-pollinated variety, an introgression line carrying wild-relative sequence — we are asking our model to make predictions in regions of sequence space it has never seen. The model's attention heads have no positional context for those k-mers because those sequences did not appear in training.

The Pangenome as Training Input

A pangenome, in its graph-based representation, encodes the full sequence diversity of a set of accessions in a single data structure. The graph pangenome format stores shared sequence as linear nodes and accession-specific sequence as branching paths through the graph. This structure is produced by tools such as minigraph-cactus (used in the Human Pangenome Reference Consortium's work) and adapted for crop genomes with modifications to handle the polyploidy and repeat expansion common in plant species.

We represent pangenome training data not as a single linear FASTA but as a set of haplotype-resolved sequences from multiple accessions, concatenated with positional embeddings that encode the accession of origin and the location within the pangenome graph. During pre-training, the model sees the same genomic interval from multiple accession backgrounds, allowing it to learn which sequence variations are associated with which haplotype contexts.

The practical effect is that our model develops representations of allelic series rather than single alleles. Where a single-reference model sees "the Rht-B1a locus," our pangenome-trained model sees "Rht-B1a, Rht-B1b, and intermediate alleles, all in the context of their surrounding haplotype backgrounds across 40 wheat accessions." The downstream benefit for trait prediction is that the model can embed a novel accession's variant at that locus into a representation space that is informed by observed allelic diversity rather than deviations from one canonical sequence.

Which Species We Have Pangenome Depth For

Building a pangenome requires chromosome-level assemblies from multiple accessions. As of 2025, the crop species with sufficient multi-accession, chromosome-level data for pangenome construction are a small and crop-biased list.

Zea mays has the most developed crop pangenome resources — multiple North American inbred line assemblies and Chia et al.'s earlier work characterising structural variant landscape across diverse lines. Triticum aestivum has an IWGSC-coordinated multi-accession assembly effort, though the hexaploid genome's complexity (17 Gb, A/B/D subgenome disambiguation) makes pangenome graph construction computationally intensive. Oryza sativa has multiple high-quality indica and japonica assemblies. Solanum lycopersicum has pangenome characterisation work across cultivated and wild accessions.

For most other crop species — including chickpea, sunflower, oilseed rape, and most Brassica species — multi-accession chromosome-level assemblies are only now becoming available, and pangenome-scale training for those species remains prospective work for us.

A Concrete Example: Soybean Diversity

Consider Glycine max. The Williams 82 reference assembly is a North American elite variety. It was the first soybean genome assembled and has been comprehensively annotated. But soybean was domesticated in East Asia, and the diversity in Chinese and Japanese landrace accessions is substantially wider than in North American elite germplasm. Several genes relevant to flowering time control and seed composition are structurally polymorphic between the Williams 82 background and Asian landrace backgrounds.

We added four soybean accessions to our pangenome training set beyond Williams 82 — all publicly available from NCBI with chromosome-level assemblies. Training with this five-accession set, compared to Williams 82 alone, improved embedding separation between East Asian landraces and North American elite lines in our sequence representation space. Practically, predictions for soybean accessions with significant Asian landrace ancestry had lower uncertainty when that diversity was present in pre-training.

Pangenome Complexity and Its Training Costs

We want to be clear about the limitations of our current approach. Pangenome-aware training is more expensive than single-reference training, both in data preparation time and in compute. For polyploid species such as wheat, constructing a well-phased pangenome across multiple accessions is a substantial bioinformatics effort before any model training begins. The graph pangenome representation requires custom tokenisation that preserves path identity — a detail that is nontrivial to implement correctly and that we got wrong on our first attempt.

Additionally, pangenome graphs for highly repetitive genomes can produce training sequences with ambiguous repeat context — regions where multiple accessions contribute divergent sequences but the divergence reflects transposable element insertion polymorphism rather than functional sequence variation. We apply repeat masking to these regions before tokenisation, but the boundary between "informative structural variant" and "repeat insertion noise" is not always clear. This is an ongoing problem in the pangenome training literature, not just in our implementation.

Why We Believe This Matters Long-Term

The trajectory of crop breeding is toward wider germplasm utilisation. As climate adaptation pressure increases, programmes are revisiting wild relatives and landraces that carry traits not present in their elite pools — drought tolerance alleles from Oryza glaberrima for rice improvement, heat tolerance from Solanum pennellii for tomato, nitrogen-use efficiency variation from African sorghum landraces. These materials carry sequence diversity that a single-reference-trained model has not seen.

A model that was trained on pangenome data across the range of available accessions for a given species is better positioned to embed novel germplasm into a prediction space that was built on observed diversity rather than extrapolated from a single reference. That is the core motivation for our pangenome training strategy, and why we are investing disproportionate effort in acquiring and processing multi-accession assemblies for the species we care about most, even when the data preparation burden is high.