The Hidden Problem in Plant Genomic Training Data

Before we trained a single model weight, we spent four months cleaning data. That ratio surprised even us. The received wisdom in machine learning is that data quality matters less at scale — that a large enough corpus can absorb noise. For plant genomics, we found this assumption to be wrong in ways that are specific to how plant sequence databases are organized and curated.

Where the Contamination Comes From

Public plant sequence archives — NCBI SRA, Ensembl Plants, GenBank — are genuinely valuable resources. They represent decades of sequencing effort from thousands of research groups. They are also aggregations of data produced under wildly varying quality standards, with metadata that was entered by humans in a hurry, and reference genomes that were state-of-the-art in 2012 and have not been re-aligned since.

The problem categories we identified in our pre-training corpus, in rough order of severity:

Assembly Artifacts

Plant genome assembly is hard. Polyploidy, high repeat content, and GC extremes combine to produce misassemblies that look like legitimate sequence but represent chimeric joins between distant chromosomal regions. When a model sees a 512-token window that spans an assembly artifact, it is learning a sequence grammar that does not exist in any real plant cell. The model has no way to know this. It simply memorizes the pattern.

We identified assembly artifacts primarily by cross-referencing against updated reference versions. A locus in the IWGSC wheat reference version 1.0 that was substantially revised or relocated in version 2.1 is likely an artifact in the original assembly. We flagged approximately 3.8% of wheat sequences this way. For older legume references — some soybean (Glycine max) chromosomal scaffolds from pre-2016 assemblies — the flagging rate was over 11%.

Mislabeled Cultivar Records

Cultivar and accession metadata in SRA is notoriously unreliable. We found cases where the same cultivar name was used for genetically distinct breeding lines deposited by different institutions, cases where accession identifiers had been reused across different submission batches, and cases where the species label was clearly wrong (a maize record labeled as Sorghum bicolor, a sweet potato sequence filed under Solanum tuberosum). The last category is particularly damaging for pre-training because it teaches the model incorrect cross-species correspondences.

We validated a random sample of 40,000 records against GRIN Taxonomy and Ensembl Plants species identifiers. 6.2% had metadata discrepancies ranging from minor (cultivar synonym ambiguity) to severe (wrong species). For pre-training, we excluded any record where the species assignment could not be confirmed against an authoritative taxonomy source.

Outdated Reference Alignments

Many publicly available resequencing datasets include both raw reads and pre-aligned BAM or VCF files aligned to an older reference. When we used the derived sequences (extracted from VCF records against the reference) rather than raw reads, we sometimes got artifacts from indel representation differences between reference versions. A deletion called against Zm-B73-REFERENCE-GRAMENE-4.0 may be represented differently against REFERENCE-NAM-5.0 at the same locus, and mixing data aligned against different references introduced systematic biases into k-mer distributions at indel sites.

Our solution was to exclude derived sequences entirely where raw reads were available, and to re-align raw reads to the most current reference for each species before extracting training sequences. This added weeks of compute time but eliminated the reference version contamination.

The Near-Duplicate Problem

Plant breeding programs sequence many related lines. A seed company resequencing 500 lines from an elite maize panel will produce 500 datasets that share 95–99% of their genomic content — because the lines are closely related. If all 500 are included in pre-training data without deduplication, the model effectively sees the same sequence thousands of times with minor variation. This biases the pre-training distribution toward elite germplasm and away from the wild relatives and landraces that carry most of the allelic diversity we want the model to understand.

We applied MinHash-based near-duplicate detection with a Jaccard similarity threshold of 0.85 at the 8-mer level, grouping near-duplicate sequences and retaining one representative per group weighted by assembly quality metrics. After deduplication, our corpus shrank by 34% by record count but less than 12% by unique sequence content — confirming the extent of elite germplasm over-representation in public databases.

Our Filtering Pipeline

The full filtering pipeline runs in four stages. An internal API call structure for users integrating their own private datasets follows the same logic:

from livingmodels import SequenceFilter

sf = SequenceFilter(
    species="Zea mays",
    reference_version="NAM-5.0",
    min_contig_n50=50_000,
    repeat_mask_model="species_specific"
)

filtered = sf.run(input_fasta="my_accessions.fa")
print(filtered.stats())

The four stages are: (1) assembly quality scoring against N50, L90, and BUSCO completeness thresholds; (2) metadata validation against GRIN Taxonomy and Ensembl species identifiers; (3) reference version normalization — re-aligning to the current reference where discrepancies are detected; (4) near-duplicate removal with taxonomically balanced resampling to maintain species coverage.

After all four stages, our 200M+ sequence corpus represented a genuinely diverse sample of plant sequence space rather than a repeat-heavy, elite-germplasm-biased collection that would have introduced systematic prediction errors in any model trained on it.

What Dirty Data Does to Model Behavior

We ran an ablation: train the same architecture on the uncleaned corpus versus the cleaned corpus, then evaluate on held-out trait prediction tasks. The uncleaned model performed comparably on traits well-represented in elite germplasm (yield in temperate maize) and substantially worse on traits where diversity matters (disease resistance, drought tolerance, abiotic stress in non-elite backgrounds). The cleaned model showed consistent 8–14 point improvements in Pearson r on the diversity-dependent traits.

This is the expected direction of effect, but the magnitude surprised us. We had not anticipated that elite germplasm bias in pre-training would affect downstream fine-tuning so severely. The hypothesis is that the uncleaned model learned embedding representations that cluster by genetic distance to elite lines rather than by functional biology. When fine-tuning data comes from a diverse wild germplasm collection, the embeddings are informative in the wrong dimensions.

A Note on Private Data Integration

Breeding organizations often ask whether they can include proprietary resequencing data from their own lines in the fine-tuning stage without running the full filtering pipeline. The short answer is: it depends on how the data was generated. If the data was aligned to a current reference and if the metadata was entered carefully, the assembly artifact and reference mismatch problems are unlikely to apply. The near-duplicate problem is still a concern for tightly related elite panels, and we recommend running the deduplication stage even on private data before fine-tuning.

We are not saying that every plant genomics dataset is low quality. Many groups deposit carefully curated, well-documented sequences. The problem is that you cannot tell which datasets are which without checking, and at the scale required for foundation model pre-training, manual curation is not an option. The pipeline exists because automated filtering is the only approach that scales.