Platform Science Use Cases Research Blog Request Access
Open Data

The State of Open Plant Sequence Datasets in 2025

Cyril Veran 6 min read
Abstract open dataset archive visualization for plant genomics

Training a plant genomics foundation model requires a dataset that is phylogenetically broad, assembly-quality consistent, and accessible under terms that permit derivative model publication. We have spent the better part of two years mapping what is actually available. This article is a snapshot of that landscape as of mid-2025 — not a comprehensive database survey, but a practitioner's account of what we found useful, what is theoretically available but practically difficult to use, and where the gaps remain that matter most for crop-relevant genomic modelling.

The Major Public Repositories and What They Contain

NCBI SRA and GenBank. The Sequence Read Archive remains the largest single repository of raw plant sequencing data. For our purposes — pre-training on assembled or near-assembled plant genomes — we primarily use GenBank's RefSeq plant genome collection and the WGS (whole genome shotgun) division. As of mid-2025, RefSeq includes annotated reference assemblies for approximately 250 plant species at chromosome or near-chromosome level, with Poaceae (grasses) and Solanaceae substantially overrepresented relative to most other plant families. Raw SRA reads are useful for pre-training token diversity, but the assembly quality heterogeneity requires careful filtering — a point we addressed in a prior post on training data quality.

Ensembl Plants. Ensembl Plants provides consistently formatted gene annotations and genome assemblies for approximately 80 crop and model plant species, with particular strength in the major cereals: Triticum aestivum (IWGSC RefSeq v2.1), Hordeum vulgare (Morex v3), Zea mays (B73 RefGen v5), Oryza sativa (IRGSP-1.0), and Glycine max (Wm82.a4). The value of Ensembl Plants for foundation model training is the consistent annotation schema — GFF3 files follow a defined structure that simplifies feature extraction — and the curated repeat masking, which matters when you are deciding whether to train on masked or unmasked sequence. The limitation is that 80 species is a narrow slice of plant diversity, and updates to genome versions can lag by 12–18 months behind community releases.

GRIN (Germplasm Resources Information Network). GRIN holds genotype data for germplasm collections maintained by the USDA National Plant Germplasm System. This is less useful for raw sequence pre-training and more relevant for trait association studies and fine-tuning with phenotyped accession data. GRIN genotype data is typically marker array or GBS, not whole-genome resequencing, which limits its direct utility for sequence-level foundation model inputs. We use GRIN primarily as a source of phenotyped accessions for fine-tuning validation rather than pre-training.

SOL Genomics Network (SGN). SOL Genomics is the primary community resource for Solanaceae — tomato, potato, pepper, and related species. The SGN genome browser and bulk download resources provide access to Solanum lycopersicum (SL4.0, SL5.0), Solanum pennellii, Solanum tuberosum (DM v6.1), and Capsicum annuum assemblies, along with associated RNA-seq and SNP datasets. For anyone working on Solanaceae prediction tasks, SGN is more current and better curated than pulling the same species from generic GenBank search.

LegumeInfo. The legume genomics community has invested significantly in coordinated data resources. LegumeInfo provides Glycine max, Phaseolus vulgaris, Medicago truncatula, Cicer arietinum, and over 30 other legume genomes in a consistent format. Legume genomes are valuable for model training because they include diploid and polyploid species, nitrogen fixation gene families, and trait associations distinct from the grass-dominated training sets of most plant genomics models.

Assembly Quality: What "Available" Actually Means for Training

The nominal count of available plant genomes in public repositories is misleading for training purposes. A significant fraction of what appears in GenBank as a "genome assembly" is a scaffold-level draft with N50 values in the hundreds of kilobases, high repeat content, and no chromosome-level scaffolding. We have found — empirically, through pre-training experiments — that including low-quality assemblies in training data degrades model performance on downstream crop prediction tasks. The signal from fragmented assemblies with poor repeat resolution competes with the regulatory region and coding sequence context that matters for trait-relevant predictions.

Our practical filtering criteria: chromosome-level assembly (NCBI Assembly level = "Chromosome" or "Complete Genome"), BUSCO completeness greater than 90% against the embryophyta_odb10 lineage dataset, and available repeat masking annotation or a RepeatModeler-based repeat library for the species. By these criteria, the usable set of publicly available plant genomes for foundation model pre-training narrows to approximately 80–110 species as of mid-2025, depending on how strictly you apply quality thresholds.

Licensing and Data Use Constraints

This is a topic that the genomics community does not discuss enough in the context of model training. A meaningful fraction of high-quality plant genome assemblies in public databases were produced under sequencing consortium agreements that specify data use terms extending beyond standard GenBank deposition. Some Triticeae assemblies, for example, were deposited under Fort Lauderdale Agreement terms that restrict redistribution of derived data — including, arguably, models trained on that data — without attribution and publication compliance.

We are not lawyers, and this landscape is genuinely ambiguous. What we have done practically: for species where consortium agreement terms are clearly permissive (Creative Commons or equivalent), we include the assembly unconditionally. For species with ambiguous terms, we include the data in pre-training and document it in our model provenance record, treating the provenance transparency as sufficient for academic use. For commercial deployment contexts, we advise clients to review their specific use case against the original data deposition terms for any assembly that comes from a sequencing consortium rather than a single lab's direct deposition.

Gaps That Matter for Crop Genomics Specifically

Several gaps in the public dataset landscape have direct consequences for crop trait prediction quality. First, there is a near-complete absence of high-quality assemblies for African orphan crops — teff (Eragrostis tef), fonio (Digitaria exilis), African nightshade (Solanum scabrum). The assemblies that do exist are at scaffold level with low BUSCO completeness. Any foundation model trained on currently available public data will perform poorly on these species, which matters as climate adaptation drives breeding programmes toward more heat and drought-tolerant orphan species.

Second, intraspecific diversity is underrepresented. Most species contribute one or two reference assemblies to public databases, typically derived from a single elite cultivar or model accession. For crop prediction tasks that require generalisation across landrace and wild-relative germplasm, a single reference assembly provides insufficient haplotype context. Pangenome projects are beginning to address this — the wheat pangenome consortium has released multiple accession assemblies, and similar efforts exist for maize — but for most species, public pangenome-scale data does not yet exist. This is a fundamental limitation for building models that generalise beyond elite germplasm.

Third, RNA-seq datasets for less-studied tissue types are sparse. Our model makes use of gene expression context where available. Root transcriptomes, pollen transcriptomes, and stress-induced transcriptomes are available for the major crop species but thin for nearly everything else. Gene models in many non-crop assemblies are computationally predicted only, with little RNA-seq support, making exon boundary accuracy uncertain in those regions.

What We Use and What We Would Prioritise If We Had More Time

Our current pre-training corpus spans 93 plant species meeting our quality thresholds, with particular depth in Poaceae (23 species), Solanaceae (12 species), and Fabaceae (18 species). We supplement this with curated raw read datasets from NCBI SRA for 40 additional species where only draft-level assemblies exist, using only reads that map to known gene space from closely related species as a quality filter.

If we were allocating effort to expand the public data landscape, we would prioritise: (1) chromosome-level assemblies for the major African cereal landraces, (2) multi-accession assemblies for chickpea and cowpea to enable pangenome-scale representation, and (3) standardised phenotype records linked to existing germplasm bank accessions — currently, the sequence-phenotype connection requires extensive manual curation even when both data types nominally exist in public databases.

The data is getting better each year. The trajectory is positive. But for the specific task of building crop trait prediction models that work across diverse germplasm, we are still building on a narrower genomic foundation than the nominal assembly counts suggest.

More from the lab