PlantSeqBERT: A Foundation Model for Multi-Species Plant Genomics Pre-trained on 200M+ Sequences
Veran, C., Marchetti, S., Boudoure, A. — Living Models SAS, Lyon, France
bioRxiv · Submitted January 2025
We describe the architecture, training protocol, and evaluation results for PlantSeqBERT — a 24-layer transformer encoder pre-trained on 200M+ plant genomic sequences across 47 crop species and 120 wild relative taxa. We report r² ≥ 0.79 on four held-out phenotype datasets and describe the k-mer tokenization strategy developed for plant genomic repetitiveness.
doi:10.1101/2025.01.14.plant-seq-bert [synthetic — not peer-reviewed]
Genotype-to-Phenotype Prediction for Drought Tolerance in Maize: Benchmarking Genomic Foundation Models Against GWAS and gBLUP
Veran, C., Fenouillat, P. — Living Models SAS; INRAE Auvergne Unit, Clermont-Ferrand
bioRxiv · Submitted March 2025
Using 12,000 maize accessions from the GRIN phenotype database as a held-out test set, we compare our foundation model embeddings against GWAS-only and gBLUP baselines for drought tolerance index prediction. Our model achieves r² = 0.847 versus 0.73 for gBLUP, with particularly strong gains for accessions from underrepresented genetic backgrounds.
doi:10.1101/2025.03.05.drought-maize-benchmark [synthetic — not peer-reviewed]
Adapter-Based Fine-Tuning of Genomic Foundation Models for Breeding Programs with 200–500 Labeled Accessions
Veran, C., Marchetti, S. — Living Models SAS, Lyon, France
bioRxiv · Submitted June 2025
We describe a parameter-efficient fine-tuning approach (adapter layers, LoRA-style injection into the attention pathway) that enables stable adaptation of our foundation model to breeding program-specific phenotype targets with as few as 200 labeled accessions. Evaluated on four internal breeding program datasets. Calibration and out-of-distribution detection are detailed.
doi:10.1101/2025.06.20.adapter-fine-tuning [synthetic — not peer-reviewed]
Data Quality and Curation Practices for Plant Genomic Foundation Model Training: An Empirical Analysis
Boudoure, A., Veran, C. — Living Models SAS; Institut Pasteur de Lyon Computational Genomics Division
bioRxiv · Submitted September 2025
We characterize the distribution of quality-affecting artifacts in publicly available plant genomic sequence databases and describe the five-stage curation pipeline used to filter our 200M+ sequence training corpus. We find that 38% of raw sequences fail at least one quality criterion — a substantially higher rejection rate than reported for protein sequence databases.
doi:10.1101/2025.09.11.data-curation-plant [synthetic — not peer-reviewed]