Publications

Technical notes and pre-prints from Living Models.

We share methodology openly before peer review. Below is our current publication output — results, evaluation sets, and collaboration invitations.

Collaborate with us

Pre-prints

PlantSeqBERT: A Foundation Model for Multi-Species Plant Genomics Pre-trained on 200M+ Sequences

Veran, C., Marchetti, S., Boudoure, A. — Living Models SAS, Lyon, France

bioRxiv · Submitted January 2025

We describe the architecture, training protocol, and evaluation results for PlantSeqBERT — a 24-layer transformer encoder pre-trained on 200M+ plant genomic sequences across 47 crop species and 120 wild relative taxa. We report r² ≥ 0.79 on four held-out phenotype datasets and describe the k-mer tokenization strategy developed for plant genomic repetitiveness.

doi:10.1101/2025.01.14.plant-seq-bert [synthetic — not peer-reviewed]

Genotype-to-Phenotype Prediction for Drought Tolerance in Maize: Benchmarking Genomic Foundation Models Against GWAS and gBLUP

Veran, C., Fenouillat, P. — Living Models SAS; INRAE Auvergne Unit, Clermont-Ferrand

bioRxiv · Submitted March 2025

Using 12,000 maize accessions from the GRIN phenotype database as a held-out test set, we compare our foundation model embeddings against GWAS-only and gBLUP baselines for drought tolerance index prediction. Our model achieves r² = 0.847 versus 0.73 for gBLUP, with particularly strong gains for accessions from underrepresented genetic backgrounds.

doi:10.1101/2025.03.05.drought-maize-benchmark [synthetic — not peer-reviewed]

Adapter-Based Fine-Tuning of Genomic Foundation Models for Breeding Programs with 200–500 Labeled Accessions

Veran, C., Marchetti, S. — Living Models SAS, Lyon, France

bioRxiv · Submitted June 2025

We describe a parameter-efficient fine-tuning approach (adapter layers, LoRA-style injection into the attention pathway) that enables stable adaptation of our foundation model to breeding program-specific phenotype targets with as few as 200 labeled accessions. Evaluated on four internal breeding program datasets. Calibration and out-of-distribution detection are detailed.

doi:10.1101/2025.06.20.adapter-fine-tuning [synthetic — not peer-reviewed]

Data Quality and Curation Practices for Plant Genomic Foundation Model Training: An Empirical Analysis

Boudoure, A., Veran, C. — Living Models SAS; Institut Pasteur de Lyon Computational Genomics Division

bioRxiv · Submitted September 2025

We characterize the distribution of quality-affecting artifacts in publicly available plant genomic sequence databases and describe the five-stage curation pipeline used to filter our 200M+ sequence training corpus. We find that 38% of raw sequences fail at least one quality criterion — a substantially higher rejection rate than reported for protein sequence databases.

doi:10.1101/2025.09.11.data-curation-plant [synthetic — not peer-reviewed]

Open data

Open benchmark datasets

Evaluation sets used for our validation benchmarks — available under CC BY 4.0 for independent replication and method comparison.

LM-Maize-Drought-12k

12,000 maize accessions with drought tolerance phenotype records from GRIN. Includes 4-season field measurements across 3 water-stress environments. Sequence: SNP array + imputed WGS.

CC BY 4.0

Download dataset (Zenodo) →

LM-Tomato-Disease-3k

3,200 tomato accessions with inoculation trial results for Fusarium wilt and late blight. Phenotype from 2-year greenhouse inoculation study. Sequence: WGS 15x coverage.

CC BY 4.0

Download dataset (Zenodo) →

LM-MultiCrop-Stability-8k

8,400 accessions across wheat, soybean, and sorghum with multi-location yield trial data from CIMMYT and national programs. Spanning 3 climate zones across 6 countries.

CC BY 4.0

Download dataset (Zenodo) →

We co-develop training data with genomics labs.

The exchange is direct: your lab contributes phenotyped accessions (minimum 500 lines, measured trait values, sequenced to at least 10× WGS or imputed SNP array). We contribute pre-training compute, model access during development, and co-authorship on resulting pre-prints. We currently have capacity for 2 new academic partnerships in 2026.

Propose a collaboration Read our methodology