A foundation model built for plant genomes, not repurposed from protein.
Designed ground-up for plant genomics: intron-aware tokenization, multi-species joint training, variant-level embeddings validated against GWAS hits across five crops. Purpose-built, not adapted.
Three design decisions that matter
Every architecture choice was made specifically for plant genomics — not carried over from NLP or protein models.
k-mer tokenizer tuned for plant genomic repetitiveness
Plant genomes are 40-80% repetitive sequence — transposons, tandem repeats, centromeric arrays. Standard BPE tokenizers collapse these into noise. Our k-mer vocabulary was built from a representative cross-species corpus to preserve signal from repetitive regions rather than discard it.
Masked sequence modeling + next-region prediction
Two complementary objectives: masked k-mer prediction teaches local variant context; next-region prediction learns long-range genomic structure across introns and regulatory elements. Pre-trained on 200M+ sequences across 47 species in joint multi-species batches — population variation is encoded, not averaged out.
512-dim space validated against GWAS hit loci
Embedding quality was validated by checking whether known GWAS-significant variant positions cluster in interpretable sub-regions of the embedding space — without label data. Correlation with GRIN phenotype records: ~91% across 12,000 maize accessions. This is not a proxy metric — it's a direct alignment test.
200M+ sequences. 47 crops. 120 wild relatives.
Scale and species breadth are not marketing metrics — they are the precondition for transfer learning to work in low-label plant breeding contexts.
| Species group | Sequences | Assembly quality | Primary source |
|---|---|---|---|
| Cereals (wheat, maize, barley, sorghum, rice) | ~82M | Chromosome-level (T2T for wheat) | Ensembl Plants + URGI |
| Solanaceae (tomato, pepper, potato) | ~31M | Chromosome-level + scaffold | NCBI SRA + Sol Genomics |
| Legumes (soybean, pea, chickpea) | ~44M | Chromosome-level | LegumeInfo + NCBI |
| Brassicas + root crops | ~19M | Scaffold + chromosome | Ensembl Plants |
| Wild relatives + landraces | ~24M | Mixed (scaffold-level accepted) | GRIN + internal curation |
All sequences passed our deduplication and quality-filtering pipeline. Assembly artifacts, mislabeled cultivars, and contaminant sequences were removed before tokenization. Data sourcing details available on request.
Three primary endpoints
The Living Models API is REST-based with a Python SDK wrapper. Authentication via API key in request header. All responses include confidence bounds and known limitations flags.
Full API reference documentation available after access request approval.
import requests
response = requests.post(
'https://api.livingmodles.com/v1/predict',
headers={'X-API-Key': api_key},
json={
'vcf_path': 'candidates_batch.vcf',
'traits': ['drought_tolerance', 'yield_stability'],
'species': 'triticum_aestivum'
}
)
result = response.json()
# result['predictions'][0]
# {'drought_tolerance': {'score': 0.847, 'ci': [0.81, 0.88]},
# 'yield_stability': {'score': 0.712, 'ci': [0.67, 0.76]},
# 'out_of_distribution': false}
The four input formats plant genomics actually uses
No custom conversion required for VCF, FASTA, SNP array CSV, or PLINK binary. We handle format parsing on ingest so you start from your existing pipeline output.
Standard HTTP endpoints. Language-agnostic. Works with any stack that can make POST requests.
Availablepip install livingmodels. High-level wrapper with batch processing, retry logic, and pandas DataFrames output.
AvailableDirect file upload for standard plant genomics formats. VCF 4.2+, multi-FASTA, SNP array CSV. No pre-processing required.
Available.bed/.bim/.fam binary PLINK format for labs whose population structure workflows already output PLINK files. Parsed on ingest — no manual conversion to VCF required.
In developmentStart with a benchmark run.
Submit a sample dataset and receive trait predictions to evaluate model quality before committing to an integration. No contract required for the initial benchmark.