Platform Science Use Cases Research Blog Request Access
Platform

A foundation model built for plant genomes, not repurposed from protein.

Designed ground-up for plant genomics: intron-aware tokenization, multi-species joint training, variant-level embeddings validated against GWAS hits across five crops. Purpose-built, not adapted.

Three design decisions that matter

Every architecture choice was made specifically for plant genomics — not carried over from NLP or protein models.

01 — Tokenization

k-mer tokenizer tuned for plant genomic repetitiveness

Plant genomes are 40-80% repetitive sequence — transposons, tandem repeats, centromeric arrays. Standard BPE tokenizers collapse these into noise. Our k-mer vocabulary was built from a representative cross-species corpus to preserve signal from repetitive regions rather than discard it.

02 — Pre-training objective

Masked sequence modeling + next-region prediction

Two complementary objectives: masked k-mer prediction teaches local variant context; next-region prediction learns long-range genomic structure across introns and regulatory elements. Pre-trained on 200M+ sequences across 47 species in joint multi-species batches — population variation is encoded, not averaged out.

03 — Embedding geometry

512-dim space validated against GWAS hit loci

Embedding quality was validated by checking whether known GWAS-significant variant positions cluster in interpretable sub-regions of the embedding space — without label data. Correlation with GRIN phenotype records: ~91% across 12,000 maize accessions. This is not a proxy metric — it's a direct alignment test.

200M+ sequences. 47 crops. 120 wild relatives.

Scale and species breadth are not marketing metrics — they are the precondition for transfer learning to work in low-label plant breeding contexts.

200M+
plant sequences in pre-training corpus
47
crop species with full-genome coverage
120
wild relative taxa for diversity capture
Species group Sequences Assembly quality Primary source
Cereals (wheat, maize, barley, sorghum, rice) ~82M Chromosome-level (T2T for wheat) Ensembl Plants + URGI
Solanaceae (tomato, pepper, potato) ~31M Chromosome-level + scaffold NCBI SRA + Sol Genomics
Legumes (soybean, pea, chickpea) ~44M Chromosome-level LegumeInfo + NCBI
Brassicas + root crops ~19M Scaffold + chromosome Ensembl Plants
Wild relatives + landraces ~24M Mixed (scaffold-level accepted) GRIN + internal curation

All sequences passed our deduplication and quality-filtering pipeline. Assembly artifacts, mislabeled cultivars, and contaminant sequences were removed before tokenization. Data sourcing details available on request.

Three primary endpoints

The Living Models API is REST-based with a Python SDK wrapper. Authentication via API key in request header. All responses include confidence bounds and known limitations flags.

POST
/v1/embed
Submit a VCF, FASTA, or SNP array CSV. Returns a 512-dimensional embedding vector for each input accession.
POST
/v1/predict
Submit a VCF + trait target list. Returns predicted trait scores with confidence intervals and a limitation flag if the input falls outside model training distribution.
POST
/v1/fine-tune
Submit labeled phenotype data (minimum 200 accessions). Trains an adapter layer on top of the frozen foundation model. Returns a fine-tuned model ID for your breeding program.

Full API reference documentation available after access request approval.

api_example.py
import requests

response = requests.post(
    'https://api.livingmodles.com/v1/predict',
    headers={'X-API-Key': api_key},
    json={
        'vcf_path': 'candidates_batch.vcf',
        'traits': ['drought_tolerance', 'yield_stability'],
        'species': 'triticum_aestivum'
    }
)

result = response.json()
# result['predictions'][0]
# {'drought_tolerance': {'score': 0.847, 'ci': [0.81, 0.88]},
#  'yield_stability': {'score': 0.712, 'ci': [0.67, 0.76]},
#  'out_of_distribution': false}

The four input formats plant genomics actually uses

No custom conversion required for VCF, FASTA, SNP array CSV, or PLINK binary. We handle format parsing on ingest so you start from your existing pipeline output.

REST API

Standard HTTP endpoints. Language-agnostic. Works with any stack that can make POST requests.

Available
py
Python SDK

pip install livingmodels. High-level wrapper with batch processing, retry logic, and pandas DataFrames output.

Available
VCF / FASTA upload

Direct file upload for standard plant genomics formats. VCF 4.2+, multi-FASTA, SNP array CSV. No pre-processing required.

Available
PLINK format

.bed/.bim/.fam binary PLINK format for labs whose population structure workflows already output PLINK files. Parsed on ingest — no manual conversion to VCF required.

In development

Start with a benchmark run.

Submit a sample dataset and receive trait predictions to evaluate model quality before committing to an integration. No contract required for the initial benchmark.