Predicting Disease Resistance in Tomato Without Field Inoculation

Field inoculation trials for disease resistance screening in Solanum lycopersicum are among the most resource-intensive experiments in vegetable breeding. A controlled inoculation study for Fusarium wilt (Fusarium oxysporum f. sp. lycopersici) requires pathogen maintenance, greenhouse space, three to five weeks of post-inoculation observation, and a trained pathologist to score symptom severity. For a programme running 600+ candidate lines, that is a substantial constraint. We have been working on whether sequence-based predictions can identify resistance-associated genotype classes with sufficient confidence to guide which lines deserve priority in inoculation trials — not to replace the trial, but to stratify access to it.

The Genetic Architecture of Tomato Disease Resistance

Tomato disease resistance presents an interesting modelling challenge because resistance is partially governed by well-characterised major-resistance (R) genes — the I, I-2, I-3 loci for Fusarium wilt races 1, 2, and 3 respectively — and partially by quantitative resistance loci (QRLs) that modify the response even in the presence of R gene alleles. The I-2 locus alone accounts for a large portion of race 2 Fusarium wilt resistance in modern tomato breeding lines, making sequence-level detection of the functional allele highly predictive for that specific pathotype. Late blight resistance (Phytophthora infestans), by contrast, is predominantly quantitative in cultivated tomato, with the Ph-3 locus providing partial resistance at best and the broader resistance spectrum requiring many QRLs acting in combination.

We find this distinction important to state clearly at the outset: not all disease resistances are equally tractable from sequence data. Major-gene resistances with well-defined haplotype signatures are much more predictable than quantitative resistances where we are estimating aggregate genetic potential.

Data Sources and Input Representation

Our starting dataset for this work came from two sources. The SOL Genomics Network (SGN) hosts phenotyped tomato accessions with associated sequence data spanning multiple disease resistance evaluations, and the NCBI SRA contains resequencing data for domesticated tomato panels that include genotype calls at known resistance loci. We used these public resources to construct a reference set of approximately 1,400 S. lycopersicum accessions with documented disease resistance scores.

Input to our model is a fixed-length encoding derived from resequencing data or high-density SNP arrays. For disease resistance specifically, we include the marker panel positions covering the I, I-2, I-3, Ph-2, and Ph-3 loci at high density, plus genome-wide background markers at lower density. The genome-wide markers matter because race specificity and quantitative modifier effects are distributed across the genome — looking only at the canonical resistance locus undercounts total resistance potential in lines that carry partial QRL stacks without the major R allele.

Fusarium Wilt: Where Sequence Prediction Works Well

For Fusarium wilt race 2 resistance, our model achieves strong discrimination between susceptible and resistant genotype classes. The underlying reason is straightforward: the I-2 allele has a highly characteristic haplotype structure in modern breeding material, and the surrounding linkage disequilibrium block is sufficiently conserved that our marker panel captures it with low ambiguity. In a retrospective evaluation against the SGN phenotype records, we achieved greater than 92% concordance between predicted class and field-recorded resistance category.

Race 3 resistance is harder. The I-3 gene has a narrower distribution in public tomato diversity and was introgressed more recently into cultivated backgrounds, meaning the haplotype in commercial material is more uniform — a good thing for detectability — but the public data for validation is sparser. We approach race 3 predictions with wider confidence intervals and recommend treating them as prior information rather than definitive classification.

Late Blight: Quantitative Resistance and Its Limits

Late blight is where honest boundaries matter most. Ph-3 contributes meaningfully to resistance but has not prevented late blight epidemics in commercial production. Breeding for durable late blight resistance in tomato involves accumulating multiple QRLs, many of which were identified in wild Solanum relatives — particularly S. habrochaites and S. pimpinellifolium — and progressively introgressed into cultivated backgrounds over decades.

Our sequence predictions for late blight capture the Ph-2 and Ph-3 haplotypes accurately when they are present. But predicting the effective QRL load in a complex introgression line requires that those QRL regions be included in our marker panel and that our training data contains sufficient examples of accessions with documented QRL stacking. In our current model version, the prediction for late blight is best interpreted as an estimate of major-gene resistance status, with a secondary polygenic modifier score that carries substantial uncertainty for lines with unusual introgression histories.

We are not saying sequence prediction cannot eventually capture quantitative late blight resistance well. We are saying that for this trait, the training data requirement is higher and the current prediction should be understood accordingly.

Practical Application: Stratifying Inoculation Trial Access

Consider a tomato breeding cycle with 620 F3 families, each represented by a few plants from which leaf tissue is collected for genotyping. A Fusarium wilt race 2 inoculation trial has greenhouse capacity for 180 entries per season. The question is which 180 to prioritise.

Using our sequence predictions as a pre-screening filter, the programme eliminates lines predicted as clearly susceptible at the I-2 locus and focuses inoculation capacity on three groups: (1) lines predicted as I-2 resistant for confirmation of I-2 function in their specific genetic background, (2) lines with ambiguous or heterozygous I-2 predictions requiring phenotypic resolution, and (3) a random selection of susceptible-predicted lines as experimental controls. This approach channels inoculation resources toward lines where phenotypic information is most decision-relevant.

The reduction in inoculation trial entries achieved by this approach depends on the frequency of the I-2 allele in the breeding programme's starting material. In a programme that has been selecting on I-2 for several cycles, the clearly susceptible proportion may be low and the pre-screening value modest. In a programme introducing diverse germplasm from wild relatives or landrace sources — where resistance allele status is unknown — pre-screening is more impactful.

Input Data Requirements and Practical Minimum

Our disease resistance predictions for tomato work best with resequencing data at moderate coverage (5–10×) or high-density SNP array data covering the major resistance loci. GBS data works if the target loci are captured at sufficient depth; however, GBS enzyme cut-site distributions do not guarantee coverage at all resistance marker positions, and we advise checking marker coverage at the I, I-2, Ph-2, and Ph-3 positions explicitly before interpreting predictions for those loci.

For low-coverage or GBS data with gaps at key loci, our model flags individual locus predictions as low_coverage_uncertain rather than returning a single aggregate confidence score. We introduced this distinction after early feedback that aggregate scores masked locus-specific data quality problems.

What We Are Working on Next

The most pressing gap in our current tomato model is Tomato Yellow Leaf Curl Virus (TYLCV) resistance. The Ty gene series (Ty-1 through Ty-6) is commercially critical and the resistance alleles have reasonable haplotype characterisation in the literature. We are assembling phenotyped accession data to add TYLCV resistance as a predicted trait in our next model update. Bacterial spot resistance (Xanthomonas spp.) mediated by the Bs gene series is further behind — the genetic complexity and race diversity in European field populations makes this a harder modelling target than the fungal and oomycete resistances we have addressed first.