Learning interpretable models of phenotypes from whole genome sequences with the Set Covering Machine
The increased affordability of whole genome sequencing has motivated its use for phenotypic studies. We address the problem of learning interpretable models for discrete phenotypes from whole genomes. We propose a general approach that relies on the Set Covering Machine and a k-mer representation of the genomes. We show results for the problem of predicting the resistance of Pseudomonas Aeruginosa, an important human pathogen, against 4 antibiotics. Our results demonstrate that extremely sparse models which are biologically relevant can be learnt using this approach.
💡 Research Summary
The paper addresses the challenge of building predictive models for discrete phenotypes directly from whole‑genome sequencing (WGS) data while retaining interpretability. The authors propose a pipeline that combines a k‑mer based representation of genomes with the Set Covering Machine (SCM), a rule‑based learning algorithm that produces highly sparse classifiers in the form of a conjunction or disjunction of Boolean rules. Each rule corresponds to the presence or absence of a specific k‑mer (k = 31) in a genome, allowing the final model to be expressed as a simple logical statement that can be readily examined by biologists.
Methodologically, the genomes are transformed into binary vectors φ(x) where each dimension indicates whether a particular k‑mer appears in the sequence. From the full set K of unique k‑mers across all samples, two types of rules are generated: p_k (k‑mer present) and a_k (k‑mer absent). The SCM algorithm iteratively selects the rule that maximally reduces classification error on the yet‑uncovered training examples, balancing true positives against false positives through a trade‑off parameter p. A second parameter s caps the maximum number of rules, providing explicit control over model complexity. The greedy selection process approximates the NP‑hard set‑cover problem with a worst‑case guarantee and runs in O(|K|·|S|·s) time, which scales linearly with the number of k‑mers and samples. To handle the massive feature space (millions of k‑mers), the authors store the binary matrices on disk using HDF5 compression and process them in blocks, keeping memory usage modest (≈3 GB).
The approach is evaluated on a clinically relevant dataset comprising 390 Pseudomonas aeruginosa isolates with measured resistance to four antibiotics: amikacin, doripenem, levofloxacin, and meropenem. For each drug, intermediate‑resistance isolates are removed, yielding a binary classification problem (resistant = 1, sensitive = 0). Nested 5‑fold cross‑validation is employed: an inner loop tunes p (10⁻¹ to 10¹) and s (1–10), while the outer loop estimates generalization risk. The SCM is compared against a linear‑kernel Support Vector Machine (SVM) and a majority‑class baseline.
Results show that SCM achieves lower or comparable risk to the SVM on three of the four antibiotics, despite using only 2–5 rules per model. For levofloxacin, the SCM risk is 0.075 ± 0.025 versus 0.232 ± 0.029 for the SVM. The learned rules are biologically meaningful: the levofloxacin model consists of two absence rules targeting k‑mers that map to the quinolone‑resistance‑determining region of DNA gyrase subunit A (covering Thr‑83 and Asp‑87) and subunit B (covering Ser‑468 and Glu‑470), both known mutation sites conferring fluoroquinolone resistance. For the other antibiotics, the models also involve DNA gyrase k‑mers, which the authors attribute to the high prevalence of multi‑drug resistance in the dataset; many isolates resistant to one drug are also resistant to others, causing the SCM to select the most globally discriminative rule set.
The authors discuss a key limitation: SCM can only produce a single conjunction or disjunction, which may be insufficient to capture multiple, distinct resistance mechanisms simultaneously. They propose extending the method to learn disjunctions of conjunctions, preserving sparsity while allowing richer logical structures.
In conclusion, the study demonstrates that a k‑mer based representation combined with the Set Covering Machine can yield highly sparse, accurate, and biologically interpretable models directly from whole‑genome data. This approach offers a practical alternative to dense classifiers such as SVMs, especially in clinical settings where model transparency and low validation cost are essential. Future work will focus on scaling the method to larger cohorts and more complex logical model families, potentially enhancing its utility for a broad range of genotype‑phenotype association studies.
Comments & Academic Discussion
Loading comments...
Leave a Comment