Llettuce: An Open Source Natural Language Processing Tool for the Translation of Medical Terms into Uniform Clinical Encoding
This paper introduces Llettuce, an open-source tool designed to address the complexities of converting medical terms into OMOP standard concepts. Unlike existing solutions such as the Athena database search and Usagi, which struggle with semantic nuances and require substantial manual input, Llettuce leverages advanced natural language processing, including large language models and fuzzy matching, to automate and enhance the mapping process. Developed with a focus on GDPR compliance, Llettuce can be deployed locally, ensuring data protection while maintaining high performance in converting informal medical terms to standardised concepts.
💡 Research Summary
The paper presents Lettuce, an open‑source, GDPR‑compliant tool that automates the mapping of informal medical terms to standardized OMOP concepts. Existing solutions such as the Athena database search and the OHDSI Usagi tool rely heavily on lexical string matching, which often fails to capture semantic nuances and requires extensive manual validation. Lettuce addresses these shortcomings by integrating three distinct retrieval pathways: (1) a traditional keyword‑based search that leverages pre‑computed lexical features, (2) a dense‑vector semantic search using pre‑computed embeddings stored via PGVector, and (3) a Retrieval‑Augmented Generation (RAG) pipeline that incorporates large language models (LLMs) to reason over candidate concepts when lexical and vector scores are insufficient.
Architecture and Implementation
Lettuce builds on a standard OMOP‑CDM database, extending the concept table with two auxiliary columns: (i) a column of lemmatized lexemes for rapid keyword matching, and (ii) a column of dense embeddings generated by a sentence‑transformer (e.g., BGE‑small‑en‑v1.5). These embeddings are pre‑computed for all concepts, enabling exact cosine‑similarity searches without resorting to approximate nearest‑neighbor indexes, thereby preserving high accuracy. The system offers both a command‑line interface and a FastAPI‑based HTTP API, allowing flexible integration into pipelines or interactive GUIs.
Retrieval Pathways
- Keyword Search: The input term is tokenized, lemmatized, and compared against the stored lexemes. This method excels when the source term shares wording with an OMOP concept but struggles with synonyms, abbreviations, or brand names.
- Vector‑Based Semantic Search: The same sentence‑transformer encodes the input term, and cosine similarity is computed against all concept embeddings. The approach successfully groups semantically equivalent terms (e.g., “paracetamol” and “acetaminophen”) even when their surface forms differ.
- RAG‑Enabled LLM Search: If the top‑k vector similarity scores fall below a configurable threshold, Lettuce constructs a prompt that includes the most relevant candidate concepts and asks the LLM to suggest the appropriate OMOP term. The LLM’s output is then validated against the OMOP vocabulary; if it does not directly match a concept name, a secondary keyword search is performed on the LLM’s suggestion to produce a ranked list.
GDPR‑Focused Deployment
All components run locally; the tool is distributed under the MIT license and can be containerised with Docker or deployed via omop-lite. This design eliminates the need to send protected health information (PHI) to external APIs, addressing the major compliance concerns associated with proprietary LLM services such as OpenAI’s ChatGPT.
Evaluation
Two drug‑name datasets were used: one containing formal RxNorm terms and another comprising self‑reported brand or colloquial names (e.g., “Now Foods omega‑3”). Lettuce’s semantic pipelines achieved up to a two‑fold increase in the proportion of correct concepts appearing within the top‑10 results compared with pure lexical search. In the illustrative case of “Now Foods omega‑3”, the system correctly mapped the term to the OMOP concept “Fish oil”, whereas Athena’s string match returned unrelated calcium or ubiquinone products. The RAG pipeline further improved recall for ambiguous or low‑similarity inputs by leveraging LLM reasoning.
Key Insights and Limitations
- Combining lexical, vector, and LLM‑augmented retrieval yields a robust, multi‑layered approach that adapts to varying degrees of term formality.
- Pre‑computing embeddings for the entire OMOP vocabulary ensures high‑precision similarity calculations but incurs an upfront computational cost.
- The current implementation is English‑centric; extending to multilingual vocabularies would require additional language‑specific embeddings and LLMs.
- Prompt engineering and similarity thresholds are manually set; automated hyper‑parameter optimisation could further streamline deployment across domains.
Conclusion and Future Work
Lettuce demonstrates that open‑source, locally‑hosted NLP pipelines can substantially reduce manual effort in OMOP concept mapping while maintaining compliance with data‑privacy regulations. The authors plan to broaden language support, explore automatic threshold tuning, and evaluate the system in real‑world clinical data pipelines. By making the code publicly available, Lettuce invites the community to extend, benchmark, and integrate the tool into broader health‑informatics ecosystems.
Comments & Academic Discussion
Loading comments...
Leave a Comment