Title: Darth Vecdor: An Open-Source System for Generating Knowledge Graphs Through Large Language Model Queries
ArXiv ID: 2512.15906
Date: 2025-12-17
Authors: ** Jonathan A. Handler, MD 1) Keylog Solutions LLC, Northbrook, IL, USA (jhandler@gmail.com) 2) Clinical Intelligence and Advanced Data Lab, OSF HealthCare, Peoria, IL, USA (jonathan.a.handler@osfhealthcare.org) — **
📝 Abstract
Many large language models (LLMs) are trained on a massive body of knowledge present on the Internet. Darth Vecdor (DV) was designed to extract this knowledge into a structured, terminology-mapped, SQL database ("knowledge base" or "knowledge graph"). Knowledge graphs may be useful in many domains, including healthcare. Although one might query an LLM directly rather than a SQL-based knowledge graph, concerns such as cost, speed, safety, and confidence may arise, especially in high-volume operations. These may be mitigated when the information is pre-extracted from the LLM and becomes query-able through a standard database. However, the author found the need to address several issues. These included erroneous, off-topic, free-text, overly general, and inconsistent LLM responses, as well as allowing for multi-element responses. DV was built with features intended to mitigate these issues. To facilitate ease of use, and to allow for prompt engineering by those with domain expertise but little technical background, DV provides a simple, browser-based graphical user interface. DV has been released as free, open-source, extensible software, on an "as is" basis, without warranties or conditions of any kind, either express or implied. Users need to be cognizant of the potential risks and benefits of using DV and its outputs, and users are responsible for ensuring any use is safe and effective. DV should be assumed to have bugs, potentially very serious ones. However, the author hopes that appropriate use of current and future versions of DV and its outputs can help improve healthcare.
💡 Deep Analysis
📄 Full Content
DARTH VECDOR: AN OPEN-SOURCE
SYSTEM FOR GENERATING KNOWLEDGE
GRAPHS THROUGH LARGE LANGUAGE
MODEL QUERIES
Author: Jonathan A. Handler, MD (1, 2)
Author Affiliations:
1) Keylog Solutions LLC, Northbrook, IL, USA. jhandler@gmail.com
2) Clinical Intelligence and Advanced Data Lab, OSF HealthCare, Peoria, IL, USA.
jonathan.a.handler@osfhealthcare.org
ABSTRACT
Many large language models (LLMs) are trained on a massive body of knowledge present on the Internet.
Darth Vecdor (DV) was designed to extract this knowledge into a structured, terminology-mapped, SQL
database (“knowledge base” or “knowledge graph”). Knowledge graphs may be useful in many domains,
including healthcare. Although one might query an LLM directly rather than a SQL-based knowledge graph,
concerns such as cost, speed, safety, and confidence may arise, especially in high-volume operations.
These may be mitigated when the information is pre-extracted from the LLM and becomes query-able
through a standard database. However, the author found the need to address several issues. These
included erroneous, off-topic, free-text, overly general, and inconsistent LLM responses, as well as
allowing for multi-element responses. DV was built with features intended to mitigate these issues. To
facilitate ease of use, and to allow for prompt engineering by those with domain expertise but little
technical background, DV provides a simple, browser-based graphical user interface. DV has been
released as free, open-source, extensible software, on an "as is" basis, without warranties or conditions of
any kind, either express or implied. Users need to be cognizant of the potential risks and benefits of using
DV and its outputs, and users are responsible for ensuring any use is safe and effective. DV should be
assumed to have bugs, potentially very serious ones. However, the author hopes that appropriate use of
current and future versions of DV and its outputs can help improve healthcare.
INTRODUCTION
Large language models (LLMs) have already had a significant impact in healthcare, and many more uses
are reported in development and in planned implementation. Since LLMs are trained on huge volumes of
data, LLMs are encoded with a significant swath of the knowledge present on the Internet and possibly
other sources. Therefore, the author hypothesized that LLMs can be used to as a source to populate
knowledge graphs in a database (or “knowledge base”) for various uses. For example, a knowledge graph
of medications used to treat diseases might be used as a part of a research effort in which a database that
includes the knowledge graph along with patient data is queried to find which patients have potentially
untreated diseases (i.e., no medication has been prescribed that treats that disease).
Querying a knowledge graph previously created through LLM queries, rather than just querying an LLM
directly as needed, may have several potential advantages:
1. Cheaper
a. Lower compute costs: In some cases, the cost of computation to query a knowledge
graph may be dramatically lower than querying an LLM.
b. Lower hardware costs and complexity: If the LLM that would have been used would be
run on institutionally controlled servers, the costs, complexity, and management burden
of the hardware stack required to achieve rapid responses may be prohibitive for many.
The hardware costs and complexity needed for querying a knowledge graph in a database
rather may be much lower in many cases than those needed to support many LLMs.
2. Faster
a. Faster query speed: In many cases, querying a knowledge graph (e.g., via a vector
database) may be orders of magnitude faster than querying an LLM.
b. Facilitation of development and implementation: The people, processes, and
technologies for building and implementing systems built on a knowledge graph (perhaps
especially if implemented through a SQL database) may be more well-developed and
readily available than systems using LLMs.
3. Safer
a. Reduction of privacy and confidentiality risks: If the LLM that would have been used is a
third party’s commercial service, querying a knowledge graph running on institutionally
controlled servers instead may reduce the risks of submitting potentially sensitive data to
a commercial third-party service.
b. Reduction of many business risks: If the LLM is controlled by a third party, then using a
knowledge graph in operational use rather than directly querying the LLM may reduce
many business risks such as the third party deprecating the product, increasing pricing, or
modifying functionality.
4. Surer
a. More explainable responses: LLMs are often considered “black boxes” since the actual
logic for producing a given output commonly cannot be provided in a format meaningful to
humans. Although an LLM’s population of the knowledge graph may be considered “black
box,” the downstream usage of that knowledge graph can often be more explainable
since