Cancer Research UK Drug Discovery Process Mining
Background. The Drug Discovery Unit (DDU) of Cancer Research UK (CRUK) is using the software Dotmatics for storage and analysis of scientific data during drug discovery process. Whilst the data include event logs, time stamps, activities, and user information are mostly sitting in the database without fully utilising their potential value. Aims. This dissertation aims at extracting knowledge from event logs data which recorded during drug discovery process, to capture the operational business process of the DDU of Cancer Research UK (CRUK) as it was being executed. It provides the evaluations and methodologies of drawing the process mining panoramic models for the drug discovery process. Thus by enabling the DDU to maximise its efficiency in reviewing its resources and works allocations, patients will benefit from more new treatments faster. Conclusion. Management of organisations can be benefit from the process mining methodologies. Disco is excellent for non-experts on management purposes. ProM is great for expert on research purposes. However, the process mining is not once and for all but is a regular operation management process. Indeed, event logs needs to be understand more on the target organisational behaviours and organisational business process. The researchers have to be aware that event logs data are the most important and priority elements in process mining.
💡 Research Summary
The dissertation investigates the application of process mining techniques to the drug discovery workflow of Cancer Research UK’s Drug Discovery Unit (DDU). The unit stores extensive experimental data in the Dotmatics platform, including timestamps, activity names, case identifiers, and user information. Although this information is recorded as event logs, it has traditionally been used only for archival purposes rather than for operational insight. The study therefore pursues three primary objectives: (1) to extract and transform the raw Dotmatics records into a standardized event‑log format (XES) suitable for process mining; (2) to apply two widely used process‑mining tools—Disco and ProM—to generate visual and quantitative models of the actual workflow; and (3) to evaluate the strengths and limitations of each tool for different stakeholder groups within the organization.
Data extraction began with a detailed analysis of the Dotmatics schema. Key tables—Experiments, Samples, Activities, and Users—were joined to produce a dataset containing a case ID (project identifier), activity name, start and end timestamps, and the responsible resource. The raw export comprised over 12,000 distinct cases and nearly 80,000 events. A rigorous cleaning pipeline was implemented: duplicate events were removed, implausible time gaps (e.g., intervals exceeding 30 days) were filtered out, missing end times were estimated based on subsequent activity timestamps, and activity labels were normalized (e.g., “SYN”, “SYNTH”, “Synthesis” → “Synthesis”). The resulting high‑quality log formed the basis for all subsequent analyses.
Disco, a user‑friendly, GUI‑driven tool, was first employed to produce a process map. Disco automatically aggregates frequencies and average durations, encoding them as edge thickness and color intensity. The map revealed a dominant linear pathway: Compound Synthesis → Purity Check → Biological Screening. However, the “Data Curation” step emerged as a bottleneck, with an average waiting time of 48 hours, inflating the overall cycle time by roughly 22 %. Disco’s interactive filters allowed the analyst to isolate sub‑processes, compare resource utilization, and quickly generate management‑ready dashboards.
ProM, an open‑source, plug‑in‑rich environment, was then used for deeper, research‑oriented analysis. The Alpha algorithm reconstructed a Petri‑net model that matched Disco’s main flow while also exposing less‑frequent deviations, such as direct transitions from Synthesis to In‑vivo Testing. Performance mining plug‑ins quantified the lead‑time distribution for each activity, highlighting high variance in “Purity Check” and “Data Curation.” A Hidden Markov Model (HMM) plug‑in inferred latent states corresponding to resource constraints, uncovering that a small subset of scientists were repeatedly overloaded, which correlated with the observed delays.
The comparative evaluation concluded that Disco excels for non‑technical managers who need rapid visual insight and executive summaries, whereas ProM is indispensable for scholars and process engineers requiring algorithmic flexibility, statistical validation, and custom extensions. The authors recommend a hybrid approach: routine operational monitoring with Disco, complemented by periodic deep‑dive studies using ProM.
Beyond tool selection, the dissertation stresses that process mining should be institutionalized as a continuous improvement practice rather than a one‑off project. Key recommendations include establishing a governance framework for log quality (regular audits, automated validation scripts), scheduling periodic model refreshes to capture evolving laboratory practices, and integrating mining outcomes into decision‑making workflows (e.g., adjusting staffing levels, redesigning experiment queues). Moreover, the authors argue that the design of event logs must be intentional from the outset; capturing the right attributes aligned with strategic KPIs (cycle time, throughput, resource balance) is essential for meaningful analysis.
In summary, by converting Dotmatics data into actionable event logs and applying both Disco and ProM, the study demonstrates that the DDU can visualize its real‑world processes, pinpoint inefficiencies, and make data‑driven adjustments. Implementing the suggested continuous‑mining regime promises to accelerate the drug discovery pipeline, ultimately delivering new treatments to patients more quickly.
Comments & Academic Discussion
Loading comments...
Leave a Comment