Monitoring Deployed AI Systems in Health Care
Post-deployment monitoring of artificial intelligence (AI) systems in health care is essential to ensure their safety, quality, and sustained benefit-and to support governance decisions about which systems to update, modify, or decommission. Motivated by these needs, we developed a framework for monitoring deployed AI systems grounded in the mandate to take specific actions when they fail to behave as intended. This framework, which is now actively used at Stanford Health Care, is organized around three complementary principles: system integrity, performance, and impact. System integrity monitoring focuses on maximizing system uptime, detecting runtime errors, and identifying when changes to the surrounding IT ecosystem have unintended effects. Performance monitoring focuses on maintaining accurate system behavior in the face of changing health care practices (and thus input data) over time. Impact monitoring assesses whether a deployed system continues to have value in the form of benefit to clinicians and patients. Drawing on examples of deployed AI systems at our academic medical center, we provide practical guidance for creating monitoring plans based on these principles that specify which metrics to measure, when those metrics should be reviewed, who is responsible for acting when metrics change, and what concrete follow-up actions should be taken-for both traditional and generative AI. We also discuss challenges to implementing this framework, including the effort and cost of monitoring for health systems with limited resources and the difficulty of incorporating data-driven monitoring practices into complex organizations where conflicting priorities and definitions of success often coexist. This framework offers a practical template and starting point for health systems seeking to ensure that AI deployments remain safe and effective over time.
💡 Research Summary
The paper presents a pragmatic framework for post‑deployment monitoring of artificial intelligence (AI) systems in health care, drawing on real‑world experience at Stanford Health Care (SHC). Recognizing that governance decisions—whether to update, modify, or retire an AI tool—require reliable, actionable data, the authors organize monitoring around three complementary principles: system integrity, performance, and impact.
System integrity monitoring ensures that the technical pipeline (data extraction, feature generation, model inference, and output delivery) remains operational. Key metrics include service uptime, API request latency, data‑retrieval failures, and inference‑time errors. Pre‑defined thresholds trigger automated alerts to engineering, data‑science, or application teams for rapid remediation. The approach aligns with MLOps best practices and is applied both to internally hosted models and to externally hosted large‑language‑model (LLM) APIs.
Performance monitoring tracks whether the model’s predictive behavior stays accurate over time despite evolving clinical practices, documentation habits, and patient populations—a phenomenon known as dataset shift or concept drift. For traditional machine‑learning models, the framework monitors time‑series of ROC‑AUC, sensitivity, specificity, and calibration metrics, prompting retraining or feature redesign when degradation exceeds set limits. For generative AI, the authors differentiate fixed‑prompt systems (e.g., an LLM‑driven hospice eligibility screen) from open‑prompt systems (e.g., an EHR‑integrated chatbot). Fixed‑prompt tools are evaluated on consistency of output against a known rubric, while open‑prompt tools are assessed via token usage, error codes, latency, and user‑feedback loops.
Impact monitoring links AI outputs to downstream clinical or operational outcomes. The framework defines concrete key performance indicators (KPIs) such as reduction in time‑to‑diagnosis, changes in treatment patterns, documentation time saved, or patient‑level health outcomes. Regular review of these KPIs informs decisions about workflow redesign, user training, or tool decommissioning, ensuring that AI delivers sustained value rather than merely functioning technically.
Implementation leverages existing institutional data platforms to avoid “point‑solution” sprawl. SHC uses Epic’s Model and Feature Management dashboards for Epic‑hosted models, Databricks notebooks and dashboards for in‑house services, and ServiceNow as a centralized intake for incident tickets and change requests. Where legacy dashboards exist (Tableau, PowerBI), they are incorporated into the monitoring plan, but the authors recommend consolidating tools whenever possible to reduce cross‑system permissions and maintenance overhead.
The paper also candidly discusses challenges: limited staffing and budget, the complexity of integrating monitoring into a health system with thousands of applications and interfaces, and the difficulty of aligning diverse stakeholder definitions of success. To mitigate these, the authors advocate for automated alerting, pre‑specified response protocols, and continuous education of clinical and operational partners.
In sum, the proposed framework translates technical monitoring recommendations into actionable governance processes. By coupling MLOps‑style integrity checks, rigorous performance drift detection, and outcome‑focused impact assessment, it offers a scalable template for health systems seeking to keep AI tools safe, accurate, and beneficial throughout their lifecycle.
Comments & Academic Discussion
Loading comments...
Leave a Comment