Engineering AI Agents for Clinical Workflows: A Case Study in Architecture,MLOps, and Governance
The integration of Artificial Intelligence (AI) into clinical settings presents a software engineering challenge, demanding a shift from isolated models to robust, governable, and reliable systems. However, brittle, prototype-derived architectures often plague industrial applications and a lack of systemic oversight, creating a responsibility vacuum'' where safety and accountability are compromised. This paper presents an industry case study of the Maria’’ platform, a production-grade AI system in primary healthcare that addresses this gap. Our central hypothesis is that trustworthy clinical AI is achieved through the holistic integration of four foundational engineering pillars. We present a synergistic architecture that combines Clean Architecture for maintainability with an Event-driven architecture for resilience and auditability. We introduce the Agent as the primary unit of modularity, each possessing its own autonomous MLOps lifecycle. Finally, we show how a Human-in-the-Loop governance model is technically integrated not merely as a safety check, but as a critical, event-driven data source for continuous improvement. We present the platform as a reference architecture, offering practical lessons for engineers building maintainable, scalable, and accountable AI-enabled systems in high-stakes domains.
💡 Research Summary
The paper presents a comprehensive industry case study of the “Maria” platform, a production‑grade AI system deployed in primary healthcare in Brazil. Its central thesis is that trustworthy clinical AI cannot be achieved by isolated model development alone; instead, it requires the holistic integration of four engineering pillars: (1) a synergistic architecture that combines Clean Architecture with an Event‑Driven Architecture (EDA), (2) an agent‑centric design that treats AI components as autonomous agents, (3) robust, agent‑level MLOps pipelines, and (4) a Human‑in‑the‑Loop (HITL) governance model tightly coupled with the MLOps infrastructure.
Architectural Integration
The authors adopt Clean Architecture to isolate core domain logic (entities and use‑cases) from volatile outer layers such as frameworks, databases, and large language model (LLM) providers. This separation protects clinical business rules from technology churn and enables extensive unit testing. Around this core they overlay an EDA, using an asynchronous event bus (AWS EventBridge/SNS) to decouple communication between the core and infrastructure adapters. Events represent clinical triggers—patient check‑in, lab result arrival, etc.—allowing the system to react in real time while maintaining a complete audit trail. The combination yields a system that is both maintainable (thanks to clean boundaries) and resilient (thanks to asynchronous, loosely‑coupled components).
Agent‑Based Design
The platform defines two primary agents: a Pre‑Appointment agent that interprets patient intent and prepares the encounter, and a Post‑Appointment agent that generates summaries, diagnoses, and follow‑up recommendations. Each agent encapsulates its own business logic and owns a dedicated MLOps lifecycle. By treating agents as first‑class modular units, the system achieves true separation of concerns: a change in one agent’s model or data pipeline does not impact the other, and responsibility for each agent can be traced independently.
MLOps Implementation
MLOps pipelines are built around CI/CD, model registries, data minimization, drift detection, and observability (logs, metrics, traces). Crucially, the pipelines operate at the agent level, providing versioned model artifacts, performance dashboards, and automated rollback capabilities for each agent separately. This granularity enables “graceful degradation” when a model drifts or a new version fails validation, without bringing down the entire platform. The infrastructure layer implements abstract interfaces (ports) for persistence, model calls, and event publishing, allowing easy swapping of underlying technologies (e.g., Aurora vs. DynamoDB, OpenAI vs. Bedrock).
Human‑in‑the‑Loop Governance
Recognizing that fully autonomous AI is unsafe in medicine, the authors embed a HITL workflow at critical decision points. Clinicians review AI outputs via a Step Functions orchestrated UI, and their feedback is captured as events that feed back into the training data pipeline. Because every prediction is linked to a specific model version, training dataset snapshot, and performance metrics, the HITL process becomes auditable and satisfies regulatory requirements such as HIPAA and GDPR. The paper argues that a mature MLOps foundation is a prerequisite for any meaningful HITL system, as it supplies the necessary context for clinicians to make informed judgments.
Research Questions and Findings
- RQ1 (Architecture): The combined Clean + EDA pattern delivers a resilient, maintainable, and auditable system, as demonstrated by the clear separation of core business rules from infrastructure and the event‑driven coordination of components.
- RQ2 (Design & MLOps): Modeling AI components as autonomous agents with independent MLOps lifecycles improves modularity, facilitates governance, and reduces technical debt.
- RQ3 (Governance): Integrating HITL with MLOps provides the traceability and context required for safe clinical oversight, effectively filling the “responsibility vacuum” often observed in AI deployments.
Lessons Learned and Limitations
The authors highlight several practical insights: (i) architectural decoupling dramatically lowers long‑term maintenance costs; (ii) agent‑level MLOps enables rapid response to model drift while preserving system stability; (iii) HITL becomes a true safety net when each AI suggestion is linked to versioned artifacts. However, the paper lacks quantitative performance metrics (e.g., model accuracy, latency, drift detection sensitivity), making it difficult to assess clinical impact. Dependence on specific cloud services (AWS) raises concerns about cost and vendor lock‑in. The HITL workflow’s impact on clinician workload and UI/UX design is not explored, and regulatory certification pathways (FDA, CE) are only mentioned superficially.
Conclusion
“Maria” serves as a concrete reference architecture for building trustworthy AI in high‑stakes domains. By marrying Clean Architecture with Event‑Driven design, treating AI components as autonomous agents, provisioning granular MLOps pipelines, and embedding a data‑rich HITL governance loop, the platform demonstrates a viable path toward safe, maintainable, and auditable clinical AI. The approach is extensible to other regulated sectors where reliability, transparency, and human oversight are paramount.
Comments & Academic Discussion
Loading comments...
Leave a Comment