13  Deploying AI in Clinical and Public-Health Practice

13.1 Learning objectives

By the end of this chapter you should be able to:

  • Recognise that interface design, not raw model capability, is now the binding constraint on real-world AI value, and design deployments accordingly.
  • Apply a shadow-mode rollout pattern: run the AI in parallel with the existing workflow, log outputs, and promote to production only after a quantified comparison.
  • Design drift and regression monitoring appropriate to external-API-backed models that change underneath you, including version pinning and rollback procedures.
  • Apply the 15 Times to Use AI, and 5 Not To framework as a practical decision filter for analytic workflows.

13.2 Orientation

The deployment chapter closes the book by integrating the threads from earlier chapters into the operational question: how does a research-analytics team, or a clinical institution, or a research consortium, actually run AI in production over time? The answer is more about interface design, change management, and ongoing verification than about model selection. The researcher who has chosen the right capability class (Ch 2), set up appropriate context (Ch 3), designed the RAG pipeline (Ch 4), and audited for bias (Ch 10) still has to deploy the system in a way that people use it correctly and that it stays trustworthy as the underlying technology shifts.

Mollick’s 2026 framing is direct (Mollick, 2026): ‘people don’t want a chatbot. They want an agent that works on their actual files, with their actual tools, accessible the way they talk to people’. The binding constraint on real-world AI value is no longer model capability, frontier models in 2026 are extraordinarily capable, but the interfaces, harnesses, and integrations that put the capability in front of users in a usable form.

The chapter develops four threads. Interfaces are the bottleneck: why the interface matters more than model selection at this stage. Shadow-mode rollout: the pattern that has emerged for safe deployment of AI in clinical and research workflows. Drift and version management: how to keep a deployment working as external models change. When to use AI, and when not: the practical decision framework that closes the book.

13.3 The researcher’s contribution

Three judgements are not delegable.

(Judgement 1.) Deployment is a continuous practice. The temptation to treat deployment as a one-time event (‘we built the AI tool and rolled it out’) misses that the underlying technology drifts, the team’s needs evolve, and the failure modes accumulate without active maintenance. The researcher treats deployment as ongoing operational work, like maintaining a clinical-trial database, not like publishing a paper.

(Judgement 2.) The decision to deploy is not a technology decision. Whether to deploy AI for a specific workflow depends on whether deployment improves the work, fits the team’s capacity to verify, and is appropriate to the regulatory and ethical context. The researcher asks the deployment-decision question explicitly rather than defaulting to ‘we have the capability, we should use it’.

(Judgement 3.) Some workflows should not be deployed. Not every task is a candidate for AI assistance, even when the capability exists. Mollick’s 15 Times to Use AI, and 5 Not To (Mollick, 2024a) articulates the negative cases: learning that requires personal struggle, high-accuracy work where plausible hallucinations slip past readers, decisions requiring human ethical judgement. The researcher decides what to leave untouched.

These judgements are what distinguish deployment that creates lasting value from deployment that produces a brief productivity bump followed by operational debt.

13.4 Interfaces are the bottleneck

The thesis is empirical: by 2026, frontier AI capability substantially exceeds what most users extract from it. The gap is not in the model; it is in the interface, the harness, and the integration into the user’s actual work.

Mollick’s Claude Dispatch and the Power of Interfaces (Mollick, 2026) documents the pattern. Claude Dispatch combines Claude Code’s agentic backend with mobile messaging so users can drive desktop AI agents from a phone. The user’s actual capability extracted from the AI rises substantially when the interface meets them where they are (in a chat app, on their phone) rather than requiring them to come to the AI (open a browser tab, log in, paste content).

The implication for analytic deployment: invest in the interface as much as in the model. Three patterns matter for research-analytics teams.

The model lives where the work lives. A researcher’s day-to-day work happens in RStudio, a SAS terminal, a Quarto editor, an EHR view, or a specific-protocol-development tool. AI assistance accessed in those environments is used; AI assistance requiring a context switch is used less. The right deployments embed AI in the existing environments, RStudio plugins, SAS macros that call APIs, EHR-integrated AI scribes, rather than requiring a separate window.

The handoff between AI and human is fast. AI assistance that requires the user to copy output, paste it elsewhere, run something, and copy back is high-friction. Assistance that integrates the verification step (run the code in-place, see the output, edit) is low-friction. The empirical pattern is that low-friction interfaces get used; high- friction ones get bypassed.

The interface respects the user’s mental model. A researcher thinks in terms of cohorts, models, analyses, and reports, not in terms of prompts, tokens, and contexts. Interfaces that present AI assistance in cohort-and-model language (‘extract a risk-of-bias assessment from this paper into the team’s standard table’) are more usable than ones that require the user to translate their work into prompt-engineering vocabulary.

For a research-analytics team’s deployment:

  • Integrate AI capability into the team’s primary tools (RStudio extensions, custom SAS macros, Quarto-integrated review tools).
  • Build minimum-viable interfaces for the team’s recurring workflows; iterate based on use.
  • Avoid forcing the team into a separate AI application unless the workflow specifically benefits from it.

Question. A research-analytics team has access to a frontier AI through a web chat interface. Adoption is patchy: some team members use it heavily, others not at all. The team lead is considering whether to mandate use. What is the more productive intervention?

Answer. The more productive intervention is to study why the non-users are not using it and to address the interface friction those users face, rather than mandating use. Common causes: - The non-users work primarily in tools that are not the web chat (SAS programmer in a terminal, protocol writer in Word). Web-chat workflow is high-friction for them. - The non-users do work that is not well-suited to the team’s current AI applications (they do more domain-judgement-heavy work; AI is less immediately useful). - The non-users have tried it, found verification burdensome, and stopped.

Each cause has a different intervention: integrate AI into their tools, identify their use cases and develop applications, or improve the verification workflow. Mandating use does not address any of these and produces resentment without productivity.

13.5 Shadow-mode rollout

A shadow-mode rollout runs the AI in parallel with the existing workflow, logs both outputs, and uses the comparison to evaluate whether the AI should replace, augment, or be discarded. The pattern is borrowed from clinical decision support and from financial machine learning; it generalises to any deployment where the existing workflow is the standard against which the AI is measured.

The pipeline:

  1. Identify the existing workflow’s output (the manual triage decision, the manual code, the manual abstract screening, etc.).
  2. Set up the AI to produce the same output on the same inputs in parallel, but the AI’s output does not affect any decision.
  3. Log both outputs alongside the input and any ground-truth signal that becomes available later.
  4. Periodically (weekly, monthly) compute agreement metrics: kappa for categorical outputs, ICC or correlation for continuous, accuracy against ground truth where available.
  5. Investigate disagreements: are they cases where the AI was wrong, where the human was wrong, or where the question was genuinely ambiguous?
  6. Decide based on the analysis: promote the AI to replace or augment the manual workflow, continue shadow mode, or discontinue.

A specific example: a research-analytics team is considering AI-assisted abstract screening for a systematic review. The shadow-mode setup:

  • Two senior reviewers screen abstracts as they always have.
  • The AI screens the same abstracts in parallel; AI decisions are logged but do not affect the review.
  • After 200 abstracts, kappa between AI and human reviewers is 0.78 (substantial agreement).
  • Disagreements (28 of 200) are reviewed: 16 are cases where the AI made a defensible but stricter decision, 8 are AI mistakes, 4 are human mistakes.
  • The team decides: AI is promoted from shadow to augmentation. AI screens, then a human reviews AI decisions (especially the unclear cases). Throughput increases; accuracy is maintained.

The shadow period is the basis for the deployment decision. Without it, the decision is anecdotal; with it, the decision is empirical.

For high-stakes deployments (clinical decision support, regulatory submissions), shadow mode is the appropriate default. The added latency and cost are small relative to the value of the empirical basis for the deployment decision.

13.6 Drift, regression, and version pinning

External-API-backed models change. The model identifier ‘claude-opus-4-7’ refers to a model that is updated periodically (with notification, but updates happen). ‘gpt-5’ is similarly versioned. A deployment that worked on the model as it existed last quarter may behave subtly differently this quarter even if the identifier is the same.

Three operational concerns.

Drift detection. Run the team’s evaluation suite (per Chapter 9) periodically against the deployed model. If the score drops materially: say, by more than the suite’s noise level, investigate whether the underlying model has changed. The provider’s release notes should be checked; if the model has changed and the new behaviour is not what was expected, the deployment may need adjustment.

Version pinning. Some API providers offer pinned model versions (‘claude-opus-4-7-20260301’ rather than ‘claude-opus-4-7’). Pinned versions are stable; the team controls when to upgrade. The trade is that pinned versions eventually deprecate, and the team must migrate to a current version on the provider’s schedule. For high-stakes deployments, pinning is appropriate; the migration overhead is the price of control.

Rollback procedures. When a model update produces a material problem, the team needs to be able to revert. For pinned versions, this means identifying the previous version and switching back. For unpinned, this may mean switching to a different provider or falling back to a pre-AI workflow. The rollback procedure is documented before the first deployment; discovering it during a crisis is too late.

A concrete example: a research-analytics team’s RoB assessment workflow uses an LLM with a published evaluation showing kappa 0.78 against expert reviewers. Six weeks after deployment, the weekly evaluation drops to kappa 0.69, a 9-point drop. The team: 1. Reviews the model provider’s release notes; confirms a new model version was released two weeks earlier. 2. Switches to the pinned previous version (kappa recovers to 0.78). 3. Tests the new version on a larger evaluation sample; finds the new version’s behaviour has shifted on certain RoB domains. 4. Adjusts the prompt to compensate for the shift; re-evaluates with the new version and the adjusted prompt; achieves kappa 0.81 (better than before). 5. Migrates to the new version with the adjusted prompt.

The total time: about 4 hours of investigation and adjustment. Without the evaluation harness and the pinned-version capability, the team would either have proceeded with degraded performance (unaware) or suffered a longer disruption.

13.7 When to use AI, and when not

Mollick’s 15 Times to Use AI, and 5 Not To (Mollick, 2024a) is a practical decision framework. The fifteen positive cases (paraphrased for analytic context):

  1. Brainstorming research questions or analysis approaches.
  2. Translating across vocabularies (statistical to clinical, technical to lay).
  3. Summarising long documents (papers, protocols, reports).
  4. Generating boilerplate code in standard languages.
  5. Producing first drafts of routine prose.
  6. Formatting and reformatting documents.
  7. Extracting structured data from text.
  8. Cross-checking work against a reference.
  9. Generating examples for training or illustration.
  10. Translating between formats (LaTeX to Word, plot specs to ggplot2 code).
  11. Quick literature searches (deep research).
  12. Generating starting points for critique.
  13. Producing variants of an idea or design.
  14. Coding boilerplate plots and tables.
  15. Routine project-management tasks.

The five not-to cases:

  1. Learning something new where the struggle is the learning. Asking AI for the answer short-circuits the cognitive work that builds understanding.
  2. High-accuracy work where plausible hallucinations slip past readers. Citation work, regulatory submissions, anywhere the reader will not check.
  3. Decisions requiring human ethical judgement. IRB protocol decisions, decisions about research subjects, decisions about authorship.
  4. Communication that should reflect personal relationship. A condolence note, a recommendation letter, a reference letter, the human authorship is part of the value.
  5. High-stakes situations where errors are catastrophic and the AI’s failure mode is silent. Anywhere the cost of a confident wrong answer exceeds the value of the work.

The framework is decision-oriented; for each task on the team’s plate, the question is which list it falls in. ‘Use AI for this’ is a decision, not a default.

13.8 Worked example: a complete deployment

A research-analytics group at an academic medical centre deploys AI-assistance across their workflow over six months. The deployment integrates everything from the preceding chapters.

Context (from earlier chapters): The team has built its evaluation suite (Ch 9), conducted bias audits on the models they use (Ch 10), set up HIPAA-compliant API configurations (Ch 11), and adopted a context- engineering-first stance with a shared prompt library (Ch 12). Now they are operationalising.

Deployment 1: code generation in RStudio.

  • Interface: a custom RStudio addin that takes highlighted text or selected lines, sends to the team’s API, and returns suggested code in a side panel.
  • Use case: generate dplyr / data.table code from natural-language description, debug failing code, refactor for clarity.
  • Verification: the user runs the suggested code before accepting it.
  • Shadow-mode period: 1 month. Comparison against manually-written code on a sample. Acceptance rate: 78%; bug rate in accepted code: 4% (caught by running). Deemed acceptable.
  • Rollout: full team after the shadow period.

Deployment 2: methods drafting.

  • Interface: a Quarto extension that, given an analysis plan, produces a methods-section draft.
  • Use case: first-pass draft of a methods section, edited by the investigator.
  • Verification: investigator review. The methods section is the investigator’s responsibility; AI-drafted text gets the same scrutiny as any other draft.
  • Shadow-mode period: 2 months on internal reports before extending to publications.
  • Rollout: full team for internal reports; publication-specific review for journal submissions.

Deployment 3: RoB assessment for systematic reviews.

  • Interface: a structured-output RoB tool that takes PDF input and returns a Cochrane RoB 2 assessment in JSON.
  • Use case: first-pass RoB for systematic reviews, reviewed by the senior reviewer.
  • Verification: confidence-flagged outputs receive 100% review; high-confidence outputs receive 20% review.
  • Shadow-mode period: kappa against expert review on 100 papers; kappa = 0.74. Disagreements analysed.
  • Rollout: as augmentation, not replacement.

Deployment 4: literature scoping (deep research).

  • Interface: direct use of the provider’s deep- research tool.
  • Use case: scoping reviews for protocol design.
  • Verification: 100% citation verification on cited papers; 20% deep verification of extracted data.
  • Shadow-mode period: not applicable (the use case is not displacing an existing workflow).
  • Rollout: per-project, with documented disclosure.

Operational practices:

  • Weekly evaluation suite run; alerts on degradation.
  • Monthly review of incident logs (AI errors, unexpected outputs, user-reported issues).
  • Quarterly review of the prompt library; updates based on observed patterns.
  • Annual review of the deployment overall: which uses are providing value, which are not, what to add or remove.

Six-month outcomes:

  • Throughput on routine analytical work: estimated +35% (no precise measurement, but qualitative).
  • Quality (errors caught in review): no measurable change, suggesting AI-assisted work is at least as accurate as before given the verification regime.
  • Team satisfaction: high; the verification burden is real but feels manageable.
  • Cost: $1,200/month in API charges, well below the operational savings.

The deployment is operational. The team continues to refine. The deployment is not ‘done’; it is continuous practice.

13.9 Collaborating with an LLM on deployment

Three prompt patterns illustrate working with LLMs on deployment tasks.

Prompt 1: ‘Plan the rollout for this AI deployment.’ Provide the deployment description and the team context.

What to watch for. The LLM produces a competent plan. It tends to under-specify the verification regime and the rollback procedure. Push back on both.

Verification. The plan needs to include shadow-mode metrics, acceptance criteria, and rollback steps. If any of these is vague, refine.

Prompt 2: ‘Audit my deployment for failure modes I’m not considering.’ Provide the deployment description and the current verification regime.

What to watch for. The LLM is good at identifying potential failure modes in the abstract. It is less good at identifying institution-specific issues (your IRB’s specific requirements, your IT department’s specific constraints). Supplement with local knowledge.

Verification. For each failure mode the LLM flags, ask: is this realistic for our deployment? What is the cost of preparing for it? Decide which to address.

Prompt 3: ‘Draft the rollback procedure for this deployment.’ Provide the deployment details.

What to watch for. The LLM produces a reasonable draft procedure. The procedure needs to be tested in practice before the first crisis; an untested rollback is at best speculative.

Verification. Run the rollback procedure in a controlled scenario before deployment. Document what worked and what did not.

The meta-pattern: deployment is operational discipline. The LLM accelerates planning; the discipline of running the deployment over time remains the work.

13.10 Principle in use

Three habits define defensible deployment work:

  1. Shadow before promote. Every new AI deployment runs in parallel with the existing workflow long enough to produce empirical metrics. Promotion is based on the metrics, not on the model’s reputation.

  2. Pin the version, monitor for drift. Use pinned model versions for high-stakes deployments. Run the evaluation suite periodically. Update intentionally, not automatically.

  3. Plan the rollback before the launch. Every deployment has a documented rollback procedure tested in advance. Operations under crisis is not the time to figure out the rollback.

13.11 Exercises

  1. For an AI deployment in your team, document the shadow-mode rollout plan (or, if shadow mode was not used, the alternative validation approach).

  2. Design the rollback procedure for an AI deployment in your team. Test it in a controlled scenario. Document what you learned.

  3. Run your team’s evaluation suite once a week for a month. Plot the scores over time; investigate any movements outside the noise band.

  4. For a workflow you are considering for AI assistance, classify it against Mollick’s 15-and-5 framework. Justify your classification.

  5. Write a one-page ‘AI deployment summary’ for your team’s most important AI use. Include the purpose, interface, verification regime, monitoring, and rollback. Have a colleague review it for completeness.

13.12 Further reading

  • Mollick (2026), Claude Dispatch and the Power of Interfaces. The thesis that interface design is the binding constraint on real-world AI value.
  • Mollick (2024a), 15 Times to Use AI, and 5 Not To. The decision framework for whether to use AI for a specific task.
  • Mollick (2024b), Latent Expertise: Everyone is in R&D. The bottom-up adoption pattern that underwrites long-term deployment success.
  • Finlayson et al. (2021), The Clinician and Dataset Shift in Artificial Intelligence. The reference treatment of clinical-deployment drift.
  • Tierney et al. (2024), Ambient AI Scribes and Clinician Documentation. An applied case study of clinical-AI deployment with empirical outcomes.