8  Deep Research and Evidence Synthesis Pipelines

8.1 Learning objectives

By the end of this chapter you should be able to:

  • Use deep-research modes (OpenAI Deep Research, Elicit, Scite, Claude with extended thinking + web tools) for scoping reviews, background searches, and evidence synthesis.
  • Design a living-review pipeline that combines LLM screening with human adjudication and tracks PRISMA- aligned documentation.
  • Apply LLM-assisted risk-of-bias assessment with appropriate human verification.
  • Distinguish tasks where deep research adds value (background searches, broad-scope mapping) from tasks where it does not (formal systematic-review methodology, meta-analysis pooling).

8.2 Orientation

Deep research is the name that emerged in 2025 for a class of narrow agent that combines a reasoning model with autonomous web access, runs for ten to thirty minutes, and produces a structured report with citations. OpenAI Deep Research, Anthropic’s research mode, Google’s research-with-agents, plus Elicit and Scite at the research-tool layer. The capability has changed what is possible at the literature-search and evidence-synthesis end of applied analytic work.

Mollick (Mollick, 2025) frames the shift: ‘for the first time, an AI isn’t just summarizing research, it’s actively engaging with it at a level that actually approaches human scholarly work’. The empirical evidence matches the framing. A graduate-level research question that would have taken a senior researcher a day of literature work, pull and read 60 papers, organise the themes, draft a synthesis, can be done by Deep Research in 15 minutes with output that requires verification but is broadly defensible.

The chapter develops three threads. Deep research as a class of narrow agent: what these systems do, when they succeed, and how their failure modes differ from chat or reasoning models. Evidence synthesis pipelines: how to use LLM-assisted approaches in formal systematic-review work, with the PRISMA-aligned documentation that distinguishes such work from informal literature reviews. Risk-of-bias assessment: a specific application where LLM assistance has been formally evaluated and where the verification regime is well-defined.

The framing inherits Mollick’s Four Singularities for Research (Mollick, 2024): the four ways AI is reshaping academic research, of which the methods-singularity (autonomous experiments and analysis) is closest to this chapter’s topic.

8.3 The researcher’s contribution

Three judgements are not delegable.

(Judgement 1.) Deep research is a draft, not a finished product. The output of a deep-research run is a structured document with citations and synthesis. The researcher verifies the citations exist, the claims attributed to them are accurate, and the synthesis does not over- or under-state the evidence. A deep-research output that is uploaded to a collaborator without verification is the analytic equivalent of submitting a draft your intern wrote without reading it.

(Judgement 2.) Search comprehensiveness is a methodological choice. Formal systematic reviews require comprehensive search of multiple databases, explicit search strings, and deduplication. Deep research is good at scoping searches and broad-mapping; it is not a substitute for the formal search step. The researcher decides which kind of search the work requires and uses deep research where appropriate. A clinical guideline informed by a deep-research-only search is not the same artefact as one informed by a PRISMA-aligned systematic review.

(Judgement 3.) The interpretation of evidence remains the researcher’s. Deep research can produce a synthesis paragraph with citations. The conclusion ‘the evidence supports treatment X for indication Y’ is a clinical judgement informed by the evidence, not a literal output of the synthesis tool. The researcher weighs the evidence, considers the limitations, and forms the conclusion, with the deep-research output as input rather than output.

These judgements are what distinguish deep-research- assisted work that informs decisions from work that launders unsupported claims through the appearance of literature support.

8.4 Deep research as a class of narrow agent

Deep-research products share a common architecture:

  1. Query decomposition. The user provides a research question; the system breaks it into sub-questions.
  2. Iterative search. For each sub-question, the system performs web searches, reads results, identifies follow-up searches.
  3. Synthesis. The system organises findings into a structured report, with citations to specific sources for specific claims.
  4. Self-critique. Some implementations include a final pass where the system evaluates its own output for gaps, contradictions, or weak support.

The user-visible output: a 5–30 page document with inline citations, structured headings, and a bibliography. The latency is 5–30 minutes; the cost is typically $1–$5 per run; the citation count is typically 30–100 sources per report.

For clinical and public-health researchers, three classes of task are well- suited.

Background scoping. ‘What’s known about X in indication Y’ as the starting point for a research project. Deep research produces a competent landscape view in less time than manual searching takes to even formulate the right search terms.

Methodological lookups. ‘What approaches have been used to analyse Z type of trial’ or ‘how have other groups handled the missing-data problem in W kind of study’. Deep research produces a synthesis of methods across multiple papers more efficiently than manual extraction.

Living-review updates. Periodic refresh of a literature base on a defined topic. A monthly deep- research run on the cohort’s research question identifies new papers and incorporates them into the team’s working bibliography.

Three classes of task are poorly suited.

Formal systematic reviews. Deep research does not produce a comprehensive search across multiple databases, deduplicated, with explicit inclusion/ exclusion documented. PRISMA-aligned systematic reviews require the formal search; deep research can supplement but not replace.

Meta-analyses. Deep research can identify candidate studies and extract reported statistics, but the quantitative pooling, heterogeneity assessment, and sensitivity analysis remain applied analytic work. Deep research’s tabular extraction is helpful as a draft input.

Niche or recent literature. Deep research depends on web indexing. Very recent papers (last few weeks), papers in subscription-only journals not indexed in PMC, and papers in non-English literatures are under-represented. The researcher notices the gap and supplements.

Question. A clinical team needs literature evidence to support a decision about whether to add an adaptive-design feature to a planned Phase III trial. The team has 2 weeks before the protocol must be finalised. Should they use deep research or commission a formal systematic review?

Answer. For a 2-week timeline, deep research is the right tool. A systematic review takes 3–9 months including title/abstract screening, full-text review, data extraction, risk-of-bias assessment, and synthesis. Deep research, with the researcher’s verification, produces a competent evidence summary in 1–3 days. The trade-off: the deep research is not PRISMA- compliant; if the resulting protocol is later challenged on the evidence base, the team has weaker methodological standing than they would with a formal review. The right disclosure is ‘a deep-research- assisted scoping review of the literature, with key citations verified manually’. For decisions with major regulatory implications, a formal review may still be warranted post-hoc; for protocol design, deep research is usually adequate and the trade-off is reasonable.

8.5 Living reviews: LLM screening with human adjudication

A living systematic review is one that is updated continuously as new papers are published rather than producing a single static output. Living reviews are better matched to fast-moving fields than traditional fixed reviews; they are also more labour-intensive. LLM assistance changes the labour math in ways that make living reviews tractable.

The pipeline:

  1. Continuous search. A scheduled query against PubMed, Cochrane, EMBASE retrieves new papers since the last update. Standard library tools (e.g., pubmedR, easyPubMed) handle this.
  2. LLM-assisted screening. Each new paper’s abstract is scored against the inclusion criteria. The LLM produces a recommendation (include / exclude / unclear) with reasoning.
  3. Human adjudication. A reviewer (or two, with conflict resolution) confirms include/exclude decisions. The agreement statistic against the LLM is tracked over time.
  4. Data extraction. For included papers, the LLM produces a structured extraction (study design, sample size, primary outcome, key result). The reviewer verifies.
  5. Synthesis update. The new papers are incorporated into the running synthesis. The output is versioned; the diff against the previous version is tracked.
  6. Documentation. PRISMA-aligned flow diagram is updated; the LLM involvement is documented per PRISMA-AI.

The agent of Chapter 7 can be set up to run this pipeline weekly. The researcher’s time per update is on the order of 1–2 hours; the time per LLM-assisted extraction is 30 seconds rather than 30 minutes manual. The trade is the verification overhead; for a 5-paper- per-week update rate, the math works out favourably.

A worked construction of the screening prompt:

You are screening abstracts for a systematic review on
[topic].

Inclusion criteria:
- Population: adults with [condition X]
- Intervention: [intervention class Y]
- Comparator: any (active or placebo)
- Outcome: [primary outcome Z] reported
- Study design: randomised controlled trial

Exclusion criteria:
- Animal studies, in vitro studies
- Conference abstracts without full data
- Case reports or case series
- Studies in pediatric populations only

For the abstract below, return JSON:
{
  "decision": "include" | "exclude" | "unclear",
  "rationale": "one sentence",
  "reviewer_attention_needed": true | false
}

Mark "reviewer_attention_needed: true" for any unusual
or borderline case.

ABSTRACT: [...]

The structured output makes downstream processing straightforward: counts of include/exclude/unclear, filtering by reviewer_attention_needed, automatic tracking of decision rationale.

The agreement metric (Cohen’s kappa between LLM decisions and human reviewer decisions on a sample) is the headline. For this to be accepted in the methods section, kappa needs to be in the substantial-agreement range (0.6–0.8 typically). The MedAlign benchmark and related work show modern reasoning models can achieve this on well-specified inclusion criteria; achievement is not automatic and depends on prompt engineering.

8.6 Risk-of-bias assessment with verification

Risk-of-bias assessment is one of the most labour- intensive steps in a systematic review. Each included study is rated across multiple domains (sequence generation, allocation concealment, blinding, missing data, reporting bias, etc.) using a structured tool (Cochrane RoB 2 for RCTs, ROBINS-I for non-randomised studies). Manual rating takes 30–60 minutes per study.

LLM-assisted rating has been formally evaluated and performs at acceptable levels for screening but requires human verification. The published validation work shows kappa around 0.7 for individual domains between LLM and expert reviewers, with characteristic weaknesses on domains that require careful reading of the methods (e.g., ‘allocation concealment’, where plausible-sounding methods sections may have hidden flaws).

A working pattern:

You are rating risk of bias in this RCT using the
Cochrane RoB 2 tool. For each domain, return:
{
  "domain": "<domain name>",
  "judgement": "low" | "some concerns" | "high",
  "rationale": "specific quote or paraphrase from
                methods supporting the judgement",
  "confidence": "high" | "medium" | "low"
}

Use confidence: low when the methods do not provide
enough information to make a confident judgement;
this flags for reviewer attention.

DOMAINS:
1. Bias from randomisation
2. Bias due to deviations from intervention
3. Bias due to missing outcome data
4. Bias in measurement of outcome
5. Bias in selection of reported result

STUDY METHODS: [...]

The structured rationale field is what makes verification tractable. The reviewer can read the quote, judge whether it actually supports the judgement, and override where appropriate. The confidence field is the auto-flag for unclear methods sections; reviewers focus their time on the flagged studies.

The end-to-end pipeline:

  1. LLM produces RoB judgements for all studies.
  2. Reviewer audits 100% of low-confidence judgements.
  3. Reviewer spot-checks 20% of high-confidence judgements.
  4. Disagreements between LLM and reviewer are adjudicated by a second reviewer.
  5. The final RoB table for the review uses adjudicated judgements.
  6. Methods section discloses LLM-assisted RoB with audit and adjudication procedure.

The labour math: 30 minutes/study manual × 50 studies = 25 hours. LLM-assisted: 30 seconds/study × 50 + ~15 minutes audit on the 30% requiring it (15 studies) + 15 minutes adjudication on the 5–10 disagreements = 5–6 hours. The savings are real.

8.7 Worked example: a deep-research-assisted scoping review

A research team is preparing a Phase III trial in a specific cancer indication. They need a scoping review of: - Recent (last 5 years) Phase II/III trials in the indication. - Primary endpoints used and the rationale. - Sample sizes and the assumptions used. - Approaches to multiplicity for secondary endpoints.

A traditional scoping review for this would take 4–6 weeks. The team has 10 days.

Step 1: structure the deep-research query. The team writes a one-paragraph briefing:

Conduct a scoping review of Phase II and Phase III
trials in [specific cancer indication] published
2020-2026. For each identified trial, extract: primary
endpoint, sample size, sample-size assumptions
(effect size, power), approach to multiplicity for
secondary endpoints, and whether the primary endpoint
was met.

Output format: structured table (one row per trial)
plus a 1-page synthesis paragraph identifying common
choices and outliers.

Search PubMed, ClinicalTrials.gov, and the FDA
review documents (where available). Cite specific
sources for each claim.

Step 2: run. OpenAI Deep Research, ~25 minutes, ~$3.20 in API charges. Output: a 14-page report with a 32-row trial table and a synthesis paragraph.

Step 3: verify. The researcher verifies: - The 32 trials cited exist (check PMID). - For 6 randomly-selected trials, the extracted primary endpoint matches the published trial. - The synthesis paragraph’s claims (e.g., ‘most trials in the indication used overall survival as the primary endpoint with a hazard ratio assumption of 0.75’) are consistent with the table.

The verification takes ~2 hours. Two extracted endpoints are slightly mis-stated and corrected. One trial cited is real but a Phase Ib (not II/III) and removed. The synthesis paragraph is accurate.

Step 4: supplement. The deep-research output missed two recent trials the team knows about (one at ASCO 2025, not yet indexed; one in a Japanese journal). The researcher adds them manually to the table.

Step 5: use in protocol. The scoping review informs the team’s choice of primary endpoint (progression-free survival, consistent with most recent trials), sample size assumption (hazard ratio 0.70, conservative relative to the median 0.65), and multiplicity strategy (gatekeeping for secondary endpoints, common in the field).

Step 6: document. The methods section of the protocol cites the scoping review and discloses ‘an AI-assisted scoping review with manual verification of all citations and 20% verification of extracted fields’. The detailed methodology lives in an internal document for institutional review.

Total time: ~10 hours of researcher time (verification + supplementation + protocol writing), versus an estimated 4–6 weeks for a traditional scoping review. The trade-off is the PRISMA-incompleteness; the team is comfortable with this for the protocol-development purpose.

8.8 Collaborating with an LLM on deep research workflows

Three prompt patterns illustrate working with deep- research tools effectively.

Prompt 1: ‘Write the deep-research brief for this question.’ Provide the question, the use case, and constraints.

What to watch for. The LLM writes a competent brief. It tends to under-specify the verification expectations. Push back: ‘what should the deep- research output explicitly cite versus assert? what should be flagged for verification?’

Verification. The brief becomes the contract for the deep-research run. Run a small test with the brief and inspect the output’s structure; refine the brief if the structure is not what you wanted.

Prompt 2: ‘Audit this deep-research output for citation fidelity.’ Provide the output and ask the LLM to identify claims that may be over-stated relative to the cited source.

What to watch for. The LLM is reasonably good at identifying obvious misalignments but may miss subtler over-statements. The audit is a useful first pass; manual citation-by-citation verification remains the gold standard for high-stakes use.

Verification. Confirm the LLM-flagged citations. Manually verify a 20% random sample of the unflagged citations. The combination is more efficient than full manual verification.

Prompt 3: ‘Compare this LLM-assisted RoB judgement with the published Cochrane assessment.’ Provide the LLM’s RoB output and the Cochrane assessment for the same study.

What to watch for. The LLM is good at noting specific differences but may rationalise its own judgement when pressed. For high-stakes use, a second LLM (different family) provides a more independent comparison.

Verification. If LLM and Cochrane disagree, investigate. The Cochrane reviewers are usually right but not always; the disagreement may indicate a recent re-evaluation of the methodology.

The meta-pattern: deep research is a force multiplier for the literature-engagement phase of applied analytic work, not a replacement for the researcher’s judgement about the evidence. Use it where the speed-up is substantial, the verification is tractable, and the use case can absorb a deep- research-shaped artefact rather than a formal systematic review.

8.9 Principle in use

Three habits define defensible work in this area:

  1. Verify citations before quoting. Every citation in deep-research output that will appear in a paper, protocol, or report must exist and say what it is claimed to say. The verification cost is small; the cost of a mis-cited claim in print is not.

  2. Use deep research for the kinds of questions it handles well. Scoping reviews, methodological surveys, background searches, living-review updates. Not formal systematic reviews, not meta- analysis pooling, not niche-literature identification.

  3. Disclose AI-assistance proportionate to use. Internal scoping documents need lighter disclosure than published protocols, which need lighter disclosure than systematic reviews. The PRISMA-AI extension and similar reporting guidelines specify the standard for each context.

8.10 Exercises

  1. Run a deep-research query on a topic in your area and verify the output. Document the verification findings: how many citations exist, how many accurately support the claim, how many are misattributed. Compute a verification-rate statistic.

  2. Set up a living-review pipeline for a research question of your team’s. Run it for one month and compute the LLM-vs-human kappa on the screening decisions.

  3. Use an LLM-assisted RoB approach on a study you have manually rated. Compare per-domain judgements; identify systematic differences.

  4. Compare deep-research outputs across two providers (e.g., OpenAI Deep Research vs Anthropic’s research mode) on the same question. Document structural differences in coverage and citation accuracy.

  5. Draft an AI-assistance disclosure for a protocol that used deep research for the literature review section. Submit to a colleague for review.

8.11 Further reading

  • Marshall & Wallace (2019), Toward systematic review automation: a practical guide to using machine learning tools in research synthesis. The pre-LLM reference for systematic- review automation, useful context.
  • Mollick (2025), The End of Search, The Beginning of Research. The flagship contemporary treatment of deep research.
  • Mollick (2024), Four Singularities for Research. Mollick’s framing of how AI is reshaping academic work.
  • The Cochrane handbook’s chapter on systematic- review methodology (current edition) is the reference for what formal systematic reviews require, useful for understanding what deep research does and does not substitute for.