No. 9 - The Sovereign Stack

Headline Insight

This week's issue carries a signal that no single paper announces but all of them collectively confirm: the architecture of deployable clinical AI has resolved. It is not a cloud API. It is not a frontier model with a BAA. It is a small, locally-runnable model distilled from a large one, grounded in quantitative retrieval rather than generative hallucination, and wrapped in an auditable reasoning chain. SHIELD proves the distillation economics are compelling — $10 of Gemini Flash labeling, then free forever on hospital hardware. MedScribe proves the grounding architecture works for 3D imaging. ADAPTS proves modular item-level agents outperform monolithic prompts for structured clinical assessment. The intraoperative infection paper proves that routinely collected EHR time-series, engineered properly, rival postoperative lab data for risk prediction. These are four independent teams reaching the same destination from four different directions.

The frame worth tracking is what these papers collectively displace. Every one of them is, at its core, a response to the same deployment barrier: PHI cannot leave the building, frontier models are too expensive to run at scale, and regulatory-grade clinical AI requires an auditable chain from input to output. The sovereign stack — small distilled model, on-premise inference, structured retrieval, explicit reasoning trace — satisfies all three constraints simultaneously. SHIELD's DeBERTa v3 runs on a workstation CPU at 88% precision and 86% recall after a one-time distillation cost. MedScribe's Llama 3.1-8B produces CT reports that outperform Gemini 2.5 Flash on pathology classification without a single additional training example. ADAPTS achieves ICC(2,1) of 0.877 on psychiatric rating by injecting clinical conventions into prompts rather than fine-tuning on sensitive patient data. The intraoperative infection model achieves AUROC 0.880 using nothing but vital signs already captured in the OR data warehouse.

What this means for procurement and investment decisions: The hospital that waits for a foundation model vendor to solve PHI compliance is solving the wrong problem. The infrastructure problem — diverse, well-annotated local corpora, distillation pipelines, on-premise compute — is already solvable with today's methods, as SHIELD demonstrates. The grounding problem — replacing hallucinated synthesis with quantitative evidence retrieval — is already solvable, as MedScribe demonstrates. The regulatory auditability problem — producing item-level justifications traceable to source evidence — is already solvable, as ADAPTS demonstrates. None of these systems are clinically validated at the standard required for SaMD clearance, but all of them are architecturally correct. The gap between where these papers sit and where a cleared product sits is a validation and regulatory engagement gap, not a scientific one. That is a shorter gap than it has ever been.

Pre-Print Intelligence (arXiv)

DALPHIN: Benchmarking Digital Pathology AI Copilots Against Pathologists on an Open Multicentric Dataset

Brief: DALPHIN is a multicentric open benchmark designed to evaluate Vision-Language Models (VLMs) in digital pathology across 130 diagnoses and 14 subspecialties. It compares general-purpose models like GPT-5 and Gemini 2.5 Pro against pathology-specific models like PathChat+ using a dataset of 1,236 images from 300 cases.
Methodological Integrity: High integrity due to multicentric data sourcing across 6 countries and validation by 31 pathologists. The use of sequestered ground truth effectively mitigates data leakage and prevents model over-fitting during benchmarking.
Strategic Implication: Shifts the pathology AI paradigm from narrow, single-task classifiers to generalist 'copilots' capable of complex visual question answering. This enables a transition toward ambient diagnostic support and reduces the reliance on highly specialized human labor for rare diagnoses.
Executive Summary: The study establishes a robust, open-access framework for benchmarking pathology AI, demonstrating that domain-specific models (PathChat+) significantly outperform general-purpose LLMs in clinical accuracy. It provides a scalable validation pipeline for integrating AI into routine diagnostic workflows.

Innovation: 8/10 | Applicability: 9/10 | Commercial Viability: 8/10

MedScribe: Clinically Grounded CT Reporting through Agentic Workflows

Brief: MedScribe reformulates 3D CT report generation as an iterative, hypothesis-driven evidence acquisition loop in which an LLM dynamically invokes pathology-specific segmentation and radiomics tools (via TotalSegmentator), retrieves nearest-neighbor textual evidence from a quantitative RAG space, and synthesizes findings only after accumulating grounded volumetric descriptors. Evaluated on CT-RATE (1,304 scans) and RadChestCT (360 scans) against CT-CHAT, Gemini 2.5 Flash, MedGemma 1.5, and VILA-M3.
Methodological Integrity: RAG reference space and evaluation labels both derive from CT-RATE training data via the same RadBERT classifier, creating a circularity risk the authors acknowledge but do not fully resolve; the RadChestCT cross-dataset results partially mitigate this, though segmentation-error sensitivity analysis shows a consistent 8–9 point macro F1 drop on scans with unreliable lung segmentation, exposing a direct pipeline dependency on upstream model quality.
Strategic Implication: Decoupling linguistic synthesis from visual encoding addresses a genuine hallucination bottleneck in 3D radiology AI, but clinical deployment requires PACS integration, prospective radiologist evaluation, and regulatory clearance per indication — none initiated.
Executive Summary: MedScribe demonstrates statistically significant macro F1 improvements over four state-of-the-art VLMs on chest CT reporting via tool-grounded agentic reasoning without task-specific fine-tuning; pipeline robustness remains contingent on segmentation quality and the framework is pre-clinical.

Innovation: 8/10 | Applicability: 5/10 | Commercial Viability: 6/10

SHIELD: A Diverse Clinical Note Dataset and Distilled Small Language Models for Enterprise-Scale De-identification

Brief: SHIELD is a 1,394-note clinical de-identification benchmark built via diversity-optimized set-cover sampling across six demographic and document-type strata from Stanford's STARR-OMOP warehouse, annotated with 10,505 gold-standard PHI spans across 9 categories by 12 human labelers with adjudication. A teacher-student distillation pipeline transfers Gemini 2.5 Flash capabilities into a locally deployable DeBERTa v3 model achieving micro-averaged span-level precision of 0.88 and recall of 0.86 on SHIELD, at an estimated 68,800× reduction in cloud API cost.
Methodological Integrity: All notes originate from a single institution (Stanford Health Care), limiting generalizability claims; cross-dataset evaluation confirms that institution-specific entities (HOSPITAL, LOCATION) resist transfer in both directions regardless of training diversity, while low-support categories (WEB: 82 spans, LOCATION: 495 spans) yield bootstrap CIs spanning 30 percentage points.
Strategic Implication: The distillation approach directly resolves the PHI-to-cloud-API barrier that blocks EHR data secondary use at scale, and public release of both the dataset and model lowers the barrier for institutional adoption; LOCATION recall ceiling of 0.53 remains a gap for full-automation use cases.
Executive Summary: SHIELD delivers a modern, demographically diverse de-identification benchmark and a locally deployable DeBERTa v3 model that closely matches its LLM teacher on structured PHI categories at negligible marginal inference cost, with demonstrated cross-institutional generalization on universal categories.

Innovation: 7/10 | Applicability: 8/10 | Commercial Viability: 8/10

ADAPTS: Agentic Decomposition for Automated Protocol-agnostic Tracking of Symptoms

Brief: ADAPTS is a mixture-of-agents LLM framework for automated re-rating of depression and anxiety severity from clinical interview transcripts. It decomposes HAM-D and HAM-A assessments into 15 and 13 item-level sub-agents respectively, each performing targeted evidence retrieval and explicit justification before scoring. Evaluated across two independent datasets (N=204) with distinct interview protocols, with benchmarking across Claude Sonnet 4.5, Gemini 3 Pro, DeepSeek R1, Llama Scout 4, and GPT OSS.
Methodological Integrity: All authors are company employees evaluating their own commercial system; ground-truth inter-rater reliability was 100% unanimous on only 42.8% of samples (Illiad dataset initially relied on single raters), creating a noisy reference standard; models exhibit consistent positive MAE bias across insomnia and anxiety items, and the datasets originate from structured research settings that do not reflect routine clinical variability.
Strategic Implication: The 0.877 absolute agreement ICC under the Extended protocol and demonstrated protocol-agnosticism are genuine steps toward scalable psychiatric endpoint measurement in clinical trials, where rater variance is a documented contributor to trial failure; FDA pathway for autonomous psychiatric rating AI is undefined.
Executive Summary: ADAPTS demonstrates LLM-based psychiatric interview re-rating approaching expert rank-order reliability across heterogeneous protocols, with systematic calibration bias addressable via clinical knowledge injection, but the company-funded design and structured-research-only validation constrain deployment claims.

Innovation: 7/10 | Applicability: 6/10 | Commercial Viability: 7/10

PubMed Gems

OncoPT: long-context transformer models for in hospital tumor phenotype extraction from pathology reports.

Brief: OncoPT utilizes Longformer and BigBird architectures to extract five key tumor phenotypes from unstructured, long-form pathology reports. The models are designed for on-premises deployment to ensure PHI privacy while handling sequences up to 4,096 tokens.
Methodological Integrity: The use of a private dataset limits external reproducibility, and the reliance on weighted F-1 scores may mask performance gaps in rare malignancy classes.
Strategic Implication: Shifts tumor registry from manual abstraction to automated, onsite extraction, reducing administrative burden while bypassing cloud-based privacy hurdles.
Executive Summary: OncoPT outperforms GPT-4o and o1 on the CORAL dataset for oncology phenotype extraction. It enables local execution of long-context medical NLP without violating data residency requirements.

Innovation: 6/10 | Applicability: 8/10 | Commercial Viability: 7/10

End-of-surgery prediction of postoperative infectious complications from intraoperative vital-sign dynamics.

Brief: A machine-learning model (Random Forest) integrating intraoperative vital-sign time-series — including summary, trend, and higher-order statistics (kurtosis, skewness, entropy) derived from arterial blood pressure, heart rate, SpO₂, nasopharyngeal temperature, and EtCO₂ — predicts postoperative bacterial infections at the moment of skin closure across 10,719 surgical cases from a single Swiss tertiary center, achieving AUROC 0.880 (95% CI 0.850–0.911) versus 0.806 for preoperative-only baseline.
Methodological Integrity: Requiring complete high-fidelity invasive monitoring introduces significant selection bias — included patients were older, had higher ASA scores, and underwent longer operations than excluded cases, with a 4.7% vs. 2.8% infection rate differential; single-center retrospective design, no prospective validation, and systematic overestimation of infection risk across all procedure clusters indicate the model requires recalibration before clinical deployment.
Strategic Implication: Shifting infection risk stratification to the end-of-surgery timepoint creates a 48–72 hour early warning window for targeted surveillance and antimicrobial intervention; integration into perioperative monitoring systems is technically tractable but requires prospective multi-center validation and regulatory engagement not yet initiated.
Executive Summary: A well-executed single-center study demonstrating that intraoperative vital-sign dynamics encode meaningful infection risk signal beyond preoperative factors, with AUROC approaching postoperative laboratory-based models; external validation and selection bias resolution are prerequisites for clinical translation.

Innovation: 7/10 | Applicability: 6/10 | Commercial Viability: 7/10

AI Clinical Trials (ClinicalTrials.gov)

MARVEL Project: Manual vs. VELYS Robotic-Assisted Arthroplasty for Knee Osteoarthritis — NCT07551856

Summary: MARVEL is a multicentre, pragmatic, parallel-group, blinded RCT (n=346) comparing VELYS imageless robotic-assisted TKA (raTKA) using functional alignment against conventional manual TKA (mTKA) in primary knee osteoarthritis. Primary endpoint is change in Forgotten Joint Score (FJS) at 6 months; secondary endpoints span PROMs, 3D motion analysis, intraoperative kinematics, health economics, human factors (NASA Task Load Index, System Usability Scale), and surgical safety. Ten-year registry linkage provides long-term follow-up beyond the 12-month active follow-up window. Surgeons must complete ≥20 non-trial VELYS cases before recruiting, and an internal pilot will evaluate learning curve effects.

Methodological Integrity: Industry collaboration with J&J/DePuy introduces commercial COI, though sponsorship is NHS-led; the blinding strategy is unclear given the visible intraoperative use of robotic hardware; 173 participants per arm is adequately powered for the FJS MCID of 7 points at 80% power, though the 15% dropout assumption may be optimistic for a 12-month follow-up with a predominantly older surgical cohort.

Strategic Implication: Robust pragmatic RCT evidence on imageless robotic TKA is the key missing data point for NHS value-based procurement and payer reimbursement decisions; the multi-domain design (biomechanics, health economics, human factors) positions this trial to influence both commissioning policy and surgical adoption curves beyond a single outcome measure.

Executive Summary: MARVEL is one of the first adequately powered pragmatic RCTs of imageless robotic TKA with a multi-domain endpoint suite; its NHS sponsorship and registry linkage design address a genuine evidentiary gap, with results expected to directly inform health technology assessment and procurement strategy in the UK and beyond.

Innovation: 6/10 | Applicability: 8/10 | Commercial Viability: 10/10