No. 12 - Your LLM Summarizer Is Leaking Patient Race

Headline Insight

Health systems procuring LLM-based clinical summarization tools are routinely validating them on summarization quality metrics — ROUGE scores, clinician preference, drafting time reduction. This paper from Vanderbilt/VUMC demonstrates that those evaluations are missing a material security property: the vector representations that LLM summarization systems export as standard operational artifacts encode recoverable sensitive demographic information, specifically EHR-recorded race, even when the underlying source documents are access-restricted.

The mechanism is not exotic. When a clinical LLM processes a document, it outputs not just text but intermediate representations — the last-token hidden state and mean-pooled prompt embeddings — that are routinely retained for downstream RAG retrieval, audit logging, monitoring dashboards, and similarity search. These artifacts are treated as anonymized numerical outputs and typically fall outside the access controls applied to raw clinical text. The Vanderbilt team shows that a linear probe trained on these artifacts recovers race classification substantially above chance from MIMIC-IV data. The exposure exists in the infrastructure, not in the model's text output.

The mitigation the authors propose — SurfaceLoRA, a gradient-reversal adversarial PEFT approach — reduces recoverability on the last-token artifact toward chance level while preserving ROUGE summarization performance. This is a meaningful result: it demonstrates that the tradeoff between privacy and utility is not binary. However, the mitigation is artifact-specific. Protecting the last-token representation does not protect mean-pooled embeddings, which remain substantially above chance after the intervention. Health systems deploying vector stores, RAG pipelines, or any workflow that exports and persists LLM intermediate representations will need to audit and mitigate each artifact class independently.

What this means operationally for health system AI governance:

The HIPAA Safe Harbor standard specifies 18 categories of identifiers that must be removed for de-identification; race is not among them, because it is not traditionally considered a direct identifier. This paper introduces a structural problem: LLM vector artifacts may constitute an indirect re-identification vector for demographic attributes that fall outside the standard de-identification framework, under access-control assumptions that health systems currently treat as sufficient.

Any health system that has deployed or is evaluating an LLM summarization tool with downstream vector export — including Epic-embedded AI summarization, vendor-provided ambient documentation, or internally developed RAG pipelines — should treat this paper as a vendor diligence trigger. The relevant questions are: what intermediate representations does the system export, where are they stored, who has access, and has the vendor conducted an artifact-level demographic leakage audit?

The paper's scope is single-attribute and single-institution, and the authors are explicit about these limits. Generalizability to other demographic variables — age, insurance status, socioeconomic proxies — is undemonstrated, though technically plausible by the same mechanism. The governance implication does not require broader generalizability to be actionable: a single recoverable sensitive attribute from a nominally access-restricted artifact is sufficient to constitute a material risk under most enterprise security frameworks.

This is not a theoretical vulnerability. It is a demonstrated empirical finding from production-quality EHR data, with a proposed — if incomplete — mitigation, published in a peer-reviewed venue. The appropriate institutional response is an audit of deployed summarization infrastructure, not a monitoring brief.

Pre-Print Intelligence (arXiv)

VesselSim: Learning 3D Blood Vessel Segmentation Without Expert Annotations

Brief: VesselSim is a two-stage framework for annotation-free 3D vascular segmentation that trains a 3D U-Net exclusively on 16,500 procedurally generated synthetic angiography volumes and deploys a self-supervised mask-reconstruction decoder for test-time adaptation to unseen clinical scans. Evaluated zero-shot on HiP-CT kidney and TopCoW brain datasets (CTA and MRA), it matches or exceeds VesselFM and other foundation models trained on large real-data corpora — despite VesselFM having seen both evaluation datasets during training.

Methodological Integrity: The primary comparison is structurally favorable to VesselSim: VesselFM's training overlap with both test sets confers a significant advantage the paper does not sufficiently acknowledge, making the competitive result harder to interpret. HiP-CT and TopCoW dataset sizes are small (3 volumes and 125 each), and clDice variance is high, particularly on HiP-CT (±25%), limiting statistical confidence.

Strategic Implication: Eliminating the annotation bottleneck for vascular segmentation directly addresses the most cited barrier to scaling AI in interventional radiology and vascular surgery planning; a commercially viable module could be licensed into existing PACS and surgical navigation platforms without requiring site-specific labeled data.

Executive Summary: A credible proof-of-concept that geometry-driven synthetic pretraining can substitute for large annotated vascular datasets with competitive zero-shot generalization; deployment readiness requires prospective clinical validation and regulatory engagement not yet initiated.

Innovation: 7/10 | Applicability: 6/10 | Commercial Viability: 6/10

Vectors Are Not Neutral: Sensitive-Information Inference from Exported LLM Representations in Summarization

Brief: This Vanderbilt/VUMC study demonstrates that prompt-derived vector artifacts from LLM-based clinical summarization systems — specifically the last-token hidden state and mean-pooled prompt representations — retain recoverable sensitive demographic information (EHR-recorded race) even when source documents remain access-restricted. The proposed mitigation, SurfaceLoRA, applies gradient-reversal adversarial PEFT to drive race recoverability on the targeted lasttok artifact toward chance-level, while preserving ROUGE summarization utility, though pooled meanpool representations remain substantially above chance.

Methodological Integrity: The audit is single-attribute (race) and single-institution (MIMIC-IV), which limits generalizability across demographic variables (e.g., age, insurance status, socioeconomic proxies) and non-US EHR environments; EHR-recorded race is an administrative field with documented noise and inconsistency, not a ground-truth identity label. Mitigation is shown to be artifact-specific — reducing leakage on lasttok does not protect meanpool — a finding with direct governance implications that health systems deploying RAG or vector-store workflows will need to address independently per exported artifact.

Strategic Implication: For health systems building LLM summarization pipelines with downstream RAG, monitoring, or audit workflows, this constitutes a material HIPAA-adjacent governance risk requiring artifact-specific auditing and mitigation protocols before deployment; the framework is directly actionable for enterprise AI governance teams.

Executive Summary: A rigorous technical demonstration that clinical LLM summarization systems expose sensitive demographic information through exported vector artifacts under standard access-control assumptions, with a lightweight PEFT mitigation method that reduces but does not eliminate the risk; the findings are immediately relevant to health system AI procurement and vendor security diligence.

Innovation: 7/10 | Applicability: 7/10 | Commercial Viability: 6/10

SAR2Mesh: Towards 3D Heart Mesh Generation Using Contactless Radar Imaging and Physics-Informed Neural Network

Brief: SAR2Mesh proposes reconstructing high-fidelity 3D cardiac meshes from multi-view millimeter-wave SAR images using a coarse-to-fine graph convolutional deformation network, a geometry-aware multi-view feature projection module, and a differentiable physics-informed radar loss enforcing consistency between predicted geometry and raw radar echoes. The framework introduces Cardiac Mesh-SAR, a paired synthetic dataset derived from MRI-derived meshes simulated through a physics-based forward SAR model.

Methodological Integrity: All three evaluation datasets — MMWHS-SSM, ACDC, and the private dataset — use SAR data synthesized from MRI via the authors' own forward simulation rather than real radar acquisitions; this is a simulation-to-simulation validation with an unquantified sim-to-real gap that is the central unresolved limitation. No real in-vivo SAR hardware validation is presented, and there is no COI disclosure despite what appears to be an in-house clinical dataset.

Strategic Implication: If the sim-to-real gap can be closed, contactless cardiac monitoring via mmWave radar has a credible commercial pathway in ICU continuous monitoring and point-of-care settings where MRI is inaccessible — but the current absence of real-hardware validation leaves this entirely speculative.

Executive Summary: A technically novel first attempt at radar-to-3D cardiac mesh reconstruction with strong simulation-domain results, but the complete reliance on synthetic radar data means clinical applicability and deployment readiness cannot be assessed from this work alone.

Innovation: 8/10 | Applicability: 3/10 | Commercial Viability: 4/10

PathWISE: Multi-Agent Cancer Pathway Triaging Ontology Learning from Clinical Flowcharts

Brief: PathWISE is a five-phase multi-agent pipeline that converts visual NHS cancer pathway flowcharts into executable HL7 CQL libraries deployable as FHIR CDS Hooks services, combining VLM-based visual parsing, deterministic DFS graph traversal, SNOMED-CT-constrained CQL generation, and a Java CQL-to-ELM compiler-as-critic loop. Evaluated across five UK cancer pathways (colorectal, lung, skin, upper GI, breast), the system achieved 100% CQL compilation success across all three LLM configurations and identified 544 structured governance findings, with a Hybrid configuration (Gemini Pro Vision + Claude Sonnet) achieving zero unmapped terminology concepts on the lung pathway.

Methodological Integrity: Evaluation is confined to five pathways from a single national system (NHS England), with routing correctness assessed against only 25 synthetic FHIR R4 patient records — a sample inadequate for clinical generalization claims. Node-level semantic audit pass rates remain below 50% across all configurations (best: 48.9%), meaning the majority of pathway logic requires false placeholders rather than computable CQL; the system produces compilable but partially hollow artefacts. No live EHR deployment or clinician usability study is reported.

Strategic Implication: The compiler-as-critic design pattern — substituting a deterministic Java compiler for stochastic LLM critics — is the architecturally significant contribution and represents a credible blueprint for regulated clinical AI; the NHS England AIQCoP independent corroboration of content error findings provides meaningful external validation of audit reliability.

Executive Summary: PathWISE demonstrates a principled architecture for converting non-computable visual clinical guidelines into standards-based CDS artefacts with deterministic compilation guarantees; the 48.9% node-level semantic pass rate and absence of live clinical validation constrain near-term deployment claims, but the framework has direct relevance for NHS and equivalent national health system digitization programs.

Innovation: 8/10 | Applicability: 5/10 | Commercial Viability: 6/10

Treatment-Conditioned Diffusion for Forecasting Neurodegenerative Disease Progression

Brief: This paper from Innoloft Inc. proposes a conditional diffusion framework that forecasts 12-month DaTscan SPECT images in Parkinson's disease patients by conditioning a 2D U-Net on baseline screening DaTscan images and monthly levodopa equivalent daily dose (LEDD) vectors processed through a contrastive Transformer autoencoder. Evaluated on 212 PPMI participants (2,968 image pairs across 14 striatal slices), the model achieves 14.0% lower MSE, 7.2% lower MAE, and 4.9% higher SSIM versus a no-progression static baseline.

Methodological Integrity: The baseline is the no-progression assumption — a deliberately weak comparator that does not include any prior DaTscan progression model, making the improvement magnitude difficult to contextualize against the published literature. The cohort of 212 subjects from a single longitudinal study (PPMI) is small for a generative imaging model, and evaluation uses only pixel-level metrics without clinical outcome validation (e.g., UPDRS correlation, progression subtype stratification). COI: lead author is affiliated with Innoloft Inc., a commercial entity; no independent validation dataset is used.

Strategic Implication: Treatment-conditioned neuroimaging progression forecasting has a clear application in clinical trial enrichment and personalized levodopa titration, but without comparison to established progression models and without clinical endpoint validation, the current work does not yet support procurement or investment decisions.

Executive Summary: A technically coherent proof-of-concept for treatment-conditioned DaTscan forecasting in Parkinson's disease with modest quantitative improvements over a static baseline; the weak comparator, small cohort, commercial author affiliation, and absence of clinical outcome validation limit actionable conclusions at this stage.

Innovation: 7/10 | Applicability: 4/10 | Commercial Viability: 5/10

A Signal–Language Foundation Model for Broad-Spectrum Cardiovascular Assessment from Routine Electrocardiography

Brief: ECGCLIP is a CLIP-style contrastive learning framework that aligns 12-lead ECG waveforms with expert-authored diagnostic text reports, pre-trained on 2.8 million ECG–text pairs from Zhongshan Hospital (Fudan University). Evaluated across 89 downstream tasks — 45 standard ECG tasks, 39 echocardiography-derived structural phenotypes, and 5 rare cardiomyopathies — across nine independent external cohorts, ECGCLIP-R34 consistently outperforms random initialization and the Merl-R18 baseline, with particular gains in low-prevalence conditions (Ebstein anomaly PRAUC 0.253, cardiac amyloidosis 0.201) where baselines perform near chance. The model also demonstrates strong data efficiency: fine-tuning on 10% of labeled data matches or exceeds full-dataset baseline performance.

Methodological Integrity: Pre-training data originates overwhelmingly from a single Chinese tertiary center (Zhongshan Hospital), with Chinese-language reports machine-translated via GPT-4o before text encoding — introducing an unquantified translation noise risk and limiting generalizability to Western reporting ontologies and care patterns. PRAUC values for rare diseases, while multi-fold above chance, remain low in absolute terms (0.121–0.253), and confidence intervals on these tasks are wide, reflecting limited case counts and unstable estimates.

Strategic Implication: The echocardiography screening capability — inferring structural phenotypes from ECG alone — has a credible application in resource-limited settings and population-level cardiac risk stratification, and data efficiency findings reduce the labeled-data barrier for prospective health system adoption; however, single-center pre-training concentration and the absence of prospective clinical outcome validation mean regulatory and procurement timelines remain multi-year. COI: No author conflicts of interest declared; senior co-authors affiliated with Zhongshan Hospital and Imperial College London.

Executive Summary: A well-powered ECG foundation model demonstrating robust cross-cohort generalization and meaningful rare-disease discrimination gains, with open weights; near-term deployment readiness is constrained by single-institution pre-training, GPT-translated supervision, and the absence of prospective clinical validation.

Innovation: 8/10 | Applicability: 6/10 | Commercial Viability: 7/10

PubMed Gems

Towards Generalizable Seizure Monitoring: EpiVLM for Cross-Environment Detection and Classification

Brief: EpiVLM is a fine-tuned multimodal vision-language system (Qwen2.5-VL-32B backbone + SAM2 segmentation) for automated seizure detection and semiology classification across five seizure types from routine video. Evaluated on 232 videos from 127 patients spanning two tertiary EMUs, home recordings, and a public dataset, it achieved accuracy 0.795–0.947 and sensitivity 0.842–0.957 without site-specific recalibration across external cohorts.

Methodological Integrity: The cohort is pediatric-enriched (median age 16.3 years), with markedly degraded performance in the ≥36 age stratum — the group carrying the highest real-world epilepsy burden — and confidence intervals for that cohort are too wide to support deployment claims. Manual automatisms FPR of 0.291–0.366 persists across all pipeline configurations, representing a structural failure mode unresolved by either segmentation or fine-tuning; leave-one-center-out retraining was explicitly deferred due to imbalanced site sizes.

Strategic Implication: The adjunctive surveillance framing — flagging candidate segments for clinician adjudication rather than autonomous classification — is the only defensible near-term deployment use case, and positions EpiVLM as a workload reduction tool for epilepsy monitoring units rather than a standalone diagnostic system. PACS/EHR integration and CE/FDA SaMD pathways for a multi-semiology video AI remain unaddressed, with no regulatory engagement reported.

Executive Summary: EpiVLM demonstrates multicenter-stable, low-latency seizure semiology recognition from routine video using a fine-tuned VLM pipeline, with external validation performance statistically indistinguishable across sites; adult generalizability gaps, persistent manual automatism false positives, and absent regulatory strategy constrain near-term deployment scope.

Innovation: 7/10 | Applicability: 6/10 | Commercial Viability: 6/10