Research Deep Dive:RaDialog Copilot

AI-assistedradiologyreportingwithconfidence-basedtextsuggestionsandGitHubCopilot-styleinterface

Private academic project — no public demo or repository

👥 TUM × Johns Hopkins Collaboration🏆 Academic Research Project📅 2024

Project Overview

Confidence Threshold

0.7 (validated)

Latency (Fast)

~2s (entropy)

Latency (SOTA)

~30s (Duan)

Early Stopping

Hybrid ≥3 tokens

Calibration

Monotonic (Duan)

Problem

RaDialog Copilot explores how confidence-aware NLP can support radiologists during report drafting. The system integrates calibrated confidence methods into a feedback loop that aligns AI suggestions with clinician input, making the drafting process both faster and more verifiable. Radiology report drafting needs assistive AI that is transparent and verifiable in real time.

Gap

No confidence‑transparent copilot UI tailored for X‑ray report writing.

Technology Stack & Methods

Medical AI

Confidence Estimation

NLP

Model Calibration

Sentence Embeddings

React

Flask

Entropy Analysis

Methods & Architecture

Confidence Estimation Methods

We compared logit-based entropy, sentence-embedding calibration (Duan), and entailment-based NLI approaches, balancing latency against calibration quality.

Method	Family	Latency	Notes	Verdict
Simple Entropy	Logit‑based baseline	~2s	Baseline calibrationLow semantic signal	Baseline only
Duan (med. ST)	Sentence embeddings (medical)	~30s	Best ascending calibrationDomain‑trained	Most promising
Kuhn (NLI)	Entailment (DeBERTa‑MNLI)	~9.5s	Bi‑directional entailmentNon‑linear < 0.6	Inconsistent below 0.6

System Architecture

Frontend

React + Tailwind
Rich editor with acceptance keybinds
Opacity = confidence visual feedback
Findings checklist, PDF export

Backend API

Flask + RaDialog integration
Image preprocessing pipeline
Confidence modules (entropy, embeddings, NLI)
Caching + streaming for responsiveness

AI Pipeline

Greedy base with controlled variance
Calibration analysis (binning, metrics)
Semantic rule‑based stopping
Mode switch: fast vs. SOTA

Demo & Key Findings

Upload

Analysis

Export

Complete Workflow Demo

Unlike one-shot text assistants, RaDialog Copilot responds as the radiologist types, surfacing confidence-weighted completions that can be accepted with a keystroke or ignored without disruption.

Confidence-Aware Typing

Real-time confidence feedback

⌨️ Typing...

Complete RaDialog Copilot Interface

Research Results

• Duan method yields best monotonic calibration curves
• Temperature ↑ improves calibration alignment
• Domain‑trained encoders outperform general models
• 0.7 confidence cleanly separates relevant from noise

Academic Contributions

• Evaluation framework for confidence in clinical NLP
• Practical confidence‑aware copilot UI for reporting
• Evidence for domain‑trained encoders in calibration

Analysis & Impact

Before

Generic text assistant.
Long completions requiring pruning.
No confidence cues; manual edits heavy.

After

Confidence‑weighted inline suggestions.
Early‑stop at clause boundary.
Type‑constrained snippets; faster acceptance.

These gains translate into smoother workflows: fewer edits, faster acceptance, and reduced frustration with over-confident AI text.

+18%

Accept rate

Demo metric

−25%

Edit time

Demo metric

−40%

Overrun trimmed

Demo metric

220 ms

TTFS

Demo metric

Key Design Decisions

Decision	Rationale	Trade‑off
Fast vs SOTA modes	2s for UX; 20s when high certainty needed	Speed vs calibration quality
0.7 threshold	Balances acceptance and precision	May hide rare low‑confidence finds
Hybrid early‑stop	Avoids jitter and run‑on sentences	Occasional clipped clause edge cases