Research Deep Dive:RaDialog Copilot

AI-assistedradiologyreportingwithconfidence-basedtextsuggestionsandGitHubCopilot-styleinterface
Private academic project — no public demo or repository
👥 TUM × Johns Hopkins Collaboration🏆 Academic Research Project📅 2024

Project Overview

Confidence Threshold
0.7 (validated)
Latency (Fast)
~2s (entropy)
Latency (SOTA)
~30s (Duan)
Early Stopping
Hybrid ≥3 tokens
Calibration
Monotonic (Duan)
Problem

RaDialog Copilot explores how confidence-aware NLP can support radiologists during report drafting. The system integrates calibrated confidence methods into a feedback loop that aligns AI suggestions with clinician input, making the drafting process both faster and more verifiable. Radiology report drafting needs assistive AI that is transparent and verifiable in real time.

Gap
No confidence‑transparent copilot UI tailored for X‑ray report writing.
Technology Stack & Methods
Medical AI
Confidence Estimation
NLP
Model Calibration
Sentence Embeddings
React
Flask
Entropy Analysis

Methods & Architecture

Confidence Estimation Methods

We compared logit-based entropy, sentence-embedding calibration (Duan), and entailment-based NLI approaches, balancing latency against calibration quality.

MethodFamilyLatencyNotesVerdict
Simple EntropyLogit‑based baseline~2s
Baseline calibrationLow semantic signal
Baseline only
Duan (med. ST)Sentence embeddings (medical)~30s
Best ascending calibrationDomain‑trained
Most promising
Kuhn (NLI)Entailment (DeBERTa‑MNLI)~9.5s
Bi‑directional entailmentNon‑linear < 0.6
Inconsistent below 0.6

System Architecture

Frontend
  • React + Tailwind
  • Rich editor with acceptance keybinds
  • Opacity = confidence visual feedback
  • Findings checklist, PDF export
Backend API
  • Flask + RaDialog integration
  • Image preprocessing pipeline
  • Confidence modules (entropy, embeddings, NLI)
  • Caching + streaming for responsiveness
AI Pipeline
  • Greedy base with controlled variance
  • Calibration analysis (binning, metrics)
  • Semantic rule‑based stopping
  • Mode switch: fast vs. SOTA

Demo & Key Findings

Upload
Analysis
Export

Complete Workflow Demo

Unlike one-shot text assistants, RaDialog Copilot responds as the radiologist types, surfacing confidence-weighted completions that can be accepted with a keystroke or ignored without disruption.

Confidence-Aware Typing
Real-time confidence feedback
⌨️ Typing...
RaDialog Copilot Final Interface

Complete RaDialog Copilot Interface

Research Results
  • Duan method yields best monotonic calibration curves
  • Temperature ↑ improves calibration alignment
  • Domain‑trained encoders outperform general models
  • 0.7 confidence cleanly separates relevant from noise
Academic Contributions
  • • Evaluation framework for confidence in clinical NLP
  • • Practical confidence‑aware copilot UI for reporting
  • • Evidence for domain‑trained encoders in calibration

Analysis & Impact

Before
  • Generic text assistant.
  • Long completions requiring pruning.
  • No confidence cues; manual edits heavy.
After
  • Confidence‑weighted inline suggestions.
  • Early‑stop at clause boundary.
  • Type‑constrained snippets; faster acceptance.

These gains translate into smoother workflows: fewer edits, faster acceptance, and reduced frustration with over-confident AI text.

+18%
Accept rate
Demo metric
−25%
Edit time
Demo metric
−40%
Overrun trimmed
Demo metric
220 ms
TTFS
Demo metric

Key Design Decisions

DecisionRationaleTrade‑off
Fast vs SOTA modes2s for UX; 20s when high certainty neededSpeed vs calibration quality
0.7 thresholdBalances acceptance and precisionMay hide rare low‑confidence finds
Hybrid early‑stopAvoids jitter and run‑on sentencesOccasional clipped clause edge cases

Research by Andrei Zitti2024

Supervised by Chantal Pellegrini & Ege Özsoy • TUM × Johns Hopkins Collaboration