Dataset Directory

Curated Guide to Public Medical Imaging Datasets

18 datasets across 8 modalities, assessed for AI-readiness across four data quality dimensions.

Dataset Directory March 2026

There is no shortage of public medical imaging data. Dozens of well-known datasets span CT, MRI, X-ray, pathology, ultrasound, and cardiac imaging. Many have thousands — even hundreds of thousands — of studies available for download. The challenge isn't access. It's readiness.

Most public datasets were designed for research benchmarks, not production AI pipelines. That means critical information is often missing: who annotated the data and what qualified them, how the images were preprocessed, whether files can be verified for integrity, and whether the structure is consistent enough to load without custom code. This directory catalogues the most important public datasets and assesses each one against the data quality dimensions that matter for real-world AI development.

Data quality assessment key

Strong — well-documented, standardized

Partial — present but incomplete

Missing — absent or undocumented

General and multi-modality repositories3 datasets

The Cancer Imaging Archive (TCIA)

One of the most important open repositories for medical imaging research. Hosts collections across CT, MRI, PET, and other modalities including LIDC-IDRI and TCGA imaging cohorts.

CTMRIPETMulti-collection

DICOM metadata

Annotation provenance

Structural consistency

Clinical linkage

UK Biobank

Over 100,000 imaging datasets spanning brain, cardiac, and abdominal MRI. Rich clinical and genomic linkage. Requires application approval.

MRICardiacBrainGenomicsRestricted access

DICOM metadata

Annotation provenance

Structural consistency

Clinical linkage

OpenNeuro

Open platform for neuroimaging data using the BIDS standard — one of the few imaging domains with a widely adopted data organization format.

MRIBIDS formatNeuroimaging

DICOM metadata

Annotation provenance

Structural consistency

Clinical linkage

CT imaging3 datasets

LIDC-IDRI

The gold standard for lung nodule detection. 1,018 chest CT scans with XML-based annotations from four radiologists per case. Hosted on TCIA.

Lung CTNodule detectionMulti-readerVia TCIA

DICOM metadata

Annotation provenance

Structural consistency

Clinical linkage

NSCLC Radiogenomics

CT with paired genomic data for non-small cell lung cancer. Useful for radiomic feature extraction and biomarker discovery.

Lung CTGenomicsRadiomics

DICOM metadata

Annotation provenance

Structural consistency

Clinical linkage

MOSMEDDATA

COVID-19 CT scans with severity scoring. Useful for classification and severity assessment research.

Lung CTCOVID-19Severity scoring

DICOM metadata

Annotation provenance

Structural consistency

Clinical linkage

Brain imaging (MRI)3 datasets

BraTS (Brain Tumor Segmentation)

Multi-institutional brain MRI for tumor segmentation benchmarking. Annotations cover tumor sub-regions.

Brain MRISegmentationChallenge

DICOM metadata

Annotation provenance

Structural consistency

Clinical linkage

ADNI (Alzheimer's Disease Neuroimaging Initiative)

Longitudinal brain MRI and PET with linked clinical data. Extremely valuable for disease progression modeling.

Brain MRIPETLongitudinalRestricted access

DICOM metadata

Annotation provenance

Structural consistency

Clinical linkage

IXI Dataset

Nearly 600 healthy brain MRI scans from three London hospitals. Useful as a baseline and control dataset.

Brain MRIHealthy controlsMulti-site

DICOM metadata

Annotation provenance

Structural consistency

Clinical linkage

Chest imaging (X-ray)3 datasets

NIH ChestX-ray14

Over 100,000 chest X-rays with 14 disease labels. NLP-derived from radiology reports — widely used but known to contain label noise.

Chest X-rayNLP labels100k+ images

DICOM metadata

Annotation provenance

Structural consistency

Clinical linkage

CheXpert (Stanford)

Large chest X-ray dataset with improved labeling and explicit uncertainty labels.

Chest X-rayUncertainty labelsStanford

DICOM metadata

Annotation provenance

Structural consistency

Clinical linkage

MIMIC-CXR

Over 377,000 chest radiographs linked with ICU clinical data and free-text radiology reports via PhysioNet.

Chest X-rayEHR linkedReportsPhysioNet

DICOM metadata

Annotation provenance

Structural consistency

Clinical linkage

Cardiac imaging2 datasets

ACDC (Automated Cardiac Diagnosis Challenge)

Cardiac MRI segmentation benchmark. 150 exams with annotations covering ventricles and myocardium across five clinical groups.

Cardiac MRISegmentationChallenge

DICOM metadata

Annotation provenance

Structural consistency

Clinical linkage

EchoNet-Dynamic

10,030 echocardiogram videos with ejection fraction labels. One of the strongest cardiac ultrasound datasets available.

EchocardiographyVideoEF predictionStanford

DICOM metadata

Annotation provenance

Structural consistency

Clinical linkage

Histopathology (whole slide imaging)2 datasets

CAMELYON16 / CAMELYON17

Lymph node metastasis detection from whole slide images. Multi-center data in CAMELYON17 for generalizability testing.

WSIMetastasis detectionChallenge

DICOM metadata

Annotation provenance

Structural consistency

Clinical linkage

PANDA (Prostate Cancer Grade Assessment)

Over 10,000 whole slide images of prostate biopsies with Gleason grading. Intentionally noisy labels reflecting real-world variability.

WSIProstateNoisy labelsKaggle

DICOM metadata

Annotation provenance

Structural consistency

Clinical linkage

Ultrasound1 dataset

BUSI (Breast Ultrasound Images)

780 breast ultrasound images with segmentation masks across normal, benign, and malignant categories.

Breast ultrasoundSegmentationClassification

DICOM metadata

Annotation provenance

Structural consistency

Clinical linkage

Multi-modal and clinical integration1 dataset

PhysioNet (MIMIC-IV ecosystem)

Combines imaging (MIMIC-CXR), electronic health records (MIMIC-IV), and physiological waveforms. The closest to a real-world clinical data environment for research.

X-rayEHRWaveformsICUMulti-modal

DICOM metadata

Annotation provenance

Structural consistency

Clinical linkage

The pattern is consistent. Across nearly every dataset in this directory, the same gaps appear: annotation provenance is partial at best, preprocessing decisions are undocumented, and integrity verification is absent. These datasets were built for research benchmarks — not for production AI pipelines where traceability, reproducibility, and regulatory compliance matter.

What this directory reveals

Scan the quality assessments across all 18 datasets and a pattern emerges. DICOM metadata and structural consistency are often adequate — these are the dimensions that dataset creators tend to address because they're required to make the data loadable.

Annotation provenance is almost always partial. You can usually find out that annotations exist, but not who created them, what their qualifications were, what guidelines they followed, or how disagreements were resolved. For challenge datasets (BraTS, CAMELYON, ACDC), the annotation protocol is better documented — but still rarely at a level that would satisfy a regulatory submission.

Clinical linkage varies widely. Some datasets (UK Biobank, ADNI, MIMIC ecosystem) excel here. Most others treat imaging in isolation from the clinical context that generated it.

Integrity verification is effectively absent from every dataset on this list. There is no standard mechanism to confirm that the files you downloaded are identical to what was originally published — no checksums, no cryptographic signatures, no tamper detection.

These are not criticisms of the teams who built these datasets. They are observations about what the community has standardized and what it has not. The model side of medical imaging AI has frameworks, benchmarks, and reproducibility standards. The data side does not — yet.

This is the gap Princeton Medical Systems is working to close.

Need production-grade imaging datasets?

We build datasets with documented provenance, credentialed annotations, standardized structure, and integrity verification built in.

Get in touch

This directory is maintained by Princeton Medical Systems and reflects publicly available information as of March 2026. Quality assessments are based on our review of published documentation, data dictionaries, and dataset structures. Assessments may not reflect unpublished internal documentation maintained by dataset creators. If you maintain a dataset listed here and believe an assessment should be updated, please contact us.