Skip to main content

Curated Guide to Public Medical Imaging Datasets

18 datasets across 8 modalities, assessed for AI-readiness across four data quality dimensions.

← Back to all articles
Dataset Directory March 2026

There is no shortage of public medical imaging data. Dozens of well-known datasets span CT, MRI, X-ray, pathology, ultrasound, and cardiac imaging. Many have thousands — even hundreds of thousands — of studies available for download. The challenge isn't access. It's readiness.

Most public datasets were designed for research benchmarks, not production AI pipelines. That means critical information is often missing: who annotated the data and what qualified them, how the images were preprocessed, whether files can be verified for integrity, and whether the structure is consistent enough to load without custom code. This directory catalogues the most important public datasets and assesses each one against the data quality dimensions that matter for real-world AI development.

Data quality assessment key
Strong — well-documented, standardized
Partial — present but incomplete
Missing — absent or undocumented
General and multi-modality repositories3 datasets
One of the most important open repositories for medical imaging research. Hosts collections across CT, MRI, PET, and other modalities including LIDC-IDRI and TCGA imaging cohorts.
CTMRIPETMulti-collection
DICOM metadata
Annotation provenance
Structural consistency
Clinical linkage
Over 100,000 imaging datasets spanning brain, cardiac, and abdominal MRI. Rich clinical and genomic linkage. Requires application approval.
MRICardiacBrainGenomicsRestricted access
DICOM metadata
Annotation provenance
Structural consistency
Clinical linkage
Open platform for neuroimaging data using the BIDS standard — one of the few imaging domains with a widely adopted data organization format.
MRIBIDS formatNeuroimaging
DICOM metadata
Annotation provenance
Structural consistency
Clinical linkage
CT imaging3 datasets
The gold standard for lung nodule detection. 1,018 chest CT scans with XML-based annotations from four radiologists per case. Hosted on TCIA.
Lung CTNodule detectionMulti-readerVia TCIA
DICOM metadata
Annotation provenance
Structural consistency
Clinical linkage
CT with paired genomic data for non-small cell lung cancer. Useful for radiomic feature extraction and biomarker discovery.
Lung CTGenomicsRadiomics
DICOM metadata
Annotation provenance
Structural consistency
Clinical linkage
COVID-19 CT scans with severity scoring. Useful for classification and severity assessment research.
Lung CTCOVID-19Severity scoring
DICOM metadata
Annotation provenance
Structural consistency
Clinical linkage
Brain imaging (MRI)3 datasets
Multi-institutional brain MRI for tumor segmentation benchmarking. Annotations cover tumor sub-regions.
Brain MRISegmentationChallenge
DICOM metadata
Annotation provenance
Structural consistency
Clinical linkage
Longitudinal brain MRI and PET with linked clinical data. Extremely valuable for disease progression modeling.
Brain MRIPETLongitudinalRestricted access
DICOM metadata
Annotation provenance
Structural consistency
Clinical linkage
Nearly 600 healthy brain MRI scans from three London hospitals. Useful as a baseline and control dataset.
Brain MRIHealthy controlsMulti-site
DICOM metadata
Annotation provenance
Structural consistency
Clinical linkage
Chest imaging (X-ray)3 datasets
Over 100,000 chest X-rays with 14 disease labels. NLP-derived from radiology reports — widely used but known to contain label noise.
Chest X-rayNLP labels100k+ images
DICOM metadata
Annotation provenance
Structural consistency
Clinical linkage
Large chest X-ray dataset with improved labeling and explicit uncertainty labels.
Chest X-rayUncertainty labelsStanford
DICOM metadata
Annotation provenance
Structural consistency
Clinical linkage
Over 377,000 chest radiographs linked with ICU clinical data and free-text radiology reports via PhysioNet.
Chest X-rayEHR linkedReportsPhysioNet
DICOM metadata
Annotation provenance
Structural consistency
Clinical linkage
Cardiac imaging2 datasets
Cardiac MRI segmentation benchmark. 150 exams with annotations covering ventricles and myocardium across five clinical groups.
Cardiac MRISegmentationChallenge
DICOM metadata
Annotation provenance
Structural consistency
Clinical linkage
10,030 echocardiogram videos with ejection fraction labels. One of the strongest cardiac ultrasound datasets available.
EchocardiographyVideoEF predictionStanford
DICOM metadata
Annotation provenance
Structural consistency
Clinical linkage
Histopathology (whole slide imaging)2 datasets
Lymph node metastasis detection from whole slide images. Multi-center data in CAMELYON17 for generalizability testing.
WSIMetastasis detectionChallenge
DICOM metadata
Annotation provenance
Structural consistency
Clinical linkage
Over 10,000 whole slide images of prostate biopsies with Gleason grading. Intentionally noisy labels reflecting real-world variability.
WSIProstateNoisy labelsKaggle
DICOM metadata
Annotation provenance
Structural consistency
Clinical linkage
Ultrasound1 dataset
780 breast ultrasound images with segmentation masks across normal, benign, and malignant categories.
Breast ultrasoundSegmentationClassification
DICOM metadata
Annotation provenance
Structural consistency
Clinical linkage
Multi-modal and clinical integration1 dataset
Combines imaging (MIMIC-CXR), electronic health records (MIMIC-IV), and physiological waveforms. The closest to a real-world clinical data environment for research.
X-rayEHRWaveformsICUMulti-modal
DICOM metadata
Annotation provenance
Structural consistency
Clinical linkage

The pattern is consistent. Across nearly every dataset in this directory, the same gaps appear: annotation provenance is partial at best, preprocessing decisions are undocumented, and integrity verification is absent. These datasets were built for research benchmarks — not for production AI pipelines where traceability, reproducibility, and regulatory compliance matter.

What this directory reveals

Scan the quality assessments across all 18 datasets and a pattern emerges. DICOM metadata and structural consistency are often adequate — these are the dimensions that dataset creators tend to address because they're required to make the data loadable.

Annotation provenance is almost always partial. You can usually find out that annotations exist, but not who created them, what their qualifications were, what guidelines they followed, or how disagreements were resolved. For challenge datasets (BraTS, CAMELYON, ACDC), the annotation protocol is better documented — but still rarely at a level that would satisfy a regulatory submission.

Clinical linkage varies widely. Some datasets (UK Biobank, ADNI, MIMIC ecosystem) excel here. Most others treat imaging in isolation from the clinical context that generated it.

Integrity verification is effectively absent from every dataset on this list. There is no standard mechanism to confirm that the files you downloaded are identical to what was originally published — no checksums, no cryptographic signatures, no tamper detection.

These are not criticisms of the teams who built these datasets. They are observations about what the community has standardized and what it has not. The model side of medical imaging AI has frameworks, benchmarks, and reproducibility standards. The data side does not — yet.

This is the gap Princeton Medical Systems is working to close.

Need production-grade imaging datasets?

We build datasets with documented provenance, credentialed annotations, standardized structure, and integrity verification built in.

Get in touch

This directory is maintained by Princeton Medical Systems and reflects publicly available information as of March 2026. Quality assessments are based on our review of published documentation, data dictionaries, and dataset structures. Assessments may not reflect unpublished internal documentation maintained by dataset creators. If you maintain a dataset listed here and believe an assessment should be updated, please contact us.