How to Plot and Compare ROC Curves: Tools and TechniquesReceiver Operating Characteristic (ROC) curves are essential for evaluating the performance of binary classifiers. They visualize the trade-off between true positive rate (sensitivity) and false positive rate (1 − specificity) across different decision thresholds. This article explains what ROC curves represent, how to compute and plot them, how to compare multiple ROC curves, and which tools and techniques are most useful in practice.
What a ROC curve shows
A ROC curve plots:
- True Positive Rate (TPR) on the y-axis: TPR = TP / (TP + FN)
- False Positive Rate (FPR) on the x-axis: FPR = FP / (FP + TN)
Points along the curve correspond to different classification thresholds applied to model scores (probabilities or continuous outputs). The diagonal line from (0,0) to (1,1) represents a random classifier; curves above this line indicate better-than-random performance. The closer the curve follows the top-left corner, the better the classifier.
A common scalar summary is the Area Under the ROC Curve (AUC or AUROC). AUC ranges from 0 to 1, with 0.5 representing random performance and 1.0 representing a perfect classifier. AUC is threshold-independent and summarizes the model’s ability to discriminate between classes.
When to use ROC curves
ROC curves are especially helpful when:
- Class prevalences differ from evaluation to deployment (AUC is prevalence-independent).
- You care about ranking ability (ordering positive instances above negatives).
- Costs of false positives and false negatives are not fixed and may be explored across thresholds.
Avoid relying solely on ROC curves when:
- Classes are highly imbalanced and positive class is rare; Precision–Recall curves may be more informative.
- You need calibration (probability accuracy) rather than ranking.
Data and metrics you need
To plot a ROC curve you need:
- Ground-truth binary labels (0/1).
- Model scores or probabilities for the positive class.
From these you can compute:
- TPR (sensitivity / recall) and FPR for many thresholds.
- AUC (using trapezoidal rule or Mann–Whitney U interpretation).
- Confidence intervals for AUC (via bootstrap or DeLong’s method).
- Statistical tests for difference between AUCs (DeLong’s test).
How to compute and plot ROC curves — step-by-step
- Obtain predicted scores and true labels.
- Sort instances by predicted score in descending order.
- For a set of thresholds (each unique score or a grid), compute TPR and FPR.
- Plot FPR on x-axis vs TPR on y-axis; connect points to form the curve.
- Compute AUC (numerical integration of curve).
- Optionally compute confidence intervals and plot them or plot multiple curves for comparison.
Pseudo-algorithm (conceptual):
- thresholds = sorted(unique(scores), descending)
- for t in thresholds:
- predict positive if score >= t
- compute TP, FP, TN, FN
- compute TPR = TP/(TP+FN), FPR = FP/(FP+TN)
- plot (FPR, TPR)
- AUC = integrate trapezoids under curve
Tools and code examples
Below are concise examples in Python and R, plus notes on GUI tools.
Python (scikit-learn + matplotlib)
import numpy as np from sklearn.metrics import roc_curve, roc_auc_score import matplotlib.pyplot as plt y_true = np.array([0,1,1,0,1,0]) # replace with real labels y_scores = np.array([0.1,0.9,0.8,0.3,0.6,0.2]) # predicted probabilities fpr, tpr, thresholds = roc_curve(y_true, y_scores) auc = roc_auc_score(y_true, y_scores) plt.plot(fpr, tpr, label=f'ROC (AUC = {auc:.3f})') plt.plot([0,1],[0,1],'k--', label='Random') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC Curve') plt.legend() plt.show()
To compare multiple classifiers:
# y_scores_1, y_scores_2 are arrays of scores from two models fpr1, tpr1, _ = roc_curve(y_true, y_scores_1) fpr2, tpr2, _ = roc_curve(y_true, y_scores_2) auc1 = roc_auc_score(y_true, y_scores_1) auc2 = roc_auc_score(y_true, y_scores_2) plt.plot(fpr1,tpr1,label=f'Model 1 (AUC={auc1:.3f})') plt.plot(fpr2,tpr2,label=f'Model 2 (AUC={auc2:.3f})')
Confidence intervals and statistical comparison (DeLong) can be done with the package delong
or scikit-posthocs
variants, or via bootstrapping.
R (pROC)
library(pROC) roc_obj <- roc(response = y_true, predictor = y_scores) plot(roc_obj, main = sprintf("ROC (AUC = %.3f)", auc(roc_obj))) # Compare two ROC curves roc1 <- roc(y_true, y_scores_1) roc2 <- roc(y_true, y_scores_2) roc.test(roc1, roc2, method="delong")
Other tools
- MATLAB: built-in perfcurve function.
- Excel: possible but tedious—compute TPR/FPR across thresholds and chart.
- GUI platforms: many ML platforms (e.g., Weka, KNIME, rapidminer) plot ROC curves directly.
Comparing ROC curves: techniques and statistics
Visual comparison is the first step: plot multiple curves on the same axes and compare AUCs. For rigorous comparison:
- DeLong’s test: nonparametric test for difference between correlated AUCs (same dataset). Commonly used and implemented in many libraries.
- Bootstrap test: resample dataset with replacement, compute AUC difference distribution, derive confidence interval and p-value.
- Paired permutation test: shuffle model scores between models for each instance to test significance.
- Compare partial AUC: if you’re interested in a specific FPR range (e.g., FPR < 0.1), compute AUC over that segment.
Report:
- AUC values with 95% confidence intervals.
- p-value for the chosen statistical test.
- If using multiple comparisons, adjust p-values (e.g., Bonferroni).
Practical tips and pitfalls
- Use predicted probabilities or continuous scores, not binary predictions.
- For imbalanced datasets, supplement ROC with Precision–Recall curves; AUC can be overly optimistic.
- When comparing models trained on different datasets, AUC comparisons may be invalid—ensure same test set or use cross-validation.
- For clinical or operational decisions, prefer measures at relevant thresholds (sensitivity at fixed specificity or vice versa).
- Beware overfitting: evaluate ROC on held-out test data or via cross-validation/bootstrapping.
- Plot confidence bands (bootstrapped) to visualize uncertainty.
Example workflow for model evaluation
- Split data into training and test sets (or use cross-validation).
- Train models and obtain probability scores on the test set.
- Plot ROC curves and compute AUCs and confidence intervals.
- Perform statistical tests (DeLong or bootstrap) to compare AUCs.
- Examine PR curves, calibration, and decision thresholds relevant to application.
- Document thresholds chosen and expected TPR/FPR at deployment.
Summary
ROC curves are a flexible, threshold-independent way to evaluate classifier discrimination. Tools like scikit-learn, pROC ®, and MATLAB make plotting and comparing ROC curves straightforward. For robust comparisons use DeLong’s test or bootstrapping, and always consider complementary metrics (PR curves, calibration) especially when classes are imbalanced.
Leave a Reply