Analyze predictions using dependable confusion matrix metrics. See tradeoffs between sensitivity, precision, specificity, and errors. Export results, compare models, and interpret thresholds with confidence.
| Model | TP | TN | FP | FN | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|---|---|---|
| Classifier A | 120 | 135 | 15 | 30 | 85.00% | 88.89% | 80.00% | 84.21% |
| Classifier B | 128 | 126 | 24 | 22 | 84.67% | 84.21% | 85.33% | 84.77% |
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall or Sensitivity = TP / (TP + FN)
Specificity = TN / (TN + FP)
F1 Score = 2 × Precision × Recall / (Precision + Recall)
F-beta Score = (1 + β²) × Precision × Recall / (β² × Precision + Recall)
MCC = (TP × TN − FP × FN) / √[(TP + FP)(TP + FN)(TN + FP)(TN + FN)]
Balanced Accuracy = (Recall + Specificity) / 2
Jaccard Index = TP / (TP + FP + FN)
Cohen’s Kappa compares observed agreement against agreement expected by chance.
Accuracy is familiar. It is also incomplete. A model can score well when classes are imbalanced. That happens often in fraud, safety, medicine, and fault detection. Confusion matrix metrics reveal where predictions fail. They show whether mistakes come from missed positives or false alarms. They also help teams explain results to nontechnical stakeholders without hiding model risk behind one broad average.
Precision asks how many predicted positives were correct. Recall asks how many real positives were found. These values move with the threshold. A strict threshold may raise precision but miss more positives. A loose threshold may recover positives but create extra false positives. Good evaluation weighs both. Product teams should match the chosen balance to business cost, user harm, and review capacity before deployment.
Specificity measures how well negatives are rejected. It matters when unnecessary alerts are costly. Screening pipelines, moderation systems, and monitoring tools all benefit from this view. False positive rate gives the same story from the error side. Seeing both helps teams explain tradeoffs clearly. Negative predictive value is also useful when users need confidence that a negative prediction can safely be trusted.
Balanced accuracy and G-Mean work well when one class dominates. They reduce the illusion created by a large majority class. Jaccard index is useful when overlap matters. F1 score summarizes precision and recall in one number. F-beta shifts that summary toward recall or precision as needed. These measures are helpful when benchmark reports must compare several threshold settings across the same dataset.
Matthews correlation coefficient is strong for binary evaluation. It uses all four confusion matrix cells. It remains informative when classes are uneven. Cohen’s kappa also adjusts for chance agreement. These metrics are valuable when a single reliable summary is needed for ranking models. They are especially useful in validation reports where teams want deeper agreement signals beyond raw percentage accuracy alone.
No single score wins every problem. Teams should compare several metrics together. They should also record the threshold used during testing. This calculator helps you inspect popular classifier performance metrics quickly. It supports cleaner reporting, better model comparison, and better deployment decisions. Use exported tables, graphs, and documented formulas to create repeatable reviews for audits, dashboards, and model governance workflows.
It evaluates binary classifier performance from confusion matrix counts. You enter true positives, true negatives, false positives, and false negatives. The tool then computes common diagnostic and machine learning metrics in one place.
Accuracy can hide weak performance when classes are imbalanced. A model may predict the majority class often and still look strong. Precision, recall, specificity, and MCC expose that weakness.
Focus on precision when false positives are expensive. Examples include spam blocking, manual review queues, and costly interventions. High precision means positive predictions are usually correct.
Focus on recall when missing a positive case is risky. Examples include disease screening, fraud detection, and safety monitoring. High recall captures more real positives.
MCC measures overall binary classification quality using all confusion matrix cells. It stays useful under class imbalance and ranges from negative agreement to perfect agreement.
F-beta lets you control the balance between precision and recall. Beta above one gives more weight to recall. Beta below one favors precision more strongly.
Specificity measures how well the model identifies true negatives. It is important when false alarms waste time, money, or trust. It complements recall.
Yes. The calculator includes CSV export for spreadsheet analysis and PDF export for reports, documentation, or sharing with stakeholders and project teams.
Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.