Non-Inferiority Sample Size Calculator

Test model changes without overstating performance differences. Set margins, alpha, power, allocation, and outcome assumptions. Size trustworthy experiments for efficient machine learning decision making.

Calculator Inputs

Continuous Metric Assumptions

Binary Metric Assumptions

This design assumes higher values are better. For lower-is-better metrics, convert the scale first or enter a transformed improvement score.

Example Data Table

Scenario Endpoint Control Test Margin Power Adjusted Total
AUC benchmark Continuous 0.9100 0.9000 0.0200 0.80 298
F1 score benchmark Continuous 0.8450 0.8380 0.0150 0.90 1206
Safety pass-rate benchmark Binary 0.8800 0.8700 0.0300 0.90 12096

Formula Used

The calculator uses a one-sided non-inferiority design for two independent groups. It assumes the test model should not be worse than the control by more than the chosen margin.

For continuous metrics:
ncontrol = ((z1-α + zpower)² × (σtest² / r + σcontrol²)) / (δ + Δ)²

For binary metrics:
ncontrol = ((z1-α + zpower)² × (ptest(1-ptest) / r + pcontrol(1-pcontrol))) / (δ + Δ)²

Here, δ is expected test minus control, Δ is the absolute non-inferiority margin, and r is the allocation ratio for test versus control.

The adjusted sample size divides by one minus dropout rate. Final group sizes are rounded upward.

How to Use This Calculator

  1. Enter the metric name so your result stays easy to interpret.
  2. Select continuous for scores like AUC, F1, or calibrated utility.
  3. Select binary for pass-rate, harm-rate complement, or success labels.
  4. Set a one-sided alpha and the power you need.
  5. Choose the largest acceptable performance loss as the non-inferiority margin.
  6. Enter control and test assumptions from prior experiments or pilot runs.
  7. Add the allocation ratio when groups will not be equal.
  8. Include dropout to protect the final evaluable sample.
  9. Click calculate and review the adjusted total sample size first.
  10. Export the result as CSV or PDF if needed.

Why This Matters in AI and Machine Learning

Non-inferiority testing is useful when a new model offers advantages beyond raw performance. A smaller model may cut latency, reduce cost, use less memory, or improve interpretability. In those cases, you may accept a tiny metric loss if the tradeoff stays within a defensible margin.

This calculator helps teams size experiments before launching expensive evaluations. It supports benchmark planning for model compression, retrieval updates, policy models, ranking systems, safety filters, and human review pipelines. By combining margin, alpha, power, and allocation assumptions, it turns model validation into a planned study rather than a rough estimate.

Use continuous mode for bounded scores such as AUC, F1, utility, or quality ratings. Use binary mode for success rates, approval outcomes, or pass and fail checks. The result gives both raw and dropout-adjusted sizes, which makes planning easier for offline experiments and staged online evaluations.

FAQs

1. What is a non-inferiority margin?

It is the largest acceptable drop from control performance. If the test stays within that loss, it can still be judged acceptable.

2. Why is alpha one-sided here?

Non-inferiority asks whether the test is not unacceptably worse. That question uses a one-sided rejection boundary by design.

3. When should I use continuous mode?

Use it for scores treated like numeric outcomes, such as AUC, F1, utility, ranking quality, or averaged judge ratings.

4. When should I use binary mode?

Use it when each evaluation unit becomes success or failure, pass or fail, safe or unsafe, or another binary outcome.

5. What does allocation ratio mean?

It is the planned test sample divided by the control sample. A value of 1 means equal group sizes.

6. Why include dropout?

Some observations become unusable because of filtering, missing labels, or failed runs. Dropout adjustment protects the final evaluable sample.

7. Can I use this for lower-is-better metrics?

Yes, but transform the metric first so higher means better. That keeps the margin and interpretation consistent.

8. Is this exact for every study design?

No. It is a practical normal approximation. Complex designs may need paired formulas, cluster adjustments, or simulation-based planning.

Related Calculators

language style matching toolsimilar figures and indirect measurement calculatorfind x in similar triangles calculatorsentence similarity score calculator

Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.