Non-Inferiority Sample Size Calculator for AI and Machine Learning

Calculator Inputs

Metric name

Endpoint type

Non-inferiority margin

One-sided alpha

Power

Allocation ratio (test/control)

Dropout rate (%)

Continuous Metric Assumptions

Control mean

Test mean

Control standard deviation

Test standard deviation

Binary Metric Assumptions

Control pass rate

Test pass rate

This design assumes higher values are better. For lower-is-better metrics, convert the scale first or enter a transformed improvement score.

Example Data Table

Scenario	Endpoint	Control	Test	Margin	Power	Adjusted Total
AUC benchmark	Continuous	0.9100	0.9000	0.0200	0.80	298
F1 score benchmark	Continuous	0.8450	0.8380	0.0150	0.90	1206
Safety pass-rate benchmark	Binary	0.8800	0.8700	0.0300	0.90	12096

Formula Used

The calculator uses a one-sided non-inferiority design for two independent groups. It assumes the test model should not be worse than the control by more than the chosen margin.

For continuous metrics:
n_control = ((z_1-α + z_power)² × (σ_test² / r + σ_control²)) / (δ + Δ)²

For binary metrics:
n_control = ((z_1-α + z_power)² × (p_test(1-p_test) / r + p_control(1-p_control))) / (δ + Δ)²

Here, δ is expected test minus control, Δ is the absolute non-inferiority margin, and r is the allocation ratio for test versus control.

The adjusted sample size divides by one minus dropout rate. Final group sizes are rounded upward.

How to Use This Calculator

Enter the metric name so your result stays easy to interpret.
Select continuous for scores like AUC, F1, or calibrated utility.
Select binary for pass-rate, harm-rate complement, or success labels.
Set a one-sided alpha and the power you need.
Choose the largest acceptable performance loss as the non-inferiority margin.
Enter control and test assumptions from prior experiments or pilot runs.
Add the allocation ratio when groups will not be equal.
Include dropout to protect the final evaluable sample.
Click calculate and review the adjusted total sample size first.
Export the result as CSV or PDF if needed.

Why This Matters in AI and Machine Learning

Non-inferiority testing is useful when a new model offers advantages beyond raw performance. A smaller model may cut latency, reduce cost, use less memory, or improve interpretability. In those cases, you may accept a tiny metric loss if the tradeoff stays within a defensible margin.

This calculator helps teams size experiments before launching expensive evaluations. It supports benchmark planning for model compression, retrieval updates, policy models, ranking systems, safety filters, and human review pipelines. By combining margin, alpha, power, and allocation assumptions, it turns model validation into a planned study rather than a rough estimate.

Use continuous mode for bounded scores such as AUC, F1, utility, or quality ratings. Use binary mode for success rates, approval outcomes, or pass and fail checks. The result gives both raw and dropout-adjusted sizes, which makes planning easier for offline experiments and staged online evaluations.

FAQs

1. What is a non-inferiority margin?

It is the largest acceptable drop from control performance. If the test stays within that loss, it can still be judged acceptable.

2. Why is alpha one-sided here?

Non-inferiority asks whether the test is not unacceptably worse. That question uses a one-sided rejection boundary by design.

3. When should I use continuous mode?

Use it for scores treated like numeric outcomes, such as AUC, F1, utility, ranking quality, or averaged judge ratings.

4. When should I use binary mode?

Use it when each evaluation unit becomes success or failure, pass or fail, safe or unsafe, or another binary outcome.

5. What does allocation ratio mean?

It is the planned test sample divided by the control sample. A value of 1 means equal group sizes.

6. Why include dropout?

Some observations become unusable because of filtering, missing labels, or failed runs. Dropout adjustment protects the final evaluable sample.

7. Can I use this for lower-is-better metrics?

Yes, but transform the metric first so higher means better. That keeps the margin and interpretation consistent.

8. Is this exact for every study design?

No. It is a practical normal approximation. Complex designs may need paired formulas, cluster adjustments, or simulation-based planning.