Calculator Inputs
Enter your current model baseline, proposed target size, data plan, and hardware assumptions. The calculator returns scaling ratios, compute estimates, memory needs, and a simple loss projection.
Example Data Table
These example planning cases show how teams might size model expansion paths before deeper architecture or budget reviews.
| Scenario | Base Params | Target Params | Training Tokens | Context | Cluster | Intent |
|---|---|---|---|---|---|---|
| Compact assistant upgrade | 7B | 13B | 260B | 4096 | 64 GPUs | Improve quality without massive deployment growth. |
| Mid-scale reasoning jump | 13B | 34B | 680B | 8192 | 128 GPUs | Push stronger benchmarks with balanced data scaling. |
| Long-context platform model | 34B | 70B | 1400B | 16384 | 256 GPUs | Target larger enterprise and agent workloads. |
| Premium frontier candidate | 70B | 120B | 2400B | 32768 | 512 GPUs | Explore premium capability with much heavier cost. |
Formula Used
Parameter Growth = Target Parameters ÷ Base Parameters
Compute-Optimal Tokens ≈ 20 × Target Parameters
Training FLOPs ≈ 6 × Parameters × Tokens
Training Time = Training FLOPs ÷ Effective Cluster Throughput
Weight Memory = Parameters × Bytes Per Parameter
Training State Memory = Parameters × Training Bytes Per Parameter
Projected Loss = Base Loss × (Target ÷ Base)-α
These formulas are planning shortcuts. They do not include exact architecture effects, communication overhead, activation checkpointing, MoE routing, or dataset quality differences.
How to Use This Calculator
- Enter your current model size in billions of parameters.
- Enter the planned target model size.
- Set your available training tokens in billions.
- Provide a baseline loss from your reference model.
- Choose a scaling exponent that fits your experiments.
- Enter context length and global batch tokens.
- Set bytes per parameter for inference precision.
- Set training bytes per parameter for optimizer planning.
- Add GPU count, per-GPU TFLOPS, and expected utilization.
- Press calculate to show results above the form.
- Review the data-balance status and projected training time.
- Use the export buttons to save result summaries.
Frequently Asked Questions
1) What does model scaling mean?
Model scaling means increasing parameters, data, or compute to improve performance. Good scaling plans balance all three instead of only making the network larger.
2) Why does the calculator use 20 tokens per parameter?
That ratio is a simple planning heuristic inspired by common scaling-law discussions. It gives a fast reference target, not a strict rule for every dataset or architecture.
3) Why are FLOPs only approximate?
Real training depends on kernels, attention variants, padding, communication overhead, and mixed-precision efficiency. The FLOPs estimate is mainly for rough cluster budgeting.
4) What does the scaling exponent control?
The exponent controls how quickly projected loss changes as parameter count grows. Smaller values imply slower improvement from added model size.
5) Why is training memory larger than weight memory?
Training needs weights, gradients, optimizer states, and sometimes extra buffers. Weight memory alone only covers stored parameters for one copy.
6) Does context length change parameter count?
No. Context length affects sequence size, activation load, and throughput. It does not directly change parameter count in this simplified calculator.
7) Should I always scale parameters first?
Not always. If your token supply is weak, a larger model may become undertrained. Balanced scaling usually gives better returns than parameter growth alone.
8) Can this calculator estimate inference serving cost?
Partly. It estimates weight memory well enough for rough serving plans. Full serving cost also depends on batching, latency targets, KV cache, and hardware utilization.