Calculator Inputs
Use practical reinforcement learning values for traces and returns.
Plotly Graph
The chart compares n-step returns with their λ weights.
Formula Used
TD Error
δt = rt+1 + γV(st+1) − V(st)
This measures the gap between the current estimate and a one-step bootstrapped target.
Eligibility Trace
et = γλet−1 + xt
The trace stores recent state influence, scaled by discounting and decay.
Value Adjustment
ΔV = α · δt · et
A larger learning rate or trace amplifies the update size.
Truncated λ Return
Gλt = Σ wnG(n)t
This page blends 1-step through 5-step returns using λ-based weights.
How to Use This Calculator
- Enter the current value estimate for the present state.
- Set α, γ, and λ between 0 and 1.
- Type the previous eligibility trace and current state activation.
- Enter five future rewards from the rollout sequence.
- Provide the bootstrap value estimate for each matching future step.
- Press Calculate TD Lambda to show the result above the form.
- Review the value update, λ return, and decay profile.
- Use the CSV or PDF buttons when you need a saved summary.
Example Data Table
| Example Input | Value | Purpose |
|---|---|---|
| Current value V(st) | 3.2000 | Base estimate before updating. |
| Learning rate α | 0.1200 | Controls update speed. |
| Discount factor γ | 0.9500 | Discounts future information. |
| Trace decay λ | 0.8000 | Balances short and long returns. |
| Previous trace et−1 | 0.5000 | Previous eligibility memory. |
| State activation xt | 1.0000 | Current feature or state activity. |
| Reward path | 1.5000, 0.8000, 0.6000, 0.4000, 0.2000 | Observed rollout rewards. |
| Bootstrap values | 3.6000, 3.9000, 4.1000, 4.2000, 4.3000 | Value estimates after each step. |
| Example Output | Value |
|---|---|
| TD error δt | 1.720000 |
| Updated trace et | 1.380000 |
| Value adjustment | 0.284832 |
| Updated value estimate | 3.484832 |
| 5-step λ return | 6.107133 |
| Trace decay factor γλ | 0.760000 |
| Effective trace horizon | 4.166667 |
| Weight sum check | 1.000000 |
Frequently Asked Questions
1. What does TD(λ) combine?
TD(λ) combines one-step bootstrapping with multi-step return information. Lambda controls how much weight longer horizons receive, helping balance bias and variance during value learning.
2. Why is lambda restricted between 0 and 1?
That range keeps the trace decay interpretable and stable. Values near zero emphasize one-step updates, while values near one push the method toward longer-horizon credit assignment.
3. What is the meaning of the eligibility trace?
The eligibility trace records how strongly recent states or features should be updated. A higher trace means the current TD error influences the state more strongly.
4. Why are several bootstrap values included?
Each n-step return needs its own bootstrap estimate at the end of that horizon. Supplying values for steps one through five lets the calculator build a truncated λ return consistently.
5. What happens when lambda equals zero?
The λ return collapses to the one-step target, and trace influence decays immediately. This makes the method behave like standard TD learning without multi-step blending.
6. What happens when lambda approaches one?
Longer-horizon returns gain more weight, so updates use more rollout information. This can reduce bootstrap bias, but it may also introduce more variance from sampled rewards.
7. Is this calculator useful for function approximation?
Yes. The state activation input lets you represent a simple active feature level, which makes the page useful for tabular intuition and lightweight feature-based learning checks.
8. Why does the page show an effective trace horizon?
The horizon approximates how long trace information persists, based on γλ. Larger values mean the influence of past states decays more slowly across updates.