TD Lambda Calculator for Reinforcement Learning

Calculator Inputs

Use practical reinforcement learning values for traces and returns.

Reset

Current value V(s_t)

Learning rate α

Discount factor γ

Trace decay λ

Previous trace e_t-1

State activation x_t

Reward r_t+1

Reward r_t+2

Reward r_t+3

Reward r_t+4

Reward r_t+5

Bootstrap value V(s_t+1)

Bootstrap value V(s_t+2)

Bootstrap value V(s_t+3)

Bootstrap value V(s_t+4)

Bootstrap value V(s_t+5)

Plotly Graph

The chart compares n-step returns with their λ weights.

Formula Used

TD Error

δ_t = r_t+1 + γV(s_t+1) − V(s_t)

This measures the gap between the current estimate and a one-step bootstrapped target.

Eligibility Trace

e_t = γλe_t−1 + x_t

The trace stores recent state influence, scaled by discounting and decay.

Value Adjustment

ΔV = α · δ_t · e_t

A larger learning rate or trace amplifies the update size.

Truncated λ Return

G^λ_t = Σ w_nG⁽ⁿ⁾_t

This page blends 1-step through 5-step returns using λ-based weights.

The calculator uses a practical five-step horizon. It combines short and longer bootstrapped targets, which is useful when comparing bias and variance in reinforcement learning updates.

How to Use This Calculator

Enter the current value estimate for the present state.
Set α, γ, and λ between 0 and 1.
Type the previous eligibility trace and current state activation.
Enter five future rewards from the rollout sequence.
Provide the bootstrap value estimate for each matching future step.
Press Calculate TD Lambda to show the result above the form.
Review the value update, λ return, and decay profile.
Use the CSV or PDF buttons when you need a saved summary.

Example Data Table

Example Input	Value	Purpose
Current value V(s_t)	3.2000	Base estimate before updating.
Learning rate α	0.1200	Controls update speed.
Discount factor γ	0.9500	Discounts future information.
Trace decay λ	0.8000	Balances short and long returns.
Previous trace e_t−1	0.5000	Previous eligibility memory.
State activation x_t	1.0000	Current feature or state activity.
Reward path	1.5000, 0.8000, 0.6000, 0.4000, 0.2000	Observed rollout rewards.
Bootstrap values	3.6000, 3.9000, 4.1000, 4.2000, 4.3000	Value estimates after each step.

Example Output	Value
TD error δ_t	1.720000
Updated trace e_t	1.380000
Value adjustment	0.284832
Updated value estimate	3.484832
5-step λ return	6.107133
Trace decay factor γλ	0.760000
Effective trace horizon	4.166667
Weight sum check	1.000000

Frequently Asked Questions

1. What does TD(λ) combine?

TD(λ) combines one-step bootstrapping with multi-step return information. Lambda controls how much weight longer horizons receive, helping balance bias and variance during value learning.

2. Why is lambda restricted between 0 and 1?

That range keeps the trace decay interpretable and stable. Values near zero emphasize one-step updates, while values near one push the method toward longer-horizon credit assignment.

3. What is the meaning of the eligibility trace?

The eligibility trace records how strongly recent states or features should be updated. A higher trace means the current TD error influences the state more strongly.

4. Why are several bootstrap values included?

Each n-step return needs its own bootstrap estimate at the end of that horizon. Supplying values for steps one through five lets the calculator build a truncated λ return consistently.

5. What happens when lambda equals zero?

The λ return collapses to the one-step target, and trace influence decays immediately. This makes the method behave like standard TD learning without multi-step blending.

6. What happens when lambda approaches one?

Longer-horizon returns gain more weight, so updates use more rollout information. This can reduce bootstrap bias, but it may also introduce more variance from sampled rewards.

7. Is this calculator useful for function approximation?

Yes. The state activation input lets you represent a simple active feature level, which makes the page useful for tabular intuition and lightweight feature-based learning checks.

8. Why does the page show an effective trace horizon?

The horizon approximates how long trace information persists, based on γλ. Larger values mean the influence of past states decays more slowly across updates.