2026.04.15 meeting notes

lower reward scale
prediction len = 0
tighter action range
reward = delta wealth
higher gamma and gae_lambda for long horizons
more logs: position_abs step_pnl
tanh scaling:
- sample actions pre tanh $x \sim N (μ, σ^{2})$
- squash: $z = tanh (x)$
- rescale: a = scale * z + bias
fixed KL blowup

End result

https://wandb.ai/leonardotoffalini-e-tv-s-lor-nd-university/pufferlib/runs/tznzh3wr?nw=nwuserleonardotoffalini

config (short version):

[env]
num_envs = 2048
time_horizon = 512
T_min = 128
T_max = 1024
hurst = 0.1
process_type = fbm
liquidate = 1
# 0: one step liq, 1: linear liq
liq_type = 1
friction_coef = 0.01
friction_power = 2
reward_scale = 0.001
action_low = -5.0
action_high = 5.0
price_window_size = 32
prediction_len = 0
normalize_observations = true
 
[train]
total_timesteps = 500_000_000
minibatch_size = 65536
ent_coef = 1e-4
gae_lambda = 0.99
gamma = 0.999
learning_rate = 0.001
target_kl = 0.03
max_logratio = 12.0

TODO

give T/T_max as obs
if we remove the information of the trading horizon then the only thing it can learn is a hom. strategy
try to train on only two T values, instead of a uniform distribution $T \sim E [T_{m i n}, T_{m a x}]$
increase model size
change remote url from personal to aielte-research

Jegyzetek kincstára

Explorer

2026.04.15 meeting notes

End result

TODO

Table of Contents