• lower reward scale

  • prediction len = 0

  • tighter action range

  • reward = delta wealth

  • higher gamma and gae_lambda for long horizons

  • more logs: position_abs step_pnl

  • tanh scaling:

    • sample actions pre tanh
    • squash:
    • rescale: a = scale * z + bias
  • fixed KL blowup

End result

https://wandb.ai/leonardotoffalini-e-tv-s-lor-nd-university/pufferlib/runs/tznzh3wr?nw=nwuserleonardotoffalini

config (short version):

[env]
num_envs = 2048
time_horizon = 512
T_min = 128
T_max = 1024
hurst = 0.1
process_type = fbm
liquidate = 1
# 0: one step liq, 1: linear liq
liq_type = 1
friction_coef = 0.01
friction_power = 2
reward_scale = 0.001
action_low = -5.0
action_high = 5.0
price_window_size = 32
prediction_len = 0
normalize_observations = true
 
[train]
total_timesteps = 500_000_000
minibatch_size = 65536
ent_coef = 1e-4
gae_lambda = 0.99
gamma = 0.999
learning_rate = 0.001
target_kl = 0.03
max_logratio = 12.0

TODO

  • give T/T_max as obs
  • if we remove the information of the trading horizon then the only thing it can learn is a hom. strategy
  • try to train on only two T values, instead of a uniform distribution
  • increase model size
  • change remote url from personal to aielte-research