-
lower reward scale
-
prediction len = 0
-
tighter action range
-
reward = delta wealth
-
higher gamma and gae_lambda for long horizons
-
more logs: position_abs step_pnl
-
tanh scaling:
- sample actions pre tanh
- squash:
- rescale: a = scale * z + bias
-
fixed KL blowup
End result
config (short version):
[env]
num_envs = 2048
time_horizon = 512
T_min = 128
T_max = 1024
hurst = 0.1
process_type = fbm
liquidate = 1
# 0: one step liq, 1: linear liq
liq_type = 1
friction_coef = 0.01
friction_power = 2
reward_scale = 0.001
action_low = -5.0
action_high = 5.0
price_window_size = 32
prediction_len = 0
normalize_observations = true
[train]
total_timesteps = 500_000_000
minibatch_size = 65536
ent_coef = 1e-4
gae_lambda = 0.99
gamma = 0.999
learning_rate = 0.001
target_kl = 0.03
max_logratio = 12.0TODO
- give T/T_max as obs
- if we remove the information of the trading horizon then the only thing it can learn is a hom. strategy
- try to train on only two T values, instead of a uniform distribution
- increase model size
- change remote url from personal to aielte-research