keep the naive reward for all these plots, and maybe try out different rewards and replicate the plots and see what changed
- plot the T horizon changing do it for different trained models (mean reward, certainty equivavelnt)
one training cycle for each T
(could even do one training cycle and do a snapshot at each target T)
(the time horizon could even be a training variable)
It would be nice to see the agents trading strategy while its evolving. Put everything in info and recreate the rollouts to plot them.