keep the naive reward for all these plots, and maybe try out different rewards and replicate the plots and see what changed

  • plot the T horizon changing do it for different trained models (mean reward, certainty equivavelnt)
    one training cycle for each T
    (could even do one training cycle and do a snapshot at each target T)
    (the time horizon could even be a training variable)

It would be nice to see the agents trading strategy while its evolving. Put everything in info and recreate the rollouts to plot them.