You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
For StockTradingEnv when training in SB3 VecEnv (using get_sb_env), at the terminal date(let's say day X), it doesn't calculate reward since training data doesn't contain the next date closed price.
But when SB3 algorithm conducts collect_rollouts, it adds the previous reward (day X-1's reward) into rollout_buffer.
To Reproduce
Steps to reproduce the behavior:
pick a short period training data to quickly reach terminal date when debuging , eg. 5days in total.
break point after rollout_buffer.add() in on_policy_algorithm.py(SB3 module) line 232
stepping until after day 5 (add the 4th experience)
see the rollout_buffer variable (the reward[3] is the same as the previous one)
Expected behavior
after I examined SB3 code, it provides a solution for calculating the reward when next state is unobservable in episode. it use prediction value with discount as reward, while I'm not sure about its mathematic meaning.
# see GitHub issue #633
for idx, done in enumerate(dones):
if (
done
and infos[idx].get("terminal_observation") is not None
and infos[idx].get("TimeLimit.truncated", False)
):
terminal_obs = self.policy.obs_to_tensor(infos[idx]["terminal_observation"])[0]
with th.no_grad():
terminal_value = self.policy.predict_values(terminal_obs)[0] # type: ignore[arg-type]
rewards[idx] += self.gamma * terminal_value
But it requires the Env return "terminated = False, truncated = True." (while StockTradingEnv return "terminated = True, truncated = False:) # In SB3, truncated and terminated are mutually exclusive.
I think in stock trading scenario, the last day in training data should be a time limit truncation, but not a terminal, because it has future reward (value), unless it holds zero shares and the policy will not buy anymore at that state.
Screenshots
Additional context
I simply change the return in terminal case in the StockTradingEnv code, line 300. return self.state, 0.0, False, True, {}. # reward = 0, could cause some issue
but I didn't consider the impact on ElegantRL, Ray and Portfolio Management scene.
The text was updated successfully, but these errors were encountered:
Describe the bug
For StockTradingEnv when training in SB3 VecEnv (using get_sb_env), at the terminal date(let's say day X), it doesn't calculate reward since training data doesn't contain the next date closed price.
But when SB3 algorithm conducts collect_rollouts, it adds the previous reward (day X-1's reward) into rollout_buffer.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
after I examined SB3 code, it provides a solution for calculating the reward when next state is unobservable in episode. it use prediction value with discount as reward, while I'm not sure about its mathematic meaning.
But it requires the Env return "terminated = False, truncated = True." (while StockTradingEnv return "terminated = True, truncated = False:) # In SB3, truncated and terminated are mutually exclusive.
I think in stock trading scenario, the last day in training data should be a time limit truncation, but not a terminal, because it has future reward (value), unless it holds zero shares and the policy will not buy anymore at that state.
Screenshots
Additional context
I simply change the return in terminal case in the StockTradingEnv code, line 300.
return self.state, 0.0, False, True, {}
. # reward = 0, could cause some issuebut I didn't consider the impact on ElegantRL, Ray and Portfolio Management scene.
The text was updated successfully, but these errors were encountered: