Wrong reward at terminal date when training with SB3 #1228

iezfly · 2024-05-11T03:47:52Z

Describe the bug
For StockTradingEnv when training in SB3 VecEnv (using get_sb_env), at the terminal date(let's say day X), it doesn't calculate reward since training data doesn't contain the next date closed price.
But when SB3 algorithm conducts collect_rollouts, it adds the previous reward (day X-1's reward) into rollout_buffer.

To Reproduce
Steps to reproduce the behavior:

pick a short period training data to quickly reach terminal date when debuging , eg. 5days in total.
break point after rollout_buffer.add() in on_policy_algorithm.py(SB3 module) line 232
stepping until after day 5 (add the 4th experience)
see the rollout_buffer variable (the reward[3] is the same as the previous one)

Expected behavior
after I examined SB3 code, it provides a solution for calculating the reward when next state is unobservable in episode. it use prediction value with discount as reward, while I'm not sure about its mathematic meaning.

            # see GitHub issue #633
            for idx, done in enumerate(dones):
                if (
                    done
                    and infos[idx].get("terminal_observation") is not None
                    and infos[idx].get("TimeLimit.truncated", False)
                ):
                    terminal_obs = self.policy.obs_to_tensor(infos[idx]["terminal_observation"])[0]
                    with th.no_grad():
                        terminal_value = self.policy.predict_values(terminal_obs)[0]  # type: ignore[arg-type]
                    rewards[idx] += self.gamma * terminal_value

But it requires the Env return "terminated = False, truncated = True." (while StockTradingEnv return "terminated = True, truncated = False:) # In SB3, truncated and terminated are mutually exclusive.

I think in stock trading scenario, the last day in training data should be a time limit truncation, but not a terminal, because it has future reward (value), unless it holds zero shares and the policy will not buy anymore at that state.

Screenshots

Additional context
I simply change the return in terminal case in the StockTradingEnv code, line 300.
return self.state, 0.0, False, True, {}. # reward = 0, could cause some issue

but I didn't consider the impact on ElegantRL, Ray and Portfolio Management scene.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong reward at terminal date when training with SB3 #1228

Wrong reward at terminal date when training with SB3 #1228

iezfly commented May 11, 2024 •

edited

Loading

Wrong reward at terminal date when training with SB3 #1228

Wrong reward at terminal date when training with SB3 #1228

Comments

iezfly commented May 11, 2024 • edited Loading

iezfly commented May 11, 2024 •

edited

Loading