Forecast for only a few specific time series when using dynamic exogenous variables? #165

lamhintai · 2023-07-08T17:52:27Z

lamhintai
Jul 8, 2023

Is there any way that we can produce forecasts from a global model (e.g. LightGBM) using mlforecast for a few specific time series (id_col)?

This case arises when we are training a global model on several thousands of series of mobile network cell site usage. Some sites got decommissioned and we no longer have the projected exogenous variables (e.g. resource blocks configured) for these id_col values. Or think of it as a model to forecast the demands of many products with prices as an exogenous variable - eventually some products got discontinued and therefore they are no longer on the future projected price book (dynamic_dfs).

It looks like the use of GroupedArray in the implementation of forecast has demanded that one must make up exogenous variables for all id_col in the future in order to forecast (maybe just for 1 series), even if many items in id_col may not exist anymore in the business sense? (I do not have the experiment jupyter notebook in work at hand, but I recall I have received a warning of "new must be of size 17538" or something similar.)

jmoralez · 2023-07-12T05:00:35Z

jmoralez
Jul 12, 2023
Maintainer

Hi @lamhintai, thanks for using mlforecast. The joins with the dynamic features use a left join, so if you don't provide them for all your series you'll probably get some warnings about nulls but it should be able to continue with the prediction. Can you provide a small example?

3 replies

lamhintai Jul 31, 2023
Author

Hi @jmoralez,

Sorry to have taken some time to come back, I was put into another team for a short assignment and only came back last week.

Plus, thank you for the pointer regarding the left join - I eventually figured out the cause of my issue is not related to mlforecast, after stepping through the mlforecast code with the debugger. The DataFrame supplied to the dynamic_dfs contained the time column that was not properly renamed to "ds", which mlforecast expects, hence the left join produced a much larger DataFrame after the join as a result.

I think for this part we are good to go now. Thanks again for your work on mlforecast. :)

However, for my particular use case (running a forecast for 10k+ cell sites), the end user (radio engineers) are really hoping for the ability of tinkering with the values inside dynamic_dfs and seeing the renewed forecasts for a single cell site. I'm not sure if this is technically feasible, but if in the future there is a function for specifying the time series to forecast (perhaps through naming the time series id in TimeSeries.id_col?), that would be of tremendous value to a lot of such interactive use cases.

jmoralez Aug 1, 2023
Maintainer

Sure, that seems like a very valid use case. I've created #180 to implement it.

jmoralez Aug 15, 2023
Maintainer

Hey @lamhintai. We've just released 0.9.1 which includes an ids argument in the predict method. Please let us know if you find any problems with it.

truonghm · 2023-07-13T05:10:19Z

truonghm
Jul 13, 2023

Hi @jmoralez ,

I'm having a similar problem, and there are indeed warnings about nulls when using LightGBM, but when trying with other models, such as RandomForestRegressor or AdaBoostRegressor, an error will be raised (example below). I'm thinking about using the SimpleImputer from sklearn to fix this issue, but no luck so far. If you have any suggestion on how to handle this, it would be great. Thank you!

Example of the warning:

UserWarning: Found null values in lag3, lag6, lag12, expanding_mean_lag3, rolling_mean_lag6_window_size12, rolling_max_lag6_window_size12, rolling_min_lag6_window_size12, rolling_mean_lag12_window_size24, rolling_max_lag12_window_size24, rolling_min_lag12_window_size24.
  warnings.warn(f'Found null values in {", ".join(nulls[nulls].index)}.')

Example of the error:

ValueError: Input X contains NaN.
RandomForestRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

An example of my code:

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor


from mlforecast import MLForecast

from window_ops.rolling import rolling_mean, rolling_max, rolling_min
from window_ops.expanding import expanding_mean

train = data.loc[demand['ds'] < '2022-01-01'].reset_index(drop=True).sort_values(by=["ds", "unique_id"])
valid = data.loc[(demand['ds'] >= '2022-01-01') & (data['ds'] <= '2022-12-31')].reset_index(drop=True)
h = valid['ds'].nunique()

models={
    'rf': RandomForestRegressor(n_estimators=10),
    'ada': AdaBoostRegressor(estimator=DecisionTreeRegressor()),

}

model = MLForecast(
    models=models,
    freq="M",
    lags=[3, 6, 12],
    lag_transforms={
        3: [expanding_mean],
        6: [(rolling_mean, 12), (rolling_max, 12), (rolling_min, 12)],
        12: [(rolling_mean, 24), (rolling_max, 24), (rolling_min, 24)],
    },
    target_transforms=[NullImputer()],
    date_features=["month"],
    num_threads=6,
)


model.fit(train, id_col="unique_id", time_col="ds", target_col="y", static_features=[])

p = model.predict(horizon=h)

p = p.merge(valid[["unique_id", "ds", "y"]], on=["unique_id", "ds"], how="inner")

2 replies

jmoralez Jul 13, 2023
Maintainer

That means some series are too short for the lag features. One option is using the before_predict_callback argument of the predict method to impute them somehow and avoid the errors.

truonghm Jul 14, 2023

thank you, I have updated the data to include all the time range for each unique id (while filling na with 0), and it's working now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forecast for only a few specific time series when using dynamic exogenous variables? #165

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Forecast for only a few specific time series when using dynamic exogenous variables? #165

lamhintai Jul 8, 2023

Replies: 2 comments · 5 replies

jmoralez Jul 12, 2023 Maintainer

lamhintai Jul 31, 2023 Author

jmoralez Aug 1, 2023 Maintainer

jmoralez Aug 15, 2023 Maintainer

truonghm Jul 13, 2023

jmoralez Jul 13, 2023 Maintainer

truonghm Jul 14, 2023

lamhintai
Jul 8, 2023

Replies: 2 comments 5 replies

jmoralez
Jul 12, 2023
Maintainer

lamhintai Jul 31, 2023
Author

jmoralez Aug 1, 2023
Maintainer

jmoralez Aug 15, 2023
Maintainer

truonghm
Jul 13, 2023

jmoralez Jul 13, 2023
Maintainer