Cracking Multidimensional Time Series Forecasting with AutoML


Kaggle, the world’s largest data science community and the University of Nicosia conducted a global competition to forecast the sales of thousands of products sold by Walmart, the world’s largest retailer. The competition involved over 5500 teams from all over the world, during a period of 4 months. The objective was to forecast daily unit sales for over 3000 unique products across 10 Walmart locations in 3 US states. The available data included item level, department, product categories, calendar, selling price, and store details. 
The dotData team leveraged our Enterprise AutoML 2.0 platform for M5 data analysis and delivered outstanding results – ranking in the top 1.8% (#102 out of 5,500+ teams). The whole process required less than one hour of manual effort from a single person and almost zero data preprocessing. The purpose of this blog is to demonstrate how citizen data scientists such as business analysts and BI teams can leverage our AutoML tool to quickly deliver results against world-class data scientists with minimal effort.

The Task: Super-High Dimensional Time Series Forecasting

Time-series forecasting such as demand forecasting, revenue forecasting, and sales forecasting is one of the most fundamental and common requirements in many businesses. While there exist a multitude of forecasting techniques (for example, classically, auto-regressive models or state-space models, and more recently Long Short Term Memory Neural Networks), real-world time-series forecasting problems have never been easy. 
The task of this competition is to forecast SKU-level product demand, critical for retailers to optimize the inventory level. Some of the challenges faced when working with this data are: 

  • the data essentially consists of super-high dimensional time-series (more than 3000 products x 10 stores)
  • many store-item level series are sparse and noisy
  • the data has hierarchical nature (i.e. state-level – store-level – product category level – item level) and scales and behaviors of time series are diverse
  • there is intermittency in sales of many products (e.g. a product is sold is 2011, taken of the shelf and sold again in 2013)
  • different products were sold during different time periods and their series have little overlap (e.g. some products started selling in 2011 whereas some started selling only in 2014)
  • there are many “external factors” (e.g. promotional effects) which are not easy to incorporate in classical time-series modeling techniques

The competition continued for 4 months and participants tuned their models by applying different preprocessing methods and modeling algorithms. Figure 1 shows the score distribution of top 40% participants (the scores were less than 1.0): 

  • The score distribution is very long tail with mean=6.09 and mode=5.39.  
  • Regardless, there is a high peak around 0.75 (about 900 teams = 16.3% of 5,500) and the score is widely distributed even for top performers.

This tells the difficulty of this task and it is hard even for experienced data science teams to achieve competitive accuracy. 

Figure 1 The score distribution in the top 40% (score < 1.0)

score distribution

First Execution : Top-3% score with almost zero manual preprocessing.

The first execution using dotData involved building models for each store, which resulted in 10 models. Here are the steps we took: 

dotData Steps

First Execution : Deeper Dive

Model Review

Using standard configuration, for each store, dotData automatically explored over 100,000 features and validated their statistical significance to create a feature table. Using those features dotData generated over 500 models with different combinations of hyper-parameters for both linear and non-linear machine learning algorithms such as L1/L2-regularized regression, regression tree, LightGBM, XGBoost, PyTorch, TensorFlow, etc. dotData supports various advanced techniques to encode temporal information into temporal features and ML algorithms can further optimize the combinations of such temporal features to achieve the best forecasting accuracy. 
Not very surprisingly, 9 models (i.e. 9 stores) used XGBoost and 1 model used LightGBM (so all models eventually selected gradient boosting algorithms) while different hyper-parameters and feature sets were applied. Figure 2 shows the ratio of training and validation scores for each model. As we can see, the validation MAE was only 1-2% worse than the training MAE. The validation/training ratio w.r.t. RMSE was larger (3-5%) for a few stores but this was because the absolute sales values were big. Overall, the models automatically developed by dotData were well generalized with little overfitting.

Figure 2 The ratio of training and validation scores for each model (store) in percentage. (Validation / Training – 1) * 100.

validation training ratio

Feature Review

As you can imagine, most Kagglers should have used XGBoost, LightGBM or Neural Networks  and we used the same ML algorithms. Then why dotData could achieve the top-3% score from the beginning? The answer lies in dotData’s ability to explore a massive amount of advanced temporal features. 
Table 1 shows examples of temporal features that dotData automatically discovered. As you can see, there are different types of temporal features. It is interesting that we can map these features to our domain feature engineering process such as hierarchical patterns or periodical patterns. It is worth emphasizing that dotData explored over 100,000 features for each model. If doing manually, it would entail writing complex queries and hundreds of lines of code. Additionally, one needs to have a good grasp of the domain to generate relevant features. dotData avoids any need for writing code, queries and domain expertise by completely relying on signals in the raw data. 

Table 1 Examples of temporal features generated in CA store models 

dotdata Temporal Features

Prediction Review

The competition submissions are evaluated on predictions for a 28 days period in 2016 (the data in the test period is not disclosed so it is blind testing). Figure 3 shows a time series plot of “actual” sales for one store (CA_2) in California in 2016 for the training period and “predictions” for the test period (the last 28 days). Since we do not have the “actuals” for the test period we can perform a qualitative evaluation of the trend. As shown in the plot, the trend in sales is captured pretty well in the predictions.

Figure 3 – Sales during training and test for store CA_2 in 2016

Sales during training and test for store ca_2 in 2016

Second Execution : Utilize Hierarchical Stacking to Optimize Accuracy

The Top-3% score was already  amazing but we proceeded our trial one step further. Given the hierarchical nature of the data (i.e. state-level – store-level – product category level – item level), the problem can be looked at from different angles. Sometimes building models at a specific level in the hierarchy might perform better than the other levels. 
Our second execution to build more robust predictions is to incorporate such “hierarchical information.” The step was very straightforward. We build models in different levels using the same steps as the first execution: 

  • for all states, which results in 1 model
  • for each state, which results in 3 models
  • for each store, which results in 10 models (this is equivalent to the first execution)
  • for each category (Foods, Hobbies, Households) in each store, which results in 30 models (10 stores x 3 categories)

The final prediction was the simple average of predictions in all 4 levels.
Figure 4 shows a time series plot of “actual” sales for Foods and Household categories in one store (CA_2) in California in 2016 for the training period and “predictions” for the test period (the last 28 days). 
Figure 4 Sales for Foods, Household during training and test for CA_2 store in 2016
Figure 4 Sales for Foods
This simple-average ensemble significantly pushed up the ranking from #170 to #102 (top-1.8%).  More  importantly, the result was achieved with little manual pre/post-processing with reasonable computation time. We are very excited that our automation achieved very competitive results with world-class data science teams in a very practical enterprise use case.


For most retailers, forecasting requires an army of data scientists who toil for months constructing features, generating models and delivering predictions. However there is a better way to tackle multi-dimensional time series forecasting. By using dotData’s AutoML 2.0 platform with end-to-end automation, we delivered accurate predictions that were ranked in top 1.8% (#102 out of 5,500+ teams) using only 150 hours of computation time. Any BI analyst using few clicks, and without any coding, can deliver results just like a data scientist. dotData empowers citizen data scientists to achieve high accuracy with explainable features using powerful AI automation.
To learn more about our AI Automation Solution, please join us in our next webinar titled AI Competition: How We (Almost) Won in One Day![/vc_column_text][/vc_column][/vc_row]