Learning from the Best: Rossman Sales Kaggle Winners


One of my favorite past Kaggle competitions is the Rossman Store Sales competition that ran from September 30th to December 15th, 2015. Even though this competition ran 3 years ago, there is much to learn from the approaches used and from working with the competition dataset. This competition also led to a great paper on a novel neural architecture process, Entity Embeddings of Categorical Variables by 3rd place winner Cheng Guo.

For starters, the competition is a great example of working with real-world business data to solve real world business problems. When learning new techniques, its often easier to use a nice, clean, well-covered dataset. This has its advantages, not least of which is spending less or no time on tasks like data cleaning and exploratory analysis.

But what if I want to practice my data cleaning and EDA skills? In that case, the closer my data and scenario can approximate a real-world, on-the-job situation the better!


Enter the Rossman sales competition. Dirk Rossmann GmbH operates more than 3000 drug stores across Europe, and to help with planning they would like to accurately forecast demand for individual stores up to six weeks out.

For the competition Rossman provides a training set of daily sales data for 1115 stores in Germany between January 1st 2013 and July 31st, 2015. Of these 1115 stores, 84% (935) of the stores have daily data for every date in the time period, the remaining stores have 80% complete due to being closed for 6 months in 2014 for refurbishment.

For each store, for each day we are given some basic information including the number of sales, number of customers, whether the store was open that day, whether there was a promotion running, and whether it was a holiday.

In addition to daily data for each store, we have some additionally summary information about the store describing what type of store it is, how close the nearest competition is, when the competition opened, and whether the store participates in ‘continuing and consecutive’ promotions and when those occur.

The test set mirrors the training features, less ‘Sales’ (the feature competitors are tasked to predict), and spans dates from August 1st to September 17th, 2015.

rs icons

The Competition

While 3,303 teams entered the compeition, there could only be one winner. Luckily for me (and anyone else with an interest in improving their skills), Kaggle conducted interviews with the top 3 finishers exploring their approaches. In addition to the focused blogs, EDA and discussion from competitors and shared code is available on the competition forums and scripts/kernels (Kaggle ‘scripts’ were rebranded to ‘kernels’ in the summer of 2016).

Gert Jacobusse finished first, using an ensemble of XGBoost models. Nima Shahbazi finished 2nd, also employing an ensemble of XGBoost models. Cheng Guo and team Neokami, inc. finished third, employing new (at the time) deep learning package Keras to develop a novel approach for categorical features in neural networks.

Have and Have-Nots

Competitors are supplied with a good volume of data (1,017,209 samples in the train set), and a modest number of features.

It’s important to note what they’re not given. Namely, any sort of product information, sales targets, marketing budgets, demographic information about the areas around a store. The data is aggregate, and represents a high level view of each store.

totally real data science Data science is 90% drawing charts on chalkboards according to stock photos

For a moment, put yourself in the shoes of a data scientist at Rossman.

An enterprise of this size surely has more information available; you could mine sales receipts, inventory, budgets, targets… so many additional sources should be at your finger tips! But they aren’t, which puts you in a good simulation of an all too common scenario: there isn’t time or budget available to collect , mine, and validate all that data. What’s been made available is a good representation of data that is already on-hand, validated, and enough to get started.

Without more detailed information available, feature engineering and creative use of findings from exploratory data analysis proved to be critical components of successful solutions.

Exploratory Data Analysis

With relatively few features available, its no surprise that the competition winners were able to deeply examine the dataset and extract useful information, identify important trends, and build new features.

Timing is everything

In his winning entry, one of the Gert Jacobusse identified a key aspect of the data as it relates to the problem he was trying to solve. If Rossman wants predictions 1 day to 6 weeks out from present, the degree to which the model can consider recent data comes into question. If the model always had to predict or 2 weeks out, the model could rely on recent trends combined with some historical indicators - however at 6 weeks out, any ‘recent trends’ would be beyond the data available at prediction.

This heavily influenced his feature engineering; he would go on to build features examining quarterly, half year, full year, and 2 year trends based on centrality (mean, median, harmonic mean) and spread (standard deviation, skew, kurtosis, percentile splits.

Be a rebel - keep the nulls

The competition explanation mentions that days and stores with 0 sales are ignored in evaluation (that is, if your model predicts sales for a day with 0 sales, that error is ignored). To most Kagglers, this meant to ignore or drop any days with 0 sales from their training dataset - but Nima Shahbazi is not most Kagglers.

Nima decided to investigate these days; while many showed the obvious result of 0 sales being logged when the store was closed - he start to see trends. One such trend was the abnormal behavior of the Sales response variable following a continuous period of closures. In the interview, Nima highlights a period in 2013 as an example. Looking at a single store, Nima shows that following a 10 day closure the location experienced unusually high sales volume (3 to 5x recent days). Had he simply dropped 0 sales days, his models would not have had the information needed to explain these abnormal patters.

If I put on my armchair behavior psychologist hat, I can see that this pattern passes the smell test. I can imagine that if my local CVS was closed for 10 days the first day it re-opens would be a madhouse with the entire neighborhood coming in for all the important-but-not-dire items that had stacked up over the last week and half.

Something New Entirely

While many top competitors chose to mine the available data for insights, Cheng Guo and his team chose an entirely new approach. Familiar with embedding methods such as Word2Vec for representing sparse features in a continuous vector space, and the poor performance of neural network approaches on one-hot encoded categorical features, Guo decided to take a stab at encoding categorical feature relationships into a new feature space.

While other methods of extracting information and relationships from structured data were used by others in the competition, such as PCA and KMeans clustering - Guo’s approach proved effective at mapping the feature information to a new space, and allowing the euclidean distance between points in this space as a way to measure the relationship between stores. This provided the best representation of the data, and allowed Guo’s models to make accurate predictions.

Interestingly, Guo used T-SNE to project his team’s embeddings down to two dimensions, and for fun examined the representation of German regions in the embedding space compared to their locations on a map - and found striking similarities.

train Model trains are fun but won't win you any kaggle competitions


It's no surprise that the top two performers in this competition both used XGBoost (extreme gradient boosted trees) to develop their models. Since its release in March 2014, XGBoost has been one of the tools of choice for top Kaggle competitors. XGBoost provides. great model performance on unstructured data, the ability to handle incomplete or missing data with ease, and all the benefits of both tree based learners and gradient decent optimization - all wrapped up in a highly optimized package.

If there’s one thing more popular than XGBoost in Kaggle competitions - its ensembling. Why use one model when you can use 3, or 4, or 20 (as was the case with Jacobusse’s winning submission). Ensembling allows data scientists to combine well performing models trained on different subsets of features or slices of the data into a single prediction - leveraging the subtleties learned in each unique model to improve their overall scores.

Jacobusse and Nima trained their models on different feature sets and time stretches in their data, to great results. While Jacobusse’s final submission used an ensemble of 20 different models, he found that some of the individual models would have placed in the top 3 by themselves! So while many of his models were highly performant, their combined effect was only a slight lift over their individual performance. Jacobusse combined his models by taking the harmonic mean of their predictions.

Guo and his team used a feed forward neural network in combination with their entity embedding technique. Guo’s team was kind enough to share their code on github. They built their models and entity embeddings with Keras (which was new at the time).

From a code standpoint; this makes their approach relatively straight forward. Each categorical feature (store number, day of week, promotion, year, month, day, state) was encoded separately with the resulting vectors concatenated and fed into a network.

The network itself was a feed-forward network with two hidden layers of 1000 and 500 units (respectively), with a Rectified Linear Unite (ReLU) activation function, and and single layer output with a sigmoid activation.

Guo’s team trained this architecture 10 times, and used the average of the 10 models as their prediction. While each model used the same features and the same data, by ensembling several different trainings of the same model they ensured that variances due to randomization in the training prosses were minimized.

On Beyond Modeling

While all three winners used great EDA, modeling, and ensembling techniques - but sometimes that isn’t enough.

In some competitions there can be issues with competitors ‘fitting the leaderboard’ instead of the data, that is tweaking their models based on the result of submitting their predictions instead of fitting based on signal from the data.

This wasn’t the case with the Rossman competition winners. In his interview, Jacobusse specifically called out the practice of overfitting the leaderboard and its unrealistic outcomes. Instead, to push his models over the edge, Jacobusse applied a weight of 0.995 due to the tendency of his models to slightly overpredict. Others found similar patterns, and in retrospect a weighting of 0.985 would have improved Jacobusse’s ultimate score.

The Big Takeaways

Looking back on the techniques employed by the winners, there are many tricks we can learn. Taking a step back, and looking at their overall approaches and thought processes, there are a few takeaways that can help in any project or situation:

• Use the question / scenario to guide your usage of the data.

While most approached the competition with the idea of “find the data that helps produce the best model”, Jacobusse considered the problem at hand and was able to engineer his data selection and train/test splits around not using the latest month of data – which not only helped his scores in the end but gave him a testing set he could count on.

• Knowing why data isn’t needed can be more important than just removing it.

Shahbazi didn’t just accept that entries with 0 sales weren’t counted during scoring for the leaderboard. Investigating why the data wasn’t being used and what insight that provided was a key part of their analysis.

• Techniques that work in other domains could be used in others.

Cheng Guo and his team took an established technique (embeddings) commonly used in Natural Language Processing and applied it in a novel manner to a sales problem. They thought outside the box, and discovered a useful technique.

Questions about this blog or just want to talk about data science? Shoot me a message on the Metis Community Slack