Predictions as a Probability Density Distribution

IN Machine Learning — 20 July, 2015

Predicting a quantity (regression) in predictive and prescriptive analytics requires the prediction of a full probability density distribution to capture all relevant aspects of a specific scenario. This blog post will explain why knowledge of the full distribution is important in order to master the corresponding business case.

Predictive and prescriptive analytics use actionable data to support or automate decision making. Many use cases require the prediction of a quantity, i.e. a regression. For example, a supermarket manager needs to know how many products will be sold on any given day in any location in order to optimize the replenishment process, an insurer might want to know the size of the next claim, a tour operator will optimize the travel requirements based on the expected number of passengers, etc.

It is important to note that the prediction itself is rarely the quantity one is ultimately interested in but the actions based on these predictions.

Unfortunately, it's not as simple as predicting just a single number (e.g. “5 apples for store x on Monday”); any prediction in a realistic scenario is a combination of a deterministic part and a stochastic part. A detailed analysis of the former allows us to determine the “laws of nature” for any given process very accurately, whereas the stochastic part models the random fluctuations (also: volatility), which ultimately limits the precision of the predictions.

Moving Toward a Probability density Distribution

Let's illustrate this with a simple thought experiment. Five customers each leave their respective homes in the morning to buy one or more units of a specific product, e.g. a can of soda. At the end of the day, seven cans will have been sold, as illustrated below:
In this thought experiment, the same day is now repeated with the same customers and applying the same initial conditions. But this time we introduce a set of random factors, which influence the results follows:

  • Some customers may forget to buy the product,
  • Some may buy two or more even if they didn't plan on doing so (e.g. because of a promotion they weren't aware of),
  • Some may not be able to go shopping at all as they're stuck in traffic or a meeting, etc.

By repeating the experiment many times, a distribution of the purchases will appear as shown below:

Note that in the example above, a very simple symmetric function has been selected to illustrate the general idea. In any realistic scenario the resulting curve will be asymmetric, likely bound by zero on the lower end but exhibiting long tails towards high numbers.

The illustration above can now be turned around. Instead of looking at the distribution of historic purchases, one might ask the following question: Given the recorded data, what is the a priori probability density distribution for the sale of this item? In other words, how likely will 1, 2, 3,...,8, 9 or more units of this item be sold on any given day and in any specific store? This means that the distribution of historic events is used to describe the behavior of a single item.

However, this is not the same as asking what quantity of this this item is most likely to be sold or how many units of the item are sold on average or what the median sale is, etc. The answers to those questions are derived from the full probability density function and reduce the total amount of information contained in the probability density function to a single number. The next blog post in this series will focus on this aspect in detail.

Predictions as a Probability Density Distribution

Any prediction of future events, in the context of predictive and prescriptive analytics, also needs to produce a full probability density distribution. Simply producing a number, such as the mean (which is frequently provided by simple predictive tools), is not sufficient to optimize the actions derived from the predictions. For example, a supermarket has to balance the on-shelf availability of any given product with the write-off rate at which the product has to be taken off the shelf if it's not sold, which is of particular importance in the case of perishable goods. If only Mean Sales is available as a prediction, there is no way to estimate the write-off rate and on-shelf availability based on the specific behavior of this product. Even when adding a second number, such as the volatility, to the prediction it does not improve the situation much. While some estimate of the expected fluctuations is then known, the details are still “hidden” in the exact shape of the full probability density distribution.

Advanced predictive models (such as NeuroBayes(R)) always provide the full probability density distribution, which can then be used to derive the best point estimator for a given scenario.

Next time:

The next blog post in this series will focus on the role of the cost function and how it is used to derive the optimal point estimator for a given scenario from a predicted probability density function.

Dr. Ulrich Kerzel Dr. Ulrich Kerzel

earned his PhD under Professor Dr Feindt at the US Fermi National Laboratory and at that time made a considerable contribution to core technology of NeuroBayes. He continued this work as a Research Fellow at CERN before he came to Blue Yonder as a Principal Data Scientist.