What does a data scientist do, exactly? Marketing, for one, often provides a false picture, as Former CERN physicist and Senior Data Scientist at Blue Yonder Dr. Paul Schaack knows only too well.
The term data scientist is extremely vague, and so it’s understandable that it’s often misunderstood. Today, it’s used excessively , and the most diverse people use it to describe their jobs: Marketing specialists call themselves data scientists, as do data experts who work with machine learning and programming. However, there’s a world of difference between those two careers, and they can’t be thrown into one "pot."
Marketing pros see themselves as data scientists because they look at customer data. But what they’re really looking at is the business and they’re dealing with pure analyses. That has nothing to do with data science — they don't use any statistical methods to come up with their results or to weight them.
The way I see it, data scientists are a cross between software developers, data analysts, and statisticians. A typical statistician has much more in common with a data scientist than someone from marketing — this becomes apparent when a specific expectation isn’t filfilled by the “data scientist” from marketing.
My own career, which began with me doing physical calculations at CERN, is not so far removed from what I do now, namely create forecasts for commerce and industry. Back then I worked with the NeuroBayes algorithm, the "heart" of predictive analyses and predictive analytics, and I still use it now at Blue Yonder.
At that time I had to identify a very small number of events, maybe 70, among millions. This meant I needed an algorithm that predicts the right signals very accurately.
In physics and in business I "feed" the algorithm with enormous data quantities. The result is always the same: a probability. In science it can be the probability of the decay of a particle — in business we forecast whether or not a customer will spend money, for example.
Either way, we need historical data to “train” the model. And for seasonal effects to be captured, we have to go back at least two years so we can check our model using historical data and our knowledge of what actually happened in the past. If we are missing important historical information, we can’t train our data model correctly and in turn can’t forecast future Events.
Users in business now expect detailed results. Maybe they want a daily forecast, although there are only a very few sales per week. In which case, there isn’t sufficient data.
A further important point is data quality and having as many quantitative and categorical data points as possible. As soon as texts created by humans come into play, or any form of sentiment analysis is required, the forecasts are less exact than those based on numbers.
A common misconception about predictive analytics is that events far in the future can be accurately forecast. While forecasts nine or even twelve months ahead are quite possible, they come with substantial uncertainty. In order to make sufficiently robust forecasts about Christmas business one month before the season starts, data from at least the last three Christmas seasons is required.
In order to dispel any misunderstanding and to prevent disappointment, we recommend a one to two-day workshop at the start of a project. This gives external data scientists the chance to exchange information with the people who know the data the best, because they work with it every day. We are then able to identify which data projects are right for us — some enterprises would be better served with a standard analysis rather than a forecasting project.
Once the potential value of a predictive analytics project has been established, the proof-of-concept phase can begin. We demonstrate the reliability of the Blue Yonder technology by carrying out blind tests to compare it with the methods previously used by the customer.
Today people have a lot of data and feel under pressure to do something with it — they just don't know what. That’s why everyone is now talking about big data.