Democratizing Big Data and Predictive Analytics

IN Big Data General — 04 December, 2013

What should I do in the Big Data space? This is a question I often hear from C-Level executives in organizations of all sizes.Our answer at Blue Yonder is to explore the predictive analytics space. Companies can use modern analytic tools to examine big data to gain insights that improve business efficiency, spot trends and opportunities and predict how customers, suppliers and competitors will behave in the future. Predictive analytics is at the core of monetizing on data and it is where Big Data has its roots.

The origin of Big Data

Big data is used to describe massive amounts of data typically coming from social media such as Facebook or Twitter, or smart machines such as satellites, mobile phones, audio and video equipment, payment terminals, CCTV and sensors. Large Internet companies including Google, Facebook, Twitter and Yahoo paved the way for Big Data. Engineers at these companies developed their own technologies or developed technologies in open source projects to manage and process data volumes far greater than ever encountered before. Why did they do this? Because they needed the technology to commercially leverage big data to monetize their service offerings. The services offered by internet companies are offered free of charge to end-users, which means revenue streams need to come from elsewhere. Targeted advertising is the answer.. Tracking user preferences, behaviors and actions is the underlying data source that enables targeted advertisement on a large scale. Presenting the ad that is most relevant to the end user makes these companies successful. Collecting and managing the data from millions of user interactions, mining them for patterns, predicting behavior of end users and taking automated decisions on which ad to display is the machinery that keeps these internet business’ running. The benefits are so significant it is impossible to imagine a world where the data capture and ad placement process is done in a manual, non-automated, non-data-driven way. So we have it confirmed, Big Data is here to stay and internet companies are using it to be dynamic, innovative and deliver more effective services. Now, thanks to the democratization of big data — opening up data analytics to mainstream information workers and developers, as well as data scientists and engineers — the possibilities are limitless with companies and entire industries being transformed through data discovery. Successful internet companies share the concepts and technologies they have implemented (e.g. MapReduce[1], Big Table[2], Cassandra [3]). These concepts have been implemented in open source projects (Hadoop) or released as open source project right away (Cassandra).

Imitating the winners

The availability of these technologies has sparked a wave of interest and innovation. We are having lively discussions with people asking: ‘are you using Hadoop?’, or ‘are you integrating social media data into your predictions?’ We often find that the following reasoning underlies these questions: ‘Google is successful; they use MapReduce; they include social media data in their analysis. If you do not use this technology or data feed, how can I be as successful with you?’

"We thought that we had the answers, it was the questions we had wrong." – Bono by hds is licensed under CC BY 2.0

So what are the right questions to ask? Hal Varian - Chief Economist at Google - published a paper on computer-mediated transactions back in 2005 [4]. He observes that today, "most economic transactions involve a computer". While "the record-keeping role was the original motivation for adding the computer to the transaction" the data collected can now be put to additional uses. Varian identifies four main categories where the data collected by computers can be used to facilitate:

  • Controlled experiments - most prominent are A/B tests in web-environments. Each instance is an experiment, carried out with control groups, like measuring the success of a new check-out process.
  • New forms of contracts - new contract forms are possible because data is being recorded that was not recorded before. This data (e.g. on proper or improper usage of a service or good) can be used to offer contracts. Think of the company climate corporation [TODO] that just got acquired by Mosanto. climate corporation offers insurances to farmers to hedge against crop devastating weather conditions.
  • Data extraction and analysis - add business value by providing business insights. These can be generic BI tools for the B2B sector, specific tools like web-site traffic analysis, but B2C applications like the upcoming quantified-self movement (e.g. capturing data on your personal activity and sleep patterns).
  • Personalization and customization - covers personalization of content, recommendations, and prices.

The categories defined by Varian are a great starting point for generating questions that are relevant to your business’ journey towards Big Data and predictive analytics. Look for the computer-mediated transactions at the core of your business and ask if there is potential for controlled experiments, new form of contracts, data analysis, and personalization/customization.

From imitation to interpretation

Starting with your set of questions you will quickly realize that questions regarding technology come second. If you identify a possibility to harvest value from little, structured data there is no reason to start talking about Hadoop. If your potential is with large amounts of content optimized for human readability (often called "unstructured" data) picking a MapReduce-based technology could be a very good choice.


1806162940_a4e00385e2 „Barcelona Guitar solo“ by David Blaikie is licensed under CC BY 2.0


We often find companies are ignoring the data troves collected over years because they do not fit into the Hadoop pattern. Instead, our recommendation is to have a close look at the heart of your business' "computer-mediated transactions". These transactions could be interactions with customers, managing and dispatching staff, selling goods (online), managing supply-chain logistics, producing goods, keeping machinery and/or facilities productive. You can be sure you will find intelligence and powerful insights from data within your existing IT systems. This is your data and the competition has no access to the value you generate from it.

From backward looking to forward thinking

Transactional data allows for descriptive analytics like reporting, dashboards or more sophisticated methodologies of data mining. Reporting is good and has been done in companies for a number of decades. Despite the work that remains in the descriptive analytics today's business environment is more demanding. In our customer engagements we find  looking at past data quickly brings up the question of what will happen in the future, and which decision we would take if we knew about it today. Predictive analytics is the key enabler. At Blue Yonder we are confident that predictive analytics is the main driver in generating value from Big Data in this era of agility and innovation. Customization or personalization is unthinkable without predicting the likelihood of potential future outcomes. Each recommendation in an online store makes assumptions on the expected preferences of a visitor; each search result delivered by a search engine is carefully optimized towards the predicted end-user need. Why are enterprise transactions not based on the same principles?

  • Recommendations to sales staff on the customer to visit or the product to offer
  • Replenishment of goods based on fine-granular, predicted demand
  • Call-center activities proposed based on customer affinity to specific offerings
  • Energy hedging in energy-intensive industries based on forecasted electricity demand
  • Spare part allocation based on predicted demand
  • Insurance contracts personalized based on predicted personal risks

The list can be extended by looking at industry by industry, vertical by vertical. What do these scenarios have in common?

  • They run operational processes with thousands, millions or billions of transactions per day
  • They have the potential to automate (parts) of the operational process decisions

Optimizing each decision has a direct influence on the companies’ top and bottom line - with every improved decision you grow your revenue and incur less costs. In addition, these scenarios all depend on future knowledge to improve or automate decision-making. Predictive analytics is the key concept that allows you to inform or automate decisions. Depending on the scale of the decisions to be taken, automation is the only way forward. Imagine search engine giants employing humans to prepare search results for your search query, or a large online retailer providing manually crafted product recommendations. From data to running services

"It's not that they can't see the solution. They can't see the problem." - G.K. Chesterton

So what does it take to put your scenario into production use? First, you will need a team that brings data science, software engineering, and operations experience to the table. If you are interested in quick results, make sure they can work quickly together. Building data-driven services adds an additional dimension of complexity to software-engineering projects that can be best tackled by having the team members collaborate closely. If your team starts exploring predictive analytics approaches like machine learning, they will find that it is not just about writing software that works as expected, but about writing software that derives its behavior from the data input. Over time it is not just the code base that changes but also the data input. To a certain degree this can be compared to adding a self-reconfiguring part to your software solution. To create dependable, data-driven solutions requires skills and experience beyond software engineering – it requires data science knowledge. Expect the data set used to start your project to evolve over time. In many cases we find that the data set we started with doesn’t deliver on expectations. The data quality might be poor so that historical data needs to be extracted again or massaged, the data quantity might be insufficient so that additional data sources need to be taken into account. All these circumstances make an iterative approach a necessity. In terms of practical data wrangling you may run into obstacles that slow you down. Missing knowledge on where to find the internal data sources or missing approval to access that data are typical situations that slow down initiatives for data-driven services. Including external data sources is even more daunting. APIs and data services are abundant, but your team will need to check and test if the data can be aligned with the internal data, if adding external data sources delivers better results, and negotiate contracts with external data suppliers. Once a predictive model is established, it needs to be robust for productive use. A demonstration or prototype might just be 10% of the effort required. The typical issues required in production-readiness are:

  • Lack of stability - the solution works in the labs but cannot be operated in production
  • Lack of scalability - the solution works for a subset of data but does not scale to real-world data sets
  • Lack of precision - the predictions are not sufficiently precise
  • Lack of trust - the end users do not accept the data magic

We have seen these issues again and again and have developed  a solution to avoid the uncertainties of one-off projects in the data-driven service arena.

From data to predictive applications


Seite-4.1-englisch All Information at a glance. Disposition with Forward Demand.


We see predictive applications as the natural next step to democratize the use of predictive analytics and Big Data. Only a few companies, with sufficient R&D budget, are likely to invest heavily in predictive analytics and Big Data to enable their own scenarios. The majority of companies will be more inclined towards product offerings that allow them to avoid most of the risks associated with individual projects. At Blue Yonder we build our predictive applications with an incredible team consisting of world-class data scientist, IT-infrastructure specialists with SaaS background, brilliant user-experience designers, and experienced software engineers. Together they design, create and operate predictive applications offered as SaaS. We make applications relevant to end users by focusing on a clear business context and the needs and goals of the users. We encode our industry experience into easily consumable predictive applications and make these applications available to a broad market of customers. We see SaaS as the natural delivery model for predictive applications. Customers benefit immediately from improvements in our predictive models. Customers can easily add external data sources to augment and improve their predictive applications. We operate the services on a dependable scale-out architecture for our customers, relieving them from the non-trivial task to setup and operate such an infrastructure. Predictive applications delivered as a service – this is how Blue Yonder makes Big Data and Predictive Analytics available to companies around the world. With Forward Demand we have just launched the first application in a family of predictive applications. Expect more to come enriched by data services offered and pre-integrated by Blue Yonder.


Jan Karstens Jan Karstens

Jan, CTO, is responsible for product and technology development and thus for the “heart” of the Blue Yonder software. He has over 15 years of experience in software engineering and in software architecture as well as in modern database technologies.