As a SaaS provider, we at Blue Yonder care about the full lifecycle of predictive applications. It is not enough to build, train and run machine learning models once; we have to continuously operate and maintain them in order to keep pace with the evolving businesses of the customers we are empowering.
From a data science perspective, maintenance work can entail monitoring data and prediction quality, A/B testing of model improvements and ad hoc inquiries into data and predictions. While our data scientists care about statistics, they should not be concerned with technicalities like maintenance of network equipment, operating system updates, or even hardware failures. In order to shield our data scientists from these matters, we are investing in the Blue Yonder data science platform.
One part of this platform is our private compute cloud. It allows our data scientists to run multiple machine learning projects on the same physical cluster, thereby greatly improving our hardware utilization.
The front end of our compute cloud consists of a domain-specific UI and REST interface. The back end is built on top of Apache Aurora, a proven and battle-tested service scheduler for the Apache Mesos cluster manager. Aurora excels at starting processes on a cluster and keeping them alive even in presence of hardware and software failures.
By using Aurora we gain a clear separation of concerns:
- Data scientists specify what they want to run using the UI or a set of Ansible
- Our middleware translates this into a specification of how it will be run (e.g. number of instances, destination for log files ...).
- Aurora then manages the placement and execution on the cluster, including the supervision of the spawned processes.
We are using Aurora to run cron jobs, Python-based microservices and our distributed in-house batch processing framework used for machine learning. Each instance of the latter consists of a master and multiple worker processes. We spawn the framework as two kinds of Aurora jobs, one for the master and one for the workers.
When multiple batch processing frameworks are competing for resources, we instrument Aurora's job updater to dynamically adjust the number of worker instances per framework according to load, fairness, and other SLA criteria. For our data scientists this elasticity culminates in faster iteration cycles for their experiments, as they can profit from the aggregate power of the shared compute infrastructure.
For us operators, adopting Aurora as a replacement for a custom cluster scheduler helped to improve the resilience and maintainability of our platform. When we want to update the kernel of a host, we can smoothly drain all tasks running on the host without any impact on our users. Furthermore, by using a mediator in front of Aurora, we gain the consistency and standardization required for seamless operations. For example, we know exactly which software versions are currently running on the cluster, which is a crucial ingredient of our compliance and security standard.
We hope you have enjoyed this glimpse into our stack. If you have any questions, please talk to us. We’ll be at EuroPython 2015 in Bilbao (20–26 July 2015). See you there!