Data Science in Bilbao: Blue Yonder sharing its experience at EuroPython 2015

IN Events & Awards — 20 July, 2015

BY_Europython

EuroPython is the largest European Python conference (1200+ attendees in 2014), the second largest Python conference world-wide and a meeting reference for all European programmers, students and companies interested in the Python programming language. Blue Yonder is a passionate user of the Python programming language and has been supporting Python/PyData events for years. Like last year we have a booth and share our experience in numerous talks:

Holger Peters − Data Scientist in our algorithms core team

USING SCIKIT-LEARN'S INTERFACE FOR TURNING SPAGHETTI DATA SCIENCE INTO MAINTAINABLE SOFTWARE

Finding a good structure for number-crunching code can be a problem, this especially applies to routines preceding the core algorithms: transformations such as data processing and cleanup, as well as feature construction.

With such code, the programmer faces the problem, that their code easily turns into a sequence of highly interdependent operations, which are hard to separate. It can be challenging to test, maintain and reuse such “Data Science Spaghetti code”.

Scikit-Learn offers a simple yet powerful interface for data science algorithms: the estimator and composite classes (called meta-estimators). By example, Holger will show how clever usage of meta-estimators can encapsulate elaborate machine learning models into a maintainable tree of objects that is both handy to use and simple to test.

Sebastian Neubauer − Data Scientist responsible for data services
A PYTHONIC APPROACH TO CONTINUOUS DELIVERY

Software development is all about writing code that delivers additional value to a customer. Following the agile and lean approach this value created by code changes should be continuously delivered as fast, as early and as often as possible without any compromise on the quality. Remarkably, there is a huge gap between the development of the application code and the reliable and scalable operation of the application.
Sebastian will go through the complete delivery pipeline from application development to the industrial grade operation, clearly biased towards the “DevOps” mindset.  After the talk you will know how to build such a continuous delivery pipeline with open-source tools like “Ansible”, “Devpi” and “Jenkins”.

Stephan Erb − Software Engineer working on the Blue Yonder Data Science Platform
RELEASE MANAGEMENT WITH DEVPI

Devpi is an open source PyPi-compatible package server. Its versatile features make it the Swiss Army knife of Python package and release management, enabling anyone to shape a custom release workflow.
Stephan will detail how we use our company-wide Devpi installation in order to share a large set of packages across teams, deploy binary packages to our application servers, and mix and mash open source packages with our own. With Devpi being a critical part of our release and deployment infrastructure, he will also cover our high-availability setup and how we perform major version updates with minimal downtime.

Christian Trebing − Software Engineer responsible for Big Data APIs
BUILDING A MULTI-PURPOSE PLATFORM FOR BULK DATA USING SQLALCHEMY

At Blue Yonder, we’ve built a platform that can accept and process bulk amounts of data for multiple business domains (e.g. handling retail store location and sales data) using SQLAlchemy as a database abstraction layer. We wanted to use as much of SQLAlchemy as possible, but we quickly found that the ORM (Object Relational Mapper) is not suitable for handling large amounts of data at once.
Christian will demonstrate an application architecture for multiple business domains and how domain configuration is used ease implementation of new APIs.

Patrick Mühlbauer − Software Engineer responsible for Big Data APIs
BUILDING NICE COMMAND LINE INTERFACES − A LOOK BEYOND THE STDLIB

One of the problems programmers are most often faced with is the parsing and validation of command-line arguments. Patrick will give you an overview of some popular alternatives to the standard library solutions (e.g. click, docopt and cliff), explain their basic concepts and differences and show how you can test your CLIs.

Florian Wilhelm − Data Scientist tailoring Predictive Applications for customers
"IT'S ABOUT TIME TO TAKE YOUR MEDICATION!" OR HOW TO WRITE A FRIENDLY REMINDER BOT

Florian will show how to use the SleekXMPP library in order to write a small chatbot that connects to Google Hangouts and reminds you or someone else to take medication for instance. He will elaborate on how to use an event-driven library to write a bot that sends scheduled messages, waits for a proper reply.

Peter Hoffmann − Software Engineer working on the Blue Yonder Data Science Platform
PYSPARK − DATA PROCESSING IN PYTHON ON TOP OF APACHE SPARK.

Apache Spark is a computational engine for large-scale data processing. It is responsible for scheduling, distribution and monitoring applications which consist of many computational task across many worker machines on a computing cluster.

Peter will give an overview of PySpark with a focus on Resilient Distributed Datasets and the DataFrame API. While Spark Core itself is written in Scala and runs on the JVM, PySpark exposes the Spark programming model to Python. It defines an API for Resilient Distributed Datasets (RDDs). RDDs are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.

Moritz Gronbach − Data Scientist tailoring Predictive Applications for customers

WHAT'S THE FUZZ ALL ABOUT? RANDOMIZED DATA GENERATION FOR ROBUST UNIT TESTING

In static unit testing, the output of a function is compared to a precomputed result. Even though such unit tests may apparently cover all the code in a function, they might cover only a small subset of behaviours of the function. This potentially allows bugs such as heartbleed to stay undetected. Dynamic unit tests using fuzzing, which allows you to specify a data generation template, can make your test suite more robust.

Moritz will demonstrate fuzzing using the hypothesis library. Hypothesis is a Python library to automatically generate test data based on a template. Data is generated using a strategy. A strategy specifies how data is generated, and how falsifying examples can be simplified. Hypothesis provides strategies for Python’s built-in data types, and is easily customizable.

We're passionate about Big Data. You too? Then meet us at our booth, talk to us during the social event on wednesday or have a look at our career opportunities.

Blue Yonder Blue Yonder

We enable retailers, consumer products and other companies to take a transformative approach to their core processes, automating complex decisions that deliver higher profits and customer value using artificial intelligence (AI).