[Pycon] [new paper] "Zubin John" - Automating massively-scalable operational pipelines with AutoML and Apache Airflow
info a pycon.it
info a pycon.it
Mer 14 Nov 2018 10:54:59 CET
Title: Automating massively-scalable operational pipelines with AutoML and Apache Airflow
Duration: 60 (includes Q&A)
Q&A Session: 15
Language: en
Type: Talk
Abstract: **Highlights**
- Repetitive human tasks (flagging, tagging, annotating) can start to drain organizational resources especially when they may be required to scale up infinitely
- Novel framework proposed addresses tackling trivial binary decision-making with supervised classification pipelines
- Demonstrate design, implementation and best practices to maintain big-data pipelines for automation, as code
- On-demand scaling, reusability and version control
- Tightly coupled Human-in-the-loop (HITL) system for validation and monitoring of model health
**Motivation**
The business problem of mapping simple yet repetitive real-world human tasks (labeling, tagging, sorting) to an automated system is far from trivial. Even with the assistance of supervised learning, there are technical challenges which can cripple such a project in its early stages. Are the models up-to-date with fresh incoming data? Are they tuned to ensure optimal performance? Will they continue to deliver results at desired level? Are they scalable?
Irrespective of simplicity, your model is stale as soon as it encounters its first real-world data point. With this axiom in mind, our talk emphasizes the use of tools like Airflow and PySpark to automate big-data pipelines for machine learning. We present a barebones project framework that maintains the entire machine learning & deployment workflow as code offering reusable, testable, versionable and scalable modular components. The talk will briefly walk through the use-case of a modular pipeline for a common e-commerce application and benefits intermediate data scientists and data engineers striving for complete ownership of their data science deployment and to break away from depending on third party model hosting/management services.
Tags: [u'Airflow', u'Big-Data', u'AI']
Maggiori informazioni sulla lista
Pycon