[Pycon] [new paper] "Jiaqi Liu" - Building a Data Pipeline with Observability in Mind
info a pycon.it
info a pycon.it
Dom 6 Gen 2019 19:18:06 CET
Title: Building a Data Pipeline with Observability in Mind
Duration: 45 (includes Q&A)
Q&A Session: 15
Language: en
Type: Talk
Abstract: It’s one thing to build a robust data pipeline process in python but a whole other challenge to find tooling and build out the framework that allows for testing and monitoring continuous data processes. In order to truly iterate and develop a codebase, one has to be able to confidently test during the development process and monitor the production system. In this talk, I hope to address the key components for building out end to end testing for data pipelines by borrowing concepts from how we test python web services. Just like how we want to check for healthy status codes from our API responses, we want to be able to check that a pipeline is working as expected given the correct inputs. The goal of the talk will be to highlight key features that allows a data pipeline to be easily testable, discuss best practices for building observable systems and learn how to identify timeseries metrics that can be used to monitor the health of a data pipeline.
This talk is for anyone who is interested in data science processes, data engineering or using python for analytics or data processing. Since the talk is more focused on testing and monitoring rather than how to build data pipelines, it would be beneficial for audience members to have experience with using python for analytics or to have built a data pipeline before. However, I will begin the talk with a brief overview on what data pipelines are so beginner audience members can easily follow along. I hope to show the audience a couple of ideas on how to create integration tests for a data process and understand how testing a data process might be similar or different to testing a web service. In this talk, I also highlight how observability as a practice from software engineering and interpretability as a practice from data science both come into play when monitoring and testing data pipelines - and how to leverage both practices to be successful.
Tags: [u'best-practices', u'monitoring', u'software-engineering', u'DataPipelines', u'Big-Data', u'pydata']
Maggiori informazioni sulla lista
Pycon