[Pycon] [new paper] "Manoj Pandey" - Building a Data Science pipeline

Dom 7 Gen 2018 14:21:50 CET

Title: Building a Data Science pipeline
Duration: 240 (includes Q&A)
Q&A Session: 0
Language: en
Type: Training

Abstract: This **workshop** will aim to teach elements of an entire **data science pipeline**, by taking an example of some cool datasets and then helping the audience build a small pipeline, making some hypothesis & validating it finally to eventually build some algorithm / machine learning model / etc, and at the end will culminate in how to share the findings to other users / team members etc.

The **take-away** of the workshop is to make the audience aware that a data science pipeline is not just about building your algorithm / machine learning model, but there are many steps involved in the pipeline, which are equally important.

`The entire data science pipeline will consist of these steps:`

1. Essential Data Science tool-kit: 
Learning about the important tools like - Git, ipython, jupyter, scipy ecosystem

2. Data Collection and Storage: 
	Ways to scrape the data, handling different file formats, storing data to flat files, to databases etc

3. Cleaning and Wrangling the data: 
	Leveraging libraries like pandas, json to wrangle the data
4. Exploratory Data Analysis: 
	Using tools like pandas, matplotlib, seaborn etc to perform EDA
5 .Building hypothesis and collecting validations
6. Reproducibility: 
	Discuss about scientific reproducibility
7. Sharing the findings - Visualizations / papers etc: 
	Building visualizations for the web - using D3.JS, charts, graphs etc.

Tags: [u'scikit-learn', u'collection', u'numpy', u'bokeh', u'selenium', u'd3', u'matplotlib', u'lxml', u'pandas']