[Pycon] [new paper] "Giacomo Debidda" - Getting started with HDF5 and PyTables

info a pycon.it info a pycon.it
Sab 6 Gen 2018 20:16:40 CET


Title: Getting started with HDF5 and PyTables
Duration: 45 (includes Q&A)
Q&A Session: 15
Language: it
Type: Talk

Abstract: 
# Description
HDF5 is a data model, a library, a file format for storing and managing big and complex data. PyTables is a Python package built on top of the HDF5 library and NumPy. It provides a high-level interface with advanced indexing and database-like query capabilities. PyTables is both easy to use and extremely fast, so it might be an invaluable tool if you need to work with large, hierarchical datasets.
At the end of this talk you will learn what HDF5 is, why it might be the right file format for you, and where PyTables fits in the Python data ecosystem.


# Target Audience (beginner/intermediate)
The audience should have a basic knowledge of Python. No prior knowledge of HDF5 or PyTables is required. Either data scientists or developers interested in building data products could benefit from this talk.
To get the most out of this talk, it would help to have at least a basic understanding of the Python data ecosystem (e.g. numpy, pandas).


# Outline
I will use a Jupyter notebook throughout the entire talk. All code snippets will be reproducible and available on GitHub.
I will also use a HDF5 viewer like HDFView to show the generated HDF5 files.
I gave a talk on HDF5, h5py and PyTables at PyData Munich (a Meetup, not a conference). Here is the [link to the GitHub repo](https://github.com/jackdbd/hdf5-pydata-munich).
This talk will focus on PyTables. I want to provide more code examples, both for the basic operations and for the more complex ones.

The talk will be structured like this:

**What is HDF5 and who uses it (5 minutes)**
The HDF5 library was first released in 1998 by the HDF Group. In the last few years it has grown in popularity, both in the academic world and in the industry.
A few reasons to use HDF5 as file format are:

 - it's an open source
 - it's flexible and well tested in the scientific environment
 - it can store big and complex data
 - it's portable
 - it's cross platform
 - it keeps the metadata
 - you need hierarchical data but you can't (or don't want to) use a database

I will briefly mention institutes and companies that use HDF5 and why they use it.

**A filesystem in a file (5 minutes)**
Brief overview of the HDF5 data model: groups, datasets, attributes.
An HDF5 file can be thought of as a container (or group) that holds a variety of heterogeneous data objects (or datasets).
Attributes are the metadata that can be applied either to a group or to a dataset.

**HDF5 and Python (3 minutes)**
A python developer who wants to use HDF5 can choose between h5py or PyTables.
I will mention the different philosophies of h5py and PyTables, and why I like PyTables more.

**First steps with PyTables (7 minutes)**
I will show several code snippets to perform basic operations on HDF5 files with PyTables and let the audience familiarize with the API.
I will explain the most important storage classes in PyTables and why you might want to pick one instead of another.

**Create a big data file with PyTables (10 minutes)**
I will talk about compression, chunking and indexing. I aim to create a "big data" file during the talk. If this is not possible I will create it beforehand.

**PyTables tools (5 minutes)**
PyTables is shipped with these utils: ptdump, pttree, ptrepack.
I will talk about these and I will explain why and when to use them.
I will also show HDF5 viewers like ViTables e HDFView.

**Search big data with PyTables and NumExpr (5 minute)**
PyTables excels at searching data. I will show all the different syntaxes for querying data and how they behave with small and big files.
I will explain what a in-kernel search is, and how NumExpr achieves C-like performance.

**Where to go from here (1 minute)**
I will cite additional material about HDF5 and PyTables that I find useful and more advanced. For example:

- [Quincey Koziol's introduction on HDF5](https://www.youtube.com/watch?v=BAjsCldRMMc)
- [Andrew Collette's talk on HDF5 and h5py](https://www.youtube.com/watch?v=nddj5OA8LJo)
- [Tom Kooij's workshop at SciPy 2017](https://www.youtube.com/watch?v=ofLFhQ9yxCw)

**Q&A (4 minutes)**
Some time for questions and answers.


Tags: [u'HDF5', u'PyTables', u'pydata']


Maggiori informazioni sulla lista Pycon