[Pycon] [new paper] "Diego Martin" - Less Python in PySpark: Techniques to reduce UDF usage

Ven 18 Gen 2019 20:12:08 CET

Title: Less Python in PySpark: Techniques to reduce UDF usage
Duration: 45 (includes Q&A)
Q&A Session: 15
Language: en
Type: Talk

Abstract: Description
===========

Most of the speed issues with Spark RDDs in Python have been solved by the DataFrame API, which offers similar performance in Scala and Python. Unfortunately, for User Defined Functions (UDFs), the overhead serialization costs can still slow down our applications compared to a Scala implementation. A significant amount of UDFs can now be rewritten using Spark constructs exclusively or sped up in those cases where no alternative is available. This talk will review some examples of these techniques and analyze the impacts on performance and code legibility.

Abstract
========

UDFs are a great way to express data transformations in those cases where the Spark API cannot provide the required functionality. But they come at a cost: the communication between the JVM and the Python interpreter is expensive and, while there are some optimizations in place, we can do better. Instead of automatically defaulting to UDFs, I want to present some recent improvements in PySpark that allow for more complex data transformations. Since no Python code is involved, there are no communication costs. These constructs are additionally verified during the physical execution planning, avoiding any undesired surprises in the middle of an expensive computation. Other improvements, such as the new vectorized UDFs, can also speed up our custom computations and simplify our code.

In this talk, I will review some of these new resources, such as:

- MapType and StructType constructs.
- Basic transformations in Spark SQL (arithmetics, conditionals...).
- String and list manipulation in Spark SQL.
- Higher-order functions in Spark SQL.
- Vectorized UDFs.
- UDFs in Scala.
- Broadcasting.

I will also review the speed-ups in performance and discuss other advantages, such as code legibility or error handling, as well as briefly go over the internals of UDFs. I hope that it will encourage you to think twice before writing a custom transformation.

This talk is targeted to developers with at least some experience with PySpark. If you have written at least one UDF, you will be able to follow.

Additional notes
================

The talk will be example-driven:
    I will propose a real-world scenario with a UDF, profile it and then introduce a UDF-free alternative, discussing changes in speed and memory consumption, code readability, testability, etc... All examples will be profiled and made available as a benchmark.

What you will learn:
  - An overview of the DataFrame API, especially the newest features, such as vectorized UDFs, higher-order functions and array operations.
  - An understanding of the internals of Python-Scala communication in Spark.
  - Some alternatives to common cases in which UDFs can be avoided.
  - The performance impact of UDFs, with metrics.
  - How to structure your UDFs and Spark code so that it is more legible and testable.

Structure:

- Introduction and motivation:
  - In which cases do developers use UDFs.
  - UDFs tradeoffs: flexibility vs. performance, testability, verifiability.

- Spark internals:
  - UDFs under the hood: Python <-> Spark communication, and serialization overhead
  - Discuss the benefits of using pure Spark constructs instead

- Hands-on:
  - I will present a series of snippets in Spark using UDFs and provide an alternative using only the Spark API, reviewing in each example a different construct. Each example will be benchmarked in terms of execution time (and if I am able to, I will also try to provide some other performance markers, such as the execution plan, memory consumption, etc). Additionally, I will discuss additional benefits and disadvantages, such as code legibility and error handling.

- Conclusion

Tags: [u'spark', u'pandas']