From Batch Processing to Real Time Analytics: Running Presto® at Scale

IEEE International Conference on Data Engineering (ICDE)

Abstract

Presto is an open source distributed query engine used widely at Facebook, Uber, Twitter, Pinterest, and many other internet companies. Since open sourced in 2013, the Presto community has made several rounds of design and implementations, to support a variety of use cases, including interactive analytics, real time reporting and dashboard, ETL workloads, A/B testing, monitoring and alerts, etc. In this paper, we’d like to introduce some of the most important features and performance improvements the open source Presto community made in recent years, which enables companies running Presto at scale, supporting millions of queries per day, with hundreds of thousands of machines. Specifically, how Presto provides unified SQL on heterogeneous storage systems without data copy; how Presto deals with complex data, including nested columnar data and schema evolution; How Presto supports geospatial queries efficiently, and how file list cache works in Presto. We also talk about cluster federation, and Presto on cloud. Experimental results and our production experience could help others running interactive SQL systems at scale.

Featured Publications