From Batch Processing to Real Time Analytics: Running Presto® at Scale

IEEE International Conference on Data Engineering (ICDE)

Abstract

Presto is an open source distributed query engine used widely at Facebook, Uber, Twitter, Pinterest, and many other internet companies. Since open sourced in 2013, the Presto community has made several rounds of design and implementations, to support a variety of use cases, including interactive analytics, real time reporting and dashboard, ETL workloads, A/B testing, monitoring and alerts, etc. In this paper, we’d like to introduce some of the most important features and performance improvements the open source Presto community made in recent years, which enables companies running Presto at scale, supporting millions of queries per day, with hundreds of thousands of machines. Specifically, how Presto provides unified SQL on heterogeneous storage systems without data copy; how Presto deals with complex data, including nested columnar data and schema evolution; How Presto supports geospatial queries efficiently, and how file list cache works in Presto. We also talk about cluster federation, and Presto on cloud. Experimental results and our production experience could help others running interactive SQL systems at scale.

Latest Publications

Sustainable AI: Environmental Implications, Challenges and Opportunities

Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga Behram, James Huang, Charles Bai, Michael Gschwind, Anurag Gupta, Myle Ott, Anastasia Melnikov, Salvatore Candido, David Brooks, Geeta Chauhan, Benjamin Lee, Hsien-Hsin S. Lee, Bugra Akyildiz, Max Balandat, Joe Spisak, Ravi Jain, Mike Rabbat, Kim Hazelwood

MLSys - 2022