Notes on Google PowerDrill
Links
- Hall et al.: Processing a Trillion Cells per Mouse Click
- Hall lecture video
- Wired article (typical Wired garbage, but still contains a few details not found in the paper)
Notes
Formatted in a question / answer style
Introduction & Background
What is PowerDrill (PD)?
- A web-based analysis tool built by Google AdWords team
- The columnar storage backend and execution engine is called "PD Serving", and is the focus of this paper
What types of analysis can you do in PD?
- Drilldown: start with the entire dataset and perform slice/filter/aggregate operations
- UI consists of bar graphs (
GROUP BY) and selection/filters (WHERE)- Bias towards discrete/categorical data (strings, dates, etc.)
What kind of data is being analyzed?
- Paper is not specific about this, but video is
- The most important AdWords datasets
- Log data
- lots of string columns (e.g. search query text)
- Wide datasets: thousands of columns
- Usecases given:
- responding to user requests (support requests?)
- spam analysis (somewhat interactive)
- Generating alerts for mission-critical systems (clickfraud according to the video)
Who is using PD?
- Google internal only
- 800 monthly users, 4 million monthly queries (c. 2012)
Why use columnar storage?
- Compression: same-typed data is lower entropy so yields higher compression rate
- specialized compression techniques for certain datatypes (e.g. dictionary encoding, RLE)