Faster Queries on Nested Data

Aug 14, 2020 • Pratham Desai (LinkedIn), James Taylor (Lyft)

Presto 334 adds significant performance improvements for queries accessing nested fields inside struct columns. They have been optimized through the pushdown of dereference expressions. With this feature, the query execution prunes structural data eagerly, extracting the necessary fields.

Motivation #

RowType is a built-in data type of Presto, storing the in-memory representation of commonly used nested data types of the connectors, eg. STRUCT type in Hive. Datasets often contain wide and deeply nested structural columns, i.e. a struct column having hundreds of fields, with the fields being nested themselves.

Although such RowType columns can contain plenty of data, most of the analytical queries access just a few fields out of it. Without dereference pushdown, Presto scans the whole column, and shuffles all that data around before projecting the necessary fields. This suboptimal execution causes higher CPU usage, higher memory usage and higher query latencies, than required. The unnecessary operations get even more expensive with wider/deeper structs and more complex query plans.

LinkedIn’s data ecosystem makes heavy usage of nested columns. It is common to have 2-3 levels of nesting, and up to 50 fields in most of our tracking tables. Because of the query execution inefficiency for nested fields, ETL pipelines were set up at LinkedIn to copy the nested columns as a set of top-level columns corresponding to subfields. This step added overhead in our ingestion process and delayed data availability for analytics. It also caused ORC schemas to be inconsistent with the rest of the infrastructure, making it harder to migrate from existing flows on row-oriented formats.

Similarly, Lyft’s schemas make heavy use of nested data to decompose a ride into its routes, riders, segments, modes, and geo-coordinates. Prior to the performance improvements, analytical queries would either need to be run on clusters with very long timeouts, or the data would have to be flattened before being analyzed, adding an extra ETL step. Not only would this be costly, it would also cause the original schema to diverge in our data warehouse making it more difficult for data scientists to understand.

The dereference pushdown optimization in Presto is having a massive impact on the ingestion story at both LinkedIn and Lyft. Nested data is now being made available faster for consumption with a consistency of structure across all stores, while maintaining performance parity for analytical queries.

Example #

Say we have a Hive table jobs, with a struct-typed column job_info in the schema. The column job_info is wide and deeply nested, i.e. ROW(company varchar, requirements ROW(skills array(...), education ROW(...), salary ...) , ...). Most queries would access a small percentage of data from this struct using the dereference projection (the . operation). Consider such a query Q below.

SELECT A.appid id, J.job_info.company c
FROM applications A JOIN jobs J
ON A.jobid = J.jobid
LIMIT 100

It should suffice to scan only one field company from J.job_info for executing this query. But, without dereference pushdown, Presto scans and shuffles everything from job_info, only to project a single field at the end.

Solution: Pushdown of Dereference Expressions #

With dereference pushdown, Presto optimizes queries by extracting the sufficient fields from a ROW as early as possible. This is enforced by modifying the query plan through a set of optimizers, and can be broadly divided into two parts.

First, dereference projections are extracted in the query plan and pushed as close to the table scan as possible. This happens independent of what the connector is. Secondly, there is a further improvement for Hive tables. The Hive Connector and ORC/Parquet readers have been optimized to scan only the sufficient subfield columns.

Pushdown of predicates on the subfields is also a crucial optimization. For example, if a query has filters on subfields (i.e. a.b > 5), they should be utilized by ORC/Parquet readers while scanning files. The pushdown helps with the pruning of files, stripes and row-groups based on column-level statistics. This optimization is achieved as a byproduct of the above two optimizations.

With the dereference pushdown, queries observe significant performance gains in terms of CPU/memory usage and query runtime, roughly proportional to the relative size of nested columns compared to the accessed fields.

Pushdown in Query Plan #

The goal here is to execute dereference projections as early as possible. This usually means performing them right after the table scans.

A projection operation that performs dereferencing on input symbols (i.e. job_info.company) reduces the amount of data going up the plan tree. Pushing dereference projections down means that we are pruning data early. It reduces the amount of data being processed and shuffled in query execution. For the example query Q, the query plan looks like the following when dereference pushdown is enabled.

The projection job_info.company now directly follows the scan of jobs table, avoiding the propagation the job_info through Limit and Join nodes. Note that all of job_info is still being scanned, and pruning it in the reader requires connector-dependent optimizations.

Pushdown in the Hive Connector #

In columnar formats like ORC and Parquet, the data is laid out in a columnar fashion even for subfields. If we have a column STRUCT(f1, f2, f3), the subfields f1, f2 and f3 are stored as independent columns. An optimized query engine should only scan the required fields through its ORC reader, skipping the rest. This optimization has been added for Hive connector.

Dereference projections above a TableScanNode are pushed down in the Hive connector as “virtual” (or “projected”) columns. The query plan is modified to refer to these new columns. For the query Q, jobs table would be scanned differently with this optimization, as shown below. The projection is now embedded in the Hive connector. Here, job_info#company can be thought of as a virtual column representing the subfield job_info.company.

The Hive connector handles the projections before returning columns to Presto’s engine. It provides the required virtual columns to format-specific readers. ORC and Parquet readers optimize their scans based on subfields required, increasing their read throughput. Subfield pruning is not possible for row-oriented format readers (e.g. AVRO). For them, Hive connector performs adaptation to project the required fields.

Pushdown of Predicates on Subfields #

Columnar formats store per-column statistics in the data files, which can be used by the readers for filtering. eg. if a query contains filter y = 5 for a top-level column y, Presto’s ORC reader can skip ORC stripes and files by looking at the upper and lower bounds for y in the statistics.

The same concept of predicate-based pruning can work for filters involving subfields, since the statistics are also stored for subfield columns. i.e. Presto’s ORC/Parquet reader should be able to filter based on a constraint like x.f1 = 5 for more optimal scans. Good news! In the final optimized plan, predicates on a subfield are pushed down to the hive connector as a constraint on the corresponding virtual column, and later used for optimizing the scan. The complete logic is a bit complicated to explain here, but can be illustrated through the following example.

Given an initial plan with a predicate on a dereferenced field (x.f1 = 5), a chain of optimizers transform it to a more optimal plan with reader-level predicates. In the future, the same optimization will be added to the Parquet reader.

In the final plan, Hive connector knows to scan the column y and the subfield x.f1. It also takes advantage of the “virtual” column constraint x#f1 = 5 for reader-level pruning.

Performance Improvement #

Dereference pushdown improves performance for queries accessing nested fields in multiple ways. First, it increases the read throughput for table scans, reducing the CPU time. The pruning of fields during the scan also means lesser data to process for all downstream operators and tasks. So the early projections result in more optimal execution for any operations that involve shuffle or copy of data. Moreover, for ORC/Parquet, the read performance improves in the case of selective filters on subfields.

Below are some experimental results on a production dataset at LinkedIn which contains 3 STRUCT columns, having ~20-30 small subfields in each. The example queries used in the analysis access only a few subfields. The queries have been listed as their approximate query shape for the sake of brevity. The plots compare CPU usage, peak memory usage and averaged query wall time.

CPU usage and peak memory usage show orders-of-magnitude improvement in presence of dereference pushdown. Query wall times also reduce considerably, and this improvement is more drastic for the relatively complex JOIN query, as expected.

Please note that these are not benchmarks! The performance improvement you’ll see will vary depending on how many columns are contained in your nested data versus how many you’ve referenced. At Lyft we saw improvements of 50x for some queries!

Future Work #

The pushdown of dereference expressions can be extended to arrays. i.e. dereference operations applied after unnesting an array should also get pushed down to the readers. For example, using our jobs table from before, our jobs.job_info structure may contain a repeating structure such as required_skills. With the following query, the entire required_skills structure would be read even though only a small part of it is being referenced.

SELECT S.description
FROM jobs J
CROSS JOIN UNNEST (job_info.required_skills) S
WHERE S.years_of_experience >= 2

The work for this improvement is being tracked in this issue.

Similar to Hive Connector, connector-level dereference pushdown can be extended to other connectors supporting nested types.

Another future improvement will be the pushdown of predicates on subfields for data stored in Parquet format. Although the pruning of nested fields occurs with Parquet, the predicates are not yet pushed down into the reader.

Conclusion #

Pushing down dereference operations in the query provides massive performance gains, especially while operating on large structs. At LinkedIn and Lyft, this feature has shown great impact for analytical queries on nested datasets.

We’re excited for the Presto community to try it out. Feel free to dig into this github issue for technical details. Please reach out to us on Slack for further disucssions or reporting issues.

Do you ❤️ Trino? Give us a 🌟 on GitHub

Trino blog