NOT IN over a subquery are much faster in
We ran the benchmark above with 3 workers (r3.2xlarge) and 1 coordinator (r3.xlarge) on TPC-DS scale 1000 stored in ORC format using the following queries:
SELECT count(*) FROM store_sales WHERE store_sales.ss_customer_sk IN ( SELECT c_customer_sk FROM customer )
SELECT count(*) FROM store_sales WHERE store_sales.ss_store_sk NOT IN ( SELECT s_store_sk FROM store WHERE s_hours <> '8AM-4PM' )
What was the improvement? #
We found that the optimization to use precomputed hashes, which is enabled by
default, was missing in
SemiJoin operator. Hash values were precomputed at the leaf
stages but they were not being used in the
SemiJoin operator leading to re-calculation
of the hash values at this operator. Since queries involving
NOT IN over a
SemiJoin operator, the fix to use precomputed hash in SemiJoin operator
improves the performance of such queries significantly.
How does optimize-hash-generation optimization work #
Presto divides a query plan into parts called Stages which can be run in parallel on multiple nodes, each node working on different set of data. There are two types of stages:
- Leaf Stages: these are the stages that are at the leaf of the Query Plan and read data from a datasource, like a Hive Table.
- Intermediate Stages: these are the stages other than the leaf stages and process data from other upstream stages.
Exchange operator shuffles and transfers the output from upstream stages to the
intermediate stages. For certain operators like
GROUP BY and
JOIN, output data of
the leaf stage is partitioned by the values of a column and the shuffle operation ensures
that a particular partition is always processed by the same task of the Intermediate stage.
This partitioning requires calculation of a hash on that column’s values during exchange
and later in the intermediate stage same hash is needed during the execution of
JOIN operation. To prevent redundant calculations, Presto calculates this hash value
in the leaf stage, uses it in
Exchange operator and makes it available in the output to let
GROUP BY or
JOIN operations use it in the intermediate stage.
Consider this query to count the number of stores per city:
SELECT count(*), city FROM stores GROUP BY city
The query plan (simplified) and its division into stages looks like below:
The leaf stage (
Stage2) reads the table from a data source, feeds the partially
aggregated data to
Stage1 where final aggregation happens, and finally, the result is available
Each row produced by
Stage2, needs to be partitioned by the value of
city column in it to ensure
data for same city is processed by the same task of
Stage1. After the exchange, when a row is consumed
Stage1, it needs to be hashed again to find a group for the row so that the final aggregation
accumulates results for each city in it’s corresponding group bucket. Double hash calculations on
the values of
city column is prevented by doing this calculation once while reading the data and then
using it in both
Final Aggregation operations which reduces CPU usage of the query.
Additionally, pushing this calculation into leaf stage which is better parallelized when there is
a large number of splits for this stage, improves query latency.
How to get this fix? #
This fix is available in Presto version 312 and above. The
optimize-hash-generation setting is enabled
by default so the fix will be in action as soon as you upgrade your Presto installation.