Queries involving IN
and NOT IN
over a subquery are much faster in
Presto 312.
We ran the benchmark above with 3 workers (r3.2xlarge) and 1 coordinator (r3.xlarge) on TPC-DS scale 1000 stored in ORC format using the following queries:
SELECT count(*)
FROM store_sales
WHERE store_sales.ss_customer_sk IN (
SELECT c_customer_sk FROM customer
)
SELECT count(*)
FROM store_sales
WHERE store_sales.ss_store_sk NOT IN (
SELECT s_store_sk
FROM store
WHERE s_hours <> '8AM-4PM'
)
What was the improvement? #
We found that the optimization to use precomputed hashes, which is enabled by
default, was missing in SemiJoin
operator. Hash values were precomputed at the leaf
stages but they were not being used in the SemiJoin
operator leading to re-calculation
of the hash values at this operator. Since queries involving IN
and NOT IN
over a
subquery use SemiJoin
operator, the fix to use precomputed hash in SemiJoin operator
improves the performance of such queries significantly.
How does optimize-hash-generation optimization work #
Presto divides a query plan into parts called Stages which can be run in parallel on multiple nodes, each node working on different set of data. There are two types of stages:
- Leaf Stages: these are the stages that are at the leaf of the Query Plan and read data from a datasource, like a Hive Table.
- Intermediate Stages: these are the stages other than the leaf stages and process data from other upstream stages.
The Exchange
operator shuffles and transfers the output from upstream stages to the
intermediate stages. For certain operators like GROUP BY
and JOIN
, output data of
the leaf stage is partitioned by the values of a column and the shuffle operation ensures
that a particular partition is always processed by the same task of the Intermediate stage.
This partitioning requires calculation of a hash on that column’s values during exchange
and later in the intermediate stage same hash is needed during the execution of GROUP BY
or JOIN
operation. To prevent redundant calculations, Presto calculates this hash value
in the leaf stage, uses it in Exchange
operator and makes it available in the output to let
GROUP BY
or JOIN
operations use it in the intermediate stage.
Consider this query to count the number of stores per city:
SELECT count(*), city
FROM stores
GROUP BY city
The query plan (simplified) and its division into stages looks like below:
The leaf stage (Stage2
) reads the table from a data source, feeds the partially
aggregated data to Stage1
where final aggregation happens, and finally, the result is available
via Stage0
.
Each row produced by Stage2
, needs to be partitioned by the value of city
column in it to ensure
data for same city is processed by the same task of Stage1
. After the exchange, when a row is consumed
in Stage1
, it needs to be hashed again to find a group for the row so that the final aggregation
accumulates results for each city in it’s corresponding group bucket. Double hash calculations on
the values of city
column is prevented by doing this calculation once while reading the data and then
using it in both Exchange
and Final Aggregation
operations which reduces CPU usage of the query.
Additionally, pushing this calculation into leaf stage which is better parallelized when there is
a large number of splits for this stage, improves query latency.
How to get this fix? #
This fix is available in Presto version 312 and above. The optimize-hash-generation
setting is enabled
by default so the fix will be in action as soon as you upgrade your Presto installation.