Presto nation, We want to hear from you! If you have a question or pull request that you would like us to feature on the show please join the Presto slack and go to the #presto-community-broadcast channel and let us know there. Otherwise, you can message Manfred Moser or Brian Olsen directly. Also, feel free to reach out to us on our Twitter channels Brian @bitsondatadev and Manfred @simpligility.
Release Notes discussed: https://prestosql.io/docs/current/release/release-348.html
We’re continuing our series covering some fundamental topics that build up to dynamic filtering! This week we’re discussing the cost-based optimizer with Presto co-creator Martin Traverso!
To recap, in episode 6 we discussed a little bit about the various forms a query takes from submission to the coordinator, to actually being executed. We discussed how the parser generates an abstract syntax tree (AST) and the analyzer checks for valid SQL including functions and making sure tables and columns being referenced actually exist.
Here’s an example of an abstract syntax tree from last weeks episode for query
SELECT * FROM (VALUES 1) t(a) WHERE a = 1 OR 1 = a OR a = 1;.
The next phase we discussed was the planner. Internally, the planner and optimizer overlap substantially, but you can think of the planner as the early part of the planning phase that generates the logical query, and over several optimization iterations becomes an optimized distributed query. The planner generates a new tree data structure called the plan IR (intermediate representation) that contains nodes representing the steps that need to be performed in order to answer the query. The leaves of the tree get executed first, and the parents of each node are dependent on the action of its child completing before it can start.
Here’s ab example of a logical plan tree using the same query form the AST above. Since this query isn’t pulling from a data source, the distributed plan is equivalent to the logical plan.
In the cost-based optimizer phase, there are various rules that are applied to the Plan IR that slowly optimize the structure into the final distributed plan that is then executed. To do this, the optimizer retrieves some statistical metadata of the tables and their data. This information includes, table row counts, column data size, column low/high value, distinct column value count, and the percentage of null values in a column. With the list of rules that aim to leverage these statistics, the optimizer improves the query structure that improves on parallelism based on the number of workers to the number of sources.
If you want to jump into the code, start at the entry point for the planner/optimizer and the initial planning starts on this line. This loop is where the actual optimization occurs. So if you are interested, maybe grab a brandy 🥃 and take some time to set your debugger at these points and watch the optimizer do its thing!
Refer to chapter 4 in Presto: The Definitive Guide pg. 50.
Before we can jump into this PR, let’s discuss what a subquery is and further
what a correlated subquery is. In SQL you have a nested query that runs within
another query, typically embedded within a
WHERE clause or
Take this query for example:
In this example, we have a standard non-correlated subquery that runs on
table2. The reason it is not correlated is because there are no dependencies
on the parent query that is being run on
table. This type of query enables the
SQL engine to run the subquery first and then use those results to run the
parent query after. In the case of a correlated query, you typically have at
least one criterion in the nested query that depends on the parent. This
requires that the nested query gets executed for each row of the parent query.
Take a look at this correlated query:
In this example, we are running the subquery in the context of the row in order
to evaluate the value of
t1.b. Having this query run for every row of the
parent query is certainly not ideal if it is not required and that is why
subquery decorrelation is a common optimization technique if an equivalent
non-correlated subquery exists for a given correlated subquery.
This pull request adds a rule that added the ability for Presto to handle the
decorrelation of a subquery containing a
LIMIT or (
TopN) clauses. So, the common trick during decorrelation is to turn it into a
query that can process the results from the inner table in one shot. The
approach is to flatten the results of executing the subquery for every row into
a single stream of rows before it is finally ready for execution.
This change also applies to a
LATERAL join, which behaves a lot like a nested
subquery only that it acts as a table and returns multiple rows instead of
just a single row.
After the show Kasia pointed out that the failing queries were not all failing for the same reason. The first failing query above actually gets planned and executed, but the exception occurs during the execution. The rest actually fail during the planning and optimization phase as they were unable to be decorrelated due to the issue I line out in the comments above.
In this week’s question, we answer: Will running Presto on my relational database make processing faster?
I have been going over the docs of PrestoSQL and it seems to fit some of my requirements. I am little concerned about the resources needed to run Presto in production. Because the size of my prod data is between 3-5GB and there will be very minimal data growth. Is Presto suitable for such a small data size?
Many times, the idea that Presto is fast gets conflated with the idea that Presto is a good fit for all use cases. It is important to understand that Presto is a) not a database b) not developed for OLTP workloads and c) built to handle data at the scale of Terabytes to Petabytes over distributed queries. Since Presto uses a connector framework, it also has an added benefit of running federated queries to whatever data source that returns data that can be represented in some columnar fashion.
For relatively small size data sets you should try directly using your relational database first. Doing this is better for small data sets. Database indexes are really nice if you’re not in big data world and if you give your SQL Server say 10 GB memory, it should be running fully in-memory and thus — fast.
Latest training from David, Dain, and Martin(Now with timestamps!):
Presto Summit Series - Real world usage
If you want to learn more about Presto yourself, you should check out the O’Reilly Presto Definitive guide. You can download the free PDF or buy the book online.
Music for the show is from the Megaman 6 Game Play album by Krzysztof Słowikowski.