Trino nation, we want to hear from you! If you have a question or pull request
that you would like us to feature on the show please join the
Trino slack, and go to the #trino-community-broadcast channel
and let us know there. Otherwise, you can message Manfred Moser or Brian Olsen
directly. Also, feel free to reach out to us on our Twitter channels Brian
@bitsondatadev and Manfred
If you want to show us some 💕, please give us a ⭐ on GitHub.
In this week’s concept, Manfred discusses Hive Partitioning.
Concept from RDBMS systems implemented in HDFS
Normally just multiple files in a directory per table
Lots of different file formats, but always one directory
Partitioning creates nested directories
Needs to be set up at start of table creation
Uses WITH ( partitioned_by = ARRAY[‘date’])
Results in tablename/date=2020-11-19
Can also nest deeper WITH ( partitioned_by = ARRAY[‘date’, ‘countrycode’])
Can greatly enhance performance
Optimizer can determine what directories to read based on field
Especially useful when fields are used in WHERE clauses
Also useful for historic data management over time such as moving data out
to archive, deleting data, or replacing data with aggregates, or even just
running compaction on subsets
Presto can use DELETE on partitions using DELTE FROM table WHERE date=value
Also possible to create empty partitions upfront CALL system.create_empty_partition
In this week’s pull request https://github.com/trinodb/trino/pull/223,
came from contributor Hao Luo. What this function
does is similar to Hive’s MSCK REPAIR TABLE
where if it finds a hive partition directory in the filesystem that exist but
no partition entry in the metastore, then it will add the entry to the
metastore. If there is an entry in the metastore but the partition was deleted
from the filesystem, then it will remove the metastore entry. You can find
more information about this procedure in the documentation.
Here are the commands and SQL I ran during the show on Presto
We ran some queries against the metastore database. It’s a complicated model so
here is a database diagram to show the different tables and their relations in
Why am I getting, “Query exceeded maximum columns. Please reduce the number
of columns referenced and re-run the query.”?
I’m running this query to check for duplicates. My table has approx. 650
columns and I get this error.
SELECT *, COUNT(1)
GROUP BY *
HAVING COUNT(1) > 1
getting a stacktrace like this
io.prestosql.spi.PrestoException: Compiler failed
at jdk.internal.reflect.GeneratedMethodAccessor480.invoke(Unknown Source)
The throwable that causes this error MethodTooLargeException comes from the ASM
library https://asm.ow2.io/ when you ask it to create a method with more
bytecode than is allowed by the JVM specification.
We try to generate code for handling given query and the code generated is too
large. Since the code is proportional to number of columns referenced, we
rewrap the exception in something more meaningful to the user.
The general strategy would be to lower the number of columns that you reference.
The problem is that in removing columns you will remove important information
to the query. For example, in the example looking for duplicates above, you
won’t be able to discard false positive duplicate matches, but this may be
good enough to help narrow the search space. As always, it depends…
To learn more about the JVM limit and search for code_length in the Java SE