In this week’s concept, Manfred discusses Hive Partitioning.
See here for more details: https://www.educba.com/partitioning-in-hive/
In this week’s pull request https://github.com/trinodb/trino/pull/223, came from contributor Hao Luo. What this function does is similar to Hive’s MSCK REPAIR TABLE where if it finds a hive partition directory in the filesystem that exist but no partition entry in the metastore, then it will add the entry to the metastore. If there is an entry in the metastore but the partition was deleted from the filesystem, then it will remove the metastore entry. You can find more information about this procedure in the documentation.
Here are the commands and SQL I ran during the show on Presto
SHOW CATALOGS;
SHOW SCHEMAS in minio;
SHOW TABLES IN minio.part;
CREATE SCHEMA minio.part
WITH (location = 's3a://part/');
-- Create a table with no partitions
CREATE TABLE minio.part.no_part (id int, name varchar, dt varchar)
WITH (
format = 'ORC'
);
INSERT INTO minio.part.no_part
VALUES
(1, 'part-1', '2020-11-18'),
(2, 'part-2', '2020-11-18'),
(3, 'part-3', '2020-11-19'),
(4, 'part-4', '2020-11-19'),
(5, 'part-5', '2020-11-20'),
(6, 'part-6', '2020-11-20');
CREATE TABLE minio.part.orders (id int, name varchar, dt varchar)
WITH (
format = 'ORC',
partitioned_by = ARRAY['dt']
);
INSERT INTO minio.part.orders
VALUES
(1, 'part-1', '2020-11-18'),
(2, 'part-2', '2020-11-18'),
(3, 'part-3', '2020-11-19'),
(4, 'part-4', '2020-11-19'),
(5, 'part-5', '2020-11-20'),
(6, 'part-6', '2020-11-20');
SELECT *
FROM minio.part.no_part
WHERE dt = '2020-11-20';
SELECT *
FROM minio.part.orders
WHERE dt = '2020-11-20';
DELETE FROM minio.part.orders
WHERE dt = '2020-11-18';
SELECT *
FROM minio.part.orders;
-- Make sure you are using minio (which is a rename of hive) catalog
CALL system.sync_partition_metadata('part', 'orders', 'ADD');
CALL system.sync_partition_metadata('part', 'orders', 'DROP');
CALL system.sync_partition_metadata('part', 'orders', 'FULL');
-- Create a table with multi partitions
CREATE TABLE minio.part.multi_part (id int, name varchar, year varchar, month varchar, day varchar)
WITH (
format = 'ORC',
partitioned_by = ARRAY['year', 'month', 'day']
);
INSERT INTO minio.part.multi_part
VALUES
(1, 'part-1', '2020', '11', '18'),
(2, 'part-2', '2020', '11', '18'),
(3, 'part-3', '2020', '11', '19'),
(4, 'part-4', '2020', '11', '19'),
(5, 'part-5', '2020', '11', '20'),
(6, 'part-6', '2020', '11', '20'),
(7, 'part-7', '2019', '11', '18'),
(8, 'part-8', '2019', '01', '18'),
(9, 'part-9', '2019', '11', '19'),
(10, 'part-10', '2019', '01', '19'),
(11, 'part-11', '2019', '11', '20'),
(12, 'part-12', '2019', '01', '20');
We ran some queries against the metastore database. It’s a complicated model so here is a database diagram to show the different tables and their relations in the metastore.
This diagram was generated by niftimusmaximus on
The Analytics Anvil.
MariaDB (metastore database)
USE metastore_db;
-- show database
SELECT * FROM DBS;
-- show tables given a database
SELECT t.*
FROM DBS d
JOIN TBLS t ON d.DB_ID = t.DB_ID
WHERE d.NAME = 'part';
-- show location and input format of the table given database/table names
SELECT s.SD_ID, s.INPUT_FORMAT, s.LOCATION, s.SERDE_ID
FROM DBS d
JOIN TBLS t ON d.DB_ID = t.DB_ID
JOIN SDS s ON t.SD_ID = s.SD_ID
WHERE t.TBL_NAME = 'orders' AND d.NAME='part';
-- show (de)serializer format of the table given database/table names
SELECT sd.SERDE_ID, sd.NAME, sd.SLIB
FROM DBS d
JOIN TBLS t ON d.DB_ID = t.DB_ID
JOIN SDS s ON t.SD_ID = s.SD_ID
JOIN SERDES sd ON s.SERDE_ID = sd.SERDE_ID
WHERE t.TBL_NAME = 'orders' AND d.NAME='part';
-- show columns of the table given database/table names
SELECT c.*
FROM DBS d
JOIN TBLS t ON d.DB_ID = t.DB_ID
JOIN SDS s ON t.SD_ID = s.SD_ID
JOIN COLUMNS_V2 c ON s.CD_ID = c.CD_ID
WHERE t.TBL_NAME = 'orders' AND d.NAME='part'
ORDER by CD_ID, INTEGER_IDX;
-- show partitions of the table given database/table names
SELECT p.*, s.LOCATION
FROM DBS d
JOIN TBLS t ON d.DB_ID = t.DB_ID
JOIN PARTITIONS p ON t.TBL_ID = p.TBL_ID
JOIN SDS s ON p.SD_ID = s.SD_ID
WHERE t.TBL_NAME = 'orders' AND d.NAME='part';
In this week’s question, we answer:
Why am I getting, “Query exceeded maximum columns. Please reduce the number of columns referenced and re-run the query.”?
Example:
I’m running this query to check for duplicates. My table has approx. 650 columns and I get this error.
SELECT *, COUNT(1)
FROM tbl
GROUP BY *
HAVING COUNT(1) > 1
getting a stacktrace like this
io.prestosql.spi.PrestoException: Compiler failed
at io.prestosql.sql.planner.LocalExecutionPlanner$Visitor.visitScanFilterAndProject(LocalExecutionPlanner.java:1306)
at io.prestosql.sql.planner.LocalExecutionPlanner$Visitor.visitProject(LocalExecutionPlanner.java:1185)
at io.prestosql.sql.planner.LocalExecutionPlanner$Visitor.visitProject(LocalExecutionPlanner.java:705)
at io.prestosql.sql.planner.plan.ProjectNode.accept(ProjectNode.java:82)
at io.prestosql.sql.planner.LocalExecutionPlanner$Visitor.visitAggregation(LocalExecutionPlanner.java:1119)
at io.prestosql.sql.planner.LocalExecutionPlanner$Visitor.visitAggregation(LocalExecutionPlanner.java:705)
at io.prestosql.sql.planner.plan.AggregationNode.accept(AggregationNode.java:204)
at io.prestosql.sql.planner.LocalExecutionPlanner.plan(LocalExecutionPlanner.java:461)
at io.prestosql.sql.planner.LocalExecutionPlanner.plan(LocalExecutionPlanner.java:432)
at io.prestosql.execution.SqlTaskExecutionFactory.create(SqlTaskExecutionFactory.java:75)
at io.prestosql.execution.SqlTask.updateTask(SqlTask.java:382)
at io.prestosql.execution.SqlTaskManager.updateTask(SqlTaskManager.java:383)
at io.prestosql.server.TaskResource.createOrUpdateTask(TaskResource.java:128)
at jdk.internal.reflect.GeneratedMethodAccessor480.invoke(Unknown Source)
The throwable that causes this error MethodTooLargeException
comes from the ASM
library https://asm.ow2.io/ when you ask it to create a method with more
bytecode than is allowed by the JVM specification.
We try to generate code for handling given query and the code generated is too large. Since the code is proportional to number of columns referenced, we rewrap the exception in something more meaningful to the user.
The general strategy would be to lower the number of columns that you reference.
The problem is that in removing columns you will remove important information to the query. For example, in the example looking for duplicates above, you won’t be able to discard false positive duplicate matches, but this may be good enough to help narrow the search space. As always, it depends…
To learn more about the JVM limit and search for code_length in the Java SE specification.
Special thanks to Ashhar Hasan for asking this question and providing some useful context!
Release Notes discussed: https://trino.io/docs/current/release/release-346.html
Manfred’s Training - SQL at any scale https://www.simpligility.com/2020/10/join-me-for-presto-first-steps/ https://learning.oreilly.com/live-training/courses/presto-first-steps/0636920462859/
Blogs
Upcoming events
Latest training from David, Dain, and Martin(Now with timestamps!):
Presto Summit Series - Real world usage
Recent Podcasts:
If you want to learn more about Presto yourself, you should check out the O’Reilly Trino Definitive guide. You can download the free PDF or buy the book online.
Music for the show is from the Megaman 6 Game Play album by Krzysztof Słowikowski.