Trino Summit 2022 is coming around the corner! This free event on November 10th will take place in-person at the Commonwealth Club in San Francisco, CA or can also be attended remotely!
Read about the recently announced speaker sessions and details in these blog posts:
You can register for the conference at any time. We must limit in-person registrations to 250 attendees, so register soon if you plan to attend in person!
Official highlights from Martin Traverso:
array
, map
, and row
types to Parquet.date_trunc
predicates over partition columns in Iceberg connector.timestamp
type in Pinot connector.array
, row
, and timestamp
columns in BigQuery.INSERT
and MERGE
.Additional highlights worth a mention according to Cole:
More detailed information is available in the release notes for Trino 396, Trino 397, Trino 398, Trino 399, Trino 400, and Trino 401.
This week we’re talking about the Hudi connector that was added in version 398.
Apache Hudi (pronounced “hoodie”) is a streaming data lakehouse platform by combining warehouse and database functionality. Hudi is a table format that enables transactions, efficient upserts/deletes, advanced indexing, streaming ingestion services, data clustering/compaction optimizations, and concurrency.
Hudi is not just a table format, but has many services aimed at creating efficient incremental batch pipelines. Hudi was born out of Uber and is used at companies like Amazon, ByteDance, and Robinhood.
The Hudi table format and services aim to provide a suite of tools that make Hudi adaptive to realtime and batch use cases on the data lake. Hudi will lay out data following merge on read, which optimizes writes over reads, and copy on write, which optimizes reads over writes.
The Hudi metadata table can improve read/write performance of your queries. The main purpose of this table is to eliminate the requirement for the “list files” operation. It is a result from how Hive-modelled SQL tables point to entire directories versus pointing to specific files with ranges. Using files with ranges help prune out files outside the query criteria.
Hudi uses multiversion concurrency control (MVCC), where compaction action merges logs and base files to produce new file slices a cleaning action gets rid of unused/older file slices to reclaim space on the file system.
One of the well-known users of Trino and Hudi is Robinhood. Grace (Yue) Lu, who joined us at Trino Summit 2021, covers Robinhood’s architecture and use cases for Trino and Hudi.
Robinhood ingests data via Debezium and streams it into Hudi. Then Trino is able to read data as it becomes available in Hudi.
Hudi and Trino support critical use cases like IPO company stock allocation, liquidity risk monitoring, clearing settlement reports, and generally fresher metrics reporting and analysis.
Before we had the official Hudi connector, many, like Robinhood, had to use the Hive connector. They were therefore not able to take advantage of the metadata table and many other optimizations Hudi provides out of the box.
The connector gets around that and now enables using some Hudi abstractions. However, the connector is currently limited to read-only mode and doesn’t support writes. Spark is the primary system used to stream data to Trino in Hudi. Check out the demo to see the connector in action.
First we want to get the read support improved and support all query types. As a next step we aim to add DDL support.
This PR of the episode was
contributed by Matthew Deady (@mwd410). The
improvements enable writes to PostgreSQL and MySQL when fault-tolerant execution
is enabled (retry-policy
is set to TASK
or QUERY
). This update included a
few changes to core classes used for connectors using JDBC clients for Trino to
connect to the database. For example, Matthew was able to build on this PR by
adding a few additional changes to get this working in SQL Server in
PR 14730.
Thank you so much to Matthew for extending our fault-tolerant execution to connectors using JDBC clients! As usual, thanks to all the reviewers and maintainers who got these across the line!
Let’s start up a local Trino coordinator and Hive metastore. Clone the
repository and navigate to the hudi/trino-hudi-minio
directory. Then
start up the containers using Docker Compose.
git clone [email protected]:bitsondatadev/trino-getting-started.git
cd community_tutorials/hudi/trino-hudi-minio
docker-compose up -d
For now, you will need to import data using the Spark and Scala method we detail in the video. Eventually we will provide a SparkSQL in the near term, and update this to show the Trino DDL support when it lands.
SHOW CATALOGS;
SHOW SCHEMAS IN hudi;
SHOW TABLES IN hudi.default;
SELECT COUNT(*) FROM hudi.default.hudi_coders_hive;
SELECT * FROM hudi.default.hudi_coders_hive;
Blog posts
Check out the in-person and virtual Trino Meetup groups.
If you want to learn more about Trino, check out the definitive guide from O’Reilly. You can download the free PDF or buy the book online.
Music for the show is from the Megaman 6 Game Play album by Krzysztof Slowikowski.