34: A big delta for Trino

Video

Releases 372, 373, and 374: Details about features and fixes in the new releases. (125s)
Project Tardigrade update: Learn about our progress for fault-tolerant query processing. (561s)
Concept of the episode: A new connector for Delta Lake object storage. (1117s)
Pull requests of the episode: Add Delta Lake connector and documentation. (1570s)
Demo of the episode: Delta Lake connector in action. (1754s)
Question of the episode: How do I secure the connection from a Trino cluster to the data source? (3240s)

Audio

Guests

In this episode Manfred has the pleasure to chat with two colleagues, who are working on making Trino better every day:

Claudius Li, Product Manager at Starburst
Joe Lodin, Information Engineer at Starburst

Brian is out to add another member to his family!

Releases 372, 373, and 374

Official highlights from Martin Traverso:

Trino 372

New trim_array function.
Support for reading ZSTD-compressed Avro files.
Support for column comments in Iceberg.
Support for Kerberos authentication in Kudu connector.

Trino 373

New Delta Lake connector.
Improved performance of LIKE when querying Elasticsearch and PostgreSQL.
Improved performance when querying partitioned Hive tables.
Support access to S3 via HTTP proxy.

Trino 374

Faster GROUP BY queries.
Vim/Emacs editing mode for CLI.
Support for TRUNCATE TABLE in Cassandra connector.
Support uint types in ClickHouse.
Support for Glue Metastore in Iceberg connector.
Add CREATE/DROP SCHEMA, table and column comments in MongoDB
Improved pushdown for PostgreSQL

Additional highlights from Manfred

Timeout configuration for LDAP authentication.
Values related to fault-tolerant execution in Web UI.
JDBC Driver.getProperties enables more client applications like DBVisualizer.
Vi and Emacs editing modes for interactive CLI usage.
Performance improvements in PostgreSQL connector.
SingleStore JDBC driver usage, end of memsql name.
Documentation for the atop connector.

More detailed information is available in the Trino 372, Trino 373, and Trino 374 release notes.

Project Tardigrade update

The team around Project Tardigrade joined us in episode 32 to talk about fault tolerant execution of queries in Trino. Now they have posted a status update on our blog.

It looks like things are really coming along well, and Joe has joined the effort to create a first user-facing documentation set.

The team has also posted a status update on the #project-tardigrade Slack channel. Everything is ready for the community to perform first real world testing, and help us make this a great feature set for Trino.

Concept of the episode: A new connector for Delta Lake object storage

It is great to have a new connector in Trino, but what does that even mean? Let’s find out.

What is a connector?

Just a quick refresher. Trino allows you to query many different data sources with SQL statements. You enable that by creating a catalog that contains the configuration to connect to a specific data source. The data source can be a relational database, a NoSQL database, and an object storage. A connector is the translation layer that maps the concepts in the data source to the Trino concepts of schema, tables, rows, columns, data types and so on. The connector needs to know how to retrieve the data itself from data source, and also how to interact with the metadata.

Here are some examples metadata questions to answer:

What are the available tables in schema xyz?
What columns does table abc have and what are the data types?
What file format is used by the storage for table efg?

And some queries about the actual data:

Give me the top 100 rows from table A.
Give me all files in partition x in the directory y.

So having a connector for your data source in Trino is a big deal. A connector unlocks the data to all your SQL analytics powered by Trino, and the underlying data source doesn’t even have to support SQL.

What is Delta Lake?

Delta Lake is an evolution of the Hive/Hadoop object storage data source. It is an open-source storage format. Data is stored in files, typically using binary formats such as Parquet or ORC. Metadata is stored in a Hive Metastore Service (HMS).

Delta Lake supports ACID transactions, time travel, and many other features that are lacking in the legacy Hive/Hadoop setup. This combination of traditional data lake storage with data warehouse features is often called a lake house.

History of the new connector

Delta Lake is fully open source, and part of the larger enterprise platform for a lake house offered by Databricks. Starburst has supported Delta Lake users with a connector for Starburst Enterprise for nearly two years. To foster further adoption and innovation with the community, the connector was donated to Trino 373 and continues to be improved.

Pull requests of the episode: Add Delta Lake connector and documentation

Over 25 developers helped Jakob with the effort to open-source the connector. It is a heavy lift to migrate a such a full featured connectors into Trino. By comparison the documentation was easy, but it is very important to enable you. Well done everyone!

Let’s have a look at the code in a bit more detail. A couple of key facts:

The Delta Lake connector is just another plugin like all other connectors.
This is a feature-rich connector supporting read and write operations.
It shares implementation details with Hive and Iceberg connectors such as HMS access, Parquet and ORC file readers, and so on.

Demo of the episode: Delta Lake connector in action

Now let’s have a look at all this in action. In the demo Claudius uses docker-compose to start up a HMS as metastore, MinIO as object storage, and of course Trino as the query engine.

If you want to follow along, all resources used for the demo are available on our getting started repository.

Here is the sample catalog delta.properties:

connector.name=delta-lake
hive.metastore.uri=thrift://hive-metastore:9083
hive.s3.endpoint=http://minio:9000
hive.s3.aws-access-key=minio
hive.s3.aws-secret-key=minio123
hive.s3.path-style-access=true
delta.enable-non-concurrent-writes=true

Once everything is up and running we can start playing.

Verify that the catalog is available:

SHOW CATALOGS;

Check if there are any schemas:

SHOW SCHEMAS FROM delta;

Lets create a new schema:

CREATE SCHEMA delta.myschema WITH (location='s3a://claudiustestbucket/myschema');

Create a table, insert some records, and then verify:

CREATE TABLE delta.myschema.mytable (name varchar, id integer);
INSERT INTO delta.myschema.mytable VALUES ( 'John', 1), ('Jane', 2);
SELECT * FROM delta.myschema.mytable;

Run a query to get more data and insert it into a new table:

CREATE TABLE delta.myschema.myothertable AS
  SELECT * FROM delta.myschema.mytable;

SELECT * FROM delta.myschema.myothertable ;

Now for some data manipulation:

UPDATE delta.myschema.myothertable set name='Jonathan' where id=1;
SELECT * FROM delta.myschema.myothertable ;
DELETE FROM delta.myschema.myothertable where id=2;
SELECT * FROM delta.myschema.myothertable ;

And finally, lets clean up:

ALTER TABLE delta.myschema.mytable EXECUTE optimize(file_size_threshold => '10MB');
ANALYZE delta.myschema.myothertable;
DROP TABLE delta.myschema.myothertable ;
DROP TABLE delta.myschema.mytable ;
DROP SCHEMA delta.myschema;

As you can see with Trino and Delta Lake you get full create, read, update, and delete operations on your lake house.

Question of the episode: How do I secure the connection from a Trino cluster to the data source

Since we talked about connectors earlier, you already know that the configuration for accessing a data source is assembled to create a catalog. This approach uses a properties file in etc/catalog. For example, let’s look at the recently updated SQL Server connector documentation:

connector.name=sqlserver
connection-url=jdbc:sqlserver://<host>:<port>;database=<database>;encrypt=false
connection-user=root
connection-password=secret

The connector uses username and password authentication. It connects using the JDBC driver, which in turn enables TLs by default. A number of other connectors also use JDBC drivers with username and password authentication, but the details vary a lot. However, for all of them you can use secrets support in Trino to use environment variable references instead of hardcoding passwords.

When it comes to other connectors the details of securing a connection vary even more. Ultimately the answer to how to secure the connection, and if that is even possible, is the usual “It depends”. Luckily you can check the documentation for each connector to find out more and ping us on Slack if you need more help.

Events, news, and various links

Trino Meetup groups

Virtual
East Coast (US)
- Trino Boston
- Trino NYC
West Coast (US)
- Trino San Fransisco
- Trino Los Angeles
Mid West (US)
- Trino Chicago

If you want to learn more about Trino, check out the definitive guide from O’Reilly. You can download the free PDF or buy the book online.

Music for the show is from the Megaman 6 Game Play album by Krzysztof Słowikowski.

Do you ❤️ Trino? Give us a 🌟 on GitHub

Trino Community Broadcast