In this episode Manfred has the pleasure to chat with two colleagues, who are working on making Trino better every day:
Brian is out to add another member to his family!
Official highlights from Martin Traverso:
trim_array
function.LIKE
when querying Elasticsearch and PostgreSQL.GROUP BY
queries.TRUNCATE TABLE
in Cassandra connector.uint
types in ClickHouse.CREATE/DROP SCHEMA
, table and column comments in MongoDBAdditional highlights from Manfred
Driver.getProperties
enables more client applications like DBVisualizer.memsql
name.More detailed information is available in the Trino 372, Trino 373, and Trino 374 release notes.
The team around Project Tardigrade joined us in episode 32 to talk about fault tolerant execution of queries in Trino. Now they have posted a status update on our blog.
It looks like things are really coming along well, and Joe has joined the effort to create a first user-facing documentation set.
The team has also posted a status update on the #project-tardigrade Slack channel. Everything is ready for the community to perform first real world testing, and help us make this a great feature set for Trino.
It is great to have a new connector in Trino, but what does that even mean? Let’s find out.
Just a quick refresher. Trino allows you to query many different data sources with SQL statements. You enable that by creating a catalog that contains the configuration to connect to a specific data source. The data source can be a relational database, a NoSQL database, and an object storage. A connector is the translation layer that maps the concepts in the data source to the Trino concepts of schema, tables, rows, columns, data types and so on. The connector needs to know how to retrieve the data itself from data source, and also how to interact with the metadata.
Here are some examples metadata questions to answer:
xyz
?abc
have and what are the data types?efg
?And some queries about the actual data:
A
.x
in the directory y
.So having a connector for your data source in Trino is a big deal. A connector unlocks the data to all your SQL analytics powered by Trino, and the underlying data source doesn’t even have to support SQL.
Delta Lake is an evolution of the Hive/Hadoop object storage data source. It is an open-source storage format. Data is stored in files, typically using binary formats such as Parquet or ORC. Metadata is stored in a Hive Metastore Service (HMS).
Delta Lake supports ACID transactions, time travel, and many other features that are lacking in the legacy Hive/Hadoop setup. This combination of traditional data lake storage with data warehouse features is often called a lake house.
Delta Lake is fully open source, and part of the larger enterprise platform for a lake house offered by Databricks. Starburst has supported Delta Lake users with a connector for Starburst Enterprise for nearly two years. To foster further adoption and innovation with the community, the connector was donated to Trino 373 and continues to be improved.
Over 25 developers helped Jakob with the effort to open-source the connector. It is a heavy lift to migrate a such a full featured connectors into Trino. By comparison the documentation was easy, but it is very important to enable you. Well done everyone!
Let’s have a look at the code in a bit more detail. A couple of key facts:
Now let’s have a look at all this in action. In the demo Claudius uses docker-compose to start up a HMS as metastore, MinIO as object storage, and of course Trino as the query engine.
If you want to follow along, all resources used for the demo are available on our getting started repository.
Here is the sample catalog delta.properties
:
connector.name=delta-lake
hive.metastore.uri=thrift://hive-metastore:9083
hive.s3.endpoint=http://minio:9000
hive.s3.aws-access-key=minio
hive.s3.aws-secret-key=minio123
hive.s3.path-style-access=true
delta.enable-non-concurrent-writes=true
Once everything is up and running we can start playing.
Verify that the catalog is available:
SHOW CATALOGS;
Check if there are any schemas:
SHOW SCHEMAS FROM delta;
Lets create a new schema:
CREATE SCHEMA delta.myschema WITH (location='s3a://claudiustestbucket/myschema');
Create a table, insert some records, and then verify:
CREATE TABLE delta.myschema.mytable (name varchar, id integer);
INSERT INTO delta.myschema.mytable VALUES ( 'John', 1), ('Jane', 2);
SELECT * FROM delta.myschema.mytable;
Run a query to get more data and insert it into a new table:
CREATE TABLE delta.myschema.myothertable AS
SELECT * FROM delta.myschema.mytable;
SELECT * FROM delta.myschema.myothertable ;
Now for some data manipulation:
UPDATE delta.myschema.myothertable set name='Jonathan' where id=1;
SELECT * FROM delta.myschema.myothertable ;
DELETE FROM delta.myschema.myothertable where id=2;
SELECT * FROM delta.myschema.myothertable ;
And finally, lets clean up:
ALTER TABLE delta.myschema.mytable EXECUTE optimize(file_size_threshold => '10MB');
ANALYZE delta.myschema.myothertable;
DROP TABLE delta.myschema.myothertable ;
DROP TABLE delta.myschema.mytable ;
DROP SCHEMA delta.myschema;
As you can see with Trino and Delta Lake you get full create, read, update, and delete operations on your lake house.
Since we talked about connectors earlier, you already know that the
configuration for accessing a data source is assembled to create a catalog. This
approach uses a properties file in etc/catalog
. For example, let’s look at the
recently updated SQL Server connector
documentation:
connector.name=sqlserver
connection-url=jdbc:sqlserver://<host>:<port>;database=<database>;encrypt=false
connection-user=root
connection-password=secret
The connector uses username and password authentication. It connects using the JDBC driver, which in turn enables TLs by default. A number of other connectors also use JDBC drivers with username and password authentication, but the details vary a lot. However, for all of them you can use secrets support in Trino to use environment variable references instead of hardcoding passwords.
When it comes to other connectors the details of securing a connection vary even more. Ultimately the answer to how to secure the connection, and if that is even possible, is the usual “It depends”. Luckily you can check the documentation for each connector to find out more and ping us on Slack if you need more help.
Trino Meetup groups
If you want to learn more about Trino, check out the definitive guide from O’Reilly. You can download the free PDF or buy the book online.
Music for the show is from the Megaman 6 Game Play album by Krzysztof Słowikowski.