Get ready for Trino Summit, coming October 21st and 22nd! This annual Trino community event is where we gather practitioners that deploy Trino at scale, and share their experiences and best practices with the rest of the community. While the planning for this event was a bit chaotic due to the pandemic, we have made the final decision to host the event virtually for the safety of all the attendees. We look forward to seeing you there, and can’t wait to share more information in the coming weeks!
LakeFS offers git-like semantics over your files in the data lake. Akin to the versioning you can do on Iceberg, you can also version your data with LakeFS, and roll back to previous commits when you make a mistake. LakeFS allows you to roll out new features in production or prod-like environments with ease and isolation from the real data. Join us as we dive into this awesome new way to approach versioning on your data!
Why we built LakeFS: Source.
In order to run Trino and LakeFS, you need Docker installed on your system with at least 4GB of memory allocated to Docker.
Let’s start up the LakeFS instance and the required PostgreSQL instance along
with the typical Trino containers used with the Hive connector.
trino-getting-started repository and navigate to the
git clone [email protected]:bitsondatadev/trino-getting-started.git cd lakefs/trino-lakefs-minio/ docker-compose up -d
Once this is done, you can navigate to the following locations to verify that everything started correctly.
AKIAIOSFODNN7EXAMPLE, and Secret Access Key,
examplerepository exists in the UI and open it.
mainin the repository, found under
example/main/, should be empty.
Once you have verified the repository exists, let’s go ahead and create a schema
under the Trino Hive catalog called
minio that was pointing to
minio but is
now wrapped by LakeFS to add the git-like layer around the file storage.
Name the schema
tiny as that is the schema we copy from the TPCH data set.
location property of the schema. It now has a namespace that is
prefixed before the actual
tiny/ table directory. The prefix contains the
repository name, then the branch name. All together this follows the pattern of
CREATE SCHEMA minio.tiny WITH (location = 's3a://example/main/tiny');
Now, create two tables,
orders by setting
using the same namespace used in the schema and adding the table name. The query
retrieves the data from the
tiny TPCH data set.
CREATE TABLE minio.tiny.customer WITH ( format = 'ORC', external_location = 's3a://example/main/tiny/customer/' ) AS SELECT * FROM tpch.tiny.customer; CREATE TABLE minio.tiny.orders WITH ( format = 'ORC', external_location = 's3a://example/main/tiny/orders/' ) AS SELECT * FROM tpch.tiny.orders;
Verify that you can see the table directories in LakeFS once they exist. http://localhost:8000/repositories/example/objects?ref=main&path=tiny%2F
Run a query on these two tables using the standard table pointing to the
SELECT ORDERKEY, ORDERDATE, SHIPPRIORITY FROM minio.tiny.customer c, minio.tiny.orders o WHERE MKTSEGMENT = 'BUILDING' AND c.CUSTKEY = o.CUSTKEY AND ORDERDATE < date'1995-03-15' GROUP BY ORDERKEY, ORDERDATE, SHIPPRIORITY ORDER BY ORDERDATE;
Open the LakeFS UI again and click on the Unversioned Changes tab. Click Commit Changes. Type a commit message on the popup and click Commit Changes.
Once the changes are commited on branch
main, click on the Branches tab.
Click Create Branch. Name a new branch
sandbox that branches off of the
main branch. Now click Create.
Although there is a branch that exists called
sandbox, this only exists
logically. We need to make Trino aware by adding another schema and tables
that point to the new branch. Do this by making a new schema called
tiny_sandbox and changing the
location property to point to the
branch instead of the
CREATE SCHEMA minio.tiny_sandbox WITH (location = 's3a://example/sandbox/tiny');
tiny_sandbox schema exists, we can copy the table definitions
orders table from the original tables created. We got
the schema for free by copying it directly from the TPCH data using the CTAS
statement. We don’t want to use CTAS in this case as it not only copies the
table definition, but also the data. This duplication of data is unnecessary and
is what creating a branch in LakeFS avoids. We want to just copy the table
definition using the
SHOW CREATE TABLE statement.
SHOW CREATE TABLE minio.tiny.customer; SHOW CREATE TABLE minio.tiny.orders;
Take the output and update the schema to
to point to
sandbox for both tables.
CREATE TABLE minio.tiny_sandbox.customer ( custkey bigint, name varchar(25), address varchar(40), nationkey bigint, phone varchar(15), acctbal double, mktsegment varchar(10), comment varchar(117) ) WITH ( external_location = 's3a://example/sandbox/tiny/customer', format = 'ORC' ); CREATE TABLE minio.tiny_sandbox.orders ( orderkey bigint, custkey bigint, orderstatus varchar(1), totalprice double, orderdate date, orderpriority varchar(15), clerk varchar(15), shippriority integer, comment varchar(79) ) WITH ( external_location = 's3a://example/sandbox/tiny/orders', format = 'ORC' );
Once these table definitions exist, go ahead and run the same query as before,
but update using the
tiny_sandbox schema instead of the
SELECT ORDERKEY, ORDERDATE, SHIPPRIORITY FROM minio.tiny_sandbox.customer c, minio.tiny_sandbox.orders o WHERE MKTSEGMENT = 'BUILDING' AND c.CUSTKEY = o.CUSTKEY AND ORDERDATE < date'1995-03-15' ORDER BY ORDERDATE;
One last bit of functionality we want to test is the merging capabilities. To
do this, create a table called
lineitem in the
sandbox branch using a CTAS
CREATE TABLE minio.tiny_sandbox.lineitem WITH ( format = 'ORC', external_location = 's3a://example/sandbox/tiny/lineitem/' ) AS SELECT * FROM tpch.tiny.lineitem;
Verify that you can see three table directories in LakeFS including
Verify that you do not see
lineitem in the table directories in LakeFS in the
You can also verify this by running queries against
lineitem in the tables
pointing to the
sandbox branch that should fail on the tables pointing to the
To merge the new table
lineitem to show up in the main branch, first commit
the new change to
sandbox by again going to Unversioned Changes tab.
Click Commit Changes. Type a commit message on the popup and click
lineitem add is committed, click on the Compare tab. Set the
base branch to
main and the compared to branch to
sandbox. You should see
the addition of a line item show up in the diff view. Click Merge and click
Once this is merged you should see the table data show up in LakeFS. Verify that
you can see
lineitem in the table directories in LakeFS in the
As before, we won’t be able to query this data from Trino until we run the
SHOW CREATE TABLE from the
tiny_sandbox schema and use the output to create
the table in the
tiny schema that is pointing to
The PR of the week adds some really useful context around query failures in the Trino Web UI. This PR was created by Pádraig O’Sullivan . For many, it can be fustrating when a query fails and you have to do a lot of digging before you understand even the type of error that is happening.This PR gives a better highlight of what failed so that you don’t have to do a lot of investigation upfront to get a sense of what is happening and where to look next.
Thank you so much Pádraig!
INSERT OVERWRITE checks whether there are entries in the target table that
exist as well in the staging table, and it first deletes the target entries,
before inserting the staging entries. Unfortunately the delete didn’t work for
RDBMS, Hive, or Iceberg. His questionis if this is a limitation of Trino for
all connectors, and how we can approach the “delete” part of
Blogs and Resources
Trino Meetup groups
If you want to learn more about Trino, check out the definitive guide from OReilly. You can download the free PDF or buy the book online.
Music for the show is from the Megaman 6 Game Play album by Krzysztof Słowikowski.