Looks like Commander Bun Bun is safe on this Iceberg
https://joshdata.me/iceberger.html
Release notes discussed: https://trino.io/docs/current/release/release-355.html
Martin’s list:
Manfred’s notes:
In the previous episode, we covered the differences between the Iceberg table format, and the Hive table format from a technical standpoint in the context of Trino. We highly recommend watching it before this episode. In this episode we ask Ryan about the origins of Apache Iceberg and why he started the project. We cover some details of the Iceberg specification which is a nice change from the ad-hoc specification that people adhere to when using Hive tables. Then Ryan dives into several amazing use cases how Netflix and others use Iceberg.
This week’s PR of the week was submitted by one of the Trino maintainers, Pratham Desai. Pratham is a Software Engineer at LinkedIn who commits a lot of time in the Trino community helping out on the slack channel, contributing code, and doing PR reviews. Thank you for all you do Pratham!
Had Brian known about this PR, he wouldn’t have had the issue he did with reading the empty snapshot created with the Iceberg Java API and would have been able to read and insert into the table just fine. If you come across this issue, we introduced this feature in release 344!
Along with the future developments we discussed in the previous episode, another
core Iceberg functionality that we want to add in Trino is support for
partition migration. We also
discussed future support for UPDATE
and MERGE
capabilities for the Iceberg
connector.
For this weeks’ demo, we continue to use the Iceberg Java API to create a table. You also have the option to use Trino, Spark, or other to ingest and query the data, but I wanted to use vanilla Iceberg API’s to experience the API and hopefully solidify my learning of Iceberg concepts in the process. Make sure you follow the instructions in the repository if you don’t have Docker or Java installed.
Let’s start up a local Trino coordinator and Hive metastore. Clone this
repository and navigate to the iceberg/trino-iceberg-minio
directory. Then
start up the containers using Docker Compose.
git clone [email protected]:bitsondatadev/trino-getting-started.git
cd iceberg/trino-iceberg-minio
docker-compose up -d
In your favorite IDE, open the files under iceberg/iceberg-java
into your
project and run the IcebergMain
class.
This class creates a logging table if it doesn’t exist along with the logging
schema. Once you run this code, you can check to see that the table in Trino
exists in the metastore under TABLE_PARAMS
.
Now we transition from the Java API to running queries over Iceberg using Trino.
/**
* This is the equivalent of running IcebergMain in the iceberg-java project.
* Go ahead and inspect the java code you can use to interact with Iceberg
* tables and metadata.
*/
CREATE TABLE iceberg.logging.logs (
level varchar NOT NULL,
event_time timestamp(6) with time zone NOT NULL,
message varchar NOT NULL,
call_stack array(varchar)
)
WITH (
format = 'ORC',
partitioning = ARRAY['hour(event_time)','level']
)
/**
* Read From Trino
*/
SELECT * FROM iceberg.logging.logs;
/**
* Write data from Trino and check data and snapshots
*/
INSERT INTO iceberg.logging.logs VALUES
(
'ERROR',
timestamp '2021-04-01' AT TIME ZONE 'America/Los_Angeles',
'Oh noes',
ARRAY ['Exception in thread "main" java.lang.NullPointerException']
);
SELECT * FROM iceberg.logging.logs;
SELECT * FROM iceberg.logging."logs$snapshots";
/**
* Write more data from Trino and check data and snapshots
*/
INSERT INTO iceberg.logging.logs
VALUES
(
'ERROR',
timestamp '2021-04-01' AT TIME ZONE 'America/Los_Angeles',
'Oh noes',
ARRAY ['Exception in thread "main" java.lang.NullPointerException']
),
(
'ERROR',
timestamp '2021-04-01 15:55:23.383345' AT TIME ZONE 'America/Los_Angeles',
'Double oh noes',
ARRAY ['Exception in thread "main" java.lang.NullPointerException']
),
(
'WARN',
timestamp '2021-04-01 15:55:23.383345' AT TIME ZONE 'America/Los_Angeles',
'Maybeh oh noes?',
ARRAY ['bad things could be happening']
);
SELECT * FROM iceberg.logging.logs;
SELECT * FROM iceberg.logging."logs$snapshots";
/**
* Read data from an old snapshot (Time travel)
*/
SELECT * FROM iceberg.logging."logs@2806470637437034115";
/**
* Add new column, notice there is no snapshots of the metadata
*/
ALTER TABLE iceberg.logging.logs ADD COLUMN severity INTEGER;
SHOW CREATE TABLE iceberg.logging.logs;
SELECT * FROM iceberg.logging.logs;
SELECT * FROM iceberg.logging."logs$snapshots";
/**
* Insert new data with new column
*/
INSERT INTO iceberg.logging.logs VALUES
(
'INFO',
timestamp '2021-04-01 19:59:59.999999' AT TIME ZONE 'America/Los_Angeles',
'es muy bueno',
ARRAY ['It is all normal'],
1
);
SELECT * FROM iceberg.logging.logs;
SELECT * FROM iceberg.logging."logs$snapshots";
/**
* Rename column and drop column
*/
ALTER TABLE iceberg.logging.logs RENAME COLUMN severity TO priority;
SHOW CREATE TABLE iceberg.logging.logs;
SELECT * FROM iceberg.logging.logs;
ALTER TABLE iceberg.logging.logs DROP COLUMN priority;
SHOW CREATE TABLE iceberg.logging.logs;
SELECT * FROM iceberg.logging.logs;
SELECT * FROM iceberg.logging."logs$snapshots";
/**
* Travel back to previous snapshots
*/
SELECT * FROM iceberg.logging.logs;
SELECT * FROM iceberg.logging."logs$snapshots";
SELECT * FROM iceberg.logging."logs@<insert-earlier-snapshot>";
CALL system.rollback_to_snapshot('logging', 'logs', <insert-earlier-snapshot>)
/**
* Back to the future snapshot
*/
SELECT * FROM iceberg.logging."logs$snapshots";
SELECT * FROM iceberg.logging."logs@<insert-latest-snapshot>";
CALL system.rollback_to_snapshot('logging', 'logs', <insert-latest-snapshot>)
SELECT * FROM iceberg.logging.logs;
SELECT * FROM iceberg.logging."logs$partitions";
When developing with Trino, there is an automated build that acts as verification of any PR. It is powered by a GitHub actions definition and runs all the tests in Trino when developers add new code. Sometimes test unrelated to the changes in your PR fail, which makes your PR show that it shouldn’t be merged due to a failure, but is actually unrelated.
Developers are aware of these flaky tests, and need a mechanism to resubmit their PR and rerun the tests. There is unfortunately no way to enable users to rerun tests through GitHub without write permissions to the Trino repository, so you have to do a dummy commit.
This can easily be done using this one line hack
git commit --amend --no-edit && git push -f
.
The good news is, we have gone through some extensive lengths to identify flaky tests in the last year. These test failures are much rarer now, and we are constantly improving the build stability as an ongoing effort.
Interested in supporting the Trino project, but don’t know where to start? A good place to start with a little less barrier to entry, is adding to the documentation. We will be supporting the writing day at the Write the Docs (WTD) Portland conference this April! Join us to learn how to get involved!
Come join us for the inaugural Virtual Trino meetup on April 21st in the virtual meetup group in your region! See the community page for more details.
At these meetups, the four Trino/Presto founders will be updating everyone on the state of Trino. We’ll discuss the rebrand, talk about the recent features, and discuss the trajectory of the project. Then we will host a hangout and an ask me anything (AMA) session. Hope to see you all there!
Blogs
Videos
Trino Meetup groups
Latest training from David, Dain, and Martin(Now with timestamps!):
If you want to learn more about Trino, check out the definitive guide from OReilly. You can download the free PDF or buy the book online.
Music for the show is from the Megaman 6 Game Play album by Krzysztof Słowikowski.