15: Iceberg right ahead!

Video

Audio

Looks like Commander Bun Bun is safe on this Iceberg
https://joshdata.me/iceberger.html

Iceberg links

Guests

Ryan Blue, creator of Iceberg, and Senior Software Engineer at Netflix (@rdblue)
David Phillips, creator of Trino/Presto, and CTO at Starburst (@electrum32)

Release 355

Release notes discussed: https://trino.io/docs/current/release/release-355.html

Martin’s list:

Multiple password authentication plugins
Column and table lineage reporting in query events
Improved planning performance for queries against Phoenix or SQL Server
Improved performance for ORDER BY … LIMIT queries against Phoenix

Manfred’s notes:

Security overview and TLS pages and authentication types
Reiterate multiple authentication providers (ldap1, ldap2, password)
Improved parallelism for table bucket count is small compared to number of nodes.
Include information about Spill to disk in EXPLAIN ANALYZE
Unixtime function changes
Hive view support improvements

Concept of the week: Apache Iceberg and the Iceberg spec

Interview with Ryan Blue

In the previous episode, we covered the differences between the Iceberg table format, and the Hive table format from a technical standpoint in the context of Trino. We highly recommend watching it before this episode. In this episode we ask Ryan about the origins of Apache Iceberg and why he started the project. We cover some details of the Iceberg specification which is a nice change from the ad-hoc specification that people adhere to when using Hive tables. Then Ryan dives into several amazing use cases how Netflix and others use Iceberg.

PR of the week: PR 7233 Fix queries on tables without snapshot id

This week’s PR of the week was submitted by one of the Trino maintainers, Pratham Desai. Pratham is a Software Engineer at LinkedIn who commits a lot of time in the Trino community helping out on the slack channel, contributing code, and doing PR reviews. Thank you for all you do Pratham!

Had Brian known about this PR, he wouldn’t have had the issue he did with reading the empty snapshot created with the Iceberg Java API and would have been able to read and insert into the table just fine. If you come across this issue, we introduced this feature in release 344!

Another future development for the Trino Iceberg connector

Along with the future developments we discussed in the previous episode, another core Iceberg functionality that we want to add in Trino is support for partition migration. We also discussed future support for UPDATE and MERGE capabilities for the Iceberg connector.

Demo: Creating tables with Iceberg and reading the data in Trino

For this weeks’ demo, we continue to use the Iceberg Java API to create a table. You also have the option to use Trino, Spark, or other to ingest and query the data, but I wanted to use vanilla Iceberg API’s to experience the API and hopefully solidify my learning of Iceberg concepts in the process. Make sure you follow the instructions in the repository if you don’t have Docker or Java installed.

Let’s start up a local Trino coordinator and Hive metastore. Clone this repository and navigate to the iceberg/trino-iceberg-minio directory. Then start up the containers using Docker Compose.

git clone [email protected]:bitsondatadev/trino-getting-started.git

cd iceberg/trino-iceberg-minio

docker-compose up -d

In your favorite IDE, open the files under iceberg/iceberg-java into your project and run the IcebergMain class.

This class creates a logging table if it doesn’t exist along with the logging schema. Once you run this code, you can check to see that the table in Trino exists in the metastore under TABLE_PARAMS.

Now we transition from the Java API to running queries over Iceberg using Trino.

/**
 * This is the equivalent of running IcebergMain in the iceberg-java project.
 * Go ahead and inspect the java code you can use to interact with Iceberg
 * tables and metadata.
 */
CREATE TABLE iceberg.logging.logs (
   level varchar NOT NULL,
   event_time timestamp(6) with time zone NOT NULL,
   message varchar NOT NULL,
   call_stack array(varchar)
)
WITH (
   format = 'ORC',
   partitioning = ARRAY['hour(event_time)','level']
)

/**
 * Read From Trino
 */

SELECT * FROM iceberg.logging.logs;

/**
 * Write data from Trino and check data and snapshots
 */

INSERT INTO iceberg.logging.logs VALUES 
(
  'ERROR', 
  timestamp '2021-04-01' AT TIME ZONE 'America/Los_Angeles', 
  'Oh noes',
  ARRAY ['Exception in thread "main" java.lang.NullPointerException']
);

SELECT * FROM iceberg.logging.logs;

SELECT * FROM iceberg.logging."logs$snapshots";

/**
 * Write more data from Trino and check data and snapshots
 */
INSERT INTO iceberg.logging.logs 
VALUES 
(
  'ERROR', 
  timestamp '2021-04-01' AT TIME ZONE 'America/Los_Angeles', 
  'Oh noes', 
  ARRAY ['Exception in thread "main" java.lang.NullPointerException']
), 
(
  'ERROR', 
  timestamp '2021-04-01 15:55:23.383345' AT TIME ZONE 'America/Los_Angeles', 
  'Double oh noes', 
  ARRAY ['Exception in thread "main" java.lang.NullPointerException']
), 
(
  'WARN', 
  timestamp '2021-04-01 15:55:23.383345' AT TIME ZONE 'America/Los_Angeles', 
  'Maybeh oh noes?', 
  ARRAY ['bad things could be happening']
);

 
SELECT * FROM iceberg.logging.logs;

SELECT * FROM iceberg.logging."logs$snapshots";

/**
 * Read data from an old snapshot (Time travel)
 */

SELECT * FROM iceberg.logging."logs@2806470637437034115";

/**
 * Add new column, notice there is no snapshots of the metadata
 */

ALTER TABLE iceberg.logging.logs ADD COLUMN severity INTEGER;

SHOW CREATE TABLE iceberg.logging.logs;

SELECT * FROM iceberg.logging.logs;

SELECT * FROM iceberg.logging."logs$snapshots";

/**
 * Insert new data with new column
 */

INSERT INTO iceberg.logging.logs VALUES 
(
  'INFO', 
  timestamp '2021-04-01 19:59:59.999999' AT TIME ZONE 'America/Los_Angeles', 
  'es muy bueno', 
  ARRAY ['It is all normal'], 
  1
);

SELECT * FROM iceberg.logging.logs;

SELECT * FROM iceberg.logging."logs$snapshots";

/**
 * Rename column and drop column
 */

ALTER TABLE iceberg.logging.logs RENAME COLUMN severity TO priority;

SHOW CREATE TABLE iceberg.logging.logs;

SELECT * FROM iceberg.logging.logs;

ALTER TABLE iceberg.logging.logs DROP COLUMN priority;

SHOW CREATE TABLE iceberg.logging.logs;

SELECT * FROM iceberg.logging.logs;

SELECT * FROM iceberg.logging."logs$snapshots";

/**
 * Travel back to previous snapshots
 */

SELECT * FROM iceberg.logging.logs;

SELECT * FROM iceberg.logging."logs$snapshots";

SELECT * FROM iceberg.logging."logs@<insert-earlier-snapshot>";

CALL system.rollback_to_snapshot('logging', 'logs', <insert-earlier-snapshot>)

/**
 * Back to the future snapshot
 */

SELECT * FROM iceberg.logging."logs$snapshots";

SELECT * FROM iceberg.logging."logs@<insert-latest-snapshot>";

CALL system.rollback_to_snapshot('logging', 'logs', <insert-latest-snapshot>)

SELECT * FROM iceberg.logging.logs;

SELECT * FROM iceberg.logging."logs$partitions";

Question of the week: What do I do to restart the test pipeline if it fails on me?

When developing with Trino, there is an automated build that acts as verification of any PR. It is powered by a GitHub actions definition and runs all the tests in Trino when developers add new code. Sometimes test unrelated to the changes in your PR fail, which makes your PR show that it shouldn’t be merged due to a failure, but is actually unrelated.

Developers are aware of these flaky tests, and need a mechanism to resubmit their PR and rerun the tests. There is unfortunately no way to enable users to rerun tests through GitHub without write permissions to the Trino repository, so you have to do a dummy commit.

This can easily be done using this one line hack git commit --amend --no-edit && git push -f.

The good news is, we have gone through some extensive lengths to identify flaky tests in the last year. These test failures are much rarer now, and we are constantly improving the build stability as an ongoing effort.

Events, news, and various links

WTD Portland

Interested in supporting the Trino project, but don’t know where to start? A good place to start with a little less barrier to entry, is adding to the documentation. We will be supporting the writing day at the Write the Docs (WTD) Portland conference this April! Join us to learn how to get involved!

Virtual Trino meetups

Come join us for the inaugural Virtual Trino meetup on April 21st in the virtual meetup group in your region! See the community page for more details.

At these meetups, the four Trino/Presto founders will be updating everyone on the state of Trino. We’ll discuss the rebrand, talk about the recent features, and discuss the trajectory of the project. Then we will host a hangout and an ask me anything (AMA) session. Hope to see you all there!

Blogs

Videos

Trino Meetup groups

Virtual
East Coast (US)
- Trino Boston
- Trino NYC
West Coast (US)
- Trino San Fransisco
- Trino Los Angeles
Mid West (US)
- Trino Chicago

Latest training from David, Dain, and Martin(Now with timestamps!):

If you want to learn more about Trino, check out the definitive guide from OReilly. You can download the free PDF or buy the book online.

Music for the show is from the Megaman 6 Game Play album by Krzysztof Słowikowski.

Do you ❤️ Trino? Give us a 🌟 on GitHub

Trino Community Broadcast