20: Trino for the Trinewbie

Video

Audio

Guests

Marius Grama, Data Engineer at willhaben internet service GmbH & Co KG (@findinpath)

Concept of the week: Trino for the Trinewbie

One of the best and easiest ways to get an understanding about Trino, and how to use it is the book Trino: Definitive Guide. The next three sections have a few excerpts from the book that does an incredible job at introducing the space Trino is in. If you would like to read the book in its entirety, Starburst offers the digital copy for free.

The Problems with Big Data

Everybody is capturing more and more data from device metrics, user behavior tracking, business transactions, location data, software and system testing procedures and workflows, and much more. The insights gained from understanding that data and working with it can make or break the success of any initiative, or even a company.

At the same time, the diversity of storage mechanisms available for data has exploded: relational databases, NoSQL databases, document databases, key-value stores, object storage systems, and so on. Many of them are necessary in today’s organizations, and it is no longer possible to use just one of them.

What is Trino?

Trino is not a database with storage, rather, it simply queries data where it lives. When using Trino, storage and compute are decoupled and can be scaled independently. Trino represents the compute layer, whereas the underlying data sources represent the storage layer.

This allows Trino to scale up and down its compute resources for query processing, based on analytics demand to access this data. There is no need to move your data, and provision compute and storage to the exact needs of the current queries, or change that regularly, based on your changing query needs.

Trino can scale the query power by scaling the compute cluster dynamically, and the data can be queried right where it lives in the data source. This characteristic allows you to greatly optimize your hardware resource needs and therefore reduce cost.

SQL-on-Anything

Trino was initially designed to query data from HDFS. And it can do that very efficiently, as you learn later. But that is not where it ends. On the contrary, Trino is a query engine that can query data from object storage, relational database management systems (RDBMSs), NoSQL databases, and other systems.

Trino queries data where it lives and does not require a migration of data to a single location. So Trino allows you to query data in HDFS and other distributed object storage systems. It allows you to query RDBMSs and other data sources. As such, it can really query data wherever it lives and therefore be a replacement to the traditional, expensive, and heavy extract, transform, and load (ETL) processes. Or at a minimum, it can help you with them and lighten the load. So Trino is clearly not just another SQL-on-Hadoop solution.

Object storage systems include Amazon Web Services (AWS) Simple Storage Service (S3), Microsoft Azure Blob Storage, Google Cloud Storage, and S3-compatible storage such as MinIO and Ceph. Trino can query traditional RDBMSs such as Microsoft SQL Server, PostgreSQL, MySQL, Oracle, Teradata, and Amazon Redshift. Trino can also query NoSQL systems such as Apache Cassandra, Apache Kafka, MongoDB, or Elasticsearch. Trino can query virtually anything and is truly a SQL-on-Anything system.

For users, this means that suddenly they no longer have to rely on specific query languages or tools to interact with the data in those specific systems. They can simply leverage Trino and their existing SQL skills and their well-understood analytics, dashboarding, and reporting tools. These tools, built on top of using SQL, allow analysis of those additional data sets, which are otherwise locked in separate systems. Users can even use Trino to query across different systems with the SQL they know.

Contributing to Trino

In this episode, Marius Grama discusses his journey with Trino. From joining the community, his first impressions and experiences, and what led him to make sixteen commits over the last three months. We also ask him where he thinks we could improve to make the onboarding experience better.

In the Trino project there are four roles. You can immediately become a participant or reviewer. To be a contributor, you need to follow some steps that are covered later in the episode. Likewise, for maintainers, there is a path to becoming a maintainer that is discussed in detail on the roles page.

Participants

Participants are those who show up and join in discussions about the project. Users, developers, and administrators can all be participants, as can literally anyone who has the time, energy, and passion to become involved. Participants suggest improvements and new features. They report bugs, regressions, performance issues, and so on. They work to make Trino better for everyone.

Contributors

Today’s episode covers the process that a contributor goes through to make a code change, but simply put:

A contributor submits code changes to Trino.

Reviewers

A reviewer reads a proposed change to Trino, and assesses how well the change aligns with the Trino vision and guidelines. This includes everything from high level project vision to low level code style. Everyone is invited and encouraged to review others’ contributions – you don’t need to be a maintainer for that.

Maintainers

A maintainer is responsible for checking in code only after ensuring it has been reviewed thoroughly and aligns with the Trino vision and guidelines. In addition to merging code, a maintainer actively participates in discussions and reviews. Being a maintainer does not grant additional rights in the project to make changes, set direction, or anything else that does not align with the direction of the project. Instead, a maintainer is expected to bring these to the project participants as needed to gain consensus. The maintainer role is for an individual, so if a maintainer changes employers, the role is retained. However, if a maintainer is no longer actively involved in the project, their maintainer status will be reviewed.

There is a writeup on the Apache Hive process to become a committer. For context, a committer is equivalent to a maintainer in Trino. This writeup aligns precisely with the Trino philosophy. Here are a few good quotes from that article:

Contributors often ask Hive PMC members the question, “What do I need to do in order to become a committer?” The simple (though frustrating) answer to this question is, “If you want to become a committer, behave like a committer.” If you follow this advice, then rest assured that the PMC will notice, and committership will seek you out rather than the other way around.

It should go without saying, but here it is anyway: your participation in the project should be a natural part of your work with Hive; if you find yourself undertaking tasks “so that you can become a committer”, then you’re doing it wrong, young padawan. This is particularly true if your motivations for wanting to become a committer are primarily negative or self-centered

PR of the week: PR 8135 Set default time zone for the current session

The PR of the week, was committed by today’s guest, Marius Grama.

This fixes issue 8112 by adding support for the SET TIME ZONE statement. The time zone specified is being stored as a session property and has a lower precedence than sql.forced-session-time-zone setting.

Thanks Marius!

Demo: Contributing to Trino

Here is the video that goes into detail on the steps below on how to contribute code to Trino!

Download an IDE.

First, you need to have an integrated development environment (IDE) to run the code. We recommend Intellij Community Edition as it is the standard that is used by developers across the project. Of course, you may use any IDE you like, but there may be issues that others may not be able to help with as readily.
Install Git.

Git is a distributed version source control software used to collaborate code with other users. You must install git in order to contribute to the project.
Install Docker.

The Trino testing framework runs Trino and other databases it connects to on Docker, a tool that runs different services in isolation using containers.
Go ahead and install Docker on your system.
Create and configure your GitHub account.

GitHub is a free hosted Git repository, and a central point of collaboration for the Trino project. If you haven’t done so, please create and configure your GitHub account.
Make a fork of the Trino repository on GitHub

Navigate to the Trino repository and click the “fork” button. Or you can just click it here: Fork.

You want to create a fork so that you can save your work without needing the special privileges it takes to commit code back to the Trino repository. This way, you can upload (also called a “push” in Git) your code to your fork and later open a pull request into the main Trino repository.
Clone your fork of the Trino repository to your computer and import into Intellij.

Execute the following clone command in your terminal:
```
 git clone [email protected]:<your_username>/trino.git
```
Open the Trino project in Intellij.
Add the Airlift code style checks to Intellij.

There are many unspoken rules to code style and formatting in any project. Trino is no exception. To make life simpler on the contributor and reviewer, the Trino code style definition that you can import into Intellij to have the Reformat Code action to format in the desired style of the project.
Build the project.

One of the greatest resources in trino history is this cheat sheet created by Piotr Findeisen. I use it for some of the commands, but the most important use, is the “fast” build command he adds on the top. In your terminal, make sure you are located in the root directory of the Trino project, and run the following command.
```
./mvnw -pl '!:trino-server-rpm,!:trino-docs,!:trino-proxy,!:trino-verifier,!:trino-benchto-benchmarks' clean install \
-TC2 -nsu \
-DskipTests \
-Dmaven.javadoc.skip=true \
-Dmaven.source.skip=true \
-Dair.check.skip-all=true
```
This builds all necessary modules of the project to run almost everything in Trino. The build excludes some modules, runs the compiler on multiple threads, skips the tests, javadocs, and the Airlift code style checks. If you would like to run code style check on a specific module (e.g. trino-elasticsearch) then you can run the following command.
```
./mvnw -pl ':trino-elasticsearch' clean install \
-TC2 -nsu \
-DskipTests \
-Dmaven.javadoc.skip=true \
-Dmaven.source.skip=true 
```
Sign the CLA.

Sign the contributor license agreement (CLA) to agree that all of your code you commit to the project is subject to the Apache License 2.0. Once you sign the agreement, scan and submit the form to [email protected]. This email gets checked every few days, and you can check if your name has been added to the contributors list.
At this point you can look for an issue labeled “good first issue” This identifies issues that we think are more approachable for developers that aren’t as familiar with the Trino repository yet.
One final thing before you move on to the contribution process. Before you start jumping in and changing the code, you’ll also want to create a special branch for your changes. A branch in git makes a separate workflow for all the changes you make to be isolated, If something goes wrong, or you need to compare with an older branch you can do so. The default branch may either be named master or main. See more on branching in git.

To make a branch for your feature, you can run the following command:

git checkout -b my-feature-branch

Follow the remaining steps in the contribution process page.

Question of the week: How do I remove nulls from an array in Trino?

A question posted to StackOverflow asked the following question:

I’m extracting data from a json column in Trino and getting the output in an array like this ['AL', NULL, 'NEW']. The problem is I need to remove the null since the array has to be mapped another array.I tried several options but no luck. How can I remove the null and get only ['AL', 'NEW'] without unnesting?

Piotr Findeisen replied:

You can use filter() for this:

trino> SELECT filter(ARRAY['AL', NULL,'NEW'], e -> e IS NOT NULL);
   _col0
-----------
 [AL, NEW]
(1 row)

Events, news, and various links

News

The “frog” book has been translated to Chinese! Keep your eyes peeled for the rebrand into Trino for the translation.

Trino Meetup groups

Virtual
East Coast (US)
- Trino Boston
- Trino NYC
West Coast (US)
- Trino San Fransisco
- Trino Los Angeles
Mid West (US)
- Trino Chicago

Latest training from David, Dain, and Martin(Now with timestamps!):

If you want to learn more about Trino, check out the definitive guide from OReilly. You can download the free PDF or buy the book online.

Music for the show is from the Megaman 6 Game Play album by Krzysztof Słowikowski.

Do you ❤️ Trino? Give us a 🌟 on GitHub

Trino Community Broadcast