In this episode we have the pleasure to chat with our colleagues, who now make the Trino community better every day:
Official highlights from Martin Traverso:
INSERTperformance in the MySQL, Oracle, and PostgreSQL connectors.
timestamp(p)type in MariaDB connector.
Additional highlights worth a mention according to Manfred:
What is community, and why has this word seen more use around technical projects, particularly those in the open-source space. There’s really no formal definition of community in the context of technology. David Spinks, author of the book, “The Business of Belonging”, defines community as:
A group of people who feel a shared sense of belonging.
For technical projects, this sense of belonging generally comes from the shared affinity towards a specific product, like Trino, or it could be a brand that hosts many products, like Google or Microsoft. There’s a lot that could be discussed here regarding why communities have become an essential ingredient to a project’s success. The quick answer I like to offer is that projects, open-source or proprietary, that have strong communities behind them innovate and grow faster, and are more successful overall.
As such, the Trino Software Foundation (TSF) recognizes that Trino will only be as successful as the health of the community that builds, tests, uses, and shares it. The activities around building a technical community fall in between engineering, marketing, and customer enablement. A common name that encompasses the individuals that work in this space is developer relations, DevRel for short. The goal of our work with the maintainers, contributors, users, and all other members of the community is the following:
Grow all aspects of the Trino project, and the Trino community to empower current and future members of the community.
We introduce some new faces who are stewards in our journey to growing the adoption of our favorite query engine, what each of them does, and how their work impacts you as a community member! Most importantly, you can learn how to get involved and help us learn how to best navigate ideas, issues, or any other contributions you may have that helps Trino to be the best query engine.
We don’t really have a seamless onboarding experience for new users. Many
members have asked questions on where to get started. One logical place people
tend to go to when browsing on the front page of the Trino site is the
getting started tab, which is ironically
still on the
trino.io/download.html page. When you open this page, you are
brought to a page primarily containing the latest binary downloads, some
community links, and some reading material to books and other resources.
The main thing you don’t really see is much getting started material. A lot of the material is intermediate level at best. There is not much beginner level guides to offer the self-service onboarding many are looking for when they just want to play around without having to bother or wait for anyone to respond. As it stands today, there is some work that Brian and Monica have started to create in this area to make the onboarding simpler.
A very common self-service getting started material is the trino-getting-started repo that Brian created to host demonstrations for the broadcast to show off some new feature or connector capabilities. This has been a good way to offer a simple environment to get them started. The only way to find this repository is to ask someone first. It would be ideal to showcase getting started materials as part of the default experience of learning about Trino.
Monica is working now on building up some demos using SaaS products like Starburst Galaxy as another method of using Trino without needing to install Docker among having to use any of your hardware to run through some examples. These options are typically more UI driven and much more approachable by members of the community that aren’t engineers or administrators.
We’ve got a handy PR template that exists for all contributors to use when they’ve submitting a pull request to Trino. Most of it is simple and self-explanatory. We ask you to describe what’s happening, where the change is happening, and what type of change it is. These are for the sake of the reviewers, giving them some important context so they understand what’s going on when they review the code. For simpler changes, it’s not usually necessary to go into a ton of detail here, but it’s nice to give a little summary for anyone looking at the PR.
The next steps are what really matter for every single PR that’s going to be merged - the documentation and release notes for a change. These are about communicating to our users. Documentation refers to Trino docs, not code comments. If Trino users need to be told how to use the feature you’re changing because of how you’re changing it, that means we need to have documentation for it. The PR template gives the options for how to go about this, but it’s incredibly helpful to have this filled out. Similarly, we ask whether or not release notes are necessary for the change, and what release notes you propose for your change. Generally speaking, if it needs to be documented, it almost always should have a release note. Even if it isn’t documented, a release note is often a good idea - things like performance improvements don’t require our users to change how they use Trino, but they won’t mind knowing that something has gotten better! The release process involves heavy editing of release notes, so it’s ok for the suggested note to be imperfect.
Trino is a technology that is built by developers, but also heavily used by developers. We want to ensure that the experience of both contributors and users of Trino is the best possible. To do that, we have to focus on many different aspects of this experience, from committing code to the CLIs and tools we offer for debugging queries and most importantly to building a sustainable community that can give answers and drive the future of the project. This is what DX is for Trino.
A while ago we started gathering metrics related to the Trino GitHub repository. This helped us identify issues like huge CI queue times. Most importantly we. can verify that the changes we made improved things, and how much.
In February this year, the 95th percentile of the CI queue time (not even the total run time!) was as high as almost 7 hours. Trino uses public GitHub runners, and there can only be 60 jobs running concurrently at the same time. This is a bottleneck because Trino has extensive test coverage for the core engine, all connectors, and other plugins. Because we can’t increase the number of runners, we looked into doing impact analysis to skip tests for modules not impacted by any change in a pull request.
Since April, the 95th percentile of the CI queue time is under 1 hour, even though the number of contributions is at an all-time high.
We keep track of these selected metrics in reports we create by running queries using the Trino CLI, saving the results in a markdown file, and publishing them as static pages using GitHub pages. The data is gathered using Trino connectors for the GitHub API and Git repositories. There’s a GitHub actions workflow running on a schedule, that spins up a Trino server, so there’s no infrastructure to maintain, except for a single S3 bucket. All of it is publicly available in the nineinchnick/trino-cicd repository. On the right, there’s a link to GitHub pages with reports.
We continue to add more reports, like tracking flaky tests or pull request activity:
By being data-driven and transparent, we make sure to provide a good experience for everyone, and this also helps us figure out where we need more resources to focus on.
We’re open to suggestions on what to track and which metrics to report on, so feel free to open issues and pull requests in the repository mentioned above, or start a thread on the Trino Slack.
One of the things we’ve been tracking over the last couple weeks has been the state of incoming PRs. We want to make sure that each PR reaches a maintainer, and that they all receive timely feedback after asking for a review. The goal in looking into this process is to help streamline and improve the time-to-initial-comment. The pleasant discovery is that it doesn’t seem like we have a lot of room to improve on that front. Not to pat ourselves on the back too heavily, but PRs find their way to maintainers, and get an initial review quite quickly, and there’s little work to be done on that front.
Our next exploration is tracking PRs that don’t quickly get approved and merged, and monitoring their life cycle and making sure follow-up reviews are happening in a timely manner as well. We now know that we are effective at giving initial feedback on a PR, but we also want to make sure that these PRs aren’t falling off a cliff or turning into a long, drawn-out process where each development iteration is slower than the last.
This months PR of the episode was contributed by alexjo2144. This feature is an exciting update on the ability to modify the partition specification of a table in Iceberg. This is an update since Brian wrote about this feature
At the time of writing, Trino is able to perform reads from tables that have multiple partition spec changes but partition evolution write support does not yet exist.
This brings us much closer to having more feature parity with other query
engines to manage Iceberg tables entirely through Trino. Thanks to our friend
findinpath for the review.
For this episode’s demo, you’ll need a local Trino coordinator, MinIO instance,
and Hive metastore backed by a database. Clone the
repository and navigate to the
iceberg/trino-iceberg-minio directory. Then
start up the containers using Docker Compose.
git clone [email protected]:bitsondatadev/trino-getting-started.git cd iceberg/trino-iceberg-minio docker-compose up -d
This demo is actually very similar to a demo we did in episode 15, except now we get to showcase one of Iceberg’s most exciting features, partition evolution.
/** * Make sure to first create a bucket names "logging" in MinIO before running */ CREATE SCHEMA iceberg.logging WITH (location = 's3a://logging/'); CREATE TABLE iceberg.logging.logs ( level varchar NOT NULL, event_time timestamp(6) with time zone NOT NULL, message varchar NOT NULL, call_stack array(varchar) ) WITH ( format = 'ORC', partitioning = ARRAY['day(event_time)'] ); /** * Inserting two records. Notice event_time is on the same day but different hours. */ INSERT INTO iceberg.logging.logs VALUES ( 'ERROR', timestamp '2021-04-01 12:23:53.383345' AT TIME ZONE 'America/Los_Angeles', '1 message', ARRAY ['Exception in thread "main" java.lang.NullPointerException'] ), ( 'ERROR', timestamp '2021-04-01 13:36:23' AT TIME ZONE 'America/Los_Angeles', '2 message', ARRAY ['Exception in thread "main" java.lang.NullPointerException'] ); SELECT * FROM iceberg.logging.logs; SELECT * FROM iceberg.logging."logs$partitions"; /** * Notice one partition was created for both records at the day granularity. */ /** * Update the partitioning from daily to hourly 🎉 */ ALTER TABLE iceberg.logging.logs SET PROPERTIES partitioning = ARRAY['hour(event_time)']; /** * Inserting three records. Notice event_time is on the same day but different hours. */ INSERT INTO iceberg.logging.logs VALUES ( 'ERROR', timestamp '2021-04-01 15:55:23' AT TIME ZONE 'America/Los_Angeles', '3 message', ARRAY ['Exception in thread "main" java.lang.NullPointerException'] ), ( 'WARN', timestamp '2021-04-01 15:55:23' AT TIME ZONE 'America/Los_Angeles', '4 message', ARRAY ['bad things could be happening'] ), ( 'WARN', timestamp '2021-04-01 16:55:23' AT TIME ZONE 'America/Los_Angeles', '5 message', ARRAY ['bad things could be happening'] ); SELECT * FROM iceberg.logging.logs; SELECT * FROM iceberg.logging."logs$partitions"; /** * Now there are three partitions: * 1) One partition at the day granularity containing our original records. * 2) One at the hour granularity for hour 15 containing two new records. * 3) One at the hour granularity for hour 16 containing the last new record. */ SELECT * FROM iceberg.logging.logs WHERE event_time < timestamp '2021-04-01 16:55:23' AT TIME ZONE 'America/Los_Angeles'; /** * This query correctly returns 4 records with only the first two partitions * being touched. */
There’s been a lot of cool things going into the Iceberg connector these days,
and another exciting one that came out in release 381 was the support for
UPDATE in Iceberg. So we’re
gonna showcase that:
/** * Update */ UPDATE iceberg.logging.logs SET call_stack = call_stack || 'WHALE HELLO THERE!' WHERE lower(level) = 'warn'; DROP TABLE iceberg.logging.logs; DROP SCHEMA iceberg.logging;
Is there a way to “quote” a sub query, to tell the Trino planner just pushdown the query and don’t bother making a sub plan?
I have a star schema, with one huge table (>100M rows) and a dimension table that has static attributes of the huge table. The dimension table is filtered to create a map, that is joined to the huge table. The result is group by on a dimension and finally some of the metrics from the huge table are aggregated to calculate stats.
Answer: We’ve recently introduced Polymorphic Table Functions to Trino in version 381.
In version 384, which was just released a few days ago, the query table function was added in PR 12325.
For a quick example in MySQL:
trino> USE mysql.tiny; USE trino:tiny> SELECT * FROM TABLE(system.query(query => 'SELECT 1 a')); a --- 1 (1 row) trino:tiny> SELECT * FROM TABLE(system.query(query => 'SELECT @@version')); @@version ----------- 8.0.29 (1 row)
So this will run exactly the command on the underlying database (not exactly a
pushdown but a pass-through) and return the results to Trino as a Table.
SELECT @@version is MySQL specific syntax that returns MySQL output as a table
that now Trino is able to further process.
Trino Meetup groups
If you want to learn more about Trino, check out the definitive guide from O’Reilly. You can download the free PDF or buy the book online.
Music for the show is from the Megaman 6 Game Play album by Krzysztof Słowikowski.