Presto SQL is now Trino Read why »

Trino Community Broadcast

35: Packaging and modernizing Trino

Apr 21, 2022

Audio

 

Video

Video sections

Show notes

Trino nation, we want to hear from you! If you have a question or pull request that you would like us to feature on the show please join the Trino slack, and go to the #trino-community-broadcast channel and let us know there. Otherwise, you can message Manfred Moser or Brian Olsen directly. Also, feel free to reach out to us on our Twitter channels Brian @bitsondatadev and Manfred @simpligility.

If you want to show us some 💕, please give us a ⭐ on GitHub.

Releases 375 to 378

Official highlights from Martin Traverso:

Trino 375

  • Support for table comments in the MySQL connector.
  • Improved predicate pushdown for PostgreSQL.
  • Performance improvements for aggregations with filters.

Trino 376

  • Better performance when reading Parquet data.
  • Join pushdown for MySQL.
  • Aggregation pushdown for Oracle.
  • Support table and column comments in ClickHouse connector.
  • Support for adding and deleting schemas in Accumulo connector.
  • Support system truststore in CLI and JDBC driver.
  • Two-way TLS/SSL certificate validation with LDAP authentication.

Trino 377

  • Add support for standard SQL trim syntax.
  • Better performance for Glue metastore.
  • Join pushdown for SQL Server connector.

Trino 378

  • New to_base32 and from_base32 functions.
  • New expire_snapshots and delete_orphan_files table procedures for Iceberg.
  • Faster planning of queries with IN predicates.
  • Faster query planning for Hive, Delta Lake, Iceberg, MySQL, PostgreSQL, and SQL Server connectors.

Additional highlights worth a mention according to Manfred:

  • Generally lots of improvements on Hive, Delta Lake, Iceberg, and main JDBC-based connectors.
  • Full Iceberg v2 table format support for read and later read and write operations is getting closer and closer.
  • Table statistics support for PostgreSQL, MySQL, and SQL Server connector including automatic join pushdown.
  • Fix failure of DISTINCT .. LIMIT operator when input data is dictionary encoded.
  • Add new page to display the runtime information of all workers in the cluster in Web UI.
  • Remove user property requirement in JDBC driver.
  • Require internal-communication.shared-secret value with authentication usage, breaking change for many users that have not set that secret.

More detailed information is available in the release notes for Trino 375, Trino 376, Trino 377, and Trino 378.

Concept of the episode: Packaging Trino

To adopt Trino you typically need to run it on a cluster of machines. These can be bare metal servers, virtual machines, or even containers. The Trino project provides a few binary packages to allow you to install Trino:

  • tarball
  • rpm
  • container image

All of them include a bunch of Java libraries that constitute Trino and all the plugins. As a result there are only a few requirements. You need a Linux operating system, since some of the libraries and code require Linux indirectly, and a Java 11 runtime.

Beyond that is just the bin/launcher script, which is highly recommended, but not required. It can be used as a service script or for manual starts/stop/status of Trino, and only needs Python.

Tarball

The tarball, is a gz compressed tar archive. For installation you just need to extract the archive anywhere. It contains the following directory structure.

  • bin, the launcher script and related files
  • lib, all globally needed libraries
  • plugins, connectors and other plugins with their own libraries each in separate sub-directories

You need to create the etc directory with the needed configuration, since the tarball does not include any defaults, and you can not start the application without those.

  • etc/catalog/*.properties
  • etc/config.properties
  • etc/jvm.config
  • etc/log.properties
  • etc/node.properties

Note that all the files are within the created directory.

RPM

The RPM archive is suitable for RPM-based Linux distributions, but testing is not very thorough across different versions and distributions.

It adapts the tarball content to the Linux file system hierarchy, hooks the launcher script up as daemon script, and adds default configuration files. That allows you to start Trino after installing the archive, as well as with system restarts.

Locations used are /etc/trino, /var/lib/trino, and others. These are configured via the launcher script parameters.

In a nutshell the RPM adds some convenience, but narrows down the supported Linux distributions. It still requires Java and Python installation and management.

Container image

The container image for Trino adds the necessary Linux, Java, and Python, and adapts Trino to the container setup.

The container adds even more convenience, since it is ready to use out of the box. It allows usage on Kubernetes with the help of the Helm charts, and includes the required operating system and application parts automatically.

Customization

All three package Trino ships are just defaults. They all require further configuration to adapt Trino to your specific needs in terms of hardware, connected data sources, security configuration, and so on. All of these can be done manually or with many existing tools.

However, you can also take it a step further and create your own package suited to your needs. The tarball can be used as source for any customization to create your own package. In the following is a list of options and scenarios:

  • Use the tarball, but remove unused plugins.
  • Use the tarball as source to create your own specific package. For example a deb archive for usage with Ubuntu, or an Alpine package for that same distro.
  • Create your own RPM similar to Manfred’s proof of concept that pulls out the Trino RPM package creation into a separate project.
  • Create your own container image with different base distro, custom set of plugins, and even with all your configuration baked into the image.

Others

You can also use brew on MacOS, but that is not suitable for production usage. More for convenience to get a local Trino for playing around.

Additional topic of the episode: Modernizing Trino with Java 17

Currently Java 11 is required for Trino. Java 17 is the latest and greatest Java LTS release with lots of good performance, security, and language improvements. The community has been working hard to make Java 17 support a reality. At this stage core Trino fully supports Java 17. Starburst Galaxy for example uses Java 17.

The maintainers and contributors would like to move to fully support and also require Java 17 soon. Here is where your input comes in, and we ask that you let us know your thoughts about questions such as the following:

  • Are you looking forward to the new Java 17 language features and other improvements as a contributor to Trino?
  • Are you already using Java 17 with Trino? In production or just testing?
  • If we require Java 17 in the next months, can you update to use Java 17 with Trino?
  • If not, what are some of the hurdles?
  • Are you okay with staying at an older release, until you can use Java 17?

Let us know on the #dev channel on Trino Slack or ping us directly. You can also chime in on the roadmap issue.

Pull requests of the episode: Worker stats in the Web UI

The PR of the episode was submitted Github user whutpencil, and adds a significant new feature to the web UI. It exposes the system.runtimes.nodes information, so statistics for each worker, in brand new pages. What a great effort! Special thanks also go out to Dawid Adamek dedep for the review.

Demo of the episode: Tarball installation and new Web UI feature

In the demo of the month Manfred shows a worker installation to add to a local tarball install of a coordinator, and then demos the Web UI with the new feature from the pull request of the month.

Question of the episode: Are write operations in Delta Lake supported for tables stored on HDFS?

Full question from Slack: I was trying the Delta Lake connector. I noticed that write operations are supported for tables stored on Azure ADLS Gen2, S3 and S3-compatible storage. Does that mean write operations are not supported for tables stored on HDFS?

Answer: HDFS is always implicitly supported for data lake connectors. It isn’t called out because it is assumed.

The confusion actually came from an error message used when the user tried to insert into a Delta Lake table they created in Spark. Then they tried inserting a record into the table through IntelliJ IDEA and received the following error message:

Unsupported target SQL type: -155

They thought the problem might be the wrong data type of birthday. Then used statement below to insert a record into the table.

INSERT INTO
  presto.people10m (id, firstname, middlename, lastname, gender, birthdate, ssn, salary)
VALUES (1, 'a', 'b', 'c', 'male', timestamp '1990-01-01 00:00:00 +00:00', 'd', 10);

However, I got an error message like this:

Query 20220419_031201_00015_8qe76 failed:
Cannot write to table in hdfs://masters/presto.db/people10m; hdfs not supported

This was an issue on the IntelliJ client.

Trino Meetup groups

If you want to learn more about Trino, check out the definitive guide from O’Reilly. You can download the free PDF or buy the book online.

Music for the show is from the Megaman 6 Game Play album by Krzysztof Słowikowski.