39: Raft floats on Trino to federate silos

Video

Audio

Guests

In this episode, we are talking to two engineers from Raft and discuss how they use Trino to connect data silos that exist across different departments in various government sectors:

Edward Morgan, Senior Platform Engineer/DevSecOps Manager at Raft
Steve Morgan, Chief Data Engineer at Raft

Register for Trino Summit 2022!

Trino Summit 2022 is coming around the corner! This will be a hybrid event on November 10th that will take place in-person at the Commonwealth Club in San Francisco, CA and can also be attended remotely! If you want to present, the call for speakers is open until September 15th.

You can register for the conference at any time. We must limit in-person registrations to 250 attendees, so register soon if you plan on attending in person!

Releases 392 to 393

Official highlights from Martin Traverso:

Trino 392

Support for dynamic filtering with fault-tolerant query execution.
Support for correlated subqueries in DELETE queries.
Support for Amazon S3 Select pushdown for JSON files.
Support for Avro format in Iceberg connector.
Faster queries when filtering by __time column in Druid.

Trino 393

Add support for MERGE.
Improved performance of highly selective LIMIT queries.
Experimental docker image for ppc64le.
Dynamic filtering support for various connectors.
Support for JSON and bytes type in Pinot.

Additional highlights worth a mention according to Manfred:

Lots of other improvements on Delta Lake, Hive, and Iceberg connectors.
Merge support in a bunch of connectors.
OAuth 2.0 refresh token fixes

More detailed information is available in the release notes for Trino 392, and Trino 393.

Concept of the episode: Trino at Raft

Raft provides consulting services and is particularly skilled at DevSecOps. One particular challenge they face is dealing with fragmented government infrastructure. In this episode, we dive in to learn how Trino enables Raft to supply government sector clients with a data fabric solution. Raft takes a special stance on using and contributing to open source solutions that run well on the cloud.

Intro to software factories

A “software factory” is an organized approach to software development that provides software design and development teams a repeatable, well-defined path to create and update software. It results in a robust, compliant, and more resilient process for delivering applications to production” – VMWare

This is a push against the previous attempts from larger government contractors who tried to build one-size-fits-all solutions that ultimately failed. The new wave of government solutions relies on methodologies similar to the software industry that append more rules and standards around technologies they can adopt in the stack.

Software factories are now a common practice for government agencies to use, as they are able to take standardized software stacks that go through rigorous validation to make sure the meet the standards of the government. One important element to these stacks are that they can be deployed in virtually any environment. A common way to do this is using Kubernetes and containers.

Standards and anatomy of a stack

With the movement towards standardization, government contractors will generally build their stack using Kubernetes templates. Kubernetes underpins each of these stacks while telemetry, monitoring, and policy agents are layered on after that. For Raft, they wanted to provide a “single pane of glass” over the existing fragmented systems that the Department of Defense (DoD) operates on. They began to develop a stack that included Trino as their method to connect data over various silos.

Data Fabric at Raft

Data Fabric is an attempt to provide government agencies the ability to set up a data mesh that is backed by Trino. Trino fits well in this narrative as it provides SQL-over-everything. Data analysts and data scientists only need to know SQL.

Data Fabric MVP is an end-to-end DataOps capability that can be deployed at the edge, in the cloud, and in disconnected environments within minutes. It provides a single control plane for normalizing and combining disparate data lakes, platforms, silos, and formats into SQL using Trino for batch data and Apache Pinot for user facing streaming analytics.

Data Fabric is driven by cloud native policy using Open Policy Agent (OPA) integrated with Trino and Kafka to provide row and column level obfuscation. It provides enterprise data catalog to view data lineage, properties, and data owners from multiple data platforms. – Raft

Security concerns around Trino

A common first question the Raft team gets asked is around Trino being a high security concern. The idea that Trino can connect to multiple data sources from one location brings up fear that individuals may gain access to information at a higher classification level than they have. The team has to educate the different users on the best practices and how to ensure this problem doesn’t occur. You will need a separate deployment of Data Fabric for each classification level and correctly identify policies in OPA that restrict visibility to information above a users’ clearance.

Iron Bank container repository

Iron Bank is a central repository of digitally-signed container images, including open-source and commercial off-the-shelf software, hardened to the DoD’s exacting specifications. Approved containers in Iron Bank have DoD-wide reciprocity across all classifications, accelerating the security approval process from months or even years down to weeks.

To be considered for inclusion into Iron Bank, container images must meet rigorous DoD software security standards. It is an extensive, continuous, complicated effort for even the most sophisticated IT teams. Continuously maintaining and managing hardening pipelines while incorporating evolving DoD specifications and addressing new vulnerabilities (CVEs) can severely stretch your resources, even if you have advanced tooling and experience in-house. (Source)

The Trino Docker image is available in Iron Bank and is maintained by folks at Booz Allen Hamilton. Their hard work makes it possible for Trino to be deployed in DoD environments.

Pull requests of the episode: PR 13354: Add S3 Select pushdown for JSON files

This PR of the episode was contributed by preethiratnam. This pull request enables S3 pushdown during a SELECT operation for JSON files. The pushdown logic is restricted to only root JSON fields, similar to CSV. S3 select does support nested column filtering on JSON files, which is planned for another PR at a later time to limit the scope.

It’s already expensive enough to query JSON files, as you pay a hefty penalty for deserialization. This at least filters out a lot of rows. Thanks to Andrii Rosa arhimondr for the review.

Demo of the episode: Running Great Expectations on a Trino Data Lakehouse Tutorial

For this episode’s demo, you’ll need a local Trino coordinator, MinIO instance, Hive metastore, and an edge node where various data libraries like Great Expectations can run. Clone the trino-datalake repository and navigate to the root directory in your cli. Then start up the containers using Docker Compose.

git clone [email protected]:bitsondatadev/trino-datalake.git

cd trino-datalake

docker-compose up -d

The rest of the demo is available in this markdown tutorial and is covered in the video demo below.

Question of the episode: How can I deploy Trino on Kubernetes without using Helm charts?

Full question from Trino Slack

This user was not able to use Helm, due to some restriction in his company. They needed the raw kubernetes yaml files to deploy Trino.

Answer: While there are very nice ways that Helm offers to directly deploy to a service that understands Helm charts, you can also use Helm on your machine to generate all the kubernetes yaml configuration files. This can be done using the helm template command. See more on this from the Trinetes episode that details this command.

Events, news, and various links

Blogs

Check out the in-person and virtual Trino Meetup groups.

If you want to learn more about Trino, check out the definitive guide from O’Reilly. You can download the free PDF or buy the book online.

Music for the show is from the Megaman 6 Game Play album by Krzysztof Slowikowski.

Do you ❤️ Trino? Give us a 🌟 on GitHub

Trino Community Broadcast