In this episode, we are talking to two engineers from Raft and discuss how they use Trino to connect data silos that exist across different departments in various government sectors:
Trino Summit 2022 is coming around the corner! This will be a hybrid event on November 10th that will take place in-person at the Commonwealth Club in San Francisco, CA and can also be attended remotely! If you want to present, the call for speakers is open until September 15th.
You can register for the conference at any time. We must limit in-person registrations to 250 attendees, so register soon if you plan on attending in person!
Official highlights from Martin Traverso:
DELETE
queries.__time
column in Druid.MERGE
.LIMIT
queries.ppc64le
.Additional highlights worth a mention according to Manfred:
More detailed information is available in the release notes for Trino 392, and Trino 393.
Raft provides consulting services and is particularly skilled at DevSecOps. One particular challenge they face is dealing with fragmented government infrastructure. In this episode, we dive in to learn how Trino enables Raft to supply government sector clients with a data fabric solution. Raft takes a special stance on using and contributing to open source solutions that run well on the cloud.
A “software factory” is an organized approach to software development that provides software design and development teams a repeatable, well-defined path to create and update software. It results in a robust, compliant, and more resilient process for delivering applications to production” – VMWare
This is a push against the previous attempts from larger government contractors who tried to build one-size-fits-all solutions that ultimately failed. The new wave of government solutions relies on methodologies similar to the software industry that append more rules and standards around technologies they can adopt in the stack.
Software factories are now a common practice for government agencies to use, as they are able to take standardized software stacks that go through rigorous validation to make sure the meet the standards of the government. One important element to these stacks are that they can be deployed in virtually any environment. A common way to do this is using Kubernetes and containers.
With the movement towards standardization, government contractors will generally build their stack using Kubernetes templates. Kubernetes underpins each of these stacks while telemetry, monitoring, and policy agents are layered on after that. For Raft, they wanted to provide a “single pane of glass” over the existing fragmented systems that the Department of Defense (DoD) operates on. They began to develop a stack that included Trino as their method to connect data over various silos.
Data Fabric is an attempt to provide government agencies the ability to set up a data mesh that is backed by Trino. Trino fits well in this narrative as it provides SQL-over-everything. Data analysts and data scientists only need to know SQL.
Data Fabric MVP is an end-to-end DataOps capability that can be deployed at the edge, in the cloud, and in disconnected environments within minutes. It provides a single control plane for normalizing and combining disparate data lakes, platforms, silos, and formats into SQL using Trino for batch data and Apache Pinot for user facing streaming analytics.
Data Fabric is driven by cloud native policy using Open Policy Agent (OPA) integrated with Trino and Kafka to provide row and column level obfuscation. It provides enterprise data catalog to view data lineage, properties, and data owners from multiple data platforms. – Raft
A common first question the Raft team gets asked is around Trino being a high security concern. The idea that Trino can connect to multiple data sources from one location brings up fear that individuals may gain access to information at a higher classification level than they have. The team has to educate the different users on the best practices and how to ensure this problem doesn’t occur. You will need a separate deployment of Data Fabric for each classification level and correctly identify policies in OPA that restrict visibility to information above a users’ clearance.
Iron Bank is a central repository of digitally-signed container images, including open-source and commercial off-the-shelf software, hardened to the DoD’s exacting specifications. Approved containers in Iron Bank have DoD-wide reciprocity across all classifications, accelerating the security approval process from months or even years down to weeks.
To be considered for inclusion into Iron Bank, container images must meet rigorous DoD software security standards. It is an extensive, continuous, complicated effort for even the most sophisticated IT teams. Continuously maintaining and managing hardening pipelines while incorporating evolving DoD specifications and addressing new vulnerabilities (CVEs) can severely stretch your resources, even if you have advanced tooling and experience in-house. (Source)
The Trino Docker image is available in Iron Bank and is maintained by folks at Booz Allen Hamilton. Their hard work makes it possible for Trino to be deployed in DoD environments.
This PR of the episode was
contributed by preethiratnam. This pull
request enables S3 pushdown during a SELECT
operation for JSON files. The
pushdown logic is restricted to only root JSON fields, similar to CSV. S3 select
does support nested column filtering on JSON files, which is planned for another
PR at a later time to limit the scope.
It’s already expensive enough to query JSON files, as you pay a hefty penalty
for deserialization. This at least filters out a lot of rows. Thanks to
Andrii Rosa arhimondr
for the review.
For this episode’s demo, you’ll need a local Trino coordinator, MinIO instance, Hive metastore, and an edge node where various data libraries like Great Expectations can run. Clone the trino-datalake repository and navigate to the root directory in your cli. Then start up the containers using Docker Compose.
git clone [email protected]:bitsondatadev/trino-datalake.git
cd trino-datalake
docker-compose up -d
The rest of the demo is available in this markdown tutorial and is covered in the video demo below.
Full question from Trino Slack
This user was not able to use Helm, due to some restriction in his company. They needed the raw kubernetes yaml files to deploy Trino.
Answer: While there are very nice ways that Helm offers to directly deploy to
a service that understands Helm charts, you can also use Helm on your machine to
generate all the kubernetes yaml configuration files. This can be done using the
helm template
command. See more on this from the
Trinetes episode that details this command.
Blogs
Check out the in-person and virtual Trino Meetup groups.
If you want to learn more about Trino, check out the definitive guide from O’Reilly. You can download the free PDF or buy the book online.
Music for the show is from the Megaman 6 Game Play album by Krzysztof Slowikowski.