Commander Bun Bun the speedy druid!
Release notes discussed: https://trino.io/docs/current/release/release-356.html
This week covers Apache Druid, a modern, real-time OLAP database. Joining us is the head of developer relations at Imply, the company that creates an enterprise version of Druid, to cover what Druid is, and the use cases it solves.
Here are the slides that Rachel uses in the show:
Druid has several process types:
The Druid architecture.
Druid processes can be deployed any way you like, but for ease of deployment we suggest organizing them into three server types: Master, Query, and Data.
Source: https://druid.apache.org/docs/latest/design/architecture.html.
Our guest, Samarth, is the author of this weeks PR of the week. Puneet Jaiswal is the first engineer that started work to add a Druid connector. Later, Samarth picked up the torch and the Trino Druid connector became available in release 337.
An honorable mention goes to our other guest, Parth, for doing some preliminary work that enabled aggregation pushdown in the SPI. This enabled the use of the Druid connector to actually scale well with the completion of PR 4313 (see future work below).
A third honorable PR, that was completed by @findepi, was adding pushdown to the jdbc client which appeared in release 337 along with the Druid connector.
It is incredible to see the amount of hands that various features and connectors pass through to get to the final release.
Let’s start up the Druid cluster along with the required Zookeeper and
PostgreSQL instance. Clone this repository and navigate to the trino-druid
directory.
git clone [email protected]:bitsondatadev/trino-getting-started.git
cd druid/trino-druid
docker-compose up -d
To do batch insert, navigate to the Druid Web UI once it has finished starting up at http://localhost:8888. Once that is done, click the “Load data” button, choose, “Example data”, and follow the prompts to create the native batch ingestion spec. Once the spec is created, run the job and ingest the data. More information can be found here: https://druid.apache.org/docs/latest/tutorials/index.html
The Druid architecture.
Once Druid completes the task, open up a Trino connection and validate that the
druid
catalog exists.
docker exec -it trino-druid_trino-coordinator_1 trino
trino> SHOW CATALOGS;
Catalog
---------
druid
system
tpcds
tpch
(4 rows)
Now show the tables under the druid.druid
schema.
trino> SHOW TABLES IN druid.druid;
Table
-----------
wikipedia
(1 row)
Run a SHOW CREATE TABLE
to see the column definitions.
trino> SHOW CREATE TABLE druid.druid.wikipedia;
Create Table
--------------------------------------
CREATE TABLE druid.druid.wikipedia (
__time timestamp(3) NOT NULL,
added bigint NOT NULL,
channel varchar,
cityname varchar,
comment varchar,
commentlength bigint NOT NULL,
countryisocode varchar,
countryname varchar,
deleted bigint NOT NULL,
delta bigint NOT NULL,
deltabucket bigint NOT NULL,
diffurl varchar,
flags varchar,
isanonymous varchar,
isminor varchar,
isnew varchar,
isrobot varchar,
isunpatrolled varchar,
metrocode varchar,
namespace varchar,
page varchar,
regionisocode varchar,
regionname varchar,
user varchar
)
(1 row)
Finally, query the first 5 rows of data showing the user and how much they added.
trino> SELECT user, added FROM druid.druid.wikipedia LIMIT 5;
user | added
-----------------+-------
Lsjbot | 31
ワーナー成増 | 125
181.230.118.178 | 2
JasonAQuest | 0
Kolega2357 | 0
(5 rows)
To answer this question I’m going to quote Samarth and Parth on this from this super long but enlightening thread on the subject.
Pro JDBC:
Going forward, Druid SQL is going to be the de-facto way of accessing Druid data with native JSON queries being more of an advanced level use case. A benefit of down the SQL route is that we can take advantage of all the changes made in the Druid SQL optimizer land like using vectorized query processing when possible, when to use a TopN vs group by query type, etc. If we were to hit historicals directly, which don’t support SQL querying, we potentially won’t be taking advantages of such optimizations unless we keep porting/applying them to the trino-druid connector which may not always be possible.
If we end up letting a Trino node act as a Druid broker (which is what would happen I assume when you let a Trino node do the final merging), then, you would need to allocate similar kinds of resources (direct memory buffers, etc.) to all the Trino worker nodes as a Druid broker which may not be ideal.
This is not necessarily a limitation but adds complexity - with your proposed implementation, the Trino cluster will need to maintain state about what Druid segments are hosted on what data nodes (middle managers and historicals). The Druid broker already maintains that state and having to replicate and store all that state on the Trino coordinator will demand more resources out of it.
To your point on SCAN query overwhelming the broker - that shouldn’t be the case as Druid scan query type streams results through broker instead of materializing all of them in memory. See: https://druid.apache.org/docs/latest/querying/scan-query.html
Pro HTTP:
I would advocate for going the Druid SQL route at least for the initial version of the connector. This would provide a solution for the majority of the use cases that Druid generally is used for (OLAP style queries over pre-aggregated data). We could in the next version of the connector, possibly focus on adding a new mode of the connector which can make native JSON queries directly to the Druid historicals and middle managers instead of submitting SQL queries to the broker.
Our general take is that Druid is designed as OLAP cube and so it is really fast when it comes to aggregate queries over reasonable cardinality dimensions and will not work well for use cases that are treating it like a regular data warehouse and trying to do pure select scans with filter. The primary reason most of our users would look to Trino’s Druid connector is:
To be able to join already aggregated data in Druid to some other datastore in our warehouse.
To gain access through tooling that doesn’t have good support for Druid inherently for dashboarding use cases (think Tableau).
Even if we wanted to support the use cases that Druid is not designed for in a more efficient manner by going thorough historicals directly, it has other implications. We are now talking about partial aggregation pushdown which is more complicated IMO than our current approach of complete pushdown. We could choose to take the approach that others have taken where we can incrementally add a mode to Druid connector to either use JDBC or go directly to historical, but I really don’t think it’s a good idea to block the current development in hopes of a more efficient future version specially when this is just implementation detail that we can switch anytime without breaking any user queries.
Trino Summit: http://starburst.io/trinosummit2021
Blogs
Videos
Trino Meetup groups
Latest training from David, Dain, and Martin(Now with timestamps!):
If you want to learn more about Trino, check out the definitive guide from OReilly. You can download the free PDF or buy the book online.
Music for the show is from the Megaman 6 Game Play album by Krzysztof Słowikowski.