23: Trino looking for patterns

Video

Audio

23_ Trino looking for patterns.mp3

1:21:40

Guests

Kasia Findeisen, Software Engineer at Starburst (@kasiafi).

Release 360

In our last episode we already had a bit of a glimpse. Now the release is really out.

Official announcement items from Martin:

Automatic configuration of TLS for internal communication.
Improved correlated subqueries with GROUP BY or LIMIT.
Support for assuming an IAM role in Elasticsearch connector.
Support for Trino views in Iceberg connector.

Manfred’s additional notes:

Documentation for materialized views SQL commands
Partial support for DELETE and batch insert support for various JDBC-based connectors
A bunch of performance and correctness fixes
Numerous improvements on Iceberg connector

More info at https://trino.io/docs/current/release/release-360.html.

Concept of the week: Row pattern matching and MATCH_RECOGNIZE

The MATCH_RECOGNIZE syntax was introduced in the latest SQL specification of 2016. It is a super powerful tool for analyzing trends in your data. We are proud to announce that Trino supports this great feature since version 356. With MATCH_RECOGNIZE, you can define a pattern using the well-known regular expression syntax, and match it to a set of rows. Upon finding a matching row sequence, you can retrieve all kinds of detailed or summary information about the match, and pass it on to be processed by the subsequent parts of your query. This is a new level of what a pure SQL statement can do.

For more details, this blog post gives you a taste of row pattern matching capabilities, and a quick overview of the MATCH_RECOGNIZE syntax.

Let’s look at an example with data similar to the TPCH data. Here is an example, and the same goal: detect a “V”-shape of the price values over time for different customers.

trino> WITH orders(customer_id, order_date, price) AS (VALUES
    ('cust_1', DATE '2020-05-11', 100),
    ('cust_1', DATE '2020-05-12', 200),
    ('cust_2', DATE '2020-05-13',   8),
    ('cust_1', DATE '2020-05-14', 100),
    ('cust_2', DATE '2020-05-15',   4),
    ('cust_1', DATE '2020-05-16',  50),
    ('cust_1', DATE '2020-05-17', 100),
    ('cust_2', DATE '2020-05-18',   6))
SELECT customer_id, start_price, bottom_price, final_price, start_date, final_date
    FROM orders
        MATCH_RECOGNIZE (
            PARTITION BY customer_id
            ORDER BY order_date
            MEASURES
                START.price AS start_price,
                LAST(DOWN.price) AS bottom_price,
                LAST(UP.price) AS final_price,
                START.order_date AS start_date,
                LAST(UP.order_date) AS final_date
            ONE ROW PER MATCH
            AFTER MATCH SKIP PAST LAST ROW
            PATTERN (START DOWN+ UP+)
            DEFINE
                DOWN AS price < PREV(price),
                UP AS price > PREV(price)
            );

 customer_id | start_price | bottom_price | final_price | start_date | final_date
-------------+-------------+--------------+-------------+------------+------------
 cust_1      |         200 |           50 |         100 | 2020-05-12 | 2020-05-17
 cust_2      |           8 |            4 |           6 | 2020-05-13 | 2020-05-18
(2 rows)

Two matches are detected, one for cust_1, and one for cust_2.

The matching algorithm was a collaboration between Martin and Kasia. This algorithm lives in the Matcher class.

The running semantics is the default both in the DEFINE and MESAURES clauses. Note that FINAL only applies to the MEASURES clause.

To sum up, here’s one complex measure expression combining different elements of the special syntax:

PR of the week: PR 8348 Document row pattern recognition in window

The PR of the week, is adding documentation for applying pattern matching over windows. This is yet another SQL functionality that Kasia added after getting the patter recognition to work with MATCH_RECOGNIZE.

Demo: Showing MATCH_RECOGNIZE functionality by example

Here are a few examples that Kasia will be running:

Demo preview:

The initial query. That’s mostly the same query that’s in the blog post, the differences being:

Usage of a real table instead of a CTE.
Additional sort key for consistent ordering
Two more measures

 SELECT custkey, match_no, start_price, bottom_price, final_price, start_date, final_date, classy
               FROM orders
                   MATCH_RECOGNIZE (
                       PARTITION BY custkey
                       ORDER BY orderdate, orderkey
                       MEASURES
                           START.totalprice AS start_price,
                           LAST(DOWN.totalprice) AS bottom_price,
                           LAST(UP.totalprice) AS final_price,
                           START.orderdate AS start_date,
                           LAST(UP.orderdate) AS final_date,
                           MATCH_NUMBER() AS match_no,
                           CLASSIFIER() AS classy
                       ONE ROW PER MATCH
                       AFTER MATCH SKIP PAST LAST ROW
                       PATTERN (START DOWN+ UP+)
                       DEFINE
                           DOWN AS totalprice < PREV(totalprice),
                           UP AS totalprice > PREV(totalprice)
                       )

The query returns many results (many matches). Wrap it in a count() aggregation to check how many there are:

 SELECT count() FROM (SELECT custkey, match_no, start_price, bottom_price, final_price, start_date, final_date, classy
               FROM orders
                   MATCH_RECOGNIZE (
                       PARTITION BY custkey
                       ORDER BY orderdate, orderkey
                       MEASURES
                           START.totalprice AS start_price,
                           LAST(DOWN.totalprice) AS bottom_price,
                           LAST(UP.totalprice) AS final_price,
                           START.orderdate AS start_date,
                           LAST(UP.orderdate) AS final_date,
                           MATCH_NUMBER() AS match_no,
                           CLASSIFIER() AS classy
                       ONE ROW PER MATCH
                       AFTER MATCH SKIP PAST LAST ROW
                       PATTERN (START DOWN+ UP+)
                       DEFINE
                           DOWN AS totalprice < PREV(totalprice),
                           UP AS totalprice > PREV(totalprice)
                       ))

Modify the PATTERN to limit the results. Now searching for a “big V”:

 SELECT count() FROM (SELECT custkey, match_no, start_price, bottom_price, final_price, start_date, final_date, classy
               FROM orders
                   MATCH_RECOGNIZE (
                       PARTITION BY custkey
                       ORDER BY orderdate, orderkey
                       MEASURES
                           START.totalprice AS start_price,
                           LAST(DOWN.totalprice) AS bottom_price,
                           LAST(UP.totalprice) AS final_price,
                           START.orderdate AS start_date,
                           LAST(UP.orderdate) AS final_date,
                           MATCH_NUMBER() AS match_no,
                           CLASSIFIER() AS classy
                       ONE ROW PER MATCH
                       AFTER MATCH SKIP PAST LAST ROW
                       PATTERN (START DOWN{3,} UP{4,})
                       DEFINE
                           DOWN AS totalprice < PREV(totalprice),
                           UP AS totalprice > PREV(totalprice)
                       ))

Unwrap from count() aggregation to see the actual matches:

 SELECT custkey, match_no, start_price, bottom_price, final_price, start_date, final_date, classy
               FROM orders
                   MATCH_RECOGNIZE (
                       PARTITION BY custkey
                       ORDER BY orderdate, orderkey
                       MEASURES
                           START.totalprice AS start_price,
                           LAST(DOWN.totalprice) AS bottom_price,
                           LAST(UP.totalprice) AS final_price,
                           START.orderdate AS start_date,
                           LAST(UP.orderdate) AS final_date,
                           MATCH_NUMBER() AS match_no,
                           CLASSIFIER() AS classy
                       ONE ROW PER MATCH
                       AFTER MATCH SKIP PAST LAST ROW
                       PATTERN (START DOWN{3,} UP{4,})
                       DEFINE
                           DOWN AS totalprice < PREV(totalprice),
                           UP AS totalprice > PREV(totalprice)
                       )

Change AFTER MATCH SKIP PAST LAST ROW to AFTER MATCH SKIP TO NEXT ROW to detect overlapping matches:

 SELECT custkey, match_no, start_price, bottom_price, final_price, start_date, final_date, classy
               FROM orders
                   MATCH_RECOGNIZE (
                       PARTITION BY custkey
                       ORDER BY orderdate, orderkey
                       MEASURES
                           START.totalprice AS start_price,
                           LAST(DOWN.totalprice) AS bottom_price,
                           LAST(UP.totalprice) AS final_price,
                           START.orderdate AS start_date,
                           LAST(UP.orderdate) AS final_date,
                           MATCH_NUMBER() AS match_no,
                           CLASSIFIER() AS classy
                       ONE ROW PER MATCH
                       AFTER MATCH SKIP TO NEXT ROW
                       PATTERN (START DOWN{3,} UP{4,})
                       DEFINE
                           DOWN AS totalprice < PREV(totalprice),
                           UP AS totalprice > PREV(totalprice)
                       )

Change ONE ROW PER MATCH to ALL ROWS PER MATCH (also, revert the previous change). Discuss the classy column and explain the running semantics on the example of final_date column:

 SELECT custkey, match_no, start_price, bottom_price, final_price, start_date, final_date, classy
               FROM orders
                   MATCH_RECOGNIZE (
                       PARTITION BY custkey
                       ORDER BY orderdate, orderkey
                       MEASURES
                           START.totalprice AS start_price,
                           LAST(DOWN.totalprice) AS bottom_price,
                           LAST(UP.totalprice) AS final_price,
                           START.orderdate AS start_date,
                           LAST(UP.orderdate) AS final_date,
                           MATCH_NUMBER() AS match_no,
                           CLASSIFIER() AS classy
                       ALL ROWS PER MATCH
                       AFTER MATCH SKIP PAST LAST ROW
                       PATTERN (START DOWN{3,} UP{4,})
                       DEFINE
                           DOWN AS totalprice < PREV(totalprice),
                           UP AS totalprice > PREV(totalprice)
                       )

Change the semantics of the final_date column to FINAL:

 SELECT custkey, match_no, start_price, bottom_price, final_price, start_date, final_date, classy
               FROM orders
                   MATCH_RECOGNIZE (
                       PARTITION BY custkey
                       ORDER BY orderdate, orderkey
                       MEASURES
                           START.totalprice AS start_price,
                           LAST(DOWN.totalprice) AS bottom_price,
                           LAST(UP.totalprice) AS final_price,
                           START.orderdate AS start_date,
                           FINAL LAST(UP.orderdate) AS final_date,
                           MATCH_NUMBER() AS match_no,
                           CLASSIFIER() AS classy
                       ALL ROWS PER MATCH
                       AFTER MATCH SKIP PAST LAST ROW
                       PATTERN (START DOWN{3,} UP{4,})
                       DEFINE
                           DOWN AS totalprice < PREV(totalprice),
                           UP AS totalprice > PREV(totalprice)
                       )

Question of the week: How do you tag a list of rows with custom periodic rules?

A StackOverflow user asked how to tag orders in a table that meet a certain criterion that relies on periodicity. There are certainly some complicated and inefficient SQL queries that you could craft to address these issues. However, now with MATCH_RECOGNIZE it is possible to do this and take advantage of the efficient matching capabilities that Martin and Kasia have added.

Here is an example orders table represented as a csv table:

Create_time, Order_id, person_id, variable_a
'2021-06-01', 1234, 2232, 1
'2021-06-02', 1235, 2232, 0.6
'2021-06-03', 1236, 2232, 0.33
'2021-06-04', 1237, 2232, 0.7
'2021-06-05', 1238, 2232, 0.6
'2021-06-06', 1239, 2232, 0.4
'2021-06-07', 1240, 2232, 0.8
'2021-06-08', 1241, 2232, 0.7
'2021-06-09', 1242, 2232, 0.4
'2021-06-10', 1243, 2232, 0.6
'2021-06-11', 1244, 2232, 0.7
'2021-06-12', 1245, 2232, 0.6

The grace period logic will produce the final_hit column as the result of this logic:

The is_hit column equals to 1 if the variable A less than equal to 0.5
There is a grace period totaling 4 Orders after the hit, so any hit that is within the grace period will be ignored. The resulting row can be called final_hit.

Based on this logic, this is the desired result of the example is:

Create_time, Order_id, person_id, variable_a, is_hit, final_hit
'2021-06-01', 1234, 2232, 1, NULL, NULL
'2021-06-02', 1235, 2232, 0.6, NULL, NULL
'2021-06-03', 1236, 2232, 0.33, true, true
'2021-06-04', 1237, 2232, 0.7, NULL, NULL
'2021-06-05', 1238, 2232, 0.6, NULL, NULL
'2021-06-06', 1239, 2232, 0.4, true, NULL
'2021-06-07', 1240, 2232, 0.8, NULL, NULL
'2021-06-08', 1241, 2232, 0.7, NULL, NULL
'2021-06-09', 1242, 2232, 0.4, true, true
'2021-06-10', 1243, 2232, 0.6, NULL, NULL
'2021-06-11', 1244, 2232, 0.7, NULL, NULL
'2021-06-12', 1245, 2232, 0.6, NULL, NULL

To accomplish this with MATCH_RECOGNIZE, you can do the following statement, which gives us the correct answer:

WITH data(Create_time, Order_id, person_id, variable_a) AS (
    VALUES
      (DATE '2021-06-01', 1234, 2232, 1),
      (DATE '2021-06-02', 1235, 2232, 0.6),
      (DATE '2021-06-03', 1236, 2232, 0.33),
      (DATE '2021-06-04', 1237, 2232, 0.7),
      (DATE '2021-06-05', 1238, 2232, 0.6),
      (DATE '2021-06-06', 1239, 2232, 0.4),
      (DATE '2021-06-07', 1240, 2232, 0.8),
      (DATE '2021-06-08', 1241, 2232, 0.7),
      (DATE '2021-06-09', 1242, 2232, 0.4),
      (DATE '2021-06-10', 1243, 2232, 0.6),
      (DATE '2021-06-11', 1244, 2232, 0.7),
      (DATE '2021-06-12', 1245, 2232, 0.6)
)
SELECT Create_time, Order_id, person_id, variable_a, if(variable_a <= 0.5, true, null) is_hit, final_hit
FROM data
   MATCH_RECOGNIZE (
     PARTITION BY person_id
     ORDER BY Create_time
     MEASURES if(classifier() = 'HIT', true, null) AS final_hit
     ALL ROWS PER MATCH WITH UNMATCHED ROWS
     AFTER MATCH SKIP PAST LAST ROW
     PATTERN (HIT G{,4})
     DEFINE /* G -- grace period */
            HIT AS HIT.variable_a <= 0.5
  )

Check out Martin and Kasia’s full answer to this question.

Events, news, and various links

Trino Meetup groups

Virtual
East Coast (US)
- Trino Boston
- Trino NYC
West Coast (US)
- Trino San Fransisco
- Trino Los Angeles
Mid West (US)
- Trino Chicago

If you want to learn more about Trino, check out the definitive guide from OReilly. You can download the free PDF or buy the book online.

Music for the show is from the Megaman 6 Game Play album by Krzysztof Słowikowski.

Do you ❤️ Trino? Give us a 🌟 on GitHub

Trino Community Broadcast