Do you ❤️ Trino? Give us a 🌟 on GitHub

Trino Community Broadcast

38: Trino tacks on polymorphic table functions

Jul 21, 2022

Video

Audio

 

Guests

In this episode we have the pleasure to chat with a couple familiar faces who have been hard at work building and understanding the features we’re talking about today:

Releases 387 to 391

Trino 387

  • Support for writing ORC Bloom filters for varchar columns.
  • Support for querying Pinot via the gRPC endpoint.
  • Support for predicate pushdown on string columns in Redis.
  • Support for OPTIMIZE on Iceberg tables with non-identity partitioning.

Trino 388

  • Support for JSON output in EXPLAIN.
  • Improved performance for row data types.
  • Support for OAuth 2.0 refresh tokens.
  • Support for table and column comments in Delta Lake.

Trino 389

  • Improved performance for row type and aggregation.
  • Faster joins when spilling to disk is disabled.
  • Improved performance when writing non-structural types to Parquet.
  • New raw_query table function for full query pass-through in Elasticsearch.

Trino 390

  • Support for setting comments on views.
  • Improved UNNEST performance.
  • Support for Databricks runtime 10.4 LTS in Delta Lake connector.

Trino 391

  • Support for AWS Athena partition projection.
  • Faster writing of Parquet data in Iceberg and Delta Lake.
  • Support for reading BigQuery external tables.
  • Support for table and column comments in BigQuery.

Additional highlights and notes according to Manfred:

  • Java 17 arrived as required runtime in 390.
  • Remove support for Elasticsearch versions below 6.6.0, add testing for OpenSearch 1.1.0.
  • New raw query table function in Elasticsearch can replace old full text search and query pass-through support.

More detailed information is available in the release notes for Trino 387, Trino 388, Trino 389, Trino 390, and Trino 391.

Concept of the episode: Polymorphic table functions

We normally cover a broad variety of topics in the Trino community broadcast, exploring different technical details, pull requests, and neat things that are going on in Trino at large. This episode, however, we’re going to be more focused, only taking a look at a particular piece of functionality that we’re all very excited about: polymorphic table functions, or PTFs for short. If you’re unfamiliar with what this means, that can sound like technobabble word soup, so we can start exploring this with a simple question…

What is a table function?

The easiest answer to this question is that it’s a function which returns a table. Scalar, aggregate, and window functions all work a little differently, but ultimately, they all return a single value each time they are invoked. Table functions are unique in that they return an entire table. This gives them some interesting properties that we’ll dive into, but it also means that you can only invoke them in situations where you’d use a full table, such as a FROM clause:

SELECT
    *
FROM
    TABLE(my_table_function('foo'));

You can also use table functions in joins:

SELECT
    *
FROM
    TABLE(my_table_function('bar'))
JOIN
    TABLE(another_table_function(1, 2, 3))
    ON true;

And while that’s all neat, it begs the question…

What can you do with table functions?

While standard table functions are cool, they have to return a pre-defined schema, which limits their flexibility. However, they still have some interesting uses as means of shortening queries or performing multiple operations at once. If you frequently find yourself selecting from the same table with a WHERE clause checking equality to a specific column but with a different value each time, you could define a table function which takes that value as a parameter and allows you to skip all the copying and pasting just for the sake of one line changing. You could take an extremely lengthy sub-query with multiple joins and abbreviate it to something as short as one of the examples above, and then use that in other queries. Or, if you want to update a table, but you also want to insert into another table as part of the same operation, you could combine those two steps into one table function, ensuring that users won’t forget the second part of that process.

So table functions are functions that return tables. It really is that simple, and we’re already two-thirds of the way to understanding what polymorphic table functions are. And now it’s time to add in that fun ‘polymorphic’ word.

What makes a table function polymorphic?

A polymorphic table function is a type of table function where the schema of the returned table is determined dynamically. This means that the returned table data, including its schema, can be determined by the arguments you pass to the function. And you might imagine, that makes PTFs a lot more powerful than an ordinary, run-of-the-mill table function.

What can you do with polymorphic table functions?

When you’re not determining the schema of the returned table well in advance, you get the flexibility to do some pretty crazy things. It can be as simple as adding or removing columns as part of the function, or it can be as complex as building and returning an entirely new table based on some input data.

Demo of the episode: The many ways you can leverage PTFs

But we’ve talked enough at a high level about what PTFs are, so now it’s a good time to look at what PTFs can actually do for you to make your life as a Trino user easier, better, and more efficient.

Possible polymorphic table functions

One thing to note - all the examples we’re about to look at are hypothetical. We’re working to bring functions similar to these to Trino soon, but there’s a few things left to implement before we get there, so for now, this is meant to highlight why we’re implementing PTFs, and we’ll take a look at what you can currently do with them a little later. When it does come time to implement these functions, they will not be exactly the same as you see them here.

Select except

Imagine a table with 10 columns, named col1, col2, col3, etc. If you want to select all the columns except the first one from that table, you end up with a query that looks like:

SELECT
    col2, col3, col4, col5, col6, col7, col8, col9, col10
FROM
    my.table;

But that’s long, and it’s a pain to type, and it gets messy, especially if your column names aren’t extremely short due to being part of a contrived example. With a simple PTF, you could get the same result with:

SELECT
    *
FROM
    TABLE(
        excl_function(
            data => TABLE(my.table), columns_to_exclude => DESCRIPTOR("col1")
        )
    );

Now, this isn’t a great PTF, because it’s going to take more time to implement than it takes to just write out your column names, and at least when we’re using only 10 columns and short column names, invoking the function takes more writing than doing it the old-fashioned way. Also, this is going to perform worse than writing the query the ordinary way. As a rule of thumb, if it can be written with normal SQL, it will be more performant when done that way. There are plans to work on optimizing PTFs, but that’s not going to happen soon, so for the time being, we’re focusing on how they enable things which previously couldn’t be done at all, rather than making queries look nicer or cleaner.

All that said, we wanted to include this example because this does a good job at demonstrating how polymorphic table functions can work and what they can do for you. But it’s a simple example, and now we can look at some which are a little more complex and a little more practical.

CSVreader

If you’ve ever tried to create a table from a CSV file, you know it can be a painful experience. It has to be very explicit, very diligent, and there’s a lot of manual cross-checking involved in ensuring that each column aligns perfectly and is correctly typed for the columns present in the CSV. Enter polymorphic table functions, here to save the day.

Remember, this is hypothetical, so by the time we get to implementing something similar to this in Trino, it will certainly look different. But a table function like this will be defined on the connector, so all the end user needs to worry about is what its signature might look like:

FUNCTION CSVreader (
    Filename VARCHAR(1000),
    FloatCols DESCRIPTOR DEFAULT NULL,
    DateCols DESCRIPTOR DEFAULT NULL
    )
RETURNS TABLE

One key thing to note here is the DESCRIPTOR type. It is a type that describes a list of column names, and there will be a function to convert a parameterized list to the DESCRIPTOR type. Other than that, everything else here does what you’d expect - you pass the function the name of the CSV file, the columns which should be typed as floats, and the columns which should have a date typing. All unspecified columns will still be handled as varchar. Calling the function might look something like:

SELECT
  *
FROM
    TABLE(
        CSVreader(
            Filename => 'my_file.csv',
            FloatCols => DESCRIPTOR("principle", "interest")
            DateCols => DESCRIPTOR("due_date")
        )
    );

Given a CSV with this content:

docno,name,due_date,principle,interest
123,Alice,01/01/2014,234.56,345.67
234,Bob,01/01/2014,654.32,543.21

Such a function would return a table that looks like:

docno name due_date principle interest
123 Alice 2014-01-01 234.56 345.67
234 Bob 2014-01-01 654.32 543.21

With a well-written PTF, the days of toiling over parsing a CSV into SQL are over!

Pivot

Pivot is an oft-requested feature which hasn’t been built in Trino because it isn’t a part of the standard SQL specification. A PIVOT keyword or built-in function isn’t planned, but with PTFs, we can support PIVOT-like functionality without needing to deviate from SQL.

A PIVOT PTF might have the following definition:

FUNCTION Pivot (
    Input_table TABLE PASS THROUGH WITH ROW SEMANTICS,
    Output_pivot_columns DESCRIPTOR,
    Input_pivot_columns1 DESCRIPTOR,
    Input_pivot_columns2 DESCRIPTOR DEFAULT NULL,
    Input_pivot_columns3 DESCRIPTOR DEFAULT NULL,
    Input_pivot_columns4 DESCRIPTOR DEFAULT NULL,
    Input_pivot_columns5 DESCRIPTOR DEFAULT NULL
)
RETURNS TABLE

But before we look at how you can invoke this, there’s a few clauses here that are worth explaining…

  • PASS THROUGH means that the input data (and all of its rows) will be fully available in the output. The alternative to this is NO PASS THROUGH.
  • WITH ROW SEMANTICS means that the result will be determined on a row-by-row basis. The alternative to this is WITH SET SEMANTICS.

And of course, the function takes some parameters, so a good function author defines what those parameters do.

  • ‘Input’ is the input table. It’s any generic table.
  • ‘Output_pivot_columns’ is the names of the columns to be created in the pivot table.
  • Input_pivot_columns are all the columns to be pivoted into the output columns. The first parameter is required, but you can specify more groupings. The number of input columns in a group to be pivoted and the number of output columns must be the same.

So you’ve got a PIVOT function, and you understand how to invoke it, so all you need to do is listen to Ross from Friends and make it happen:

SELECT
    D.id,
    D.name,
    P.accttype,
    P.acctvalue
FROM
    TABLE(
        Pivot(
            Input_table => TABLE (My.Data) AS D,
            Output_pivot_columns => DESCRIPTOR (accttype, acctvalue),
            Input_pivot_columns1 => DESCRIPTOR (accttype1, acctvalue1),
            Input_pivot_columns2 => DESCRIPTOR (accttype2, acctvalue2)
        )
    ) AS P;

If we presume we have this data in My.Data:

ID Name accttype1 acctvalue1 accttype2 acctvalue2
123 Alice external 20000 internal 350
234 Bob external 25000 internal 120

The output of that query will be:

ID Name accttype acctvalue
123 Alice external 20000
123 Alice internal 350
234 Bob external 25000
234 Bob internal 120

You can see the PASS THROUGH clause in action when you select D.id and D.name.

ExecR

As a bonus cherry on top, and as an example of something very fun that you can do with PTFs, how about executing an entire script written in R?

A connector could provide a function with the signature:

FUNCTION ExecR (
    Script VARCHAR(10000),
    Input_table TABLE NO PASS THROUGH WITH SET SEMANTICS,
    Rowtype DESCRIPTOR
)
RETURNS TABLE

The inputs here are the script, which can simply be pasted into the query as text, the input table which contains the data for the script to run on, and then a descriptor for row typing, as there’s otherwise no way for the engine to know after running the R script. Worth pointing out and contrary to the PIVOT example, this function has NO PASS THROUGH because the R script will not have the ability to copy input rows into output rows.

Invoking this function is relatively straightforward:

SELECT
    *
FROM
    TABLE(
        ExecR(
            Script => '...',
            Input => TABLE(My.Data),
            Rowtype => DESCRIPTOR(col1 VARCHAR(100), col2 REAL, col3 FLOAT)
        )
    ) AS R;

And depending on your script and your data, you can make this as simple or as extreme as you’d like!

Pull request of the episode: PR 12325: Support query pass-through for JDBC-based connectors

We’ve spent a lot of time talking about hypothetical value that we will be able to derive from polymorphic table functions sometime down the line, but we should also pump the brakes a little and take a look at what we already have in Trino in terms of polymorphic table functions. This PR, authored by Kasia Findeisen, was the first code to land in Trino that allowed access to PTFs. It’s just one particular PTF, but it’s pretty neat, so we can jump into it with a demo and an explanation for how we’re already changing the game with PTFs.

Demo of the episode #2: Using connector-specific features with query pass-through

Trino sticks to the SQL standard, which means that custom extensions and syntax aren’t supported. If you’re using a Trino connector where the underlying database has a neat feature that isn’t a part of the SQL standard, you previously were unable to take advantage of that, and you knew it wasn’t going to be added to Trino. But now with query pass-through, you can leverage any of the cool non-standard extensions that belong to connectors! We’ll look at a couple different examples, but keep in mind, because this is pushing an entire query down to the connector, the possibilities will be based on what the underlying database is capable of.

GROUP_CONCAT() in MySQL

In a table where we have employees and their manager ID, but no direct way to list managers with all their employees, we can push down a query to MySQL and use GROUP_CONCAT() to combine them all into one column with this query:

SELECT
  *
FROM
  TABLE(
    mysql.system.query(
      query => 'SELECT
        manager_id, GROUP_CONCAT(employee_id)
      FROM
        company.employees
      GROUP BY
        manager_id'
    )
  );

MODEL clause in Oracle

The MODEL clause in Oracle is an incredibly powerful way to manipulate and view data. As it’s non-ANSI compliant, it’s specific to Oracle, but if you want to use it, now you can! Through polymorphic table functions, you can generate and perform sophisticated calculations on multidimensional arrays - try saying that five times fast. We don’t have the time to explain everything about how this feature works, but if you want clarification, you can check out the Oracle documentation on MODEL and try it out for yourself.

SELECT
  SUBSTR(country, 1, 20) country,
  SUBSTR(product, 1, 15) product,
  year,
  sales
FROM
  TABLE(
    oracle.system.query(
      query => 'SELECT
        *
      FROM
        sales_view
      MODEL
        RETURN UPDATED ROWS
        MAIN
          simple_model
        PARTITION BY
          country
        MEASURES
          sales
        RULES
          (sales['Bounce', 2001] = 1000,
          sales['Bounce', 2002] = sales['Bounce', 2001] + sales['Bounce', 2000],
          sales['Y Box', 2002] = sales['Y Box', 2001])
      ORDER BY
        country'
    )
  );

Funnily enough, Oracle also supports polymorphic table functions, so if you wanted to, you could use the query function to then invoke a PTF in Oracle, including any of the hypothetical examples we went into above! PTFs inside of PTFs are possible! …though probably not the best idea.

Question of the episode: Where are we at, and what’s coming next?

Right now, there’s a few things on the radar for moving forward with PTFs. The first and more simple task at hand is expanding the query function to other connectors. We started with the JDBC connectors, but we have also landed a similar function called raw_query for ElasticSearch, are working on a BigQuery implementation, and there may still be more yet to come.

On a broader scope, the reason this was the first PTF that was implemented is because Trino doesn’t have to do anything to make it work. The next big step in powering PTFs up is to create an operator and make the engine aware of them, so that the engine can handle and process PTFs itself, which will open the door to the wide array of possibilities we explored earlier.

And finally, once that’s done, we plan on empowering you, the Trino community, to go out and actually make some polymorphic table functions. You already can implement them today, but with those limitations: you can’t use table or descriptor arguments, and the connector has to perform the execution. But once the full framework for PTFs has been built, those examples from earlier (and many possible others) still need to be implemented. There is a developer guide on implementing table functions which exists today, but there are plans to expand it so that it’s easier to go in and add the PTFs which will make a difference for you and your workflows.

Check out the in-person and virtual Trino Meetup groups.

If you want to learn more about Trino, check out the definitive guide from O’Reilly. You can download the free PDF or buy the book online.

Music for the show is from the Megaman 6 Game Play album by Krzysztof Slowikowski.