In this episode we have the pleasure to chat with a couple familiar faces who have been hard at work building and understanding the features we’re talking about today:
OPTIMIZEon Iceberg tables with non-identity partitioning.
rowtype and aggregation.
raw_querytable function for full query pass-through in Elasticsearch.
Additional highlights and notes according to Manfred:
We normally cover a broad variety of topics in the Trino community broadcast, exploring different technical details, pull requests, and neat things that are going on in Trino at large. This episode, however, we’re going to be more focused, only taking a look at a particular piece of functionality that we’re all very excited about: polymorphic table functions, or PTFs for short. If you’re unfamiliar with what this means, that can sound like technobabble word soup, so we can start exploring this with a simple question…
The easiest answer to this question is that it’s a function which returns a
table. Scalar, aggregate, and window functions all work a little differently,
but ultimately, they all return a single value each time they are invoked. Table
functions are unique in that they return an entire table. This gives them some
interesting properties that we’ll dive into, but it also means that you can only
invoke them in situations where you’d use a full table, such as a
SELECT * FROM TABLE(my_table_function('foo'));
You can also use table functions in joins:
SELECT * FROM TABLE(my_table_function('bar')) JOIN TABLE(another_table_function(1, 2, 3)) ON true;
And while that’s all neat, it begs the question…
While standard table functions are cool, they have to return a pre-defined
schema, which limits their flexibility. However, they still have some
interesting uses as means of shortening queries or performing multiple
operations at once. If you frequently find yourself selecting from the same
table with a
WHERE clause checking equality to a specific column but with a
different value each time, you could define a table function which takes that
value as a parameter and allows you to skip all the copying and pasting just for
the sake of one line changing. You could take an extremely lengthy sub-query
with multiple joins and abbreviate it to something as short as one of the
examples above, and then use that in other queries. Or, if you want to update a
table, but you also want to insert into another table as part of the same
operation, you could combine those two steps into one table function, ensuring
that users won’t forget the second part of that process.
So table functions are functions that return tables. It really is that simple, and we’re already two-thirds of the way to understanding what polymorphic table functions are. And now it’s time to add in that fun ‘polymorphic’ word.
A polymorphic table function is a type of table function where the schema of the returned table is determined dynamically. This means that the returned table data, including its schema, can be determined by the arguments you pass to the function. And you might imagine, that makes PTFs a lot more powerful than an ordinary, run-of-the-mill table function.
When you’re not determining the schema of the returned table well in advance, you get the flexibility to do some pretty crazy things. It can be as simple as adding or removing columns as part of the function, or it can be as complex as building and returning an entirely new table based on some input data.
But we’ve talked enough at a high level about what PTFs are, so now it’s a good time to look at what PTFs can actually do for you to make your life as a Trino user easier, better, and more efficient.
One thing to note - all the examples we’re about to look at are hypothetical. We’re working to bring functions similar to these to Trino soon, but there’s a few things left to implement before we get there, so for now, this is meant to highlight why we’re implementing PTFs, and we’ll take a look at what you can currently do with them a little later. When it does come time to implement these functions, they will not be exactly the same as you see them here.
Imagine a table with 10 columns, named col1, col2, col3, etc. If you want to select all the columns except the first one from that table, you end up with a query that looks like:
SELECT col2, col3, col4, col5, col6, col7, col8, col9, col10 FROM my.table;
But that’s long, and it’s a pain to type, and it gets messy, especially if your column names aren’t extremely short due to being part of a contrived example. With a simple PTF, you could get the same result with:
SELECT * FROM TABLE( excl_function( data => TABLE(my.table), columns_to_exclude => DESCRIPTOR("col1") ) );
Now, this isn’t a great PTF, because it’s going to take more time to implement than it takes to just write out your column names, and at least when we’re using only 10 columns and short column names, invoking the function takes more writing than doing it the old-fashioned way. Also, this is going to perform worse than writing the query the ordinary way. As a rule of thumb, if it can be written with normal SQL, it will be more performant when done that way. There are plans to work on optimizing PTFs, but that’s not going to happen soon, so for the time being, we’re focusing on how they enable things which previously couldn’t be done at all, rather than making queries look nicer or cleaner.
All that said, we wanted to include this example because this does a good job at demonstrating how polymorphic table functions can work and what they can do for you. But it’s a simple example, and now we can look at some which are a little more complex and a little more practical.
If you’ve ever tried to create a table from a CSV file, you know it can be a painful experience. It has to be very explicit, very diligent, and there’s a lot of manual cross-checking involved in ensuring that each column aligns perfectly and is correctly typed for the columns present in the CSV. Enter polymorphic table functions, here to save the day.
Remember, this is hypothetical, so by the time we get to implementing something similar to this in Trino, it will certainly look different. But a table function like this will be defined on the connector, so all the end user needs to worry about is what its signature might look like:
FUNCTION CSVreader ( Filename VARCHAR(1000), FloatCols DESCRIPTOR DEFAULT NULL, DateCols DESCRIPTOR DEFAULT NULL ) RETURNS TABLE
One key thing to note here is the
DESCRIPTOR type. It is a type that describes
a list of column names, and there will be a function to convert a parameterized
list to the
DESCRIPTOR type. Other than that, everything else here does what
you’d expect - you pass the function the name of the CSV file, the columns which
should be typed as floats, and the columns which should have a date typing. All
unspecified columns will still be handled as
varchar. Calling the function
might look something like:
SELECT * FROM TABLE( CSVreader( Filename => 'my_file.csv', FloatCols => DESCRIPTOR("principle", "interest") DateCols => DESCRIPTOR("due_date") ) );
Given a CSV with this content:
docno,name,due_date,principle,interest 123,Alice,01/01/2014,234.56,345.67 234,Bob,01/01/2014,654.32,543.21
Such a function would return a table that looks like:
With a well-written PTF, the days of toiling over parsing a CSV into SQL are over!
Pivot is an oft-requested feature which hasn’t been built in Trino because it
isn’t a part of the standard SQL specification. A
PIVOT keyword or built-in
function isn’t planned, but with PTFs, we can support
without needing to deviate from SQL.
PIVOT PTF might have the following definition:
FUNCTION Pivot ( Input_table TABLE PASS THROUGH WITH ROW SEMANTICS, Output_pivot_columns DESCRIPTOR, Input_pivot_columns1 DESCRIPTOR, Input_pivot_columns2 DESCRIPTOR DEFAULT NULL, Input_pivot_columns3 DESCRIPTOR DEFAULT NULL, Input_pivot_columns4 DESCRIPTOR DEFAULT NULL, Input_pivot_columns5 DESCRIPTOR DEFAULT NULL ) RETURNS TABLE
But before we look at how you can invoke this, there’s a few clauses here that are worth explaining…
PASS THROUGHmeans that the input data (and all of its rows) will be fully available in the output. The alternative to this is
NO PASS THROUGH.
WITH ROW SEMANTICSmeans that the result will be determined on a row-by-row basis. The alternative to this is
WITH SET SEMANTICS.
And of course, the function takes some parameters, so a good function author defines what those parameters do.
So you’ve got a PIVOT function, and you understand how to invoke it, so all you need to do is listen to Ross from Friends and make it happen:
SELECT D.id, D.name, P.accttype, P.acctvalue FROM TABLE( Pivot( Input_table => TABLE (My.Data) AS D, Output_pivot_columns => DESCRIPTOR (accttype, acctvalue), Input_pivot_columns1 => DESCRIPTOR (accttype1, acctvalue1), Input_pivot_columns2 => DESCRIPTOR (accttype2, acctvalue2) ) ) AS P;
If we presume we have this data in My.Data:
The output of that query will be:
You can see the
PASS THROUGH clause in action when you select D.id and D.name.
As a bonus cherry on top, and as an example of something very fun that you can do with PTFs, how about executing an entire script written in R?
A connector could provide a function with the signature:
FUNCTION ExecR ( Script VARCHAR(10000), Input_table TABLE NO PASS THROUGH WITH SET SEMANTICS, Rowtype DESCRIPTOR ) RETURNS TABLE
The inputs here are the script, which can simply be pasted into the query as
text, the input table which contains the data for the script to run on, and then
a descriptor for row typing, as there’s otherwise no way for the engine to know
after running the R script. Worth pointing out and contrary to the PIVOT
example, this function has
NO PASS THROUGH because the R script will not have
the ability to copy input rows into output rows.
Invoking this function is relatively straightforward:
SELECT * FROM TABLE( ExecR( Script => '...', Input => TABLE(My.Data), Rowtype => DESCRIPTOR(col1 VARCHAR(100), col2 REAL, col3 FLOAT) ) ) AS R;
And depending on your script and your data, you can make this as simple or as extreme as you’d like!
We’ve spent a lot of time talking about hypothetical value that we will be able to derive from polymorphic table functions sometime down the line, but we should also pump the brakes a little and take a look at what we already have in Trino in terms of polymorphic table functions. This PR, authored by Kasia Findeisen, was the first code to land in Trino that allowed access to PTFs. It’s just one particular PTF, but it’s pretty neat, so we can jump into it with a demo and an explanation for how we’re already changing the game with PTFs.
Trino sticks to the SQL standard, which means that custom extensions and syntax aren’t supported. If you’re using a Trino connector where the underlying database has a neat feature that isn’t a part of the SQL standard, you previously were unable to take advantage of that, and you knew it wasn’t going to be added to Trino. But now with query pass-through, you can leverage any of the cool non-standard extensions that belong to connectors! We’ll look at a couple different examples, but keep in mind, because this is pushing an entire query down to the connector, the possibilities will be based on what the underlying database is capable of.
In a table where we have employees and their manager ID, but no direct way to
list managers with all their employees, we can push down a query to MySQL and
GROUP_CONCAT() to combine them all into one column with this query:
SELECT * FROM TABLE( mysql.system.query( query => 'SELECT manager_id, GROUP_CONCAT(employee_id) FROM company.employees GROUP BY manager_id' ) );
MODEL clause in Oracle is an incredibly powerful way to manipulate and
view data. As it’s non-ANSI compliant, it’s specific to Oracle, but if you want
to use it, now you can! Through polymorphic table functions, you can generate
and perform sophisticated calculations on multidimensional arrays - try saying
that five times fast. We don’t have the time to explain everything about how
this feature works, but if you want clarification, you can check out
the Oracle documentation on MODEL
and try it out for yourself.
SELECT SUBSTR(country, 1, 20) country, SUBSTR(product, 1, 15) product, year, sales FROM TABLE( oracle.system.query( query => 'SELECT * FROM sales_view MODEL RETURN UPDATED ROWS MAIN simple_model PARTITION BY country MEASURES sales RULES (sales['Bounce', 2001] = 1000, sales['Bounce', 2002] = sales['Bounce', 2001] + sales['Bounce', 2000], sales['Y Box', 2002] = sales['Y Box', 2001]) ORDER BY country' ) );
Funnily enough, Oracle also supports polymorphic table functions, so if you
wanted to, you could use the
query function to then invoke a PTF in Oracle,
including any of the hypothetical examples we went into above! PTFs inside of
PTFs are possible! …though probably not the best idea.
Right now, there’s a few things on the radar for moving forward with PTFs. The
first and more simple task at hand is expanding the
query function to other
connectors. We started with the JDBC connectors, but we have also landed a
similar function called
raw_query for ElasticSearch, are working on a BigQuery
implementation, and there may still be more yet to come.
On a broader scope, the reason this was the first PTF that was implemented is because Trino doesn’t have to do anything to make it work. The next big step in powering PTFs up is to create an operator and make the engine aware of them, so that the engine can handle and process PTFs itself, which will open the door to the wide array of possibilities we explored earlier.
And finally, once that’s done, we plan on empowering you, the Trino community, to go out and actually make some polymorphic table functions. You already can implement them today, but with those limitations: you can’t use table or descriptor arguments, and the connector has to perform the execution. But once the full framework for PTFs has been built, those examples from earlier (and many possible others) still need to be implemented. There is a developer guide on implementing table functions which exists today, but there are plans to expand it so that it’s easier to go in and add the PTFs which will make a difference for you and your workflows.
Check out the in-person and virtual Trino Meetup groups.
If you want to learn more about Trino, check out the definitive guide from O’Reilly. You can download the free PDF or buy the book online.
Music for the show is from the Megaman 6 Game Play album by Krzysztof Slowikowski.