Read support for original files of Hive transactional tables in Presto

Sep 23, 2020 • Harmandeep Singh, Qubole

In Presto 331, read support for Hive transactional tables was introduced. It works well, if a user creates a new Hive transactional table and reads it from Presto. However, if an existing table is converted to a Hive transactional table, Presto would fail to read data from such a table because read support for original files was missing. Original files are those files in a Hive transactional table that existed before the table was converted into a Hive transactional table. Until version 340, Presto expected all files in a Hive transactional table to be in Hive ACID format. Users would have to perform a major compaction to convert original files into ACID files (i.e. base files) in such tables. This is not always possible as the original flat table (table in non-ACID format) could be huge and converting all the existing data into ACID format can be very expensive.

This blog is an extension of the blog Hive ACID and transactional tables’ support in Presto. It first describes original files and then goes into details of read support for such files that was added in Presto 340.

What are the original files? #

Files present in non-transactional ORC tables have the standard ORC schema. When a flat table is converted into a transactional table, existing files are not converted into Hive ACID format. Such files, in a transactional table, that are not in Hive ACID format, are called original files. These files are named as 000000_X, 000000_X_copy_Y. These files don’t have ACID columns and have differences in the schema as follows:

Table Schema

n_nationkey : int,
n_name : string,
n_regionkey : int,
n_comment : string

Original File Schema

struct {
    n_nationkey : int,
    n_name : string,
    n_regionkey : int,
    n_comment : string
}

Delta File Schema

struct {
    operation : int,
    originalTransaction : bigint,
    bucket : int,
    rowId : bigint,
    currentTransaction : bigint,
    row : struct {
        n_nationkey : int,
        n_name : string,
        n_regionkey : int,
        n_comment : string
    }
}

Before Presto 340, Presto used to fail the query if it reads from a Hive transactional table having original files.

Update and delete support on original files #

Hive achieves updates/deletes on a row in original files by synthetically generating ACID columns for original files. Presto follows the same mechanism of generating ACID columns synthetically as discussed later.

ACID column generation on original files #

Files in Hive ACID format have 5 ACID columns, but we need only 3 columns i.e. originalTransactionId, bucketId and rowId to uniquely identify a row. In this section, we will see how these 3 columns are synthetically generated for original files.

Original transaction ID #

An original transaction ID is the write ID when a record is first created. For original files, the original transaction ID is always 0.

Bucket ID #

Bucket ID is retrieved from the original file name. For the original file 0000ABC_DEF or 0000ABC_DEF_copy_G, the bucket ID will be ABC.

Row ID #

To calculate the row ID, the total row count of all the original files, which come before the current one in lexicographical order, is calculated. Then, the row ID is equal to the sum of the value calculated and local row ID in the current original file.

Here is an example to calculate the global Row ID of the 3rd row of an original File 000000_0_copy_2.

000000_0            -> 	X1 Rows (returned by ORC footer field numberOfRows)

000000_0_copy_1     -> 	X2 Rows (returned by ORC footer field numberOfRows)

000000_0_copy_2     ->	[ Row 0 ]
                        [ Row 1 ]
                        [ Row 2 ]   <- Local Row ID (returned by filePosition in OrcRecordReader) = 2
                                       Global Row ID = (X1+X2+2)
                        [ Row 3 ]

000000_0_copy_3     ->  X4 Rows

Note: As we see, additional computations are required to generate row IDs while reading original files, therefore, read is slower than ACID format files in the transactional table.

Once Presto has the 3 ACID columns for a row, it can check for update/delete on it. Delete deltas, written by Hive for Original files, have row IDs generated by following the same strategy as discussed above, hence, the same logic of filtering out deleted rows as discussed in Hive ACID and transactional tables’ support in Presto works with the original files too.

Changes in Presto to support reading original files #

Presto split generation logic and ORC reader is modified to add read support for original files. Following are the changes done at coordinator and worker level:

Split generation #

We use a new class named AcidInfo to store OriginalFiles, DeleteDeltaFiles for HiveSplit. BackgroundSplitLoader.loadPartitions is called in an executor to create splits for each partition. In addition to the steps mentioned in blog Hive ACID and transactional tables’ support in Presto, Presto does the following:

Original files, ACID subdirectories (base, delta, delete_delta) are figured out by listing the partition location by Hive AcidUtils Helper class.
Registry for delete deltas DeleteDeltaInfo is created which has minimal information through which delete_delta path can be constructed by the workers.
Registry for original files OriginalFileInfo is created which has information like file name, size and bucket ID.
AcidInfo.Builder keeps a map AcidInfo.Builder.bucketIdToOriginalFileInfoMap of bucket ID to the list of original files belonging to the same bucket.
Hive splits are created for each original file, base and delta directories. Each hive split has a construct AcidInfo.
For an original file split, AcidInfo has:
1. Bucket ID: Bucket ID of the original file.
2. OriginalFilesList: List of all the original files belong to the same bucket calculated from AcidInfo.Builder.bucketIdToOriginalFileInfoMap.
3. DeleteDeltaFilesList: List of delete deltas.
For an base/delta file split, AcidInfo has:
1. DeleteDeltaFilesList: List of delete deltas.

Reading Hive original files data in workers #

Hive splits generated during the split generation phase make their way to worker nodes where OrcPageSourceFactory is used to create PageSource for TableScan operator. In addition to the steps mentioned in blog Hive ACID and transactional tables’ support in Presto , Presto does the following:

OrcDeletedRows is created for delete_delta locations, if any.
For original file split, OrcPageSourceFactory fetches originalFilesList from AcidInfo and calculates originalFileRowId by calling OriginalFilesUtils.getPrecedingRowCount and sends this information to OrcPageSource.
OrcPageSouce returns rows from OrcRecordReader which are not present in OrcDeletedRows.

Follow up #

For an original file split, the current implementation may take quadratic time in the worst case to calculate global row ID by reading row IDs from the original files’ footer. It may be optimized by keeping a query level cache in worker nodes or by precomputing global row IDs in coordinator during split computation.

Acknowledgements #

I would also like to express my gratitude to everyone who helped me throughout developing the feature. Thank you Shubham Tagra for brainstorming sessions and providing continuous guidance on Presto Hive ACID. Thank you Piotr Findeisen for helping me further refine the code with insightful code reviews.

Do you ❤️ Trino? Give us a 🌟 on GitHub

Trino blog