In Presto 331, read support for Hive transactional tables was introduced. It works well, if a user creates a new Hive transactional table and reads it from Presto. However, if an existing table is converted to a Hive transactional table, Presto would fail to read data from such a table because read support for original files was missing. Original files are those files in a Hive transactional table that existed before the table was converted into a Hive transactional table. Until version 340, Presto expected all files in a Hive transactional table to be in Hive ACID format. Users would have to perform a major compaction to convert original files into ACID files (i.e. base files) in such tables. This is not always possible as the original flat table (table in non-ACID format) could be huge and converting all the existing data into ACID format can be very expensive.
This blog is an extension of the blog Hive ACID and transactional tables’ support in Presto. It first describes original files and then goes into details of read support for such files that was added in Presto 340.
What are the original files? #
Files present in non-transactional ORC tables have the standard ORC schema. When
a flat table is converted into a transactional table, existing files are not
converted into Hive ACID format. Such files, in a transactional table, that are
not in Hive ACID format, are called original files. These files are named as
000000_X
, 000000_X_copy_Y
. These files don’t have ACID columns and have
differences in the schema as follows:
Table Schema
n_nationkey : int,
n_name : string,
n_regionkey : int,
n_comment : string
Original File Schema
struct {
n_nationkey : int,
n_name : string,
n_regionkey : int,
n_comment : string
}
Delta File Schema
struct {
operation : int,
originalTransaction : bigint,
bucket : int,
rowId : bigint,
currentTransaction : bigint,
row : struct {
n_nationkey : int,
n_name : string,
n_regionkey : int,
n_comment : string
}
}
Before Presto 340, Presto used to fail the query if it reads from a Hive transactional table having original files.
Update and delete support on original files #
Hive achieves updates/deletes on a row in original files by synthetically generating ACID columns for original files. Presto follows the same mechanism of generating ACID columns synthetically as discussed later.
ACID column generation on original files #
Files in Hive ACID format have 5 ACID columns, but we need only 3 columns i.e.
originalTransactionId
, bucketId
and rowId
to uniquely identify a row. In
this section, we will see how these 3 columns are synthetically generated for
original files.
Original transaction ID #
An original transaction ID is the write ID when a record is first created. For original files, the original transaction ID is always 0.
Bucket ID #
Bucket ID is retrieved from the original file name. For the original file
0000ABC_DEF
or 0000ABC_DEF_copy_G
, the bucket ID will be ABC
.
Row ID #
To calculate the row ID, the total row count of all the original files, which come before the current one in lexicographical order, is calculated. Then, the row ID is equal to the sum of the value calculated and local row ID in the current original file.
Here is an example to calculate the global Row ID of the 3rd row of an original
File 000000_0_copy_2
.
000000_0 -> X1 Rows (returned by ORC footer field numberOfRows)
000000_0_copy_1 -> X2 Rows (returned by ORC footer field numberOfRows)
000000_0_copy_2 -> [ Row 0 ]
[ Row 1 ]
[ Row 2 ] <- Local Row ID (returned by filePosition in OrcRecordReader) = 2
Global Row ID = (X1+X2+2)
[ Row 3 ]
000000_0_copy_3 -> X4 Rows
Note: As we see, additional computations are required to generate row IDs while reading original files, therefore, read is slower than ACID format files in the transactional table.
Once Presto has the 3 ACID columns for a row, it can check for update/delete on it. Delete deltas, written by Hive for Original files, have row IDs generated by following the same strategy as discussed above, hence, the same logic of filtering out deleted rows as discussed in Hive ACID and transactional tables’ support in Presto works with the original files too.
Changes in Presto to support reading original files #
Presto split generation logic and ORC reader is modified to add read support for original files. Following are the changes done at coordinator and worker level:
Split generation #
We use a new class named AcidInfo
to store OriginalFiles
, DeleteDeltaFiles
for HiveSplit. BackgroundSplitLoader.loadPartitions
is called in an executor to create splits for each partition. In addition
to the steps mentioned in blog Hive ACID and transactional tables’ support in
Presto, Presto does the following:
- Original files, ACID subdirectories (
base
,delta
,delete_delta
) are figured out by listing the partition location by Hive AcidUtils Helper class. - Registry for delete deltas
DeleteDeltaInfo
is created which has minimal information through whichdelete_delta
path can be constructed by the workers. - Registry for original files
OriginalFileInfo
is created which has information like file name, size and bucket ID. AcidInfo.Builder
keeps a mapAcidInfo.Builder.bucketIdToOriginalFileInfoMap
of bucket ID to the list of original files belonging to the same bucket.- Hive splits are created for each original file, base and delta directories.
Each hive split has a construct
AcidInfo
. -
For an original file split,
AcidInfo
has:- Bucket ID: Bucket ID of the original file.
- OriginalFilesList: List of all the original files belong to the
same bucket calculated from
AcidInfo.Builder.bucketIdToOriginalFileInfoMap
. - DeleteDeltaFilesList: List of delete deltas.
-
For an base/delta file split,
AcidInfo
has:- DeleteDeltaFilesList: List of delete deltas.
Reading Hive original files data in workers #
Hive splits generated during the split generation phase make their way to worker
nodes where OrcPageSourceFactory
is used to create PageSource
for
TableScan
operator. In addition to the steps mentioned in blog Hive ACID and
transactional tables’ support in Presto
, Presto does the following:
OrcDeletedRows
is created fordelete_delta
locations, if any.- For original file split,
OrcPageSourceFactory
fetchesoriginalFilesList
fromAcidInfo
and calculatesoriginalFileRowId
by callingOriginalFilesUtils.getPrecedingRowCount
and sends this information toOrcPageSource
. OrcPageSouce
returns rows fromOrcRecordReader
which are not present inOrcDeletedRows
.
Follow up #
For an original file split, the current implementation may take quadratic time in the worst case to calculate global row ID by reading row IDs from the original files’ footer. It may be optimized by keeping a query level cache in worker nodes or by precomputing global row IDs in coordinator during split computation.
Acknowledgements #
I would also like to express my gratitude to everyone who helped me throughout developing the feature. Thank you Shubham Tagra for brainstorming sessions and providing continuous guidance on Presto Hive ACID. Thank you Piotr Findeisen for helping me further refine the code with insightful code reviews.