Improved Hive Bucketing

May 29, 2019 • David Phillips

Presto 312 adds support for the more flexible bucketing introduced in recent versions of Hive. Specifically, it allows any number of files per bucket, including zero. This allows inserting data into an existing partition without having to rewrite the entire partition, and improves the performance of writes by not requiring the creation of files for empty buckets.

Hive bucketing overview #

Hive bucketing is a simple form of hash partitioning. A table is bucketed on one or more columns with a fixed number of hash buckets. For example, a table definition in Presto syntax looks like this:

CREATE TABLE page_views (
  user_id bigint,
  page_url varchar,
  dt date
)
WITH (
  partitioned_by = ARRAY['dt'],
  bucketed_by = ARRAY['user_id'],
  bucket_count = 50
)

The bucketing happens within each partition of the table (or across the entire table if it is not partitioned). In the above example, the table is partitioned by date and is declared to have 50 buckets using the user ID column. This means that the table will have 50 buckets for each date. The assigned bucket for each row is determined by hashing the user ID value. This means that all user IDs with the same value will go into the same bucket.

Original Hive bucketing #

Originally, Hive required exactly one file per bucket. The files were named such that the bucket number was implicit based on the file’s position within the lexicographic ordering of the file names. For example, the following list of files represent buckets 0 to 2, respectively:

00000_0
00001_0
00002_0

file0
file3
file5

bucketA
bucketB
bucketD

The file names are meaningless aside from their ordering with respect to the other file names.

What’s the problem? #

The original Hive bucketing scheme has a couple of problems:

Inserting data into the table by adding additional files is not possible. Instead, an insert operation requires rewriting all of the existing files, which can be quite expensive.
If the data is sparse, some of the buckets might be empty, but because there must be a file for every bucket, the writer must create an empty file for each bucket. Some file formats, such as ORC, support zero-byte files as empty files. Other formats require writing a file with a valid header and footer. Creating these files adds latency to the write operation, and storing these tiny files is inefficient for file systems like HDFS which are designed for large files.

Improved Hive bucketing #

Newer versions of Hive support a bucketing scheme where the bucket number is included in the file name. This is the same naming scheme that Hive has always used, thus it is backwards compatible with existing data. The naming convention has the bucket number as the start of the file name, and requires that the number starts with a 0.

The following list of files shows what data written by Hive might look like for a table with a bucket count of 4:

000000_0            # bucket 0
000000_0_copy_1     # bucket 0
000000_0_copy_2     # bucket 0
000001_0            # bucket 1
000001_0_copy_1     # bucket 1
000003_0            # bucket 3

We can see that there are multiple files for buckets 0 and 1, one file for bucket 3, and no files for bucket 2.

Unfortunately, Presto used a different naming convention that was valid according to the lexicographical ordering requirement, but not the newer explicit numbering convention. File names written by Presto used to look like this:

20180102_030405_00641_x1y2z_bucket-00234

The 20180102_030405_00641_x1y2z value at the start of the file name is the Presto query ID for the query that wrote the data. This is followed by bucket- plus the padded bucket number. Presto now writes file names that match the new Hive naming convention, with the bucket number at the the start and the query ID at the end:

000234_0_20180102_030405_00641_x1y2z

When reading bucketed tables, Presto supports both the new Hive convention and the old Presto convention. Additionally, it still supports the original Hive scheme when the files do not match either of the naming conventions, keeping the requirement that there must be exactly one file per bucket.

Skipping empty buckets for faster writes #

Now that Hive and Presto no longer require files for empty buckets, Presto does not need to create them. They are still created by default for compatibility with earlier versions of Hive, Presto, and other tools, but we expect to disable it in a future release, making writes faster by default. Or you may choose to disable them now if that works for your environment. This is controlled by the hive.create-empty-bucket-files configuration property or the create_empty_bucket_files session property.

Do you ❤️ Trino? Give us a 🌟 on GitHub

Trino blog