Tardigrade Project Update

Feb 16, 2022 • Andrii Rosa

Over the last couple of months we’ve added support for full query retries, landed experimental support for task level retries and provided a proof of concept implementation of a distributed exchange plugin (description below). We are still working on improving scheduling algorithms as well as optimizing exchange plugin implementation to make the task level retries fully usable.

Here is a quick summary of our progress so far:

Added support for automatic query retries. This functionality is ready to use and can be enabled by setting the retry_policy=QUERY session property. Now it is possible to enable automatic retries for queries that produce more than 32MB of output. Dynamic filtering is now also fully supported with automatic query retries enabled.
Landed an initial set of changes to support task level retries. To be enabled, a plugin implementing the ExchangeManager interface has to be installed.
Landed a proof of concept implementation of the ExchangeManager interface. The implementation is fully functional, however we are still working on optimizing the read path. Also for now, only S3 compatible file systems are supported.
Added support for automatic retries in Hive and Iceberg. Supporting automatic retries for JDBC based connectors is up for grabs.
Implemented weight based split assignment for balanced work distribution between fault tolerant tasks.
Working on adaptive sizing strategy for intermediate tasks to minimize scheduling overhead while keeping the cost of a single task failure at minimum.
Making progress on introducing an advanced memory aware scheduling that would allow us to better support memory intensive queries, improve resource utilization and ensure fair resource allocation between queries.
Started working on supporting dynamic filtering for queries with task level retries enabled.
Working on accommodating failed attempts in various internal statistics reported by the engine (e.g.: QueryInfo, QueryCompletedEvent). UI changes will come next.

Over the next couple of weeks we are planning to focus on:

Optimizing read path for the reference implementation of the exchange plugin
Landing memory aware scheduling for fault tolerant execution
Landing adaptive sizing for intermediate tasks
Accommodating failed attempts into query statistics reporting
Making progress on supporting dynamic filtering for queries with task level retries enabled

The current state of development can be tracked by following this issue.

Stay tuned!

Do you ❤️ Trino? Give us a 🌟 on GitHub

Trino blog

Tardigrade Project Update