Official announcement items from Martin is not yet available since release it not out… but soon.
listaggfunction contributed by Marius
Data discovery is a process that aids in the analysis of data where siloed data has been centralized, and it is difficult to find data or overlap between disparate data sets. Many teams have their own view of the world when it comes to the data they need, but they commonly need to reason about how their data relates to data outside of their domain.
There are typically questions about who owns what data to help identify individuals responsible for maintaining the standards. Additionally, there are also issues around providing documentation around the data, and to identify who to call for help if there are issues using the data. This allows analysts to discover patterns in the data, and periodically audit the data storage practices. Interesting questions also arise around existing policies, and can encourage a system of record that act as a shared front end around their data policies.
Amundsen provides data discovery by using ETL processes to scrape metadata from all of the data sources. It creates a central location to collect all that metadata and enables search and other analytics of this metadata. Here’s how the project describes itself on the Amundsen website:
Amundsen is a data discovery and metadata engine for improving the productivity of data analysts, data scientists and engineers when interacting with data. It does that today by indexing data resources (tables, dashboards, streams, etc.) and powering a page-rank style search based on usage patterns (e.g. highly queried tables show up earlier than less queried tables).
Amundsen has an architecture that interacts primarily with
tables, among other metadata, depending on the data source. In Trino’s case,
the extractor used
connects directly to the Hive metastore database, for Trino views, since
they’re stored there. Physical tables use the HiveTableMetadataExtractor
to load these tables into Amundsen. This makes sense since the data is stored in
the Hive table format. For non-Hive use cases, you generally want to bypass
using Trino (for now) and directly connect Amundsen to each data source.
Amundsen includes an ETL framework called databuilder that runs multiple jobs. Jobs contain an ETL task to extract the metadata and load it into the two databases that are central to Amundsen, Neo4j and Elasticsearch. Neo4j stores the core metadata that is represented on the UI. Elasticsearch enables search over the many fields in the metadata. Ingestion via ETL follows the following steps:
Each job contains an ETL task. The task must define an extractor and a loader, and optionally a translator. You can see example configurations for different extractors on the website, like the example for the HiveTableMetadataExtractor.
The metadata is modeled using a graph representation in neo4j and optionally Apache Atlas to model advanced concepts, such as, lineage and other relations.
You can learn more about the models in the metadata here.
A new architecture, philosophy, and yes, buzzword that is gaining momentum is the data mesh. While it certainly still not concretely defined, it is in the research and development phase. Data mesh is gaining a lot of attention as a potential alternative to data lakes and data warehouses for analytics solutions.
Data mesh mirrors the philosophy of microservice architecture. It argues that data should be defined and maintained by teams responsible for their business domain similar to how the responsibility is delegated at the service layer. Since not everyone is going to be a data engineer on the domain team, there must be some consideration for the architecture of such a platform. The author of this paradigm, Zhamak Dehghani, lays out 4 principles that characterize a data mesh. Below are the principles of a Data mesh. Below the systems that provide some or all of the solution for a principle are listed in parentheses.
Like with many successful open source projects, there are enterprise products that build on and support the open source project. Stemma is the enterprise company that supports Amundsen. It’s founded by Mark and others central to the open source project.
The PR (or should we say commit) of the week, adds the original Trino extractor. As mentioned above this extractor is only needed for views as the physical tables exist in Hive and are retrieved.
If you want to help out, you can consider adding the Trino image similar to this commit completed a while back.
There were technical difficulties on the day of broadcasting the show, so the demo was moved to its own separate video.
The steps in this demo are adapted from the Amundsen installation page.
Clone this repository and navigate to the
directory. For this demo you need at least 3GB of memory allocated to your
git clone [email protected]:bitsondatadev/trino-getting-started.git cd amundsen docker-compose up -d
Once all the services are running, clone the Amundsen repository in a separate terminal. Then navigate to the databuilder folder and install all the dependencies:
git clone --recursive https://github.com/amundsen-io/amundsen.git cd databuilder python3 -m venv venv source venv/bin/activate pip3 install --upgrade pip pip3 install -r requirements.txt python3 setup.py install
Navigate to MinIO at http://localhost:9000 to create the
tiny bucket for the
schema in Trino to map to. In Trino, create a schema and a couple tables in the
CREATE SCHEMA minio.tiny WITH (location = 's3a://tiny/'); CREATE TABLE minio.tiny.customer WITH ( format = 'ORC', external_location = 's3a://tiny/customer/' ) AS SELECT * FROM tpch.tiny.customer; CREATE TABLE minio.tiny.orders WITH ( format = 'ORC', external_location = 's3a://tiny/orders/' ) AS SELECT * FROM tpch.tiny.orders;
Navigate back to the
trino-getting-started/amundsen directory in the same
Python virtual environment you just opened.
cd trino-getting-started/amundsen python3 assets/scripts/sample_trino_data_loader.py
View the Amundsen UI at http://localhost:5000 and try to search test, it should return the tables you just created.
You can verify dummy data has been ingested into Neo4j by visiting http://localhost:7474/browser/.
Log in as
neo4j with the
test password and run
MATCH (n:Table) RETURN n LIMIT 25 in the query box. You should see few tables.
If you have any issues, look at some of the troubleshooting steps in the Amundsen installation page.
This weeks question of the week comes in from the Trino Slack from Chen Xuying.
Is there any way to register a new user defined function (UDF) and needn’t restart coordinator and worker?
Currently, no. In Java, jar files and all the java code is loaded up on start time. So in order to load the files on all the worker nodes and coordinator, you need to restart. There are various ways for UDFs to be implemented in a dynamic way so we are still looking for a suggestion here.
Trino Meetup groups
If you want to learn more about Trino, check out the definitive guide from OReilly. You can download the free PDF or buy the book online.
Music for the show is from the Megaman 6 Game Play album by Krzysztof Słowikowski.