Skip to content

Latest commit

 

History

History
80 lines (48 loc) · 4.96 KB

File metadata and controls

80 lines (48 loc) · 4.96 KB
description
At the heart of Datafold is the data catalog including column level lineage

Column-level lineage

In order to make the most of the data, data practitioners need to be able to review and assess the various data assets they are working with. Datafold’s Catalog feature helps data practitioners discover their relevant data assets and learn important information about them.

Seeing how the data looks like, and how it flows throughout the system. Allowing people to quickly see where the data comes from, and which tables are affected when making a change to the pipeline.

{% embed url="https://www.youtube.com/watch?v=JqaA12uFp5k" %}

Data Catalog

When opening the catalog, you can filter down on a database/schema level to quickly see which tables live within the data warehouse.

There are two ways to search for data assets:

  • Simply use the search bar: type a keyword in the search field and find the relevant asset in the dropdown list or resulting table.
  • Use the filters on the left side of the screen to narrow down the results. You can filter for specific schemas or tags, and we can also filter by owners and asset types.

In the screenshot above you can see the public Datafold dbt-beers repository. Datafold also has a rich search engine that allows you to quickly find your data assets:

The information shown can be ingested from many sources:

  • Through the Datafold UI you can easily add descriptions, tags, and set owners of certain tables.
  • Sourcing the table/column descriptions, tags, and owners from dbt.
  • Fetching the table/column descriptions directly from the data warehouse from the schemas.
  • For certain databases, we have additional ways of ingesting metadata. For example, for Snowflake we support ingesting the Query tags.
  • For full control, you can use the API you set the metadata.
  • If there are any integrations that you would like to see, please don't hesistate to reach out!

These features make sure that the UI is rich, and that you can easily integrate Datafold into your existing data stack. In addition to that, Datafold gives easy insight into the data itself:

The Profile views shows:

  • When the profile were computed
  • Number of rows and columns
  • When the table was last updated
  • The data owner of the data asset
  • The upstream and downstream tables
  • Original query (click on “Show SQL query” to show the SQL statement that created the table)
  • Table/Column description, tags, etc.

Next to that, for each column is gives the profile of the data:

This view gives the analyst, engineer, or scientist quick insights into:

  • Sparsity, which gives an indication of how well that field is populated. Do all rows have an entry or are there many nulls?
  • Distribution, which shows how data is distributed across the number range in the case of numerical data.

The profiles are computed on an ad hoc basis by default, but it is highly recommended to compute them outside of business hours, to have the profiles cached in Datafold.

Column level lineage

Most data in your database is not loaded directly in the database, but has been assembled from underlying raw data using queries. This is the T step in ELT to refine and combine data from different sources, to combine it into useful information. One table may be generated by an SQL query on the raw data, and that table may again be used by an ETL function to generate another dataset. This creates directional relationships between the various datasets, similar to a family tree with descendants and ancestors.

This relationship is called lineage and can be shown in a lineage graph. It can be generated on a column level (showing the relationships of each column to columns in other tables), or on the table level (showing the relationships of tables in their entirety).

In order to learn where the data comes from, you can easily see an overview in the Lineage tab. Datafold gets the data from the SQL statements that are executed by your ETL tool against the Data Warehouse.

This information is very powerful because we can easily trace downstream and upstream dependencies for any data set in our warehouse and understand exactly how the data is produced and consumed.

It is very easy to navigate through the lineage. By right clicking on another table, you can easily see the up- and downstream dependencies of the specific table.