Automatic metadata generation using genAI #1599

dlpzx · 2024-10-01T10:12:29Z

Problem statement

Is your feature request related to a problem? Please describe.
Current metadata creation processes in data.all are manual and time-consuming, leading to incomplete, inconsistent, and outdated metadata. Inconsistency in metadata across datasets makes it difficult to understand and compare the information. Incomplete metadata reduces the value and usability of the data, while outdated metadata can hinder the ability to properly utilize the datasets. Additionally, the quality of manual metadata can vary significantly from dataset to dataset, depending on the data producer's expertise and available time and resources. Crucially, the burden of this undifferentiated heavy lifting falls on data producers, who must spend valuable time and resources on manual metadata creation instead of focusing on their core business problems.

The automated metadata recommendation feature can address these challenges by leveraging GenAI techniques, the metadata recommendation process can be streamlined, standardized, and kept up-to-date. This feature tries to solve the pain point of inconsistent, incomplete, and outdated metadata that exists due to manual approaches. This feature aims to improve metadata quality and consistency across data.all, while freeing producers to focus on their core competencies.

User Stories

Describe the solution you'd like

US1.

As a Data Producer, I want automated metadata recommendation for data.all datasets, including but not limited to dataset description, tags, topics, table description and column description, so that I can ensure datasets are discoverable and well-documented without manual effort.

Acceptance Criteria

Data producer created or imports a a new data.all dataset, once created, the user can generate relevant metadata automatically and display it, including dataset description, topics, tags, table description and column description
Data producer can use automated metadata recommendation for backward compatibility of existing datasets and republish the dataset without any extra steps.

US2.

As a data producer, I want the ability to run the automated metadata recommendation feature on demand, so that I can keep the data catalog information up-to-date as my data assets evolve.

Acceptance Criteria:

Data.all provides a one click interface for data producers to initiate on-demand automated metadata recommendation and updating for selected data.all dataset.
The updated metadata is reflected in the data.all catalog, allowing for user review and acceptance of the changes before they are persisted.

US3.

As a Data Producer, I want the ability to review, edit, and annotate automatically recommended metadata, so that I can ensure its accuracy and relevance while leveraging the automated process.

Acceptance Criteria:

Users can review and manually edit the AI-generated metadata before accepting it, with the accepted changes reflected in the metadata view of the data.all datasets.

US4.

As a Data Consumer, I want to use advanced search and filtering options based on enriched metadata to find relevant datasets quickly and efficiently.

Acceptance Criteria:

The interface allows for searching and filtering based on various metadata attributes

US5.

As a data.all developer and maintainer, I want the automated metadata recommendation feature to be secure and respect data governance access permissions.

Acceptance Criteria:

The automated metadata recommendation employs a least privilege model to limit permissions and access and complies with data.all's security posture.
The automated metadata recommendation is available to only authenticated data.all users and has the same data access permission as the user, only the data owner can generate metadata or update metadata.

###US6.
As a data.all developer and maintainer, I want the automated metadata recommendation feature to be configurable, scalable, reliable, and seamlessly integrated into the data.all platform, so that I can ensure a smooth and efficient user experience for all data.all users.

Acceptance Criteria:

The automated metadata recommendation is modularized and can be turned on and off.
The automated metadata recommendation can support a high load of requests and efficiently manages calls to models.
The automated metadata recommendation is seamlessly integrated into the data.all user interface in the dataset view without significant changes in user experience

US7.

As a data.all developer and maintainer, I want to be able to configure rate limits for the automated metadata recommendation feature so that I can prevent overuse and ensure responsible access to the feature.

Acceptance Criteria:

Maintainers can set thresholds for daily use metrics like number of times the automated metadata recommendation can be executed per user.
Once a user hits the configured threshold, the automated metadata recommendation feature will provide notifications to the user when they reach the usage limits and will be restricted for that user to regenerate the metadata until the next day

###US8.
As a data.all developer and maintainer, I want the automated metadata recommendation feature to clearly display a disclaimer about the limitations and confidentiality of the responses, so that I understand the context and boundaries of the AI-generated information.

Acceptance Criteria:

The automated metadata recommendation feature UI always presents a disclaimer that cannot be easily missed by the user and states: Carefully review this AI-generated response for accuracy ......”

US9.

As a data.all developer and maintainer, I want the automated metadata recommendation feature to provide feedback functionality so that users can easily indicate if the response was helpful or not, which can then be used to improve the quality of future responses.

Acceptance Criteria:

The automated metadata recommendation feature includes a thumbs up/down widget that users can click to provide feedback on the response which is captured and used to refine and improve the automated metadata recommendation responses over time.
Users receive a confirmation message after providing feedback, assuring them that their input will be used to enhance the feature.

Scope

1/ Metadata Generation:

Implement a 'Generate Metadata' or 'AI Icon' action button that data producers can access after creating a new dataset or importing an existing dataset into data.all.
When the "Generate Metadata" action is triggered, the system should automatically generate the metadata including Dataset description, Table descriptions, Column descriptions etc
Allow data producers to select specific tables and/or folders within a dataset for which they want to generate metadata, or generate it for all tables and all folders by default.
Ensure a seamless user experience by eliminating the need for users to manually fill in metadata and avoiding any duplication or overriding of the metadata that is generated automatically by the feature.

2/ Metadata Review and Acceptance:

After the automated metadata generation, display the recommended metadata in an interface for the data producer to review the AI-generated metadata, make edits, and annotate the information to ensure accuracy and relevance.
Implement a "Accept Recommendation", "Edit Recommendation" and "Reject Recommendation" action, allowing the data producer to control which metadata is persisted in the data.all catalog for their dataset.

3/ Backward Compatibility for Existing Datasets:

Extend the "Generate Metadata" functionality to support data producers' existing datasets in data.all.
Provide a way for data producers to trigger the automated metadata generation for their existing datasets, ensuring backward compatibility and enabling them to update the metadata for existing data assets.

4/ On-demand Metadata Refresh:

Offer a user-friendly action for data producers to initiate on-demand automated metadata recommendation in events of changes to existing table schema or when new tables are added to ensure completeness and correctness.
This process still needs to follow Metadata Review and Acceptance workflow as described in 2/

5/ Metadata-driven Search and Filtering:

Leverage the accepted metadata to enhance the data.all search and filtering capabilities, enabling data consumers to quickly discover relevant datasets based on the enriched information.

Out of Scope

Bring Your Own Model: The automated metadata recommendation feature will not support the ability for users to bring their own language models.
Fine Tuning: This feature doesn’t include fine tuning of LLM to get a customized model. This has been kept this way as data.all is deployed in a customer environment and due to lack of data on user executed requests and fine-tuning requires a significant size of data to align the model to a particular domain or task.
Role Management: The automated metadata recommendation feature will assume the same role of a generic data producer persona and will not customized for different user personas.

Guardrails

Transparency and Disclosure: This feature is in an experimental stage. The metadata provided should be considered as a starting point, and users are encouraged to "trust but verify" the information, as there may be limitations or uncertainties in the responses.
Truthfulness and Integrity: The feature aims to provide truthful and complete metadata information to the best of its abilities. However, it is possible that the dataset summaries, column names, or descriptions may not be entirely accurate. Users should review the metadata carefully and report any issues or discrepancies.
Clear and Informative Error Messages: If the model encounters any issues or is unable to provide the requested metadata, it will provide clear error message instead of generating incorrect response.
Human Review and Acceptance: After the model generates the metadata response, the user will be prompted to review the information. The user must explicitly accept the metadata before it can be used or saved. This human-in-the-loop approach ensures that the metadata is verified and approved before being utilized.
Cost: Usage will be restricted to a specific metric per day per user to promote responsible use. The choice of model will be determined through a frugal evaluation of functionality and cost. Estimated usage costs will be published to allow customers to make informed decisions.

Describe alternatives you've considered
See design below

Additional context
This feature will be first implemented as an MVP and then reowrked a bit to make it prod-ready.

P.S. Please Don't attach files. Add code snippets directly in the message body instead.

### Feature - Feature ### Detail - Automated metadata generation using gen AI. MVP phase ### Related #1599 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Co-authored-by: dlpzx <[email protected]>

dlpzx mentioned this issue Oct 1, 2024

Automated metadata generation using genAI MVP #1598

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic metadata generation using genAI #1599

Automatic metadata generation using genAI #1599

dlpzx commented Oct 1, 2024

Automatic metadata generation using genAI #1599

Automatic metadata generation using genAI #1599

Comments

dlpzx commented Oct 1, 2024

Problem statement

User Stories

US1.

US2.

US3.

US4.

US5.

US7.

US9.

Scope

Out of Scope

Guardrails