Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements in Glossaries module #1557

Open
dlpzx opened this issue Sep 17, 2024 · 0 comments
Open

Improvements in Glossaries module #1557

dlpzx opened this issue Sep 17, 2024 · 0 comments

Comments

@dlpzx
Copy link
Contributor

dlpzx commented Sep 17, 2024

Here is a list of improvements that can be currently be taken in the Glossaries module:

  1. Implement server-side permissions for glossary operations
    This should be the first priority. For glossaries there are no permissions decorating the service. Which means that for example if a team owns glossary1, another team could delete glossary1. This is avoided in the UI but not programmatically.

  2. Ensure business logic is in GlossariesService and not in db.repositories
    The service layer in the glossaries module offloads all the business logic to the GlossariesRepository. Instead we should execute the business logic in the service and use the repository exclusively for db operations.

  3. Clearly split Glossaries and Catalog. At the moment in the backend there is a catalog module which is in fact handling all APIs for Glossaries which are not necessary the same. In the frontend we separate among Glossaries and Catalog.

  4. Remove duplicated gql fields for Glossaries/Categories/Terms
    Categories, Terms and Glossaries share most of their gql types and input_types. We should reduce the lines of code and re-use a single input_type and single type for them. This is also applicable to the frontend query definition.

  5. Simplify API calls and make them more straight-forward
    Secondly, the way the API calls are designed to get information from a glossary can be simplified. We currently define 3 APIs in the frontend:

  • getGlossary
  • getGlossaryTree - which in reality is getGlossary with a bunch of extra fields
  • listAssociations - which in reality is getGlossaryTree - which in reality is getGlossary

getGlossary is doing too many things. We should break it down by responsibility:

  • getGlossary - input: glossary.nodeUri, output: glossary db metadata + resolve_stats
  • NEW `getGlossaryTree - input: glossary.nodeUri, output: tree object of categories, terms with their parentUri
  • NEW listAssociations - input: glossary.nodeUri, output: paginated list of associations
  1. Remove duplicated code
    Glossaries API calls return a great number of unused fields. First of all we should remove unused fields from both backend and frontend to improve performance. I am mostly referring to the Glossary object.
dlpzx added a commit that referenced this issue Sep 20, 2024
### Feature or Bugfix
- Feature:Testing


### Detail
Implement tests for Glossaries/Catalog as part of
#1220

⚠️ To test glossary associations, we need resources that use the
glossary terms (associate a term to a resource). Possible resources
include: s3_datasets, s3_tables, s3_folders, dashboards,
redshift_datasets and redshift_tables. All of which are part of other
modules that might be enabled or disabled. In this PR we assume that the
s3_datasets module is enabled! If it was not enabled, then the glossary
tests would fail!


During the implementation several enhancement ideas came up and are
collected in #1557


Tested in local with connection to real AWS deployment:

![image](https://github.com/user-attachments/assets/d759cc3d-7e44-4057-b16b-0fc94edd2290)

### Relates
- #1220 

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant