Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[POC] Generate mermaid diagrams from harmonized index schemas #4918

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

pepopowitz
Copy link
Collaborator

@pepopowitz pepopowitz commented Feb 2, 2025

Should not be merged as-is. This PR exists for the purpose of soliciting feedback.

Description

My hackathon project, a proof of concept of #4870.

Adds a new page containing diagrams of the main & runtime harmonized indexes. Includes scripts to generate those diagrams from the upstream index schemas.

Preview

https://preview.docs.camunda.cloud/pr-4918/docs/next/self-managed/operational-guides/backup-restore/harmonized-indexes/

image

What's in this PR

  1. Copies the harmonized index schemas from upstream for ease of working. Long-term, I'd expect us to check out the camunda/camunda repo as part of the workflow.
  2. Adds a script (1-combine-sources) to combine all the JSON files into one big one (or technically two, to retain the distinction between main & runtime indexes). I'm undecided if this would be useful for a long-term workflow.
  3. Adds a script (2-generate-mermaid-diagrams) to generate proper markdown to emit mermaid diagrams for all schemas in a JSON file.
    • Note that I would like extra hackathon points for not only writing unit tests during a hackathon, but using TDD to derive this logic.
  4. Adds a script (2-generate-mermaid-output) to dump the generated markdown into a file.
  5. Adds markdown partials to include that output in a new page.

Implementation notes

  • I put the new page under self-managed/backup and restore because that seemed like the most likely place someone might be interested in knowing what the schema looked like.
  • The schema definitions aren't truly an "Entity Relationship" diagram; that type of diagram seemed like the best fit that I could find.
  • The mermaid integration into docusaurus is such that very large diagrams become unreadably small. The diagram is limited to the width of the page, and is itself an SVG, and it gets scaled down to fit the page.
    • I intentionally left 6 entities in one diagram so you can see this scaling in action, in the 4th row of entities as you scroll down the page.
    • In response, I chose for this prototype to create diagrams of 3 schemas, and stack them on top of each other vertically. I looked briefly into adding scrollbars and making one big scrollable diagram, but I didn't have success with that.
    • We do have the ability to control the layout algorithm, but neither of them lays the entities out in a more readable format than what you're seeing.
  • There are multiple concepts in the schemas that we'll need to address somehow.
    • object types: an index defines nested types in its schema. I chose to represent these using ERD relationships, extracting the nested type into a separate entity (see camunda-authorization index). This is visually inaccurate according to ERDs; it is not a separate index; but it might be a better way to describe things?
      • An example would be grouping a firstName and lastName under a name, or this from our indexes:
        "permissions": {
          "type": "object",
          "properties": {
            "type": {
              "type": "keyword"
            },
            "resourceIds": {
              "type": "keyword"
            }
          }
        }
    • join types: in this case, an item in an index references other items in the same index, through a defined relationship. This is not exactly a concept that a traditional Entity Relationship diagram handles out of the box. I tried joining a couple indexes to themselves (see camunda-group and tasklist-task indexes) to represent this; I left the others as using the type join.
      • Here's an example from a schema:
        "joinRelation": {
           "type": "join",
           "eager_global_ordinals": true,
           "relations": {
             "processInstance": ["activity", "variable"]
           }
         },
    • There may be some implied "relationships" across indexes, which I did not represent. For example, the operate-decision index includes a decisionRequirementsKey property, which I think might refer to the key of an item in the operate-decision-requirements index.
      • It would make a more complete diagram to include these implied relationships, if I am understanding them correctly.
      • However it would destroy my workaround for mermaid/docusaurus's tendency to squish large diagrams, as I would need all entities in one diagram. That isn't a reason not to do it, but including the relationships would mean I'd have to find a different workaround.
      • The implied relationships are not represented in the indexes. If we chose to represent them in the diagrams, I'd need to maintain a list of relationships here, separate from the upstream source. Again, not a reason not to do it, but it does introduce complexity and fragility.

Other changes required before "done"

  1. The generation of the markdown from JSON schemas would happen in a GitHub workflow.
  2. Content for the page needs to be written. https://docs.google.com/document/d/1EFZ19Gx8Nf559pP_Bg8ObFMfGYdlq20age8P_WiSBOY/edit?tab=t.0#heading=h.c447h0byekxu is a good starting point for this. I will likely solicit a technical writer to help me with this 😅

Decisions to be made/feedback I'm interested in

Because I think it will be easier for reviewers, I will post a list of questions in a comment, so that you can reply to it with any of your feedback.

When should this change go live?

Never, at least not in this form!

@pepopowitz pepopowitz added the hold This issue is parked, do not merge. label Feb 2, 2025
@pepopowitz pepopowitz added the deploy Stand up a temporary docs site with this PR label Feb 2, 2025
Copy link
Contributor

github-actions bot commented Feb 2, 2025

👋 🤖 🤔 Hello, @pepopowitz! Did you make your changes in all the right places?

These files were changed only in docs/. You might want to duplicate these changes in versioned_docs/version-8.6/.

  • docs/self-managed/operational-guides/backup-restore/_harmonized-indexes-main.md
  • docs/self-managed/operational-guides/backup-restore/_harmonized-indexes-runtime.md
  • docs/self-managed/operational-guides/backup-restore/harmonized-indexes.md

You may have done this intentionally, but we wanted to point it out in case you didn't. You can read more about the versioning within our docs in our documentation guidelines.

@github-actions github-actions bot temporarily deployed to camunda-docs February 2, 2025 09:36 Destroyed
Copy link
Contributor

github-actions bot commented Feb 2, 2025

The preview environment relating to the commit a6a5600 has successfully been deployed. You can access it at https://preview.docs.camunda.cloud/pr-4918/index.html

@pepopowitz
Copy link
Collaborator Author

pepopowitz commented Feb 3, 2025

Things I'm directly seeking feedback on

  • "Harmonized indexes" feels like a thing we call these internally, and I wonder if a user would know what that meant. Is there a better more simplified name for the page?
  • How do you feel about the location of the page? Should it be moved somewhere else?
  • Is this better than generating text-only tables? A downloadable diagram?
  • What do you think about the object representation I described? Do you think an extracted object in the diagram is confusing, since it technically lives on the original index object? Is it clear from the diagram what's going on?
  • What do you think about the join alternatives I presented in the description? Do you have a preference? Is there another way to represent this that might be better?
  • Do you think it would be better to include the implied relationships I described?
  • How many extra hackathon points will you give me for the unit tests & TDD?

And of course, anything else that's on your mind 😄

long memberKey
join join
}
camunda-group ||--o{ camunda-group: "group:member"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to mention this but I snuck these join treatments into the generated markdown manually, to see what they would look like. The scripts would need to be updated to accommodate these, if we decide we like them.

@akeller
Copy link
Member

akeller commented Feb 5, 2025

Note that I would like extra hackathon points for not only writing unit tests during a hackathon, but using TDD to derive this logic.

🪙 🙌

@akeller
Copy link
Member

akeller commented Feb 5, 2025

  • The mermaid integration into docusaurus is such that very large diagrams become unreadably small. The diagram is limited to the width of the page, and is itself an SVG, and it gets scaled down to fit the page.

Zoomable images are back again, I see. @Sijoma was interested in Mermaid via this issue. He might have a larger example we could try in our docs to see how the experience would be. I fear most of our diagrams will end up being very large 🥲 .

"Harmonized indexes" feels like a thing we call these internally, and I wonder if a user would know what that meant. Is there a better more simplified name for the page?

This is probably a @ChrisKujawa question. It's part of platform unification and (IMO) users only really need to know about it for migrating from the multi-component concept. It's the Camunda platform core indices, but maybe with more capitalization.

How do you feel about the location of the page? Should it be moved somewhere else?

Good question. I think it's part of the architecture (self-managed/reference-architecture/#architecture) but also part of the update guide (self-managed/operational-guides/update-guide/introduction/). I don't know how often it would be referenced outside the context of updating.

Is this better than generating text-only tables? A downloadable diagram?

🤷‍♀️ Immediately, I wondered why we wouldn't offer multiple view/use options, but it doesn't have much to do with this presentation of info.

What do you think about the object representation I described? Do you think an extracted object in the diagram is confusing, since it technically lives on the original index object? Is it clear from the diagram what's going on?

I think this is clear, but I also think I cheated by looking at other representations of this data. I also don't have much feedback on the remaining questions because the diagram looks fairly simple...? But maybe I'm missing something.

How many extra hackathon points will you give me for the unit tests & TDD?

10x

@ChrisKujawa
Copy link
Member

First of all I want to thank you @pepopowitz that you looked into this and spent your hackday on this topic 🚀 Really cool. 💪🏼

Things I'm directly seeking feedback on

  • "Harmonized indexes" feels like a thing we call these internally, and I wonder if a user would know what that meant. Is there a better more simplified name for the page?

Yeah, I think it should just be like Indicies or maybe "Secondary Storage Schema" something. For C7 we call the page Database Schema

  • How do you feel about the location of the page? Should it be moved somewhere else?

I think it might make sense to have this separate, maybe even in the Reference section 🤔 But yeah in general we can move it around I guess if we find a better spot.

  • Is this better than generating text-only tables? A downloadable diagram?

Good question. I was also thinking whether it would also work when we just generate markdown tables out of this, or something. I guess somehow it is interesting to have it visual especially if you want to show relations.

  • What do you think about the object representation I described? Do you think an extracted object in the diagram is confusing, since it technically lives on the original index object? Is it clear from the diagram what's going on?

I agree it is a bit confusing, but we can workaround here via a different form or something 🤔 In general I liked that we have it to see that this is contained in the index, but yeah might be not fully clear.

  • What do you think about the join alternatives I presented in the description? Do you have a preference? Is there another way to represent this that might be better?

The join is not really clear based on the visualization, what it actually is that multiple entities can live in the same index/table. Maybe we can visualize this differently, via combined rows or something.

  • Do you think it would be better to include the implied relationships I described?

I think it would be interesting, but might introduce more complexity.

  • How many extra hackathon points will you give me for the unit tests & TDD?

At least 10 👍🏼 :D

And of course, anything else that's on your mind 😄

I was thinking, as you described there is an issue with the size of the images, whether we could split them up by use case/context. For example, by related to identity, decision execution, process execution, task execution, etc. Wdyt?

Comment on lines +4 to +8
camunda-authorization {
keyword id
long ownerKey
keyword ownerType
keyword resourceType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ One thing I was wondering was whether we could turn the values around. That we first have the name of the property then the type? I felt this is somehow more natural. What are your thoughts?

For example this is also done in the C7 ER diagram https://docs.camunda.org/manual/7.22/user-guide/process-engine/database/database-schema/#entity-relationship-diagrams

@ChrisKujawa
Copy link
Member

Maybe @aleksander-dytko or @ingorichtsmeier want to give some input here as well :)

@aleksander-dytko
Copy link
Contributor

@pepopowitz some thoughts:

Yeah, I think it should just be like Indicies or maybe "Secondary Storage Schema" something. For C7 we call the page Database Schema

I think we should officially introduce "Primary Storage" and "Secondary Storage" In our docs to describe the data pipeline. This would be useful for further reference e.g. here

Visuals

I believe having a visual representation of the schema is useful to quickly orientate in C8. We could first show the list indices and have each line clickable, with the details of the schema.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deploy Stand up a temporary docs site with this PR hold This issue is parked, do not merge.
Projects
Status: 👀 In Review
Development

Successfully merging this pull request may close these issues.

4 participants