Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc dag --status : Showing the status of stages in mermaid graph #9288

Closed
wants to merge 6 commits into from

Conversation

ostromann
Copy link

  • ❗ I have followed the Contributing to DVC checklist.

  • [] πŸ“– If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

This PR adds a feature to dvc dag to also show the status of stages in the DAG in mermaid formatting. It addresses #9287 and consequently #iterative/studio-support#81, #iterative/studio-support#59, #iterative/studio-support#40. #5369 is addressed partially, and could easily be reachable from here, in my perspective.

This PR adds a few new options to dvc dag, namely being:

  • --direction {TD, LR}: Allows to print the mermaid top-down or left-right orientation.
  • --status: Checks the status of all stages in the DAG and adds boolean attributes to them indicating:
    • deps clean: deps are unmodified and present in the local repo
    • deps pushed: deps are pushed to the remote
    • outs clean: outs are unmodified and present in the local repo
    • outs pushed: outs are pushed to the remote
    • command run: stage has an unmodified command, which has run before
    • validated: stage is valid, i.e. it is not invalidated by upstream stages that contain any changes
  • --status-import: Also checks the status of deps of import stages. I decided to separate it, in order to avoid waiting times for checking out imports from other repos.

The output adds several class definition that add styling to the mermaid graph and color the nodes accordingly. Supported colors are:

  • green: clean stage, everything is fine
  • orange: unpushed stage, probably just needs dvc push
  • red: dirty stage, probably needs dvc repro
  • blue: frozen stage that don't invalidate downstream stages

The blue stages get an additional colored border if they are not clean.

Right now the tests are still failing, which is due to a different ordering of the nodes. I reorganized the the generation of the mermaid output a bit to improve readability. However, I can't seem to figure the sorting of nodes to make it compatible with the previous output right now. Maybe you guys want to take over from here. πŸ˜„

Example image:
image

@ostromann ostromann changed the title Feature/dag status dvc dag --status : Showing the status of stages in mermaid graph Mar 30, 2023
@ostromann
Copy link
Author

Now the mermaid dag tests are all passing.

@codecov
Copy link

codecov bot commented Mar 30, 2023

Codecov Report

Patch coverage: 19.62% and project coverage change: -0.13 ⚠️

Comparison is base (fe6d543) 92.85% compared to head (5b0b155) 92.72%.

❗ Current head 5b0b155 differs from pull request most recent head 52503fd. Consider uploading reports for the commit 52503fd to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9288      +/-   ##
==========================================
- Coverage   92.85%   92.72%   -0.13%     
==========================================
  Files         459      459              
  Lines       37144    37198      +54     
  Branches     5356     5368      +12     
==========================================
+ Hits        34490    34492       +2     
- Misses       2122     2172      +50     
- Partials      532      534       +2     
Impacted Files Coverage Ξ”
dvc/commands/dag.py 57.28% <19.62%> (-40.87%) ⬇️

... and 30 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

β˜” View full report in Codecov by Sentry.
πŸ“’ Do you have feedback about the report comment? Let us know in this issue.

output = f"flowchart {direction}"

if status:
output += _get_class_defs()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
output += _get_class_defs()
output += "\n" + _get_class_defs()

Looks like this is missing a newline.

@dberenbaum
Copy link
Collaborator

Looks really cool as a first step towards better pipelines statuses!

Played around with it a bit and worked well on my toy example. One question I have is how to show stages downstream of red? Green/orange is informative to tell you that it is up to date as an independent stage, but there is a good chance it will be out of date once the upstream stage is fixed.

@ostromann
Copy link
Author

One question I have is how to show stages downstream of red?

Yes, right. They are shown in grey to indicate them as "invalidated". There was just a small bug.

I forgot to reverse the graph before invalidation. So right now the graph is a dependency graph (stage B is depending on stage A, so we have a directed edge B -> A). Another way to look at it is as a data flow graph (stage A gives data to stage B, so we have a directed edge A -> B).

This means we either need to invalidate "upstream stages" in a dependency graph, or we reverse the graph first to get a data flow graph and then invalidate "downstream stages". The latter I just did my last commit now :)

if validated is False and all(
[outs_clean, outs_pushed, deps_clean, deps_pushed, command_run]
):
data["status"] = "grey"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to use explicit status names instead of color names and let the downstream representation map status to color or whatever (i.e. _get_class_defs in mermaid).

Comment on lines 467 to 475
dag_parser.add_argument(
"--direction",
choices=["LR", "TD"],
default="TD",
help=(
"Direction of the rendered mermaid DAG. "
"Can either be 'LR' for left-to-right or 'TD' for top-down'."
),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to be opinionated on the representation.
It is hard to draw a boundary between what can be customized and what not.
In this case, I think it is not worth having an option because users could just:

dvc dag --md | sed 's/flowchart TD/flowchart LR/g'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, it's a simple thing to change. Probably shouldn't bloat the options of this CLI.

Just some real life observation as in our pipeline we have a rather shallow pipeline (like 5-6 stages deep) but a large breadth (training data comes in as 15+ archives, as it is too large to handle otherwise & training and deployment stages are adapted for many different target architectures). So our DAGs look much prettier in left-to-right ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So our DAGs look much prettier in left-to-right ;)

I have nothing against making LR the default, though πŸ˜„

Comment on lines 485 to 490
dag_parser.add_argument(
"--status-import",
action="store_true",
default=False,
help="Check the dependencies of import stages. (Only compatible with --status)",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need this option.
What I do think its needed is to make the cloud status optional, to mimic dvc status {-c}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the naming of the option is perfect, and maybe the default behavior should be inverted. But I do think it is quite relevant to have it in one way or another.

In our case, we are importing large datasets from a data repo into a model repo. If we didn't skip the dependency check for these import stages, instead of ~2 seconds, we would need ~10 minutes to generate the DAG with stage statuses.

I think, actually, checking the status of imported deps should be sped up in the first place. But for now this options gives us at least a way around it.

Copy link
Contributor

@daavoo daavoo Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I do think it is quite relevant to have it in one way or another.

What I meant is that it makes more sense to me to enable/disable the entire cloud_status = repo.status(targets=target, cloud=True) and not make it specific about imports, if that makes sense. Use that condition to also skip the import check as you do know.

In your case, there might be something especially slow about imports, but in other cases cloud_status would be the part that takes significant time and people might want to make it optional.

So, make it a more straightforward choice for the user: whether to check the remote storage (cloud) or not. Without getting into DVC implementation details like deps/imports.

instead of ~2 seconds, we would need ~10 minutes to generate the DAG with stage statuses.
I think, actually, checking the status of imported deps should be sped up in the first place.

Could you open an issue sharing some more details on your setup? Those numbers look like too much time

Copy link
Author

@ostromann ostromann Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is that it makes more sense to me to enable/disable the entire cloud_status = repo.status(targets=target, cloud=True) and not make it specific about imports, if that makes sense. Use that condition to also skip the import check as you do know.

Yes, that totally makes sense, and I think it a -c flag should be included here. For the reason you mentioned.

But, actually I think there are two things going on here.

As I say in #9304, calling dvc status is bad, when having multiple large imports. But, what is even worse in this PR, is where we basically do this:

from dvc.repo import Repo
repo = Repo(".")
for stage in repo.index.stages:
    for dep in stage.deps:
       dep.status()

Calling the status on the deps of an import stage is horribly slow compared to non-import stages. I'm not sure what could be a smoothest way to get the best of both worlds. Being able to check the the remote storage status, but not having to wait for minutes because you are using imports. 🀷 With having both flags -c and --skip-import-deps (as I have now renamed it in 52503fd) it would be possible. But I totally understand, it's desirable to keep the option list short and not confuse users.

One alternative I could think of, is to keep the CmdDAG class like it was before this PR and introduce a new class CmdDAGStatus that can have all the extra fluff πŸš€

@dberenbaum
Copy link
Collaborator

Thanks again for this PR @ostromann! I think it's basically a visualization of dvc status, right? Or do you see it differently?

I'm wondering if we should consider making this (or maybe the whole dvc dag command) just an output format of dvc status (or the proposed dvc stage status from #5369) instead of having it part of a separate command. If so, it might make sense for someone from the team to take it over as part of working on #5369.

What do you think?

@ostromann
Copy link
Author

I think it's basically a visualization of dvc status, right? Or do you see it differently?

I completely agree!

I'm wondering if we should consider making this (or maybe the whole dvc dag command) just an output format of dvc status (or the proposed dvc stage status from #5369) instead of having it part of a separate command. If so, it might make sense for someone from the team to take it over as part of working on #5369.

What do you think?

Please don't move the dvc dag alltogether. I find it super useful, especially when building a new pipeline, to just have a quick sanity check πŸ˜ƒ

But otherwise, I fully agree. This visualization could fit nicely into dvc status and I would very much appreciate if someone from the team would take it over from here πŸ‘

|
|
|

Not included in this PR yet, but on the same data that is extracted in this PR, I had also made a tabular overview, which I think could fit into dvc status as well.

This is how it looks an entire pipeline:
image

And this is how it looks for individual stages (could be suitable for dvc stage status):
image

Let me know if this would be of interest.

@dberenbaum
Copy link
Collaborator

Sounds good @ostromann! Maybe we could have a status format that looks similar but is a simpler CLI table.

@dberenbaum
Copy link
Collaborator

Thanks for the PR! We haven't been able to prioritize this on our end, and it looks like it's not passing checks and has no tests, so closing this, but feel free to reopen if you want to work on updates.

@dberenbaum dberenbaum closed this Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants