[Feedback requested] Proposal for updating DoWhy's API for supporting new causal tasks #429
Replies: 11 comments 26 replies
-
Thanks, @amit-sharma, for starting this discussion. Maybe to add some more context for those new to this discussion. When we discussed how we can add features using GCM-based inference, we also asked ourselves:
And one of the realizations were that the causal graph is that common entity. Effect estimation uses the causal graph in the identification step to get the estimand. GCM-based inference uses the causal graph to compose a graphical causal model together with causal mechanisms for the nodes. Another guiding principle, that was important to us, is that DoWhy's core business is not graphs, their manipulation, or graph algorithms. NetworkX is great at that, so let it do the heavy lifting there. Which is the reason why we'd like to go away from an own Finally, we noticed that Note that the 4-step recipe for causal prediction, that DoWhy has provided (and will provide), and which is a great for newcomers to learn DoWhy, will stay. |
Beta Was this translation helpful? Give feedback.
-
Hi, I am chiming in regarding the ability to easily represent the different types of graphs commonly used in causality that are robust, lightweight and easy to extend for the other causal tasks mentioned in dowhy's API. E.g. DAG, CPDAG, ADMG, PAG. We Need Robust Graph Representations To Make Maintenance / Upgrades EasyI see that there is an interest in trying to make graph representations heavily rely on networkx for good reasons: it is a well-tested and commonly used API for graph-related tasks. However, subclassing networkx directly is not a very feasible and good idea (I have tried it) because it does not implicitly support mixed-edges, which are needed for causal graphs (i.e. undirected edges, bidirected edges, circle end points). We can still use networkx, which utilizes one of the networkx Graph classes to represent the different types of edges. This does require some bit of book-keeping, but I do not think it is a lot and I believe it is has all the advantages of a robust networkx API. Possible SolutionsFor example, https://github.com/adam2392/causal-networkx I have implemented the basic graph classes with a focus on making it applicable and functional in structure learning algorithms. These can then easily be compatible with dowhy's existing causal ID and estimation pipelines. Moreover, these can enable ID and estimation on the partially oriented graph (e.g. PAGs). Moreover, I have implemented and am researching structure learning algorithms which I believe are important for the PyWhy community. In summary, the API I have implemented for causal graph classes could be very useful for PyWhy and adheres to a similar networkX-like API, and uses networkx whenever possible to reduce the possible bug-space. Implications for DoWhy's RoadMapDoWhy is interested in supporting more and more causal tasks. I think the first step is to make sure we have causal graph representation in Python down, such that it opens the door for researchers to contribute. Then I think when everyone uses a similar API, the pipelines and tasks can become easier to implement, test and teach. |
Beta Was this translation helpful? Give feedback.
-
Are the protocols here meant to be sufficient for any implemented graph class to work with dowhy and its current set of fucntions? I don't think it would be sufficient(?), since for example, a CPDAG can have each of these protocols, but some of the estimation algorithms would be different. Or a PAG could have each of these protocols, but the ID algorithm would need to be adjusted to account for these circular edges.
Ideally, we'd want a downstream algorithm that is ingesting the graph to be able to easily validate that the graph satisfies its assumptions.
From: Adam Li ***@***.***>
Sent: Monday, June 6, 2022 12:05 PM
To: py-why/dowhy ***@***.***>
Cc: Subscribed ***@***.***>
Subject: Re: [py-why/dowhy] [Feedback requested] Proposal for updating DoWhy's API for supporting new causal tasks (Discussion #429)
I see, if the plan is to support any graph library, then I have a few thoughts and questions:
1. Is the desire to have a networkx-API-like representation approach for DAG, CPDAG, ADMG, and PAG, where they are explicitly implemented in dowhy/pywhy and subclassed from an abstract class as you linked to in https://github.com/py-why/dowhy/blob/master/dowhy/gcm/graph.py#L41<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpy-why%2Fdowhy%2Fblob%2Fmaster%2Fdowhy%2Fgcm%2Fgraph.py%23L41&data=05%7C01%7Cemrek%40microsoft.com%7C217b6e825c314139aa3808da47ef7339%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637901391003997029%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=siIeOH4bbDrtm7ulLv0J%2FvZTQYSsCbzD9PQM50v9Tfs%3D&reserved=0>?
I suppose I could imagine feeding in a networkx.Digraph instead of a causal.DAG. That's "nice", but once you get things like CPDAG, ADMG and PAG, then I don't think there are any networkx graph classes that could be passed in since those don't support mixed-edge graphs. But sure maybe we can come up with some abstract class that any user-custom-graph class must be in-line with to enable the different pipelines for the different categories of causal graphs (this is my proposed categorization):
* Markovian (i.e. causal sufficient) with no latent confounders
* Semi-Markovian (i.e. bidirected edges for latent confounders)
* Markov Equivalence class (i.e CPDAG and PAGs)
1. I noticed that https://py-why.github.io/dowhy/dowhy.html?highlight=graph#module-dowhy.causal_graph<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpy-why.github.io%2Fdowhy%2Fdowhy.html%3Fhighlight%3Dgraph%23module-dowhy.causal_graph&data=05%7C01%7Cemrek%40microsoft.com%7C217b6e825c314139aa3808da47ef7339%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637901391003997029%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Eeh%2FvCoqaI6OiFVgqlWxh6j2sfac48cyfFaoAp0N%2BKg%3D&reserved=0> implements the causal graph for usage in dowhy, but it does not have the networkx-API, which would prevent someone from really just passing in a networkx.Digraph to dowhy functions. So if the desire is to shift to something more generic that has a networkx-like API, I would already see the need for some type of API deprecation.
Taking a look at the protocols<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpy-why%2Fdowhy%2Fblob%2Fmaster%2Fdowhy%2Fgcm%2Fgraph.py%23L41&data=05%7C01%7Cemrek%40microsoft.com%7C217b6e825c314139aa3808da47ef7339%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637901391003997029%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=siIeOH4bbDrtm7ulLv0J%2FvZTQYSsCbzD9PQM50v9Tfs%3D&reserved=0>, would this be sufficient to support your extension or do you see potential issues or aspects that are missing?
Are the protocols here meant to be sufficient for any implemented graph class to work with dowhy and its current set of fucntions? I don't think it would be sufficient(?), since for example, a CPDAG can have each of these protocols, but some of the estimation algorithms would be different. Or a PAG could have each of these protocols, but the ID algorithm would need to be adjusted to account for these circular edges.
-
Reply to this email directly, view it on GitHub<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpy-why%2Fdowhy%2Fdiscussions%2F429%23discussioncomment-2892858&data=05%7C01%7Cemrek%40microsoft.com%7C217b6e825c314139aa3808da47ef7339%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637901391003997029%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=RrOrsaLOO4W1dA20jVz9INoXod8NXKtyydosz79cYuY%3D&reserved=0>, or unsubscribe<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABNUPUHUN4WKTA7GRC73DT3VNZDVTANCNFSM5XKDJZ2Q&data=05%7C01%7Cemrek%40microsoft.com%7C217b6e825c314139aa3808da47ef7339%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637901391003997029%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=XKJ%2BEm5U05eS92OFQZaYhaijjSpmzn6S%2BDK28yT5NVY%3D&reserved=0>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.******@***.***>>
|
Beta Was this translation helpful? Give feedback.
-
I think there are two challenges we are trying to balance:
(i) be compatible with existing graph representation libraries, whatever they might be. If we want to make it easy to call out to external libraries for critical functionality, we want to make it easy to pass information back and forth. For example, if there's existing open source libraries for advanced identification, causal discovery, or other graph-focused algorithms, why wouldn't we try to reuse them? Not forcing our own implementation into the API makes that easier.
(ii) We want to make it easier for people to build inside DoWhy. That means providing strong implementations that provide the kind of functionality that is missing and that will make development of new causal research, implementations, etc. easier.
There is some tension here, but they are not necessarily in conflict. I'd lean towards satisfying (ii) via Option 2 --- us providing good basline functionality, and then addressing concerns for (i) either through a clean abstraction interface and/or conversion functions to other common graph structures.
From: Adam Li ***@***.***>
Sent: Tuesday, June 7, 2022 11:25 AM
To: py-why/dowhy ***@***.***>
Cc: Emre Kiciman ***@***.***>; Comment ***@***.***>
Subject: Re: [py-why/dowhy] [Feedback requested] Proposal for updating DoWhy's API for supporting new causal tasks (Discussion #429)
I think both Option 1 and 2 are fine w/ me in general depending on the team's preferences and options. I think the most important issue of course is the community support. I think having visibility inside pywhy as a separate package would be helpful for both options.
My slight preference for Option 2
I do think that allowing an abstract class like BaseCausalGraph within dowhy's pipelines would be helpful for "flexibility". Adding to that and proceeding with option 2 is slightly more desirable because it also provides a general interface for the most common types of causal graphs ppl would use within dowhy. Eventually, it might make sense to spin out the graph representations into a sep package for maintenance and extensibility reasons.
-
Reply to this email directly, view it on GitHub<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpy-why%2Fdowhy%2Fdiscussions%2F429%23discussioncomment-2900110&data=05%7C01%7Cemrek%40microsoft.com%7C3ebff1ff1cb249a344ba08da48b2f6ee%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637902230729393452%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=a32zY%2F3Lwh2WseCnXTSR987x4EhtXqYWjR3rOzjzJsI%3D&reserved=0>, or unsubscribe<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABNUPUH2JB7GI55HNXA6WATVN6HV5ANCNFSM5XKDJZ2Q&data=05%7C01%7Cemrek%40microsoft.com%7C3ebff1ff1cb249a344ba08da48b2f6ee%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637902230729393452%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=eSx4P34QuxlXmm%2FOUsQN%2FVv7ooEP9%2Bk24VhgpNukTLQ%3D&reserved=0>.
You are receiving this because you commented.Message ID: ***@***.******@***.***>>
|
Beta Was this translation helpful? Give feedback.
-
Agreed
From: Adam Li ***@***.***>
Sent: Tuesday, June 7, 2022 11:45 AM
To: py-why/dowhy ***@***.***>
Cc: Emre Kiciman ***@***.***>; Comment ***@***.***>
Subject: Re: [py-why/dowhy] [Feedback requested] Proposal for updating DoWhy's API for supporting new causal tasks (Discussion #429)
(i) be compatible with existing graph representation libraries, whatever they might be. If we want to make it easy to call out to external libraries for critical functionality, we want to make it easy to pass information back and forth. For example, if there's existing open source libraries for advanced identification, causal discovery, or other graph-focused algorithms, why wouldn't we try to reuse them? Not forcing our own implementation into the API makes that easier.
i) is also accomplishable on a case-by-case basis with a transformation/loading/conversion function to and from the interface.
E.g. to_networkx would convert a causal graph that is explicitly supported by hypothetically pywhy to networkx, which... might be used in another package for causality on purely DAGs for example and therefore works neatly w/ networkx.
An example in an existing package: In MNE-Python<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmne.tools%2Fdev%2Freading_raw_data.html&data=05%7C01%7Cemrek%40microsoft.com%7C78714cda658b4b51262408da48b5d232%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637902243000859743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=grL0nq9a3VZ2%2BZABw7g1iAY1ILEDKMlqOEejusW12XU%3D&reserved=0>, for example, there is extensive IO functions used to go from some dataset stored in a special format to an explicit class representing the data. All these upstream dataset formats might be stored in some unique ways that have certain features that are useful for the manufacturer that uses them (e.g. format X might be used by group Y which develops a super useful GUI but requires format X), but the core necessary functionality is available for ppl to work w/.
-
Reply to this email directly, view it on GitHub<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpy-why%2Fdowhy%2Fdiscussions%2F429%23discussioncomment-2900212&data=05%7C01%7Cemrek%40microsoft.com%7C78714cda658b4b51262408da48b5d232%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637902243000859743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Hd2371IzIysHhRL4yTuoUA7iDqzLPRymgAH9IcvB2Vk%3D&reserved=0>, or unsubscribe<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABNUPUAIDPNSZELSYAHECITVN6KCTANCNFSM5XKDJZ2Q&data=05%7C01%7Cemrek%40microsoft.com%7C78714cda658b4b51262408da48b5d232%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637902243001015962%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=klL1b2sm%2FXO9aOSGspKWR5%2FywjmUmhkR8T2oT4XsLwM%3D&reserved=0>.
You are receiving this because you commented.Message ID: ***@***.******@***.***>>
|
Beta Was this translation helpful? Give feedback.
-
Hey @amit-sharma and @kailashbuki (is this Kaibud on discord?), as discussed, I am sharing some thoughts and links relevant from the previous discord discussion. Re proposal for graph class API: I mocked up a basic "tutorial" of SCMs -> some fundamental causal graphs: https://adam2392.github.io/causal-networkx/dev/auto_examples/intro_causal_graphs.html#sphx-glr-auto-examples-intro-causal-graphs-py The basic abstract API for the graph classes I currently use is: https://github.com/adam2392/causal-networkx/blob/main/causal_networkx/graphs/base.py Re structure learning API: To represent additional classes of causal graphs (i.e. Markov equivalence classes and any future type of graphs), it will need to (imo) go hand-in-hand with the design of the causal discovery API. I have sketched out a basic API for all conditional independence tests, which if followed, can be plugged into the PC algorithm and FCI algorithm I've implemented. These then immediately feed into the different MEC graphs (cpdag and PAG and future improved graphs :)). Some thoughts on dowhy's current API My intuition is that those sort of arguments belong in the downstream pipeline. For example in this function:
one would specify the treatment, etc. I think in terms of longer-term sustainability, my experience is that one would want to separate and simplify each stage of the pipeline to have the least amount of steps (arguments/parameters) needed to set up. E.g. causal-ID shouldn't even require the data, cuz it can be entirely symbolic, yet I think the current API requires you to have it (example). I understand any of these sorts of changes need to proceed "slowly with backwards compatibility". Also not trying to criticize in any way cuz I think dowhy is an excellent start to bring CausalInf to Python. Just wanted to provide some user opinions. Also happy to be convinced otw as I use dowhy more and more :p. Re thoughts on going from the graph classes -> dowhy's pipeline: Currently, I just figured I would port the graphs into the representation needed by dowhy... But if the API I presented seems of interest, then I can help / lead integration of that API into dowhy. |
Beta Was this translation helpful? Give feedback.
-
OK - started reading the V1 API proposal - got stack on Second Paragraph "As you can see below, we envision two ways of achieving the each task: 1) the task-specific API, and 2) using a Graphical Causal Model (GCM). Having access to a fitted GCM simplifies the computation of almost every task, but it requires knowledge of the full causal graph. Hence if the full graph is known, we suggest using the GCM API. For most other tasks, the common API can be used." - in my mind, and it is likely that I am wrong one key difference between (1) and (2) is that the original DoWhy kind of was geared towards interventions that are binary or categorical (this could be miss conceptions due to the books and papers I have been reading). The GCM seems to be an extension into the full continuous world. The original documentation did not do enough noise about the figuring out the structure ;-). Personally I prefer starting with a hand crafted causality DAG and getting estimates and refutation feedback. |
Beta Was this translation helpful? Give feedback.
-
I think there is a typo in do operation produces interventional samplesY1 = dowhy.do(scm, estimand, input_values=[1]) should it not be Y0 = dowhy.do(scm, estimand, input_values=[0]) ? |
Beta Was this translation helpful? Give feedback.
-
On the SCM GUI.
|
Beta Was this translation helpful? Give feedback.
-
Nit: I think the introduction section, which has
would be strengthened significantly if it had an example of the pain point now. From someone trying to read the ideas for the first time I was a little lost on what the current problem is. |
Beta Was this translation helpful? Give feedback.
-
In the GCM API example we have
Is it possible that we also add an example that doesn't do this automatically and matches more the estimate effect API? Right now its not clear what the relationship is between these two. |
Beta Was this translation helpful? Give feedback.
-
As DoWhy moves to new tasks like attribution and causal prediction, we are thinking of updating the API so that it can work for these tasks while keeping it compatible the effect estimation API.
While the new API contains the same input-output signature for all methods, one breaking change that we would like to propose is moving to a functional API rather than an object-oriented one. We feel that it can help make function arguments more explicit and avoid book-keeping in the main codebase. In practice, this would mean that the same end-user code will work in the new API, often just by replacing
CausalModel.method
withdowhy.method([data,graph])
where one of the two parameters may be optional (e.g.,dowhy.identify_effect
does not require access to data).Would love to have your feedback on the proposed API. The changes are described in the Wiki page: API proposal for v1.
You may also refer to the Roadmap that provides additional justification for the API design decisions.
Beta Was this translation helpful? Give feedback.
All reactions