diff --git a/docs/source/gcm/index.rst b/docs/source/gcm/index.rst new file mode 100644 index 000000000..bbe09ddeb --- /dev/null +++ b/docs/source/gcm/index.rst @@ -0,0 +1,4 @@ +.. toctree:: + :maxdepth: 2 + + user_guide/index diff --git a/docs/source/gcm/user_guide/answering_causal_questions/attribute_distributional_changes.rst b/docs/source/gcm/user_guide/answering_causal_questions/attribute_distributional_changes.rst new file mode 100644 index 000000000..45e192657 --- /dev/null +++ b/docs/source/gcm/user_guide/answering_causal_questions/attribute_distributional_changes.rst @@ -0,0 +1,51 @@ +Attributing Distributional Changes +================================== + +When attributing distribution changes, we answer the question: + + What mechanism in my system changed between two sets of data? + +For example, in a distributed computing system, we want to know why an important system metric changed in a negative way. + +How to use it +^^^^^^^^^^^^^^ + +To see how the method works, let's take the example from above and assume we have a system of three services X, Y, Z, +producing latency numbers. The first dataset ``data_old`` is before the deployment, ``data_new`` is after the +deployment: + +>>> import networkx as nx, numpy as np, pandas as pd +>>> from dowhy import gcm +>>> from scipy.stats import halfnorm +>>> +>>> X = halfnorm.rvs(size=1000, loc=0.5, scale=0.2) +>>> Y = halfnorm.rvs(size=1000, loc=1.0, scale=0.2) +>>> Z = np.maximum(X, Y) + np.random.normal(loc=0, scale=1, size=1000) +>>> data_old = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z)) +>>> +>>> X = halfnorm.rvs(size=1000, loc=0.5, scale=0.2) +>>> Y = halfnorm.rvs(size=1000, loc=1.0, scale=0.2) +>>> Z = X + Y + np.random.normal(loc=0, scale=1, size=1000) +>>> data_new = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z)) + +The change here simulates an accidental conversion of multi-threaded code into sequential one (waiting for X and Y in +parallel vs. waiting for them sequentially). + +Next, we'll model cause-effect relationships as a probabilistic causal model: + +>>> causal_model = gcm.ProbabilisticCausalModel(nx.DiGraph([('X', 'Z'), ('Y', 'Z')])) # X -> Z <- Y +>>> gcm.auto_assign_causal_models(causal_model, based_on=data_old) + +Finally, we attribute changes in distributions to changes in causal mechanisms: + +>>> attributions = gcm.distribution_change(causal_model, data_old, data_new, 'Z') +>>> attributions +{'X': -0.0066425020480165905, 'Y': 0.009816959724738061, 'Z': 0.21957816956354193} + +As we can see, :math:`Z` got the highest attribution score here, which matches what we would +expect, given that we changed the mechanism for variable :math:`Z` in our data generation. + +As the reader may have noticed, there is no fitting step involved when using this method. The +reason is, that this function will call ``fit`` internally. To be precise, this function will +make two copies of the causal graph and fit one graph to the first dataset and the second graph +to the second datset. diff --git a/docs/source/gcm/user_guide/answering_causal_questions/computing_counterfactuals.rst b/docs/source/gcm/user_guide/answering_causal_questions/computing_counterfactuals.rst new file mode 100644 index 000000000..5021ceaa1 --- /dev/null +++ b/docs/source/gcm/user_guide/answering_causal_questions/computing_counterfactuals.rst @@ -0,0 +1,83 @@ +Computing Counterfactuals +========================== + +By computing counterfactuals, we answer the question: + + I observed a certain outcome z for a variable Z where variable X was set to a value x. What + would have happened to the value of Z, had I intervened on X to assign it a different value x'? + +As a concrete example, we can imagine the following: + + I'm seeing unhealthy high levels of my `cholesterol LDL + `_ (Z=10). I didn't take any medication + against it in recent months (X=0). What would have happened to my cholesterol LDL level (Z), + had I taken a medication dosage of 5g a day (X := 5)? + +How to use it +^^^^^^^^^^^^^^ + +To see how the method works, let's generate some data: + +>>> import networkx as nx, numpy as np, pandas as pd +>>> from dowhy import gcm +>>> +>>> X = np.random.normal(loc=0, scale=1, size=1000) +>>> Y = 2*X + np.random.normal(loc=0, scale=1, size=1000) +>>> Z = 3*Y + np.random.normal(loc=0, scale=1, size=1000) +>>> training_data = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z)) + +Next, we'll model cause-effect relationships as an invertible SCM and fit it to the data: + +>>> causal_model = gcm.InvertibleStructuralCausalModel(nx.DiGraph([('X', 'Y'), ('Y', 'Z')])) # X -> Y -> Z +>>> gcm.auto_assign_causal_models(causal_model, training_data, gcm.AutoAssignQuality.GOOD) +>>> +>>> gcm.fit(causal_model, training_data) + +Finally, let's compute the counterfactual when intervening on X: + +>>> gcm.estimate_counterfactuals( +>>> causal_model, +>>> {'X': lambda x: 2}, +>>> observed_data=pd.DataFrame(data=dict(X=[1], Y=[2], Z=[3]))) + X Y Z +0 2 4.034229 9.073294 + +As we can see, :math:`X` takes our treatment-/intervention-value of 2, and :math:`Y` and :math:`Z` +take deterministic values, based on our trained causal models and fixed observed data. I.e., based +on the data generation process, if :math:`X = 1`, :math:`Y = 2`, we would expect :math:`Z` to +be 6, but we *observed* :math:`Z = 3`, which means the particular noise value for :math:`Z` in this +particular sample is approximately -2.98. Now, given that we know this hidden noise factor, we can +estimate the counterfactual value of :math:`Z`, had we set :math:`X := 2`, which is approximately +9.07 (as can be seen in the result above). + +This shows that the observed data is used to calculate the noise data in the system. We can also +provide these noise values directly, via: + +>>> gcm.counterfactual_distribution( +>>> causal_model, +>>> {'X': lambda x: 2}, +>>> noise_data=pd.DataFrame(data=dict(X=[0], Y=[-0.007913], Z=[-2.97568]))) + X Y Z +0 2 4.034229 9.073293 + +As we see, with :math:`X = 2` and :math:`Y \approx 4.03`, :math:`Z` should be approximately 12. But +we know the hidden noise for this sample, approximately -2.98. So the counterfactual outcome +is again :math:`Z \approx 9.07`. + +Understanding the method +^^^^^^^^^^^^^^^^^^^^^^^^ + +Counterfactuals are very similar to :doc:`simulate_impact_of_interventions`, with an important +difference: when performing interventions, we look into the future, for counterfactuals we look into +an alternative past. To reflect this in the computation, when performing interventions, we generate +all noise using our causal models. For counterfactuals, we use the noise from actual observed data. + +To expand on our example above, we assume there are other factors that contribute to cholesterol +levels, e.g. exercising or genetic predisposition. While we *assume* medication helps against high +LDL levels, it's important to take into account all other factors that could also help against it. +We want to prove *what* has helped. Hence, it's important to use the noise from the real data, +not some generated noise from our generative models. Otherwise, I may be able to reduce my +cholesterol LDL level in the counterfactual world, where I take medication (X := 5), but not because +I took the medication, but because the *generated noise* of Z also just happened to be low and so +caused a low value for Z. By taking the *real* noise value of Z (derived from the observed data of +Z), I can prove that it was the medication that helped. diff --git a/docs/source/gcm/user_guide/answering_causal_questions/index.rst b/docs/source/gcm/user_guide/answering_causal_questions/index.rst new file mode 100644 index 000000000..cd0e017de --- /dev/null +++ b/docs/source/gcm/user_guide/answering_causal_questions/index.rst @@ -0,0 +1,13 @@ +Answering Causal Questions +=========================== + +In the following sub-sections, we'll dive deep into all causal questions the GCM-based inference in +DoWhy can answer and explain the concepts behind them and how to interpret the results. + + +.. toctree:: + :maxdepth: 3 + + simulate_impact_of_interventions + computing_counterfactuals + attribute_distributional_changes diff --git a/docs/source/gcm/user_guide/answering_causal_questions/simulate_impact_of_interventions.rst b/docs/source/gcm/user_guide/answering_causal_questions/simulate_impact_of_interventions.rst new file mode 100644 index 000000000..efd6f49c4 --- /dev/null +++ b/docs/source/gcm/user_guide/answering_causal_questions/simulate_impact_of_interventions.rst @@ -0,0 +1,50 @@ +Simulating the Impact of Interventions +====================================== + +By simulating the impact of interventions, we answer the question: + + What will happen to the variable Z if I intervene on Y? + +How to use it +^^^^^^^^^^^^^^ + +To see how the method works, let's generate some data: + +>>> import numpy as np, pandas as pd +>>> +>>> X = np.random.normal(loc=0, scale=1, size=1000) +>>> Y = 2*X + np.random.normal(loc=0, scale=1, size=1000) +>>> Z = 3*Y + np.random.normal(loc=0, scale=1, size=1000) +>>> training_data = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z)) + +Next, we'll model cause-effect relationships as a probabilistic causal model and fit it to the data: + +>>> import networkx as nx +>>> from dowhy import gcm +>>> +>>> causal_model = gcm.ProbabilisticCausalModel(nx.DiGraph([('X', 'Y'), ('Y', 'Z')])) # X -> Y -> Z +>>> gcm.auto_assign_causal_models(causal_model, training_data) +>>> gcm.fit(causal_model, training_data) + +Finally, let's perform an intervention on Y: + +>>> samples = gcm.perform_intervention(causal_model, {'X': lambda x: 1}, num_samples_to_draw=1000) +>>> samples.head() + X Y Z + 0 1 3.481467 12.475105 + 1 1 1.282945 3.279435 + 2 1 2.508717 7.907412 + 3 1 2.077061 5.506252 + 4 1 1.400568 6.097633 + +As we can see, X is now fixed at a constant value of 1. This is known as an atomic intervention. We can also perform +shift interventions where we shift the random variable X by some value: + +>>> samples = gcm.perform_intervention(causal_model, {'X': lambda x: x + 0.5}, num_samples_to_draw=1000) +>>> samples.head() + X Y Z + 0 -0.542813 0.031771 1.195391 + 1 1.615089 2.156833 6.704683 + 2 1.340949 1.910316 5.882468 + 3 1.837919 4.360685 12.565738 + 4 3.791410 8.361918 25.477725 diff --git a/docs/source/gcm/user_guide/customizing_model_assignment.rst b/docs/source/gcm/user_guide/customizing_model_assignment.rst new file mode 100644 index 000000000..10d364d64 --- /dev/null +++ b/docs/source/gcm/user_guide/customizing_model_assignment.rst @@ -0,0 +1,4 @@ +Customizing Model Assignment +============================ + +TODO \ No newline at end of file diff --git a/docs/source/gcm/user_guide/index.rst b/docs/source/gcm/user_guide/index.rst new file mode 100644 index 000000000..d56f4e04f --- /dev/null +++ b/docs/source/gcm/user_guide/index.rst @@ -0,0 +1,10 @@ +GCMs User Guide +=============== + +.. toctree:: + :maxdepth: 1 + :glob: + + introduction + answering_causal_questions/index + customizing_model_assignment diff --git a/docs/source/gcm/user_guide/introduction.rst b/docs/source/gcm/user_guide/introduction.rst new file mode 100644 index 000000000..46fdea3f0 --- /dev/null +++ b/docs/source/gcm/user_guide/introduction.rst @@ -0,0 +1,163 @@ +Introduction +============ + +Graphical causal model-based inference, or GCM-based inference for short, is an experimental addition to DoWhy, that +currently works separately from DoWhy's main API. Its experimental status also means that its API may +undergo breaking changes in the future. It will be forming a part of a joint, new API (). We +welcome your comments. + +The ``dowhy.gcm`` package provides a variety of ways to answer causal questions and we'll go through them in detail in +section :doc:`answering_causal_questions/index`. However, before diving into them, let's understand +the basic building blocks and usage patterns it is built upon. + +The basic building blocks +^^^^^^^^^^^^^^^^^^^^^^^^^ + +All main features of the GCM-based inference in DoWhy are built around the concept of **graphical causal models**. A +graphical causal model consists of a causal direct acyclic graph (DAG) of variables and a **causal mechanism** for +each of the variables. A causal mechanism defines the conditional distribution of a variable given its parents in the +graph, or, in case of root node variables, simply its distribution. + +The most general case of a GCM is a **probabilistic causal model** (PCM), where causal mechanisms are defined by +**conditional stochastic models** and **stochastic models**. In the ``dowhy.gcm`` package, these are represented by +:class:`~ProbabilsiticCausalModel`, :class:`~ConditionalStochasticModel`, and :class:`~StochasticModel`. + +.. image:: pcm.png + :width: 80% + :align: center + +| + +In practical terms however, we often use **structural causal models** (SCMs) to represent our GCMs, +and the causal mechanisms are defined by **functional causal models** (FCMs) for non-root nodes and **stochastic +models** for root nodes. An SCM implements the same traits as a PCM, but on top of that, its FCMs allow us to +reason *further* about its data generation process based on parents and noise, and hence, allow us e.g. to compute +counterfactuals. + +.. image:: scm.png + :width: 80% + :align: center + +| + +To keep this introduction simple, we will stick with SCMs for now. + +As mentioned above, a causal mechanism describes how the values of a node are influenced by the values of its parent +nodes. We will dive much deeper into the details of causal mechanisms and their meaning in section +:doc:`customizing_model_assignment`. But for this introduction, we will treat them as an opaque thing that is needed +to answer causal questions. With that in mind, the typical steps involved in answering a causal question, are: + +1. **Modeling cause-effect relationships as a GCM (causal graph + causal mechanisms):** +:: + + causal_model = StructuralCausalModel(nx.DiGraph([('X', 'Y'), ('Y', 'Z')])) # X -> Y -> Z + auto_assign_causal_models(causal_model, based_on=data) + +1. **Fitting the GCM to the data:** +:: + + fit(causal_model, data) + +3. **Answering a causal query based on the GCM:** +:: + + results = (causal_model, ...) + +Where ```` can be one of multiple functions explained in +:doc:`answering_causal_questions/index`. + +Let's look at each of these steps in more detail. + +Step 1: Modeling cause-effect relationships as a structural causal model (SCM) +------------------------------------------------------------------------------ + +The first step is to model the cause-effect relationships between variables relevant +to our use case. We do that in form of a causal graph. A causal graph is a directed acyclic +graph (DAG) where an edge X→Y implies that X causes Y. Statistically, a causal graph encodes the +conditional independence relations between variables. Using the `networkx `__ library, we can create causal graphs. In the snippet below, we create a chain +X→Y→Z: + +>>> import networkx as nx +>>> causal_graph = nx.DiGraph([('X', 'Y'), ('Y', 'Z')]) + +To answer causal questions using causal graphs, we also have to know the nature of underlying +data-generating process of variables. A causal graph by itself, being a diagram, does not have +any information about the data-generating process. To introduce this data-generating process, we use an SCM that's +built on top of our causal graph: + +>>> from dowhy import gcm +>>> causal_model = gcm.StructuralCausalModel(causal_graph) + +This causal model allows us now to assign causal mechanisms to each node in the form of functional causal models. +Section :doc:`customizing_model_assignment` explains how this can be done explicitly, but for now, we'll rely on +our auto-assign feature. The feature automatically determines a good set of default functional causal +models based on the data we work with. + +Therefore, at this point we would normally load our dataset. For this introduction, we generate +some synthetic data instead. The API takes data in form of Pandas DataFrames: + +>>> import numpy as np, pandas as pd +>>> +>>> X = np.random.normal(loc=0, scale=1, size=1000) +>>> Y = 2 * X + np.random.normal(loc=0, scale=1, size=1000) +>>> Z = 3 * Y + np.random.normal(loc=0, scale=1, size=1000) +>>> data = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z)) +>>> data.head() + X Y Z +0 -2.253500 -3.638579 -10.370047 +1 -1.078337 -2.114581 -6.028030 +2 -0.962719 -2.157896 -5.750563 +3 -0.300316 -0.440721 -2.619954 +4 0.127419 0.158185 1.555927 + +Note how the columns X, Y, Z correspond to our nodes X, Y, Z in the graph constructed above. We can also see how the +values of X influence the values of Y and how the values of Y influence the values of Z in that data set. + +In the real world, this data comes as an opaque stream of values, where we don't know how one +variable influences another. The SCM-based can basically help us to deconstruct these causal +relationships again, even though we didn't know them before. + +Now that we have the data, let's automatically assign a functional causal model (FCM) to each node in the graph, +based on the data: + +>>> gcm.auto_assign_causal_models(causal_model, based_on=data) + +While this function provides a good default, section :doc:`customizing_model_assignment` explains +how we can manually optimize the choice of models according to our problem and improve our results. + +Step 2: Fitting the SCM to the data +----------------------------------- + +With the data at hand and the graph constructed earlier, we can now train the SCM using ``fit``: + +>>> gcm.fit(causal_model, data) + +Fitting means, we learn the generative models of the variables in the SCM according to the data. + +Step 3: Answering a causal query based on the SCM +------------------------------------------------- + +The last step, answering a causal question, is our actual goal. E.g. we could ask the question: + + What will happen to the variable Z if I intervene on Y? + +This can be done via the ``perform_intervention`` function. Here's how: + +>>> samples = gcm.perform_intervention(causal_model, +>>> {'Y': lambda y: 2.34 }, +>>> num_samples_to_draw=1000) +>>> samples.head() + X Y Z +0 1.186229 6.918607 20.682375 +1 -0.758809 -0.749365 -2.530045 +2 -1.177379 -5.678514 -17.110836 +3 -1.211356 -2.152073 -6.212703 +4 -0.100224 -0.285047 0.256471 + +This intervention says: "I'll ignore any causal effects of X on Y, and set every value of Y +to 2.34." So the distribution of X will remain unchanged, whereas values of Y will be at a fixed +value and Z will respond according to its causal model. + +With this knowledge, we can now dive deep into the meaning and usages of causal queries in section +:doc:`answering_causal_questions/index`. diff --git a/docs/source/gcm/user_guide/pcm.png b/docs/source/gcm/user_guide/pcm.png new file mode 100644 index 000000000..fb9d3c41f Binary files /dev/null and b/docs/source/gcm/user_guide/pcm.png differ diff --git a/docs/source/gcm/user_guide/scm.png b/docs/source/gcm/user_guide/scm.png new file mode 100644 index 000000000..0cf133bad Binary files /dev/null and b/docs/source/gcm/user_guide/scm.png differ diff --git a/docs/source/index.rst b/docs/source/index.rst index 36eba2ea6..99636c623 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -35,6 +35,12 @@ example_notebooks/nb_advanced_index +.. toctree:: + :maxdepth: 2 + :caption: GCM-based inference (Experimental) + + gcm/index + .. toctree:: :maxdepth: 2 :caption: Package