Introduce GCMs section in docs with basic GCM User Guide

This commit is contained in:
Peter Goetz 2022-05-02 20:22:13 +02:00 коммит произвёл Amit Sharma
Родитель 7ae118ff43
Коммит fa7db57024
11 изменённых файлов: 384 добавлений и 0 удалений

Просмотреть файл

@ -0,0 +1,4 @@
.. toctree::
:maxdepth: 2

Просмотреть файл

@ -0,0 +1,51 @@
Attributing Distributional Changes
When attributing distribution changes, we answer the question:
What mechanism in my system changed between two sets of data?
For example, in a distributed computing system, we want to know why an important system metric changed in a negative way.
How to use it
To see how the method works, let's take the example from above and assume we have a system of three services X, Y, Z,
producing latency numbers. The first dataset ``data_old`` is before the deployment, ``data_new`` is after the
>>> import networkx as nx, numpy as np, pandas as pd
>>> from dowhy import gcm
>>> from scipy.stats import halfnorm
>>> X = halfnorm.rvs(size=1000, loc=0.5, scale=0.2)
>>> Y = halfnorm.rvs(size=1000, loc=1.0, scale=0.2)
>>> Z = np.maximum(X, Y) + np.random.normal(loc=0, scale=1, size=1000)
>>> data_old = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z))
>>> X = halfnorm.rvs(size=1000, loc=0.5, scale=0.2)
>>> Y = halfnorm.rvs(size=1000, loc=1.0, scale=0.2)
>>> Z = X + Y + np.random.normal(loc=0, scale=1, size=1000)
>>> data_new = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z))
The change here simulates an accidental conversion of multi-threaded code into sequential one (waiting for X and Y in
parallel vs. waiting for them sequentially).
Next, we'll model cause-effect relationships as a probabilistic causal model:
>>> causal_model = gcm.ProbabilisticCausalModel(nx.DiGraph([('X', 'Z'), ('Y', 'Z')])) # X -> Z <- Y
>>> gcm.auto_assign_causal_models(causal_model, based_on=data_old)
Finally, we attribute changes in distributions to changes in causal mechanisms:
>>> attributions = gcm.distribution_change(causal_model, data_old, data_new, 'Z')
>>> attributions
{'X': -0.0066425020480165905, 'Y': 0.009816959724738061, 'Z': 0.21957816956354193}
As we can see, :math:`Z` got the highest attribution score here, which matches what we would
expect, given that we changed the mechanism for variable :math:`Z` in our data generation.
As the reader may have noticed, there is no fitting step involved when using this method. The
reason is, that this function will call ``fit`` internally. To be precise, this function will
make two copies of the causal graph and fit one graph to the first dataset and the second graph
to the second datset.

Просмотреть файл

@ -0,0 +1,83 @@
Computing Counterfactuals
By computing counterfactuals, we answer the question:
I observed a certain outcome z for a variable Z where variable X was set to a value x. What
would have happened to the value of Z, had I intervened on X to assign it a different value x'?
As a concrete example, we can imagine the following:
I'm seeing unhealthy high levels of my `cholesterol LDL
<>`_ (Z=10). I didn't take any medication
against it in recent months (X=0). What would have happened to my cholesterol LDL level (Z),
had I taken a medication dosage of 5g a day (X := 5)?
How to use it
To see how the method works, let's generate some data:
>>> import networkx as nx, numpy as np, pandas as pd
>>> from dowhy import gcm
>>> X = np.random.normal(loc=0, scale=1, size=1000)
>>> Y = 2*X + np.random.normal(loc=0, scale=1, size=1000)
>>> Z = 3*Y + np.random.normal(loc=0, scale=1, size=1000)
>>> training_data = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z))
Next, we'll model cause-effect relationships as an invertible SCM and fit it to the data:
>>> causal_model = gcm.InvertibleStructuralCausalModel(nx.DiGraph([('X', 'Y'), ('Y', 'Z')])) # X -> Y -> Z
>>> gcm.auto_assign_causal_models(causal_model, training_data, gcm.AutoAssignQuality.GOOD)
>>>, training_data)
Finally, let's compute the counterfactual when intervening on X:
>>> gcm.estimate_counterfactuals(
>>> causal_model,
>>> {'X': lambda x: 2},
>>> observed_data=pd.DataFrame(data=dict(X=[1], Y=[2], Z=[3])))
0 2 4.034229 9.073294
As we can see, :math:`X` takes our treatment-/intervention-value of 2, and :math:`Y` and :math:`Z`
take deterministic values, based on our trained causal models and fixed observed data. I.e., based
on the data generation process, if :math:`X = 1`, :math:`Y = 2`, we would expect :math:`Z` to
be 6, but we *observed* :math:`Z = 3`, which means the particular noise value for :math:`Z` in this
particular sample is approximately -2.98. Now, given that we know this hidden noise factor, we can
estimate the counterfactual value of :math:`Z`, had we set :math:`X := 2`, which is approximately
9.07 (as can be seen in the result above).
This shows that the observed data is used to calculate the noise data in the system. We can also
provide these noise values directly, via:
>>> gcm.counterfactual_distribution(
>>> causal_model,
>>> {'X': lambda x: 2},
>>> noise_data=pd.DataFrame(data=dict(X=[0], Y=[-0.007913], Z=[-2.97568])))
0 2 4.034229 9.073293
As we see, with :math:`X = 2` and :math:`Y \approx 4.03`, :math:`Z` should be approximately 12. But
we know the hidden noise for this sample, approximately -2.98. So the counterfactual outcome
is again :math:`Z \approx 9.07`.
Understanding the method
Counterfactuals are very similar to :doc:`simulate_impact_of_interventions`, with an important
difference: when performing interventions, we look into the future, for counterfactuals we look into
an alternative past. To reflect this in the computation, when performing interventions, we generate
all noise using our causal models. For counterfactuals, we use the noise from actual observed data.
To expand on our example above, we assume there are other factors that contribute to cholesterol
levels, e.g. exercising or genetic predisposition. While we *assume* medication helps against high
LDL levels, it's important to take into account all other factors that could also help against it.
We want to prove *what* has helped. Hence, it's important to use the noise from the real data,
not some generated noise from our generative models. Otherwise, I may be able to reduce my
cholesterol LDL level in the counterfactual world, where I take medication (X := 5), but not because
I took the medication, but because the *generated noise* of Z also just happened to be low and so
caused a low value for Z. By taking the *real* noise value of Z (derived from the observed data of
Z), I can prove that it was the medication that helped.

Просмотреть файл

@ -0,0 +1,13 @@
Answering Causal Questions
In the following sub-sections, we'll dive deep into all causal questions the GCM-based inference in
DoWhy can answer and explain the concepts behind them and how to interpret the results.
.. toctree::
:maxdepth: 3

Просмотреть файл

@ -0,0 +1,50 @@
Simulating the Impact of Interventions
By simulating the impact of interventions, we answer the question:
What will happen to the variable Z if I intervene on Y?
How to use it
To see how the method works, let's generate some data:
>>> import numpy as np, pandas as pd
>>> X = np.random.normal(loc=0, scale=1, size=1000)
>>> Y = 2*X + np.random.normal(loc=0, scale=1, size=1000)
>>> Z = 3*Y + np.random.normal(loc=0, scale=1, size=1000)
>>> training_data = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z))
Next, we'll model cause-effect relationships as a probabilistic causal model and fit it to the data:
>>> import networkx as nx
>>> from dowhy import gcm
>>> causal_model = gcm.ProbabilisticCausalModel(nx.DiGraph([('X', 'Y'), ('Y', 'Z')])) # X -> Y -> Z
>>> gcm.auto_assign_causal_models(causal_model, training_data)
>>>, training_data)
Finally, let's perform an intervention on Y:
>>> samples = gcm.perform_intervention(causal_model, {'X': lambda x: 1}, num_samples_to_draw=1000)
>>> samples.head()
0 1 3.481467 12.475105
1 1 1.282945 3.279435
2 1 2.508717 7.907412
3 1 2.077061 5.506252
4 1 1.400568 6.097633
As we can see, X is now fixed at a constant value of 1. This is known as an atomic intervention. We can also perform
shift interventions where we shift the random variable X by some value:
>>> samples = gcm.perform_intervention(causal_model, {'X': lambda x: x + 0.5}, num_samples_to_draw=1000)
>>> samples.head()
0 -0.542813 0.031771 1.195391
1 1.615089 2.156833 6.704683
2 1.340949 1.910316 5.882468
3 1.837919 4.360685 12.565738
4 3.791410 8.361918 25.477725

Просмотреть файл

@ -0,0 +1,4 @@
Customizing Model Assignment

Просмотреть файл

@ -0,0 +1,10 @@
GCMs User Guide
.. toctree::
:maxdepth: 1

Просмотреть файл

@ -0,0 +1,163 @@
Graphical causal model-based inference, or GCM-based inference for short, is an experimental addition to DoWhy, that
currently works separately from DoWhy's main API. Its experimental status also means that its API may
undergo breaking changes in the future. It will be forming a part of a joint, new API (<link to proposal>). We
welcome your comments.
The ``dowhy.gcm`` package provides a variety of ways to answer causal questions and we'll go through them in detail in
section :doc:`answering_causal_questions/index`. However, before diving into them, let's understand
the basic building blocks and usage patterns it is built upon.
The basic building blocks
All main features of the GCM-based inference in DoWhy are built around the concept of **graphical causal models**. A
graphical causal model consists of a causal direct acyclic graph (DAG) of variables and a **causal mechanism** for
each of the variables. A causal mechanism defines the conditional distribution of a variable given its parents in the
graph, or, in case of root node variables, simply its distribution.
The most general case of a GCM is a **probabilistic causal model** (PCM), where causal mechanisms are defined by
**conditional stochastic models** and **stochastic models**. In the ``dowhy.gcm`` package, these are represented by
:class:`~ProbabilsiticCausalModel`, :class:`~ConditionalStochasticModel`, and :class:`~StochasticModel`.
.. image:: pcm.png
:width: 80%
:align: center
In practical terms however, we often use **structural causal models** (SCMs) to represent our GCMs,
and the causal mechanisms are defined by **functional causal models** (FCMs) for non-root nodes and **stochastic
models** for root nodes. An SCM implements the same traits as a PCM, but on top of that, its FCMs allow us to
reason *further* about its data generation process based on parents and noise, and hence, allow us e.g. to compute
.. image:: scm.png
:width: 80%
:align: center
To keep this introduction simple, we will stick with SCMs for now.
As mentioned above, a causal mechanism describes how the values of a node are influenced by the values of its parent
nodes. We will dive much deeper into the details of causal mechanisms and their meaning in section
:doc:`customizing_model_assignment`. But for this introduction, we will treat them as an opaque thing that is needed
to answer causal questions. With that in mind, the typical steps involved in answering a causal question, are:
1. **Modeling cause-effect relationships as a GCM (causal graph + causal mechanisms):**
causal_model = StructuralCausalModel(nx.DiGraph([('X', 'Y'), ('Y', 'Z')])) # X -> Y -> Z
auto_assign_causal_models(causal_model, based_on=data)
1. **Fitting the GCM to the data:**
fit(causal_model, data)
3. **Answering a causal query based on the GCM:**
results = <causal_query>(causal_model, ...)
Where ``<causal_query>`` can be one of multiple functions explained in
Let's look at each of these steps in more detail.
Step 1: Modeling cause-effect relationships as a structural causal model (SCM)
The first step is to model the cause-effect relationships between variables relevant
to our use case. We do that in form of a causal graph. A causal graph is a directed acyclic
graph (DAG) where an edge X→Y implies that X causes Y. Statistically, a causal graph encodes the
conditional independence relations between variables. Using the `networkx <https://networkx>`__ library, we can create causal graphs. In the snippet below, we create a chain
>>> import networkx as nx
>>> causal_graph = nx.DiGraph([('X', 'Y'), ('Y', 'Z')])
To answer causal questions using causal graphs, we also have to know the nature of underlying
data-generating process of variables. A causal graph by itself, being a diagram, does not have
any information about the data-generating process. To introduce this data-generating process, we use an SCM that's
built on top of our causal graph:
>>> from dowhy import gcm
>>> causal_model = gcm.StructuralCausalModel(causal_graph)
This causal model allows us now to assign causal mechanisms to each node in the form of functional causal models.
Section :doc:`customizing_model_assignment` explains how this can be done explicitly, but for now, we'll rely on
our auto-assign feature. The feature automatically determines a good set of default functional causal
models based on the data we work with.
Therefore, at this point we would normally load our dataset. For this introduction, we generate
some synthetic data instead. The API takes data in form of Pandas DataFrames:
>>> import numpy as np, pandas as pd
>>> X = np.random.normal(loc=0, scale=1, size=1000)
>>> Y = 2 * X + np.random.normal(loc=0, scale=1, size=1000)
>>> Z = 3 * Y + np.random.normal(loc=0, scale=1, size=1000)
>>> data = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z))
>>> data.head()
0 -2.253500 -3.638579 -10.370047
1 -1.078337 -2.114581 -6.028030
2 -0.962719 -2.157896 -5.750563
3 -0.300316 -0.440721 -2.619954
4 0.127419 0.158185 1.555927
Note how the columns X, Y, Z correspond to our nodes X, Y, Z in the graph constructed above. We can also see how the
values of X influence the values of Y and how the values of Y influence the values of Z in that data set.
In the real world, this data comes as an opaque stream of values, where we don't know how one
variable influences another. The SCM-based can basically help us to deconstruct these causal
relationships again, even though we didn't know them before.
Now that we have the data, let's automatically assign a functional causal model (FCM) to each node in the graph,
based on the data:
>>> gcm.auto_assign_causal_models(causal_model, based_on=data)
While this function provides a good default, section :doc:`customizing_model_assignment` explains
how we can manually optimize the choice of models according to our problem and improve our results.
Step 2: Fitting the SCM to the data
With the data at hand and the graph constructed earlier, we can now train the SCM using ``fit``:
>>>, data)
Fitting means, we learn the generative models of the variables in the SCM according to the data.
Step 3: Answering a causal query based on the SCM
The last step, answering a causal question, is our actual goal. E.g. we could ask the question:
What will happen to the variable Z if I intervene on Y?
This can be done via the ``perform_intervention`` function. Here's how:
>>> samples = gcm.perform_intervention(causal_model,
>>> {'Y': lambda y: 2.34 },
>>> num_samples_to_draw=1000)
>>> samples.head()
0 1.186229 6.918607 20.682375
1 -0.758809 -0.749365 -2.530045
2 -1.177379 -5.678514 -17.110836
3 -1.211356 -2.152073 -6.212703
4 -0.100224 -0.285047 0.256471
This intervention says: "I'll ignore any causal effects of X on Y, and set every value of Y
to 2.34." So the distribution of X will remain unchanged, whereas values of Y will be at a fixed
value and Z will respond according to its causal model.
With this knowledge, we can now dive deep into the meaning and usages of causal queries in section

Двоичные данные
docs/source/gcm/user_guide/pcm.png Normal file

Двоичный файл не отображается.


Ширина:  |  Высота:  |  Размер: 110 KiB

Двоичные данные
docs/source/gcm/user_guide/scm.png Normal file

Двоичный файл не отображается.


Ширина:  |  Высота:  |  Размер: 105 KiB

Просмотреть файл

@ -35,6 +35,12 @@
.. toctree::
:maxdepth: 2
:caption: GCM-based inference (Experimental)
.. toctree::
:maxdepth: 2
:caption: Package