Introduce GCMs section in docs with basic GCM User Guide
This commit is contained in:
Родитель
7ae118ff43
Коммит
fa7db57024
|
@ -0,0 +1,4 @@
|
|||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
user_guide/index
|
|
@ -0,0 +1,51 @@
|
|||
Attributing Distributional Changes
|
||||
==================================
|
||||
|
||||
When attributing distribution changes, we answer the question:
|
||||
|
||||
What mechanism in my system changed between two sets of data?
|
||||
|
||||
For example, in a distributed computing system, we want to know why an important system metric changed in a negative way.
|
||||
|
||||
How to use it
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
To see how the method works, let's take the example from above and assume we have a system of three services X, Y, Z,
|
||||
producing latency numbers. The first dataset ``data_old`` is before the deployment, ``data_new`` is after the
|
||||
deployment:
|
||||
|
||||
>>> import networkx as nx, numpy as np, pandas as pd
|
||||
>>> from dowhy import gcm
|
||||
>>> from scipy.stats import halfnorm
|
||||
>>>
|
||||
>>> X = halfnorm.rvs(size=1000, loc=0.5, scale=0.2)
|
||||
>>> Y = halfnorm.rvs(size=1000, loc=1.0, scale=0.2)
|
||||
>>> Z = np.maximum(X, Y) + np.random.normal(loc=0, scale=1, size=1000)
|
||||
>>> data_old = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z))
|
||||
>>>
|
||||
>>> X = halfnorm.rvs(size=1000, loc=0.5, scale=0.2)
|
||||
>>> Y = halfnorm.rvs(size=1000, loc=1.0, scale=0.2)
|
||||
>>> Z = X + Y + np.random.normal(loc=0, scale=1, size=1000)
|
||||
>>> data_new = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z))
|
||||
|
||||
The change here simulates an accidental conversion of multi-threaded code into sequential one (waiting for X and Y in
|
||||
parallel vs. waiting for them sequentially).
|
||||
|
||||
Next, we'll model cause-effect relationships as a probabilistic causal model:
|
||||
|
||||
>>> causal_model = gcm.ProbabilisticCausalModel(nx.DiGraph([('X', 'Z'), ('Y', 'Z')])) # X -> Z <- Y
|
||||
>>> gcm.auto_assign_causal_models(causal_model, based_on=data_old)
|
||||
|
||||
Finally, we attribute changes in distributions to changes in causal mechanisms:
|
||||
|
||||
>>> attributions = gcm.distribution_change(causal_model, data_old, data_new, 'Z')
|
||||
>>> attributions
|
||||
{'X': -0.0066425020480165905, 'Y': 0.009816959724738061, 'Z': 0.21957816956354193}
|
||||
|
||||
As we can see, :math:`Z` got the highest attribution score here, which matches what we would
|
||||
expect, given that we changed the mechanism for variable :math:`Z` in our data generation.
|
||||
|
||||
As the reader may have noticed, there is no fitting step involved when using this method. The
|
||||
reason is, that this function will call ``fit`` internally. To be precise, this function will
|
||||
make two copies of the causal graph and fit one graph to the first dataset and the second graph
|
||||
to the second datset.
|
|
@ -0,0 +1,83 @@
|
|||
Computing Counterfactuals
|
||||
==========================
|
||||
|
||||
By computing counterfactuals, we answer the question:
|
||||
|
||||
I observed a certain outcome z for a variable Z where variable X was set to a value x. What
|
||||
would have happened to the value of Z, had I intervened on X to assign it a different value x'?
|
||||
|
||||
As a concrete example, we can imagine the following:
|
||||
|
||||
I'm seeing unhealthy high levels of my `cholesterol LDL
|
||||
<https://www.google.com/search?q=cholesterol+ldl>`_ (Z=10). I didn't take any medication
|
||||
against it in recent months (X=0). What would have happened to my cholesterol LDL level (Z),
|
||||
had I taken a medication dosage of 5g a day (X := 5)?
|
||||
|
||||
How to use it
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
To see how the method works, let's generate some data:
|
||||
|
||||
>>> import networkx as nx, numpy as np, pandas as pd
|
||||
>>> from dowhy import gcm
|
||||
>>>
|
||||
>>> X = np.random.normal(loc=0, scale=1, size=1000)
|
||||
>>> Y = 2*X + np.random.normal(loc=0, scale=1, size=1000)
|
||||
>>> Z = 3*Y + np.random.normal(loc=0, scale=1, size=1000)
|
||||
>>> training_data = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z))
|
||||
|
||||
Next, we'll model cause-effect relationships as an invertible SCM and fit it to the data:
|
||||
|
||||
>>> causal_model = gcm.InvertibleStructuralCausalModel(nx.DiGraph([('X', 'Y'), ('Y', 'Z')])) # X -> Y -> Z
|
||||
>>> gcm.auto_assign_causal_models(causal_model, training_data, gcm.AutoAssignQuality.GOOD)
|
||||
>>>
|
||||
>>> gcm.fit(causal_model, training_data)
|
||||
|
||||
Finally, let's compute the counterfactual when intervening on X:
|
||||
|
||||
>>> gcm.estimate_counterfactuals(
|
||||
>>> causal_model,
|
||||
>>> {'X': lambda x: 2},
|
||||
>>> observed_data=pd.DataFrame(data=dict(X=[1], Y=[2], Z=[3])))
|
||||
X Y Z
|
||||
0 2 4.034229 9.073294
|
||||
|
||||
As we can see, :math:`X` takes our treatment-/intervention-value of 2, and :math:`Y` and :math:`Z`
|
||||
take deterministic values, based on our trained causal models and fixed observed data. I.e., based
|
||||
on the data generation process, if :math:`X = 1`, :math:`Y = 2`, we would expect :math:`Z` to
|
||||
be 6, but we *observed* :math:`Z = 3`, which means the particular noise value for :math:`Z` in this
|
||||
particular sample is approximately -2.98. Now, given that we know this hidden noise factor, we can
|
||||
estimate the counterfactual value of :math:`Z`, had we set :math:`X := 2`, which is approximately
|
||||
9.07 (as can be seen in the result above).
|
||||
|
||||
This shows that the observed data is used to calculate the noise data in the system. We can also
|
||||
provide these noise values directly, via:
|
||||
|
||||
>>> gcm.counterfactual_distribution(
|
||||
>>> causal_model,
|
||||
>>> {'X': lambda x: 2},
|
||||
>>> noise_data=pd.DataFrame(data=dict(X=[0], Y=[-0.007913], Z=[-2.97568])))
|
||||
X Y Z
|
||||
0 2 4.034229 9.073293
|
||||
|
||||
As we see, with :math:`X = 2` and :math:`Y \approx 4.03`, :math:`Z` should be approximately 12. But
|
||||
we know the hidden noise for this sample, approximately -2.98. So the counterfactual outcome
|
||||
is again :math:`Z \approx 9.07`.
|
||||
|
||||
Understanding the method
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Counterfactuals are very similar to :doc:`simulate_impact_of_interventions`, with an important
|
||||
difference: when performing interventions, we look into the future, for counterfactuals we look into
|
||||
an alternative past. To reflect this in the computation, when performing interventions, we generate
|
||||
all noise using our causal models. For counterfactuals, we use the noise from actual observed data.
|
||||
|
||||
To expand on our example above, we assume there are other factors that contribute to cholesterol
|
||||
levels, e.g. exercising or genetic predisposition. While we *assume* medication helps against high
|
||||
LDL levels, it's important to take into account all other factors that could also help against it.
|
||||
We want to prove *what* has helped. Hence, it's important to use the noise from the real data,
|
||||
not some generated noise from our generative models. Otherwise, I may be able to reduce my
|
||||
cholesterol LDL level in the counterfactual world, where I take medication (X := 5), but not because
|
||||
I took the medication, but because the *generated noise* of Z also just happened to be low and so
|
||||
caused a low value for Z. By taking the *real* noise value of Z (derived from the observed data of
|
||||
Z), I can prove that it was the medication that helped.
|
|
@ -0,0 +1,13 @@
|
|||
Answering Causal Questions
|
||||
===========================
|
||||
|
||||
In the following sub-sections, we'll dive deep into all causal questions the GCM-based inference in
|
||||
DoWhy can answer and explain the concepts behind them and how to interpret the results.
|
||||
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 3
|
||||
|
||||
simulate_impact_of_interventions
|
||||
computing_counterfactuals
|
||||
attribute_distributional_changes
|
|
@ -0,0 +1,50 @@
|
|||
Simulating the Impact of Interventions
|
||||
======================================
|
||||
|
||||
By simulating the impact of interventions, we answer the question:
|
||||
|
||||
What will happen to the variable Z if I intervene on Y?
|
||||
|
||||
How to use it
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
To see how the method works, let's generate some data:
|
||||
|
||||
>>> import numpy as np, pandas as pd
|
||||
>>>
|
||||
>>> X = np.random.normal(loc=0, scale=1, size=1000)
|
||||
>>> Y = 2*X + np.random.normal(loc=0, scale=1, size=1000)
|
||||
>>> Z = 3*Y + np.random.normal(loc=0, scale=1, size=1000)
|
||||
>>> training_data = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z))
|
||||
|
||||
Next, we'll model cause-effect relationships as a probabilistic causal model and fit it to the data:
|
||||
|
||||
>>> import networkx as nx
|
||||
>>> from dowhy import gcm
|
||||
>>>
|
||||
>>> causal_model = gcm.ProbabilisticCausalModel(nx.DiGraph([('X', 'Y'), ('Y', 'Z')])) # X -> Y -> Z
|
||||
>>> gcm.auto_assign_causal_models(causal_model, training_data)
|
||||
>>> gcm.fit(causal_model, training_data)
|
||||
|
||||
Finally, let's perform an intervention on Y:
|
||||
|
||||
>>> samples = gcm.perform_intervention(causal_model, {'X': lambda x: 1}, num_samples_to_draw=1000)
|
||||
>>> samples.head()
|
||||
X Y Z
|
||||
0 1 3.481467 12.475105
|
||||
1 1 1.282945 3.279435
|
||||
2 1 2.508717 7.907412
|
||||
3 1 2.077061 5.506252
|
||||
4 1 1.400568 6.097633
|
||||
|
||||
As we can see, X is now fixed at a constant value of 1. This is known as an atomic intervention. We can also perform
|
||||
shift interventions where we shift the random variable X by some value:
|
||||
|
||||
>>> samples = gcm.perform_intervention(causal_model, {'X': lambda x: x + 0.5}, num_samples_to_draw=1000)
|
||||
>>> samples.head()
|
||||
X Y Z
|
||||
0 -0.542813 0.031771 1.195391
|
||||
1 1.615089 2.156833 6.704683
|
||||
2 1.340949 1.910316 5.882468
|
||||
3 1.837919 4.360685 12.565738
|
||||
4 3.791410 8.361918 25.477725
|
|
@ -0,0 +1,4 @@
|
|||
Customizing Model Assignment
|
||||
============================
|
||||
|
||||
TODO
|
|
@ -0,0 +1,10 @@
|
|||
GCMs User Guide
|
||||
===============
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:glob:
|
||||
|
||||
introduction
|
||||
answering_causal_questions/index
|
||||
customizing_model_assignment
|
|
@ -0,0 +1,163 @@
|
|||
Introduction
|
||||
============
|
||||
|
||||
Graphical causal model-based inference, or GCM-based inference for short, is an experimental addition to DoWhy, that
|
||||
currently works separately from DoWhy's main API. Its experimental status also means that its API may
|
||||
undergo breaking changes in the future. It will be forming a part of a joint, new API (<link to proposal>). We
|
||||
welcome your comments.
|
||||
|
||||
The ``dowhy.gcm`` package provides a variety of ways to answer causal questions and we'll go through them in detail in
|
||||
section :doc:`answering_causal_questions/index`. However, before diving into them, let's understand
|
||||
the basic building blocks and usage patterns it is built upon.
|
||||
|
||||
The basic building blocks
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
All main features of the GCM-based inference in DoWhy are built around the concept of **graphical causal models**. A
|
||||
graphical causal model consists of a causal direct acyclic graph (DAG) of variables and a **causal mechanism** for
|
||||
each of the variables. A causal mechanism defines the conditional distribution of a variable given its parents in the
|
||||
graph, or, in case of root node variables, simply its distribution.
|
||||
|
||||
The most general case of a GCM is a **probabilistic causal model** (PCM), where causal mechanisms are defined by
|
||||
**conditional stochastic models** and **stochastic models**. In the ``dowhy.gcm`` package, these are represented by
|
||||
:class:`~ProbabilsiticCausalModel`, :class:`~ConditionalStochasticModel`, and :class:`~StochasticModel`.
|
||||
|
||||
.. image:: pcm.png
|
||||
:width: 80%
|
||||
:align: center
|
||||
|
||||
|
|
||||
|
||||
In practical terms however, we often use **structural causal models** (SCMs) to represent our GCMs,
|
||||
and the causal mechanisms are defined by **functional causal models** (FCMs) for non-root nodes and **stochastic
|
||||
models** for root nodes. An SCM implements the same traits as a PCM, but on top of that, its FCMs allow us to
|
||||
reason *further* about its data generation process based on parents and noise, and hence, allow us e.g. to compute
|
||||
counterfactuals.
|
||||
|
||||
.. image:: scm.png
|
||||
:width: 80%
|
||||
:align: center
|
||||
|
||||
|
|
||||
|
||||
To keep this introduction simple, we will stick with SCMs for now.
|
||||
|
||||
As mentioned above, a causal mechanism describes how the values of a node are influenced by the values of its parent
|
||||
nodes. We will dive much deeper into the details of causal mechanisms and their meaning in section
|
||||
:doc:`customizing_model_assignment`. But for this introduction, we will treat them as an opaque thing that is needed
|
||||
to answer causal questions. With that in mind, the typical steps involved in answering a causal question, are:
|
||||
|
||||
1. **Modeling cause-effect relationships as a GCM (causal graph + causal mechanisms):**
|
||||
::
|
||||
|
||||
causal_model = StructuralCausalModel(nx.DiGraph([('X', 'Y'), ('Y', 'Z')])) # X -> Y -> Z
|
||||
auto_assign_causal_models(causal_model, based_on=data)
|
||||
|
||||
1. **Fitting the GCM to the data:**
|
||||
::
|
||||
|
||||
fit(causal_model, data)
|
||||
|
||||
3. **Answering a causal query based on the GCM:**
|
||||
::
|
||||
|
||||
results = <causal_query>(causal_model, ...)
|
||||
|
||||
Where ``<causal_query>`` can be one of multiple functions explained in
|
||||
:doc:`answering_causal_questions/index`.
|
||||
|
||||
Let's look at each of these steps in more detail.
|
||||
|
||||
Step 1: Modeling cause-effect relationships as a structural causal model (SCM)
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
The first step is to model the cause-effect relationships between variables relevant
|
||||
to our use case. We do that in form of a causal graph. A causal graph is a directed acyclic
|
||||
graph (DAG) where an edge X→Y implies that X causes Y. Statistically, a causal graph encodes the
|
||||
conditional independence relations between variables. Using the `networkx <https://networkx
|
||||
.github.io/>`__ library, we can create causal graphs. In the snippet below, we create a chain
|
||||
X→Y→Z:
|
||||
|
||||
>>> import networkx as nx
|
||||
>>> causal_graph = nx.DiGraph([('X', 'Y'), ('Y', 'Z')])
|
||||
|
||||
To answer causal questions using causal graphs, we also have to know the nature of underlying
|
||||
data-generating process of variables. A causal graph by itself, being a diagram, does not have
|
||||
any information about the data-generating process. To introduce this data-generating process, we use an SCM that's
|
||||
built on top of our causal graph:
|
||||
|
||||
>>> from dowhy import gcm
|
||||
>>> causal_model = gcm.StructuralCausalModel(causal_graph)
|
||||
|
||||
This causal model allows us now to assign causal mechanisms to each node in the form of functional causal models.
|
||||
Section :doc:`customizing_model_assignment` explains how this can be done explicitly, but for now, we'll rely on
|
||||
our auto-assign feature. The feature automatically determines a good set of default functional causal
|
||||
models based on the data we work with.
|
||||
|
||||
Therefore, at this point we would normally load our dataset. For this introduction, we generate
|
||||
some synthetic data instead. The API takes data in form of Pandas DataFrames:
|
||||
|
||||
>>> import numpy as np, pandas as pd
|
||||
>>>
|
||||
>>> X = np.random.normal(loc=0, scale=1, size=1000)
|
||||
>>> Y = 2 * X + np.random.normal(loc=0, scale=1, size=1000)
|
||||
>>> Z = 3 * Y + np.random.normal(loc=0, scale=1, size=1000)
|
||||
>>> data = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z))
|
||||
>>> data.head()
|
||||
X Y Z
|
||||
0 -2.253500 -3.638579 -10.370047
|
||||
1 -1.078337 -2.114581 -6.028030
|
||||
2 -0.962719 -2.157896 -5.750563
|
||||
3 -0.300316 -0.440721 -2.619954
|
||||
4 0.127419 0.158185 1.555927
|
||||
|
||||
Note how the columns X, Y, Z correspond to our nodes X, Y, Z in the graph constructed above. We can also see how the
|
||||
values of X influence the values of Y and how the values of Y influence the values of Z in that data set.
|
||||
|
||||
In the real world, this data comes as an opaque stream of values, where we don't know how one
|
||||
variable influences another. The SCM-based can basically help us to deconstruct these causal
|
||||
relationships again, even though we didn't know them before.
|
||||
|
||||
Now that we have the data, let's automatically assign a functional causal model (FCM) to each node in the graph,
|
||||
based on the data:
|
||||
|
||||
>>> gcm.auto_assign_causal_models(causal_model, based_on=data)
|
||||
|
||||
While this function provides a good default, section :doc:`customizing_model_assignment` explains
|
||||
how we can manually optimize the choice of models according to our problem and improve our results.
|
||||
|
||||
Step 2: Fitting the SCM to the data
|
||||
-----------------------------------
|
||||
|
||||
With the data at hand and the graph constructed earlier, we can now train the SCM using ``fit``:
|
||||
|
||||
>>> gcm.fit(causal_model, data)
|
||||
|
||||
Fitting means, we learn the generative models of the variables in the SCM according to the data.
|
||||
|
||||
Step 3: Answering a causal query based on the SCM
|
||||
-------------------------------------------------
|
||||
|
||||
The last step, answering a causal question, is our actual goal. E.g. we could ask the question:
|
||||
|
||||
What will happen to the variable Z if I intervene on Y?
|
||||
|
||||
This can be done via the ``perform_intervention`` function. Here's how:
|
||||
|
||||
>>> samples = gcm.perform_intervention(causal_model,
|
||||
>>> {'Y': lambda y: 2.34 },
|
||||
>>> num_samples_to_draw=1000)
|
||||
>>> samples.head()
|
||||
X Y Z
|
||||
0 1.186229 6.918607 20.682375
|
||||
1 -0.758809 -0.749365 -2.530045
|
||||
2 -1.177379 -5.678514 -17.110836
|
||||
3 -1.211356 -2.152073 -6.212703
|
||||
4 -0.100224 -0.285047 0.256471
|
||||
|
||||
This intervention says: "I'll ignore any causal effects of X on Y, and set every value of Y
|
||||
to 2.34." So the distribution of X will remain unchanged, whereas values of Y will be at a fixed
|
||||
value and Z will respond according to its causal model.
|
||||
|
||||
With this knowledge, we can now dive deep into the meaning and usages of causal queries in section
|
||||
:doc:`answering_causal_questions/index`.
|
Двоичный файл не отображается.
После Ширина: | Высота: | Размер: 110 KiB |
Двоичный файл не отображается.
После Ширина: | Высота: | Размер: 105 KiB |
|
@ -35,6 +35,12 @@
|
|||
|
||||
example_notebooks/nb_advanced_index
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
:caption: GCM-based inference (Experimental)
|
||||
|
||||
gcm/index
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
:caption: Package
|
||||
|
|
Загрузка…
Ссылка в новой задаче