Setup details are described here, and the meta-post about these solutions is here.

Materials

The summmer course1 is based off of the second edition of Statistical Rethinking by Richard McElreath. This post covers the following exercise questions:

  • Chapter 5
    • E{1,2,3,4}
    • M{1,2,3,5}
  • Chapter 6
    • E{1,2,3,4}
    • M{1,2,3}
  • Chapter 7
    • E{1,2,3,4}
    • M{1,2,3,4,5,6}

Packages

A colophon with details is provided at the end, but the following packages and theme parameters are used throughout.

1libsUsed<-c("tidyverse","tidybayes","orgutils","dagitty",
2            "rethinking","tidybayes.rethinking",
3            "ggplot2","kableExtra","dplyr","glue",
4            "latex2exp","data.table","printr","devtools")
5invisible(lapply(libsUsed, library, character.only = TRUE));
6theme_set(theme_grey(base_size=24))

Chapter V: The Many Variables & The Spurious Waffles

Easy Questions (Ch5)

5E1

Which of the linear models below are multiple linear regressions?

  1. μ=α+βx
  2. μ=βx+βzz
  3. μ=α+β(xz)
  4. μ=α+βx+βzz
Solution

A multiple regression problem is one with more than one predictor and corresponding coefficients in an additive (hence “linear”) manner. By this logic, we can analyze the options as follows:

  1. Has one predictor variable, x thus is not a multiple regression
  2. Is a multiple linear regression since there are two independent variables, x and z
  3. Is not a multiple regression model, since only the difference of x and z enters the model (with slope β)
  4. This is a multiple linear regression problem, since there are two predictor variables x and z

Thus options two and four are correct.

5E2

Write down a multiple regression to evaluate the claim: Animal diversity is linearly related to latitude, but only after controlling for plant diversity. You just need to write down the model definition.

Solution

Without any further information, we can simply write a model for diversity as:

DALogNormal(μ,σ) μ=α+βLL+βDPDP

Where:

  • DA is the animal diversity
  • DP is the plant diversity
  • L is the latitude

We have used a log-normal distribution for the animal diversity, since negative values for diversity are meaningless. This arises from the understanding that the diversity is on an ordinal scale with classes. The linear model posits a linear model which has two predictors, the latitude and plant diversity. Thus this model allows for “control” of the plant diversity.

Further details would be relegated to the choice of priors instead of the model.

5E3

Write down a multiple regression to evaluate the claim: Neither amount of funding nor size of laboratory is by itself a good predictor of time to PhD degree; but together these variables are both positively associated with time to degree. Write down the model definition and indicate which side of zero each slope parameter should be on.

Solution

Without considering priors, we would like to write a linear model with two variables, funding and the lab size. To allow for extensions later regarding the type of funding, we will use “money” and “time” as inputs for the model. Again, since the time to a PhD cannot be negative, we will posit a log-normal distribution.

TLogNormal(μ,σ) μ=α+βMMi+βSS

Where:

  • T is the time to completion
  • M corresponds to money
  • S corresponds to the size of the lab

Since we are told that the variables considered jointly have a positive association with the time, we note that the slope parameters for both should be positive.

5E4

Suppose you have a single categorical predictor with 4 levels (unique values), labeled A,B,C and D. Let A be an indicator variable that is 1 where case i is in category A. Also suppose B, C and D for the other categories. Now which of the following linear models are inferentially equivalent ways to include the categorical variable in a regression? Models are inferentially equivalent when it’s possible to compute one posterior distribution from the posterior distribution of another model.

  1. μ=α+βAA+βBB+βDD
  2. μ=α+βAA+βBB+βCC+βDD
  3. μ=α+βAA+βCC+βDD
  4. μ=αAA+αBB+αCC+αDD
  5. μ=α(1BCD)+αBB+αCC+αDD
Solution

Without the priors, it is difficult to infer much from these models. For the rest of the answer to make sense, we can assume indifferent priors, and enough data to overwhelm our priors (i.e., they are weakly informative).

All the models listed have an intercept term, and several variables. We will therefore only consider the number of independent variables and their nature.

ModelVariables
(1) μ=α+βAA+βBB+βDD4
(2) μ=α+βAA+βBB+βCC+βDD5
(3) μ=α+βAA+βCC+βDD4
(4) μ=αAA+αBB+αCC+αDD4
(5) μ=α(1BCD)+αBB+αCC+αDD4

Thus we can infer that of the models, after fitting, only option two will have inferences which cannot be computed from the others.

Questions of Medium Complexity (Ch5)

HOLD 5M1

Invent your own example of a spurious correlation. An outcome variable should be correlated with both predictor variables. But when both predictors are entered in the same model, the correlation between the outcome and one of the predictors should mostly vanish (or at least be greatly reduced).

Solution

For this example, consider the total potential energy of a molecular system. We will recall that this can be written as follows: Etotal=Eelectrostatics+E1B+E2B+ Where the B terms indicate the correction terms. When predicting the total energy, if the electrostatic energy is a function of the atomic descriptors, and is entered in a model, then it masks the effect of the correction terms which also rely on the atomic descriptors. This means that correction terms to the total energy can also be thought of as a correction to the electrostatics, thus following the pattern of the divorce rate and waffles example in the chapter.

To put this is more context, let us introduce more explicit variables.

ET=EElec(θ)+E1B(θ)+E2B(θ)+

In this setting it is clear to see that the masking of variables is artificially induced.

Another possible example is from textcite:wainerMostDangerousEquation, where the utility of having smaller schools is a function of school size and the average number of achievements. The school size also affects the average number of achievements, as well as the actual utility. This then implies that there is a spurious correlation which does not exist when the variances are taken into account.

HOLD 5M2

Invent your own example of a masked relationship. An outcome variable should be correlated with both predictor variables, but in opposite directions. And the two predictor variables should be correlated with one another.

Solution

Let us consider a simple case of student completion rate based on the influences of college tuition and faculty members. Assuming that college tuition is negatively correlated, and the number of faculty is positively correlated. However, since there are more wealthy people who can afford college, a chosen sample may show a spurious where examining either variable shows a weak correlation with completion rate, due to the positive association in the wealthy population.

It is important to note that masked relationships usually arise when the population is incorrectly sampled.

5M3

It is sometimes observed that the best predictor of fire risk is the presence of firefighters-States and localities with many firefighters also have more fires. Presumably firefighters do not cause fires. Nevertheless, this is not a spurious correlation. Instead fires cause firefighters. Consider the same reversal of causal inferences in the context of the divorce and marriage data. How might a high divorce rate cause a higher marriage rate? Can you think of a way to evaluate this relationship, using multiple regression?

Solution

The example given simply allows for the inference that areas with a higher incidence of fires do tend to allocated more money and resources to prevent them, hence the observed larger number of firefighters. Similarly, a reversal of the divorce and marriage data might be focused on the possibility that divorcees tend to get married more often than other singles. However, to understand this further, more categorical variables would be required, though this information might also be best represented by a time series of life events. We can posit the following:

MNormal(μ,σ) μ=α+βLL+βRR

Where:

  • M is the marriage rate
  • L is the probability of being married based on “love”
  • R is the variable accounting for remarriage

5M5

One way to reason through multiple causation hypotheses is to imagine detailed mechanisms through which predictor variables may influence outcomes. For example, it is sometimes argued that the price of gasoline (predictor variable) is positively associated with lower obesity rates (outcome variable). However, there are at least two important mechanisms by which the price of gas could reduce obesity. First, it could lead to less driving and therefore more exercise. Second, it could lead to less driving, which leads to less eating out, which leads to less consumption of huge restaurant meals. Can you outline one or more multiple regressions that address these two mechanisms? Assume you can have any predictor data you need.

Solution

We adopt the following notation:

  • P is price (predictor)
  • O is obesity (outcome)
  • D is for driving
  • E for eating out
  • Ex for exercise

Let us try to put this in the form of a DAG.

1dag5m5<- dagitty("dag{
2P -> D -> E -> O
3P -> D -> Ex -> O
4}")
1dag5m5 %>% graphLayout %>% plot

We should note that it seems straightforward, but it is nice to check as well.

1dag5m5 %>% adjustmentSets(exposure="P",outcome="O") %>% print
1
2 {}

Now we can start working our way through the set of regressions by the most basic walk through the DAG.

  • Path One

    1. D(P) decreases
    2. Ex(D) increases
    3. O(Ex) decreases
  • Path Two

    1. D(P) decreases
    2. E(D) decreases
    3. O(E) decreases

Chapter VI: The Haunted DAG & The Causal Terror

Easy Questions (Ch6)

6E1

List three mechanisms by which multiple regression can produce false inferences about causal effects.

Solution

As per chapter five and six, we have three mechanisms:

Confounding
Where there exists an additional variable which influences exposure and outcome values
Multicollinearity
Strong associations between two or more predictor variables, which will cause the posterior distribution to suggest that none of variables are associated with the outcome even if they all actually are
Post-treatment variables
This is a form of included variable bias
Collider Bias
Conditioning on collider variables creates statistical but not causal associations between its causes

HOLD 6E2

For one of the mechanisms in the previous problem, provide an example of your choice, perhaps from your own research.

Solution

One of the core tenets of the field of computational chemistry is the act of fitting empirical potential models to more accurate potential data (or even experiments).

Multicollinearity
When dealing with decreasing effects, then using strongly correlated variables (like distance and effective distance measures like centeroid densities) cause the overall model to suggest that none of the measures are useful
Post-treatment variables
Often while finding minima and saddle points on a potential energy surface, adding information of the existing minima values will impede training a model which actually fits to the whole potential energy surface instead of being concentrated around the known minima

6E3

List the four elemental confounds. Can you explain the conditional dependencies of each?

Solution

The four elemental confounds are enumerated in Figure 1.

Figure 1: The four elemental confounds

Figure 1: The four elemental confounds

In symbolic notation, we can express this as:

ConfoundSymbolic FormConditional Independencies
ForksXZYYX|Z
PipesXZYYX|Z
CollidersXZYY⫫̸X|Z
DescendantsSee Figure 1Weakly conditions on parent

6E4

How is a biased sample like conditioning on a collider? Think of the example at the open of the chapter.

Solution

Recall that the biased sample in the introduction to the chapter was:

It seems like the most newsworthy scientific studies are the least trustworthy. The more likely it is to kill you, if true, the less likely it is to be true. The more boring the topic, the more rigorous the results. How could this widely believed negative correlation exist? There doesn’t seem to be any reason for studies of topics that people care about to produce less reliable results. Maybe popular topics attract more and worse researchers, like flies drawn to the smell of honey?

Note that this can also be expressed as a collider in a causal DAG as:

newsworthinessacceptancetrustworthiness

The idea is that a proposal will be accepted if either the newsworthiness or the trustworthiness is high. There is thus on average a negative association between these criteria among the selected set of proposals.

In essence the association in the sub-samples is not the same as the total sample, and this causes wrong inferences on the total sample set, when conditioning on collider variables.

Questions of Medium Complexity (Ch6)

6M1

Modify the DAG on page 186 to include the variable V, an unobserved cause of C and Y:CVY. Reanalyze the DAG. How many paths connect X to Y? Which must be closed? Which variables should you condition on now?

Solution

Let us outline this DAG.

 1dag6m1<- dagitty("dag{
 2U [unobserved]
 3V [unobserved]
 4X -> Y
 5X <- U -> B <- C -> Y
 6U <- A -> C
 7C <- V -> Y
 8}")
 9coordinates(dag6m1)<-list(
10  x=c(X=0,Y=2,U=0,A=1,B=1,C=2,V=2.5),
11  y=c(X=2,Y=2,U=1,A=0.2,B=1.5,C=1,V=1.5)
12)

We can visualize this with:

1dag6m1 %>% drawdag

The paths between X and Y are:

  1. XY
  2. XUACY
  3. XUACVY
  4. XUBCY
  5. XUBCVY

We can leverage dagitty to check which paths should be closed.

1dag6m1 %>% adjustmentSets(exposure="X",outcome="Y") %>% print
1 { A }

Logically, conditioning on A to close non-causal paths makes sense as it consistent with the understanding that only (1) is a causal path, and the rest will confound paths.

6M2

Sometimes in order to avoid multicollinearity, people inspect pairwise correlations among predictors before including them in a model. This is a bad procedure, because what matters is the conditional association, not the association before the variables are included in the model. To highlight this, consider the DAG XZY. Simulate data from this DAG so that the correlation between X and Z is very large. Then include both in a model prediction Y. Do you observe any multicollinearity? Why or why not? What is different from the legs example in the chapter?

Solution

The DAG under consideration is: XZY We will simulate data first.

1N<-5000
2X<-N %>% rnorm(mean=0,sd=1)
3Z<-N %>% rnorm(mean=X,sd=0.5)
4Y<-N %>% rnorm(mean=Z,sd=1)
5cor(X,Z) %>% print
1
2[1] 0.9987166

The variables X and Z are highly correlated. We can check with a regression model for this.

1m6m2<-quap(
2  alist(
3    Y ~ dnorm(mu,sigma),
4    mu<-a+bX*X+bZ*Z,
5    c(a,bX,bZ)~dnorm(0,1),
6    sigma~dexp(1)
7  ), data=list(X=X,Y=Y,Z=Z)
8)

The regression fit is essentially.

1m6m2 %>% precis
1       mean   sd  5.5% 94.5%
2a     -0.01 0.01 -0.03  0.01
3bX     0.06 0.03  0.00  0.11
4bZ     0.95 0.03  0.90  1.00
5sigma  1.02 0.01  1.00  1.04

The fit shows how X is not a useful variable, due to the addition of Z, which is a post-treatment variable, and thus should not have been included. In effect, we also realize from this that multicollinearity is a data-driven property, and has no interpretation outside specific model instances.

6M3

Learning to analyze DAGs requires practice. For each of the four DAGs below, state which variables, if any, you must adjust for (condition on) to estimate the total causal influence of X on Y.

Solution

We can leverage the dagitty package as well to figure out which variables should be conditioned on.

1dag6m3a %>% adjustmentSets(exposure="X",outcome="Y") %>% print
2dag6m3b %>% adjustmentSets(exposure="X",outcome="Y") %>% print
3dag6m3c %>% adjustmentSets(exposure="X",outcome="Y") %>% print
4dag6m3d %>% adjustmentSets(exposure="X",outcome="Y") %>% print
1 { Z }
2
3 {}
4
5 {}
6
7 { A }

Clearly the upper left and lower right DAGs need to be conditioned on Z and A respectively to close non-causal paths.

We can further rationalize this as follows:

Upper Left
XZY and XZAY are open, non-causal paths which need to be closed
Upper Right
Z is a collider which ensures that only causal paths are open
Lower Left
There is a collider Z which ensures that the non-causal paths are closed
Lower Right
This figure is more complicated, so we will consider all the paths, i.e. XY, XZY, XAZY, and we clearly need to condition on either A or Z. Z is also part of a causal path, so only A is to be conditioned on

A more canonical way to do this is to enumerate all paths for every option, but dagitty is more elegant.

1dag6m3d %>% graphLayout %>% plot

Chapter VII: Ulysses’ Compass

Easy Questions (Ch7)

HOLD 7E1

State the three motivating criteria that define information entropy. Try to express each in your own words.

Solution

The motivating criteria for defining informational entropy or “uncertainity” are:

Continuity
It is preferable to have a continuous function to define our informational criteria, since we can always discretize a continuous function (by binning) later, but a discrete function does not have a full range of values which can correspond to all the possible models. As a metric then, it is preferable to have a minimum and maximum bound, but define it such that it is continuous for representing arbitrary models
Positive and Monotonic
The monotonicity constraint is simply to ensure that as the number of events increases, given no other changes in the system, the uncertainity will increase. Since the function is already continuous, the incerasing nature is really by construction. It should be noted that a monotonously decreasing function would also satisfy the motivating criteria, but will change the interpretation completely
Additivity
As mentioned for continuity, it is possible always to bin continuous functions or discretize it. Similarly, it is desirable to keep the amount of uncertainity constant and add or subtract values to redefine categories

7E2

Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?

Solution

We can simulate this system easily.

1p<-c(0.7,0.3)
2-sum(p*log(p)) %>% print
1[1] 0.6108643

7E3

Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2”, 25%, and “4” 30% of the time. What is the entropy of this die?

Solution
1p<-c(0.2,0.25,0.25,0.3)
2-sum(p*log(p)) %>% print
1[1] 1.376227

7E4

Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?

Solution

We will not consider impossible events in our simulation.

1p<- c(1/3,1/3,1/3)
2-sum(p*log(p)) %>% print
1[1] 1.098612

Questions of Medium Complexity (Ch7)

HOLD 7M1

Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?

Solution

We know that AIC or “Akaike Information Criterion” is defined as:

Where k is the number of parameters in the model.

The WAIC or “Widely Applicable Information Criterion” is given by: WAIC=2(ilogPr(yi)iV(yi))

WAIC is more general than the AIC. WAIC and AIC will be approximately equivalent when the priors are effectively flat or when there is enough data to render the priors redundant. This is because the WAIC makes no assumptions about the shape of the posterior, while AIC is an approximation depending on:

  • A flat prior (or one overwhelmed by the likelihood)
  • A posterior distribution which is approximately a multivariate Gaussian
  • Sample size N with more parameters (p)

Furthermore, we note that the AIC simply estimates that the penalty term is twice the number of parameters, while the WAIC fits uses the lppd or the sum of variances of each log-likelihood.

HOLD 7M2

Explain the difference between model selection and model comparison. What information is lost under model selection?

Solution

Model selection involves choosing one model over the others. Ideally this occurs after appropriate model comparision. However, the chapter does mention that it is common to use heuristics like “stargazing” which uses frequentist tools to estimate which variables are important, then choose a model (or causal salad)a which has the highest number of significant variables.

Model comparision in theory should be based off entropic measures for the information used. The models should be trained on the same data-set for the metrics to be meaningful.

Model selection loses information regarding the uncertainity quantifications of the models which do not necessarily have the (relatively) optimal values of the metric used for comparision. This is important, especially since models which are parameterized for prediction, often perform better without being useful for causal analysis.

7M3

When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.

Solution

When using an information criterion, it is important to understand that different values define different “small worlds”.

This is why when working on gauging the information criterion, which work on the basis of the accumulated deviance values, having a varying number of training values will effectively be comparing apples and oranges. Each training data-set essentially fits one model, and comparing models trained on different data-sets (even subsets of the same data) will not lead to a fundamentally sound comparison.

We also know that in general, fewer data-points will have fewer deviance terms, and therefore artificially seem to be better.

We will prove this with an artificial data-set.

 1ySmallDat <- rnorm(100)
 2yLargeDat <- rnorm(1000)
 3m7m3S <- quap(
 4  alist(
 5    y ~ dnorm(mu,1),
 6    mu ~ dnorm(0,sigma)
 7  ), data=list(y=ySmallDat,sigma=1)
 8)
 9m7m3L <- quap(
10  alist(
11    y ~ dnorm(mu,1),
12    mu ~ dnorm(0,sigma)
13  ),data=list(y=yLargeDat,sigma=1)
14)
1WAIC(m7m3S) %>% rbind(WAIC(m7m3L)) %>% mutate(numSamples=c(100,1000)) %>% toOrg
WAIClppdpenaltystd_errnumSamples
278.876677095335-138.5766290068180.86170954084976611.1055429875975100
2898.5831283182-1448.201742780151.0898213789486649.52988475254591000

We see that apparently, the model with fewer data-points is superior, but from the discussion above, as well as by construction, we know that the models are the same, so the effect is clearly spurious, and caused by training on different data-sets.

7M4

What happens to the effective number of parameters as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.

Solution

Since a strength of a prior is directly related to the process of regularization, it is clear that as a prior becomes more concentrated, the model tends to be more critical of new data, and therefore the effective number of parameters will drop proportionately. Another approach to the same problem is to understand that the prior encodes our previous beliefs which in effect represents additional data which the model a-priori has been trained with.

We can test this simply by re-using the models we defined for 7M3.

 1yDat <- rnorm(5)
 2sigL<-1000
 3sigS<-1
 4m7m4S <- quap(
 5  alist(
 6    y ~ dnorm(mu,1),
 7    mu ~ dnorm(0,sigma)
 8  ), data=list(y=yDat,sigma=sigS)
 9)
10m7m4L <- quap(
11  alist(
12    y ~ dnorm(mu,1),
13    mu ~ dnorm(0,sigma)
14  ),data=list(y=yDat,sigma=sigL)
15)

Recall that the WAIC is defined by:

WAIC=2(lppdpWAIC)

Where pWAIC is the effective number of parameters. So we note that:

pWAIC=lppd0.5WAIC

This is reported by WAIC as the penalty parameter.

1WAIC(m7m4S) %>% rbind(WAIC(m7m4L)) %>% mutate(sigma=c(sigS,sigL)) %>% toOrg
WAIClppdpenaltystd_errsigma
16.4098404638955-7.314403244073210.8905169878745612.310861615754831
16.9915093637752-7.30118389905951.194570782828082.52685910220241000

Though the effect is not too strong, it is clear that having a denser prior (a.k.a smaller sigma) has a smaller number of effective paramters, as expected.

HOLD 7M5

Provide an informal explanation of why informative priors reduce overfitting.

Solution

Overfitting is easier to understand in the context of data-compression. Essentially, when overfitting occurs, the data is represented in a different encoding, instead of being compressed.

We can also look at the overfitting process to be a trade off between simply fitting to every data-point (low bias, high variance) and being completely oblivious to the data (high bias, low variance). In another sense, overfitting occurs when the model is “overly eager” to learn from the data.

Given this understanding, informative priors essentially regularize the model, by ensuring that the likelihood is closer to the posterior, and hence prevents the model from “learning” from data-points which are not actually relevant to the prior.

This implies that overfitting reduces the model by lowering the sensitivity of the model to a sample, which implicitly implies that the data contains points which are not actually a feature of the process which will generate future data.

HOLD 7M6

Provide an informal explanation of why overly informative priors result in underfitting.

Solution

Underfitting occurs when the model is insensitive to newer samples of the data. In classical terms, this means that the model has a very high bias, and typically has a correspondingly low variance.

With the understanding that priors cause regularization, which enforces sparsity of features, it is easier to see that very strong priors ensure that the model is overly sparse and incapable of picking up relevant trends in the training data.

Overly informative priors, essentially imply that the model has “seen” a large amount of data previously, which then means that it is less sensitive to newer samples of data. This means that features present in the training data which are relevant to future data will be ignored in favor of the prior predictions.

A: Colophon

To ensure that this document is fully reproducible at a later date, we will record the session info.

1devtools::session_info()

  1. Summer of 2020 ↩︎