Nima Hejazi

4 minute read

Recently, I came across the blogdown R package, a variant of RStudio’s popular bookdown R package, made by Yihui Xie and Amber Thomas. Blogdown allows the user to write blog posts with code chunks, in any of the large variety of languages supported by RMarkdown, allowing for computationally reproducible writing and programming. It also plays well with the new static site engine Hugo. Here, I’m mostly just going to take blogdown for a spin.

First, let’s generate some data and try doing some very simple summary statistics:

# we're going to be simulating...set seed and constants
set.seed(6142)
n <- 1000
tx_mean <- 25

# generate variables randomly and by structural equation
W <- replicate(3, rnorm(n))
A <- rnorm(n, mean = tx_mean, sd = 2*sqrt(tx_mean))
Y <- as.numeric(A > tx_mean & W[, 1] > 0)
O <- as.data.frame(cbind(Y, A, W))
colnames(O) <- c("Y", "A", paste0("W", seq_len(3)))
head(O)
##   Y        A         W1         W2         W3
## 1 1 28.40771  1.6361789  0.3361618  2.2859933
## 2 1 43.26174  2.0236876  0.3815559  0.1296723
## 3 1 30.45214  0.1319931  0.3985546  0.7973811
## 4 0 23.52230 -0.4502451  0.7176603 -1.0905016
## 5 1 29.88353  1.7895223 -0.3369940  0.7153515
## 6 0 21.33805  1.7653295 -0.8362112 -0.7107431
skim(O)
## Warning: map_if() is deprecated, please use `modify_if()` instead

## Warning: map_if() is deprecated, please use `modify_if()` instead

## Warning: map_if() is deprecated, please use `modify_if()` instead

## Warning: map_if() is deprecated, please use `modify_if()` instead

## Warning: map_if() is deprecated, please use `modify_if()` instead
## Numeric Variables
## # A tibble: 5 x 13
##     var    type missing complete     n        mean        sd       min
##   <chr>   <chr>   <dbl>    <dbl> <dbl>       <dbl>     <dbl>     <dbl>
## 1     A numeric       0     1000  1000 25.18186636 9.7685295 -4.678917
## 2    W1 numeric       0     1000  1000  0.04544250 0.9997996 -3.482909
## 3    W2 numeric       0     1000  1000 -0.02480579 0.9891446 -3.677962
## 4    W3 numeric       0     1000  1000 -0.02308091 1.0069851 -3.367710
## 5     Y numeric       0     1000  1000  0.27100000 0.4446985  0.000000
## # ... with 5 more variables: `25% quantile` <dbl>, median <dbl>, `75%
## #   quantile` <dbl>, max <dbl>, hist <chr>

Look at that! In the above, we generate background covariates (\(W\)) based on the standard Normal distribution, a treatment (\(A\)) based on a Normal distribution with specified mean (\(\mu = 25\)) and standard deviation (\(\sigma = 2 \cdot \sqrt{\mu} = 10\)), and an outcome (\(Y\)) based on a specified structural equation of the form:

\[Y = I(A > 25) \cdot I(W_1 > 0),\] for \(n = 100\).

Having just recently returned from ACIC ’17, I have causal inference on my mind. You’ll notice that in specifying the data generating processes above, I provided a structural equation for \(Y\) – now, let’s play with marginal structural models (MSMs) just a little bit:

msm <- glm(Y ~ ., family = "binomial", data = O)
summary(msm)
## 
## Call:
## glm(formula = Y ~ ., family = "binomial", data = O)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.67224  -0.38138  -0.10080   0.07985   2.31857  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.523376   0.687346 -13.855   <2e-16
## A            0.272499   0.021257  12.820   <2e-16
## W1           2.589087   0.200157  12.935   <2e-16
## W2           0.053711   0.119842   0.448    0.654
## W3          -0.006588   0.108947  -0.060    0.952
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1168.50  on 999  degrees of freedom
## Residual deviance:  516.94  on 995  degrees of freedom
## AIC: 526.94
## 
## Number of Fisher Scoring iterations: 7

Above, we projected the observed data (\(O = (W, A, Y)\)) onto a working MSM for the outcome, using a logistic regression with terms for all of the baseline covariates and the treatment. From the summary of the resultant glm object, we notice that the estimated coefficients for the terms corresponding to the treatment (\(A\)) and first baseline covariate (\(W_1\)) are statistically significant – recall that these were the variables we used in specifying the structural equation for \(Y\) above.

I think I’ll wrap up this experiment by trying some simple plotting:

p <- ggplot(O, aes(A, Y)) + geom_point()
p <- p + stat_smooth(method = "glm", method.args = list(family = "binomial"),
                     formula = y ~ x)
p <- p + xlab("A (treatment)") + ylab("Y (outcome)") + theme_nima()
print(p)
Logistic regression of A on Y

Figure 1: Logistic regression of A on Y

We can plot the relationship between the treatment and outcome using a logistic regression fit. Pretty awesome that blogdown supports such nice graphics so easily. In my old blog, I had to come up with a custom R script to use knitr to process R Markdown to standard markdown, allowing them to be rendered on my old Jekyll blog. Glad I never have to do that again 👍

comments powered by Disqus