# Simulating the impact of changing electoral rules in Brazil

A few weeks ago, I and two colleagues published an op-ed at Folha de São Paulo, Brazil’s  biggest newspaper, criticizing the ‘distritão’ (translates as ‘big district’) proposal. Using simple simulations, we argued that the proposal would reinforce the power of incumbent parties and incentivize even more campaign spending by candidates, which is not exactly what Brazil needs right now. Fortunately, due to remarkable opposition in the press and the general public, the ‘distritão’ proposal was eventually rejected by Congress.

Here I share the code I used to run the simulation and some graphs we can use to compare the party composition of Brazil’s Chamber of Deputies and local Legislative Assemblies under the two different systems. Long story short, the ‘distritão’ system rank-orders all the candidates in a state according to their total vote share, regardless of votes in their parties. The first ‘n’ candidates in a state will be elected. ‘N’ is the amount of seats each state has in the national Chamber of Deputies. I used this rule of the game to run the simulation, as explained in the next paragraphs.

We start by finding the candidates elected in the current system. The dataset ‘dep_fed14.csv’ used below shows the electoral results of the 2014 national elections and can be easily downloaded at Brazil’s Electoral Court website.

```d <- read.csv("dep_fed14.csv", colClasses = "character")

##Find elected candidates in the current system:
d.elect <- d[which(d\$resultado_cod %in% c("4", "3")), ]

#Find amount of elected candidates by party

part1 <- data.frame(table(d.elect\$partido_sig))
part1 <- arrange(part1, desc(Freq))
colnames(part1) <- c("party", "seats")

```

Now we move on to simulate who would be elected using the proposed ‘distritão’ rule, as explained in the second paragraph above: rank-order candidates according to vote share, select the first ‘n’ candidates. There is an obvious caveat to this simulation: while we run it using data for the 2014 election as it was, the campaign dynamic could be very different with the new rule. Nonetheless, the new dynamic would reinforce the trend we found (more incumbency advantage) thus making the simulation a valid exercise.

```##Distribuicao dos assentos pelo modelo distritao.

##split per state
d.dist <- split(d, f = as.factor(d\$sigla_Estado))

#Find how many were elected by state
d.dist.number <- c()
for (i in 1:length(d.dist)){
d.dist.number[i] <- nrow(d.dist[[i]][which(d.dist[[i]]\$resultado_cod %in% c("3", "4")),])
}

##Rank each deputy in each state:
for (i in 1:length(d.dist)){
d.dist[[i]]\$rank <- rank(-as.numeric(d.dist[[i]]\$voto_nominal))
}

##Select for candidates where rank&lt;= amount of seats in that state
d.dist.elect <- list()
for (i in 1:length(d.dist)){
d.dist.elect[[i]] <- d.dist[[i]][which(d.dist[[i]]\$rank <= d.dist.number[i]), ]
}

##Graph again amount of seats per party
d2 <- do.call(rbind, d.dist.elect)

part2 <- data.frame(table(d2\$partido_sig))
part2 <- arrange(part2, desc(Freq))
colnames(part2) <- c("party", "seats")

```

After these steps we have two datasets with the actual results of the current system (part1) and the simulated results of the ‘distritão’ system (part2). What is left to do is to join them and plot both results in a barplot using ggplot:

```part1\$regra <- "atual"
part2\$regra <- "distritao"
part <- rbind(part1, part2)

p <- ggplot(data = part, aes(x = reorder(party, -seats), y = seats, fill = regra)) +
geom_bar(stat = "identity", position = position_dodge())+
geom_text(aes(label = seats), vjust=1.6, color="black",
position = position_dodge(0.9), size= 2)+
scale_fill_brewer(palette="Reds")+
theme_minimal()+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
labs(x = "party")

png("atualvsdistritao.png")
p
dev.off()
```

The result is the barplot below. We used a similar code to run the simulation across all the 26 states in Brasil. The results for Rio de Janeiro are striking in terms of increasing incumbency advantage: largest party in the local legislation would increase its seats by 60%: # Presenting results of a Regression Discontinuity Design – Part 3

The plots in part 1 and part 2 will help you identify a spurious jump. Looking at the scatter plot o the outcome on the running variable enable you correctly identify the functional form of your estimation (i.e., don’t run a linear regression function when you should have added a polynomial). Plotting the point estimate and confidence intervals for different bandwidths helps  making sure that any significant effect is not a result of a particular (and unprincipled) choice of bandwidth.

However, there are more checks to be done. The core assumption of the RDD design is that observations in the treatment and control group are comparable when they close enough to the cutoff point. More technically, the distribution of their potential outcome is continuous at the cutoff point or, equivalently, the expected of value of potential outcome of both groups is the same as they approach the cutoff (go revise Neyman-Rubin causal inference framework if this ‘potential outcome’ lingo gets you confused).

A way to check if the two groups are comparable is to look if belonging to the treatment of control group changes some variables other than the outcome. Let’s look at concrete example this. Suppose you are trying to test whether the effects of different voting technology on political outcomes. You could explore, for example, a new rule stipulating that cities with more than 200K inhabitants, and these cities only, will have to adopt the electronic ballot. Great opportunity to implement an RDD approach , right?  Yes it, and that’s exactly what Thomas Fujiwara, from Princeton, did. What you should do is to look at the cities in which population is slightly below 200K and compare your outcome of interest (political competition, voting for left-wing parties, etc) in both groups.

However, what if cities that are slightly above 200K are also consistently  richer or their  population is more educated, older, etc. Than those slightly below the threshold? Well something will smell fishy… This may imply that either there is some sorting among observations, i.e., richer or more educated cities deliberately (for some weird reason) over-report their population. The consequence is that the two groups are not comparable any more. You don’t if the difference in the groups is a consequence of the voting technology or better socio-economic performance of cities above the 200K threshold. In other words, their potential outcome is no longer the same, compromising the RDD causal estimation. Finding whether covariates are balanced (i.e., their average is the same) in the control and treatment groups is pretty standard in experimental settings, but more important in the RDD approach in order to validate it’s assumption.

Below I propose a way to visually summarize covariate balance in the RDD in a single graph. All you have to do is to run an RDD regression using the covariate and the outcome. If the coefficient of the treatment on this regression (i.e., the ‘effect’ of being treated in the covariate) is statistically indistinguishable from zero (p-value > .1) than covariates are balanced.

Below the code for this graph. I basically loop a RDD regression model over the covariates, collect the p-values and plot them on the same graph.

As before, let’s start by creating some fake data for reproducibility.

```

##creating a fake dataset (N=1000, 500 at treated, 500 at control group)
#outcome variable
outcome <- c(rnorm(500, mean = 50, sd = 10), rnorm(500, mean = 70, sd = 10))

#running variable
running.var <- seq(0, 1, by = .0001)
running.var <- sample(running.var, size = 1000, replace = T)

##Put negative values for the running variable in the control group
running.var[1:500] <- -running.var[1:500]

#treatment indicator (just a binary variable indicating treated and control groups)
treat.ind <- c(rep(0,500), rep(1,500))

##Now we will create data for covariates. For the RDD estimation to be valid, they are supposed to be the, which should be the case for the ones below&amp;amp;amp;amp;amp;nbsp;
set.seed(123)
covar1 <- c(rnorm(500, mean = 50, sd = 10), rnorm(500, mean = 50, sd = 20))
covar2 <- c(rnorm(500, mean = 10, sd = 20), rnorm(500, mean = 10, sd = 30))
covar3 <- c(rnorm(500, mean = 50, sd = 50), rnorm(500, mean = 55, sd = 60))
covar4 <- c(rnorm(500, mean = 50, sd = 30), rnorm(500, mean = 51, sd = 30))

##Now we add this covariates to the dataset:

data <- data.frame(cbind(outcome, running.var, treat.ind, covar1, covar2, covar3, covar4))
data\$treat.ind <- as.factor(data\$treat.ind)

```

We are interested in the impact of the treatment close to the cutoff point. So let’s subset the data for the bandwidth of 10%

```

d <- data[which(data\$running.var < .1 & data\$running.var <-.1), ]

```

We want to evaluate the ‘impact’ of the treatment condition on the covariates. So we can  use a lapply function to loop a linear regression model over the covariates.

```

#Bundle the covariates' names together&amp;amp;amp;nbsp;
covars <- c("covar1", "covar2", "covar3", "covar4")

#Loop over them using a convenient feature of the "as.formula" function

models <- lapply(covars, function(x){
lm(as.formula(paste(x," ~ running.var + treat.ind",sep = "")), data = d)
})
names(models) <- covars

```

As said, if the treatment condition is associated with different values of the covariates in the treatment and control group, then the p-values of its coefficient must be statistically significant.

```#So, let's extract the p-values for the treatment variable in the regressions above.
bal <- sapply(models, function(x){
p_value <- summary(x)\$coefficient[3,4]
return(p_value)
})

bal <- data.frame(covariate = covars, p_value = bal)

#Reorder to have a nicer plot
bal <- bal[order(bal\$p_value), ]

```

Now we just need to plot it:

```
###Data for plotting p-values in the vertical lines&nbsp;
ps <- data.frame(value = c(0.1, 0.05, 0.01), threshold = c("10%","5%","1%" ))
##reordering
ps\$significance.level <- reorder(ps\$threshold, ps\$value)

###Plotting:
require(ggplot2)

p <- ggplot(bal, aes(x = p_value))
p + geom_point(aes(y=covariate), colour = "blue", size=4, shape=24, fill="blue")+
xlim(0, 1)+
ylab("Covariate")+
geom_vline(data=ps, aes(xintercept = value, colour=significance.level), show_guide = TRUE)

```

This should spit out a graph like this: Notice the p-value of all covariates is above significance level (10%), which mean we are ok in terms of covariate balance.

Before closing the sequence of posts, it must be said that another common practice is to  run a McCrary test, which will examine the continuity of the running variable density function across the cutoff point. The idea here is to check whether the observations are capable to manipulate the the running variable, i.e., if they can force themselves in or out of the treatment condition, compromising the quasi-randomness when we get close to the cutoff point (take a look at the original paper) .

# Presenting results of a Regression Discontinuity Design – Part 2

Now that we have graphed the basic scatter plot of a RDD estimation in this post, it is time to estimate and graph the actual effects of the treatment. The tricky point about RDD estimation is that there are different bandwidths around the cutoff point and the estimation results may vary quite a bit depending on the one you choose. Therefore, the researcher needs to be as transparent as possible and show the results (point estimate and confidence intervals) for different bandwidths.

Optimally, your plot should look like the one below. The graph depicts the point estimates and confidence interval (y-axis) for different bandwidths (x-axis). The goal is to show whether your results are consistent across them. Certainly, the closer to the cutoff point (where the bandwidth size is close to zero), the better for causal inference because the ‘randomness’ of an observation ending up in the control or treatment group is higher there. However, this region is also more likely to have fewer observations in many real-world applications. Therefore, the closer to the cutoff point, the higher the uncertainty about the estimate and the larger the confidence interval (notice this is the pattern above).

There are ways to mathematically optimize the bandwidth and a well-known one is to use the Imbens Kalyanaraman (IK) bandwidth. This graph also plots the IK estimation, which is the one highlighted by the vertical dashed line. But as we can see, the IK bandwidth in this case is too large. In sum, the most transparent way to present the RDD results in by plotting estimation for different bandwidths, as well as for the IK one.

The R code below does that. In the end I wrap up the code into single function that takes in the outcome variable, running variable, treatment indicator and bandwidths and spits out the graph above.

Let’s start by, again, creating some fake data for easier replication, as we did in Part 1.

```
#running variable
running.var <- seq(0, 1, by = .0001)
running.var <- sample(running.var, size = 1000, replace = T)
##Put negative values for the running variable in the control group
running.var[1:500] <- -running.var[1:500]
#treatment indicator (just a binary variable indicating treated and control groups)
treat.ind <- c(rep(0,500), rep(1,500))

```

Now we’ll choose some sensible bandwidth and loop a simple linear regression estimation at each of them. The size of these bandwidths will depend on your substantive study. Think a bit while you choose it. Here what I have in mind is an election where the running variable is the winning margin. Also, you could make the model more complex by adding a quadratic term in the regression  estimation, but I’ll keep it simple. Finally, don’t forget to make the confidence intervals heteroscedasticity-robust.

```#First, collapse the variables above in one dataset
data <- data.frame(cbind(outcome, running.var, treat.ind))

#Define the different bandwidths
bw <- seq(0.01,.5,0.05) #this will depend on your data as well as substantive question. do some thinking to define which BWs you are interested in evaluating.  #Build matrix to  store estimates => you will need one estimate per bandwidth, therefore the number of rows here will be the same amount of bandwidths.
#Create a matrix to collect the results of the regression estimations.
estimates <- matrix(NA, nrow = (length(bw)+1), ncol = 3)

#Loop over different bws to obtain a point estimate and CI for each bw
for (i in 1:length(bw)){
d <- filter(data, running.var <= bw[i] & running.var >= -bw[i]) ###Filter the dataset for ith bw
est <- lm(outcome~running.var + treat.ind, data=d)
#Making it heteroskedastic robust
require("sandwich")
require("lmtest")
est.robust <- coeftest(est, vcov. = vcovHC(est, type = "HC0"))
#Prevent the function to stop in cases where lm() cannot calculate the point estimate (too few cases in that bandwidth)
if(dim(est.robust)<3){
estimates[i,1] <- NA
estimates[i,2] <- NA
estimates[i,3] <- NA
}
else{
estimates[i,1] <- est.robust[3,1]
estimates[i,2] <- est.robust[3,1]+1.96*est.robust[3,2]
estimates[i,3] <- est.robust[3,1]-1.96*est.robust[3,2]
}
}
```

This code block should give us a matrix with point estimate and confidence intervals for the bandwidths we chose in the beginning. What we are missing now is the estimation in the IK bandwidth:

```#We will need to 'rdd' package for that
require(rdd)
#Get the running and outcome variables to feed into the next function
X <- data\$running.var
Y <- data\$outcome
#Now use IKbandwidth() from the package 'rdd'. Tweak with the parameters as necessary.
IK <- IKbandwidth(X, Y, cutpoint = 0, verbose = T,kernel = "triangular")
#Once you have established the IK band, re-run the lm estimation for the observations within it
d.IK <- filter(data, running.var <= IK & running.var >= -IK)
est.IK <- lm(outcome ~ running.var + treat.ind, data = d.IK)
est.robust.IK <- coeftest(est, vcov. = vcovHC(est.IK, type = "HC0"))
#Include them in the matrix with estimation results.
estimates[length(bw) + 1, 1] <- est.robust.IK[3,1]
estimates[length(bw) + 1, 2] <- est.robust.IK[3,1] + 1.96 * est.robust.IK[3,2]
estimates[length(bw) + 1, 3] <- est.robust.IK[3,1] - 1.96 * est.robust.IK[3,2]
```

Now you have a matrix with the point estimate and confidence intervals for the bandwidths of your choice as well as the IK one. All you need to do is to plot them on the same graph, which is the figure pasted in the beginning of the post. We will do that using ggplot2. Change xlab and ylab parameters according to your study.

```require(ggplot2)
ggplot(data=d, mapping = aes(x = bandwidth, y = point.estimate, ymin = ci.lower, ymax = ci.upper)) +
geom_vline(xintercept = IK, colour = "royalblue3", linetype = "longdash") +
geom_pointrange(size = 0.8) +
geom_hline(yintercept = 0, colour="brown4") +
ggtitle("RDD results") +
xlab("Distance from Cutoff Point") +
ylab("Point estimate and confidence interval")
```

To make it easier, I’ve wrapped it all in a single function that takes in the variables in the dataset (running variable, outcome variable and treatment indicator), the bandwidths of choice, and spits out the graph above.

```rdd <- function(outcome, running.var,treat.ind, bw){
data <- data.frame(cbind(outcome, running.var, treat.ind))
estimates <- matrix(NA, nrow = (length(bw)+1), ncol = 3)
for (i in 1:length(bw)){ ###Looping over different bws to obtain a point estimate and CI for each bw
d <- filter(data, running.var <= bw[i] & running.var >= -bw[i])
est <- lm(outcome~running.var + treat.ind, data=d) ###Estimating formula
require("sandwich")
require("lmtest")
est.robust <- coeftest(est, vcov. = vcovHC(est, type = "HC0"))
if(dim(est.robust)<3){
estimates[i,1] <- NA
estimates[i,2] <- NA
estimates[i,3] <- NA
}
else{
estimates[i,1] <- est.robust[3,1]
estimates[i,2] <- est.robust[3,1]+1.96*est.robust[3,2]
estimates[i,3] <- est.robust[3,1]-1.96*est.robust[3,2]
}
}
require(rdd)
X <- data\$running.var
Y <- data\$outcome
IK <- IKbandwidth(X, Y, cutpoint = 0, verbose = T,kernel = "triangular")
d.IK <- filter(data, running.var <= IK & running.var >= -IK)
est.IK <- lm(outcome ~ running.var + treat.ind, data = d.IK)
est.robust.IK <- coeftest(est, vcov. = vcovHC(est.IK, type = "HC0"))
estimates[length(bw) + 1, 1] <- est.robust.IK[3,1]
estimates[length(bw) + 1, 2] <- est.robust.IK[3,1] + 1.96 * est.robust.IK[3,2]
estimates[length(bw) + 1, 3] <- est.robust.IK[3,1] - 1.96 * est.robust.IK[3,2]
d <- data.frame(point.estimate = estimates[,1], ci.upper = estimates[,2], ci.lower = estimates[,3])
d\$bandwidth <- c(bw, IK)
require(ggplot2)
p <- ggplot(data=d, mapping=aes(x=bandwidth,y=point.estimate, ymin=ci.lower, ymax=ci.upper)) +
geom_vline(xintercept = IK, colour="royalblue3", linetype = "longdash") +
geom_pointrange(size=0.8) +
geom_hline(yintercept=0, colour="brown4") +
ggtitle("RDD results") +
xlab("Distance from Cutoff Point") +
ylab("Point estimate and confidence interval")
return(p)
}
```

# Presenting results of a Regression Discontinuity Design – Part 1

Regression discontinuity design (RDD) is a great tool for going beyond descriptive statistics and moving into causal inference. I’m assuming the reader knows the theory, assumptions, advantages and weakness of the method. Below I’ll just give a few, basic tips for how to present results and share code that graphs it. The interesting thing about RDD is that its a very visual estimation technique. You can see and, therefore, present to your audience what are the results in a very intuitive way. Here I’m proposing  that you present your RDD estimation using three types of graphs. I present the first, and simplest, one in this post. The next ones are slightly more elaborate.

In short, RDD is about evaluating whether an outcome variable changes significantly after an independent variable changes its value. There is third element: the running variable that defines the value of the independent variable (0 or 1 in the sharp RDD case, which is the only one I’m dealing with here). If the running variable is below a certain value – the cutoff point –  the independent variable takes the value of zero (the observations in this ‘area’ are the control group). If above, the independent variable is one (treated group).

We can show all the behavior of these three variable in a simple bidimensional scatter plot in which the y-axis is the outcome variable, the x-axis is the running variable and there is a vertical line signaling the cutoff point. I.e., this vertical line divides the scatter plot in half: the control group is plotted in the left-hand side, treated group in the right-hand side.

Therefore, this graph, very standard in RDD presentation, should look like the one below. Instead of a linear regression line, fit in a loess line to show the central tendency of the observations in each group. This way you will better visualize the functional form of the data and avoid imposing a linear trend while the data could actually be in, for example, a quadratic form. Below a very simple code that graphs the plot above. We first create some fake data for easier replication.

```##creating a fake dataset (N=1000, 500 at treated, 500 at control group)
#outcome variable
outcome <- c(rnorm(500, mean = 50, sd = 10),  rnorm(500, mean = 70, sd = 10))

#running variable
running.var <- seq(0, 1, by = .0001)
running.var <- sample(running.var, size = 1000, replace = T)

##Put negative values for the running variable in the control group
running.var[1:500] <- -running.var[1:500]

#treatment indicator (just a binary variable indicating treated and control groups)
treat.ind <- c(rep(0,500), rep(1,500))
data <- data.frame(cbind(outcome, running.var
```

Now the plot itself using ggplot2. Notice we need to tweak a bit the legend labels in this graph using the command scale_colour_discrete()

```require(ggplot2)
ggplot(data, aes(running.var, outcome, color = treat.ind)) +
geom_point() + stat_smooth(size = 1.5) +
geom_vline(xintercept=0, linetype="longdash") +
xlab("Running variable") +
ylab("Outcome variable") +
scale_colour_discrete(name="Experimental\nCondition",
breaks=c("0", "1"), labels=c("Control", "Treatment"))
```

Yes, that’s all. Now let us move to something slightly more complex in the next post.