Sunday, May 9, 2010

Clustering standard errors.

My dissertation's puzzle
Observation 1: The provision of public goods - roads, wells, schools, etc. - is crucial for economic development. Some villages in developing countries are able to provide them, while others are not. There is a large literature that asks: "Why?" The answer most often given is "(ethnic) diversity". The reason is that diverse villages have a harder time sanctioning non-contributors, preferences for public goods can differ across groups, and a million other reasons. Observation 2: Migration is one of the principal characteristics of the developing world; especially of conflict zones. In Eastern Congo, for example, almost 2/3rd of all people have been displaced by war at least once between 1996 and 2007. There is, however, no work that looks at the impact of migration on public goods provision. Puzzle: Intuitively one would expect that migration leads to less public goods provision: their arrival creates diversity (native versus immigrant), it could create tensions in the village, etc. So I carefully looked at our DRC data and found that villages with migrants actually have more public goods! Controlling for many confounds, this result held across the board: for wells, widening roads, clearing roads, patrols, schools and productive measures. Dissertation: In my dissertation, as a result, I will ask two main questions: 1) is this a causal story? 2) if yes, how this could this be? [1]

Clustered standard errors
When presenting my regression results I got the question: "Did you cluster your standard errors?". This is an important question. The data I used for these regressions is data from our baseline survey: in 2007 we held a survey among five households in each of over 600 randomly-selected villages in Eastern Congo, thereby obtaining information on over 3,000 households and well over 20,000 people.

Running a naive regression on this data is likely to give me wrong results because the standard errors that come with it assume that each observation is independent of all other observations in the data set. The latter, however, is not likely to be the case with my data because households of the same village are likely to be more similar on a wide variety of measures than are households that are not part of the village. As a result part of my data is correlated - this type of correlation is called intraclass correlation. The higher this intraclass correlation the less unique information each household provides and this has to be taken into account when running regressions; one has to inflate the standard errors to take this correlation into account. I therefore have to run my regression making sure that I cluster the standard errors at the village level (one can also use a multilevel model).

The answer to the above question was 'yes'; I had clustered my standard errors. Despite the fact that my standard errors - as expected - were substantially higher than would have been in the naive regression, I did not only find that (as the literature suggests) ethnic diversity is negatively correlated with public goods provision, migration (controlling for diversity and many other things) is positively correlated. Indeed a puzzle! One that I am very intrigued by and I hope to figure out why this could be the case in upcoming years.

[1] I have my proposal that includes a much more detailed discussion - also discussing endogeneity issues, hypothesizes different mechanisms how the relationship could be causal, etc. - that I will post online soonish.

1 comment:

  1. Hi I just came across your post in a discussion of standard errors. How did you cluster your standard errors at the village level? Did you accomplish this by including controls for each village (like a 0-1 dummy variable for each village)?

    Imagine that you had a dataset with village and month information. Maybe households were surveyed at different times of the year. If you included dummy variables for both village and month, and ran your regression, would this be clustering the errors at the village and month level?

    ReplyDelete