Machine Learning (ML) in Program Evaluation

Big data and consequently Artificial Intelligence (AI) are everywhere nowadays, even in impact evaluation. This blog post is inspired by an EES webinar held by Tom Zimmermann and Marcel Tkacik and intends to provide some basic understanding of how AI, or more precisely ML can be applied when it is about investigating the treatment effect.

First, let’s make some clarifications on what we are talking about. ML is a subset of AI, a process, that applies mathematical models to help the computer learn based on its own experience. These models are created and optimized with the patterns discovered in the data. Organizations utilize ML to recognize objects or faces on pictures, to perform sentiment analysis (identify positive, neutral, or negative mentions in a text), or to predict trends or behavior by knowing the causal relationships in the data. ML is prevalent in sales and marketing, banking and finance, cybersecurity, and transportation, just to mention some areas.

When we want to prove, that a treatment was successful, we are always curious about a change. E.g. we want to measure the increase in students’ calculation skills as a result of a new educational program in mathematics. Here our goal is to quantify the difference between the situations actually, observed after the program with the hypothetical situation that we would have observed without the intervention/treatment. In other words, we want to justify the causal relationship between an intervention and an outcome (excluding alternative explanations) and have a reliable estimate of the sign and the magnitude of the treatment effect.

Methodologically there are different types of approaches how we can do this more or less correctly: One way is observing the same people before and after the intervention (pretest-posttest design, when we don’t consider the hypothetical no-test outcome) or another way is collecting data over time, and analyzing the trend (interrupted time series design).

An even better method is to assign people to experimental and control groups and compare them from the aspect of the expected outcome. The best is, if we use a chance procedure (randomization), when we assign people into the groups, because this will lead us to a ‘true experiment’, also known as ‘randomized controlled trial’, with high internal validity. However, this is not always possible due to ethical, logistical, or other reasons. In such cases, we can pre-set a threshold value of a variable to create the groups (regression discontinuity design), or let’s say ask the participants to select their preferred group.  When we cannot randomize, we perform a so-called quasi-experiment, and a selection bias arises:  the control group and the experimental/treatment group(s) might differ from each other in more than one respect (our treatment). For example, people who opt for the same group can be systematically different without the intervention too, which is a problem. Even if we can randomize, the size of the sample is very important, and imbalances can emerge across groups despite randomization too.

One important method that aims to avoid these issues and tries to make the groups statistically equal as much as possible is matching. ‘Matched pairs design’ means, that the participants are matched with nonparticipants based on key variables, or shared characteristics, relevant to the intervention. This way we can contrast the outcomes of a program’s participants with the outcomes of “comparable” nonparticipants. Matching can work for nonexperimental data too when we identify statistical twins among observations based on measured characteristics. This is a “data-hungry” method (using the words of Heckman et al.) mainly when we have a large number of conditioning variables because the data must include a sufficient number of untreated units with characteristics that correspond to those of the treated units.

But why searching for ‘statistical twins’ in our dataset can be a solution? Matching has the assumption that after controlling for a sufficient set of factors (Conditional Independence Assumption, CIA) we can get a design that is as good as the random assignment. Thus, CIA can be used to remove selection bias and allows for inferences to be made about the average causal effect. In other words, if we have very similar people we expect very similar outcome variables if there is no effect of the treatment. A difference in the outcome variable is likely to indicate treatment effect. There are different ML matching algorithms we can apply, like propensity score matching or nearest neighbor matching.

However, it is a challenge to intuitively identify the set of variables that should enter the model, in other words, control variables that are strongly related to the treatment and thus are potentially important confounding factors.

Another ML algorithm can help to select the most relevant control variables out of several, possibly capturing nonlinearities and interactions. Such an algorithm can be a lasso for example (least absolute shrinkage and selection operator), that was originally formulated for linear regression models, and uses a penalization term to perform the variable selection effectively, to put it very simply.

It can work best if we have a lot of variables and a complex model. Unlike the classical statistical stepwise selection (e.g. forward selection or backward elimination), lasso performs both variable selection and regularization (decrease complexity and prevent overfitting) in order to enhance the prediction accuracy and interpretability simultaneously. The first step is dividing our data to training set and test set. After training the model we predict the response variable for the test set (thus, we use the ML algorithm on data that were not used for training our model), and evaluate the lasso model.

Lasso is one selection method, others are also available, like random forest or neural network. Flexible ML methods are particularly well suited to estimation in very high-dimensional cases. However, Chernozhukov and his colleagues (2018) argue, that naive inference that is based on a direct application of ML methods to estimate the causal parameter is generally invalid since in these cases, we are introducing a bias that is called regularization bias. They propose (2018) double/debiased ML for eliminating this bias. It applies the so-called ‘post-double selection’ procedure. The method can be used for a partially linear setting when the treatment effect enters linearly into the outcome equation, and treatment is exogenous.

By now, dear reader, who is still with me, you can perhaps see the potential of ML in the field of evaluation, and perhaps also the size of the rabbit hole methodologically. Since this is a very (very-very) complex topic, that raises new and new problems as you dive deeper and deeper, now this post stops here, and sticking to the blog genre, remains more a short appetizer for always-curious researchers and not a guideline or a learning material. Below I recommend however some worth-to-read literature for those, who would like to get familiar with the mentioned methods and the related conceptions.


I would like to thank Prof. Dr. Tom Zimmermann (University of Cologne) and Marcel Tkacik (Prague University of Economics and Business) for sharing their valuable insights into the concept of causal machine learning, double/debiased ML, and the exciting use cases for its application during their webinar held on the 24th of May, 2022, as part of the EES TWG 6- Methods and Design Webinar Series. This was a great introduction for me and an inspiration to keep deepening my knowledge in this field. Also, I would like to thank EES for accepting my registration for this exciting online event that was originally limited to TWG6 members.  

Recommended literature:

Angrist, J. D., & Pischke, J. S. (2014). Mastering’metrics: The path from cause to effect. Princeton university press.

Belloni, A., Chernozhukov, V., & Hansen, C. (2014). Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies, 81(2), 608-650.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters.

Duflo, E., Dupas, P., & Kremer, M. (2011). Peer effects, teacher incentives, and the impact of tracking: Evidence from a randomized evaluation in Kenya. American economic review, 101(5), 1739-74.

Heckman, J. J., Ichimura, H., & Todd, P. (1998). Matching as an econometric evaluation estimator. The review of economic studies, 65(2), 261-294.