Machine learning meets social science: policy causal evaluation with text

Potential supervisors: 

Background. Policy evaluation that relies on text data suffers from two challenges [Angrist and Pischke 2008]. The first challenge is how the evaluators should transform the full-text of a policy document into statistical models [Denny and Spirling 2017; Egami et al. 2017; Fong and Grimmer 2016]. Evaluators face a tradeoff between retaining as much meaning as possible from text, while at the same time reduce complexity so that their statistical models produce interpretable results [Egami et al. 2017]. This tradeoff lead researchers into the garden of forking paths [Gelman and Loken 2014]: the statistical results depend on the selected representation of text. For example, the International Monetary Fund's (IMF) policy programs contain a set of policies — from privatization of hospitals to liberalization of a country's trade — that evaluators commonly transform into dummy variables, the presence versus absence of a policy, in statistical models. This transformation of full-text into dummies makes the statistical model highly interpretable but can compromise vital information. This project will identify the sources of bias in the transformation of policy text and determine how these biases affect statistical inference focusing on IMF research.

Practical work. This project is suitable for students with some background in machine learning, statistical (causal) inference, text mining, and/or natural language processing (NLP). A background in social science is not a requirement but a curiosity for cross-disciplinary scientific research would be appreciated.

In the project you will first combine modern machine learning techniques for NLP to extract the relevant pieces of text from the larger document in preparation for policy evaluation. To do this, you will apply a supervised learning approach, building on a dataset of policy texts that has been hand-annotated by a research team at University of Cambridge.

You will evaluate how versions of the transformed policy text effects statistical inference evaluated on different outcomes. You will use policy as a treatment represented in at least the four following ways: (1) the raw full text; (2) only the text passages containing a policy action; (3) an ordinal indicator of the number of policies encapsulated in the raw text; (4) an indicatory (dummy) of the presence or absence of a policy program. You will then evaluate the effect of these four types of treatment representation on different outcomes such as a countries' economic growth, democratization, health spending, and similar outcomes. To conduct this statistical evaluation you are advised to use matching methods or machine learning in the service of causal inference. You will design a suitable evaluation methodology to identify how well your approach performs.

Recommended courses:

  • One or more courses in machine learning, such as Applied Machine Learning (DAT340/DIT865) or Algorithms for Machine Learning and Inference (TDA231/DIT381).
  • Some background in statistical inference.

Contact: Richard Johansson, Department of Computer Science and Engineering (, Adel Daoud, Department of Computer Science and Engineering, (


Angrist, Joshua D. and Jörn-Steffen Pischke. 2008. Mostly Harmless Econometrics: An Empiricist's Companion. Princeton university press.

Denny, Matthew and Arthur Spirling. 2017. Text Preprocessing for Unsupervised Learning: Why It Matters, When It Misleads, and What to Do about It. SSRN Scholarly Paper. ID 2849145. Rochester, NY: Social Science Research Network.

Egami, Naoki, Christian J. Fong, Justin Grimmer, Margaret E. Roberts, and Brandon M. Stewart. 2017. How to Make Causal Inferences Using Texts. arXiv preprint: 1802.02163.

Fong, Christian and Justin Grimmer. 2016. Discovery of Treatments from Text Corpora. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL).

Mozer et al., 2020, “Matching with Text Data: An Experimental Evaluation of Methods for Matching Documents and of Measuring Match Quality”, Political Analysis

Feder et al. 2021, Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond, ArXiv

Date range: 
October, 2021 to October, 2024