Feature Selection for Causal Inference through Active Learning

Potential supervisors: 

We study feature selection for causal inference from observational data. Understanding the effects of medical treatments and how they vary between individual patients is crucial for personalizing healthcare. Traditionally, the benefits of one treatment over another have been studied in randomized clinical trials. Such trials have several limitations: they are costly and time-consuming to perform, their results typically apply to only a biased subset of patients, and they should only be used when all treatment alternatives are ethical to give—for example, the effects of smoking should not be studied by encouraging patients to smoke. In settings where these limitations rule out the use of randomized trials, we can learn from observational—non-experimental—data, recorded passively in the healthcare system.

In observational data, we do not control the treatment policy and our inferences about the effects of treatments may be confounded by factors that influence both historical treatment choices and outcomes. To avoid this, we must adjust our estimates based on such factors, let’s call them X, effectively comparing patients that were similar in X but were given different treatments. Often, we do not know a priori which factors to adjust for and the full set of available data may be challenging to work with for reasons of statistics or interpretability. Instead, we would like to identify the smallest set of factors that allow us to de-confound estimates of causal effects. A sufficient set of factors are all direct causes of the treatment, X*⊂X—this is the set we wish to identify.

In general, we cannot identify X* based only on observational data without strong assumptions. However, if we could intervene on a feature X(i) by setting its value to x’(i) and observe the change in treatment policy, we could distinguish between association and causation. For example, if a doctor examines a medical record with blood pressure=140/90, age=52, height=180, sex=male, BMI=25, they may recommend lifestyle changes which have to do with weight loss. If we could alter the record, setting the BMI=19 and ask the doctor to recommend a treatment for this perturbed patient, we could observe whether interventions on the BMI variable would cause a change in treatment. We assume for now that we can perform such experiments. 

In this project, we consider an active learning paradigm, in which feature interventions are performed sequentially to identify the set X*. Our goal is to identify X* with as few interventions as possible, assuming that we have access to an oracle for the treatment assignment function (e.g., a physician). Based on the feedbacks from the oracle, we update our current belief on the relevant features. For the next example, we use the updated belief for choosing an intervention, and based on the new feedback, we again update the belief. Then, after a sufficient number of treatments and feedbacks, we expect to obtain a reliable belief on the features that can be used to identify the most relevant ones w.r.t. the target category of interest.

Project goals:

  • Formalize the problem of identifying the set of features causal of treatment assignment using active learning
  • Design an algorithm which identifies the features given access to an oracle
  • Analyze the conditions under which we can recover the true set of such features
  • Perform experiments using synthetic data to validate correctness and sample complexity



  • Studies computer Science, physics or mathematics
  • Courses in machine learning and AI
  • Courses in statistics/math
  • Good programming skills (preferable in Python)
  • Being motivated, creative, focused and has problem-solving skills

Number of students: 1-2 (preferably two)

Contact: Fredrik Johansson fredrik.johnasson(at)chalmers.se

Date range: 
October, 2020 to February, 2024