Optimization and Approximation Algorithms for Robust Statistical Inference (several projects)

Potential supervisors: 
Description: 

Standard methods in statistical analysis are often based on the assumption that data values are drawn independently from some underlying distribution, and may perform poorly when this assumption is violated. However, there are natural settings in which data values may be strongly correlated to sample membership, so that distributional assumptions (e.g. independence of data values) can lead to inaccurate results. For example, tendency to vote for a particular candidate in an election may be correlated to willingness to answer telephone surveys. Standard methods of contact tracing involve testing all social contacts of a person with a contagious disease. This practice clearly introduces correlations between the likelihood of being tested and the chance of actually having contracted the disease.

Recently, Chen, Valiant, and Valiant [CVV20] introduced a general framework to capture statistical estimation in such difficult settings. They assume that the data values may be arbitrarily correlated with the likelihood of being sampled, but that the estimation algorithm is able to leverage knowledge of the randomized data collection process. They showed, via a connection to the Little Grothendieck Problem, that semi-definite programming algorithms can be used to compute estimators that outperform standard estimators in a variety of situations where data values and sample membership are correlated.

Project Overview: In this project the student(s) will explore the framework of [CVV20] in one or more possible directions.

  • The semi-definite programming algorithm used in the original paper is quite general, but also somewhat impractical due to high computation time and memory consumption. By focusingon a more specific statistical problem it may be possible to design a provably more efficient estimation algorithm.
  • It would be interesting to develop a mathematical understanding of what properties of the randomized data collection process allow for low estimation error despite the correlation of the data values.
  • Additionally the project may explore evaluating the original semi-definite programming algorithm (or any algorithms newly designed for this project) on a real-world data-set.

The project is intended to be primarily theoretical/mathematical i.e. the goal will be to prove theorems about statistical estimation algorithms in this setting. However, depending on the direction the project takes, there may be more practical implementation (probably in python) of the algorithms to see how they perform on real-world data.

References
[CVV20] Justin Y. Chen, Gregory Valiant, and Paul Valiant. Worst-case analysis for randomly
collected data. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina
Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems
33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020,
December 6-12, 2020, virtual, 2020.

Date range: 
October, 2021 to October, 2024