Feature selection for online classification
In traditional supervised machine learning settings, we are usually given a large set of instances along with their true labels, and we learn a model or hypothesis which would be able to predict the class labels for unseen data instances. The speed/efficiency of such a learning algorithm usually depends on the number of instances and the dimension (number of features) of each datapoint.
There exists a lot of methods used for reducing the number of features. Some of them extract new features from the existing ones and are known as feature extraction methods. For instance, Principal Component Analysis (PCA) is a popular method that extracts new features from the current set of features and is widely used for dimensionality reduction. On the other hand, one can use feature selection methods in order to select a subset of current features to reduce the number of computations. Information gain and mutual information are two popular and important feature selection methods. You may look at  for the details.
But sometimes we need to learn a classifier model in an online manner. Specifically, we only receive a single datapoint or a small subset of a dataset at each timestep, and we are required to improve our model based on the new information. Like the offline setting, the efficiency of the learning algorithm is highly dependent to the number of features. Therefore, the goal of this master project is to investigate and implement different methods for online feature selection to improve the process of online classification. The effectiveness of these methods will be evaluated in different online learning algorithms.
- Studies computer Science, physics or mathematics
- Courses in machine learning and AI
- Good programming skills (preferable in Python)
- Being motivated, creative, focused and has problem-solving skills
Number of students: 1-2 (preferably two)
 Xu, Y., Jones, G. J., Li, J., Wang, B., and Sun, C. (2007). A study on mutual information-based feature selection for text categorization. Journal of Computational Information Systems, 3(3):1007–1012.