Identification of Video Content via Packet Trace Analysis: A ML Approach

Potential supervisors: 

Motivation: Privacy protection is a raising customer concern in the recent decade and in particular since the advent of GDPR-like legislation. Data availability in recent years and large improvement in machine learning technologies have made identification of users and their behaviors much easier to implement while requiring ever less data. This has lead to numerous well-known cases of user and content identification using among others GPS traces, comments on IMDB, medical ``anonymized’’ records and recently network packet traces. The latter is a quite interesting case as it represents hardly controllable data that can easily be captured or leaked and demonstrates to what extend you can learn about someone's identity and his/her behaviors solely based on some encrypted network traffic. This thesis explores one particular use-case as an example of such identification process, namely video content ientification.

Description of the thesis: Though Netflix recently decided to use HTTPS to protect the privacy of its users, Reed and Kranch demonstrated that using only some traffic analysis of TCP/IP header data, any video that is being watched can be correctly identified within minutes. Their techniques, based on a simple and clever pre-processing of the full video database, suffer from depending heavily on technicalities of the streaming service being use. Following ML-approaches (eg CNN) used in other identification problems based on time series (such as packet traces), this thesis explores the feasibility of applying ML techniques and frameworks for performing identification of video for different streaming-services. Other classification tasks that can be tackled instead could be classification of video genre or the series from which a particular episode belongs to. The thesis will focus on video content but the developed approach could be later applied to identification of other types of content. This thesis can easily be driven based on students initiatives.

Date range: 
October, 2020 to October, 2025