Adding meta-information to UML models using machine learning

Potential supervisors: 


Studying class models is needed to understand their effectiveness and large amounts such models are required of UML empirical studies. The author together with a collegue with base in Madrid is collecting a 'big data' set of UML models, for details and progress of this project see

We collect these UML models by crawling various internet resources - one of them being internet code-repositories - such as github (we search both for images (png, jpg, bmp) and files in UML-tool formats).

The online dataset is now about 800 models, we have already collected an additional 20000.

In order to enrich this dataset we would like to add meta-information, such as:

  • What type of projects does the model come from - in particular, is this a student project or a 'real' project? This question is very relevant for all empirical studies of open source projects.
  • Is the UML model a forward-designed model or a reverse engineered model?
  • Is the UML diagram a teaching example?
  • Quality of the layout of the model / learning layout algorithms
  • Are there common subgraphs ("patterns" but not the classical ones) across different UML designs?
  • What is the language used in the diagram?
  • Which tool was used to create the diagram?

The aim is to use machine learning approaches to answer these questions.

Skills that we look for:

- Students must have reasonable practical skills (programming, statistics, etc. ),

- familiarity with machine learning,

- image recognition/processing skills is a plus.

Location: Gothenburg (or Madrid)

Contact: Michel R. V. Chaudron,