Automatic mapping of text reuse as a tool for philology

Potential supervisors: 
Research groups/keywords: 
Description: 

Methods for automatic detection of "text reuse" is regularly used for spot plagiarism e.g. in the system URKUND used by the University of Gothenburg. Not so much attention has been given to this methodology as a tool for philological analysis. The purpose of the project is to develop a work environment for detection and analysis of text reuse, useful for (computer proficient) humanistic scholars.

The project consists of the following steps, each presenting their own challenges:
1. To implement an algorithm for identification of text reuse that is useful for answering humanistic questions. This entails answering questions such as: What is text reuse? How similar must two chunks of text be, for it to count as “reuse”? How long must they be?  Algorithmic choices will depend on how these questions are answered. Also, the algorithm should be reasonably flexible, allowing for identification of different “kinds” of text reuse. This step also involves:

  • choice of programming language
  • implementation
  • system design that allows for flexible use of the algorithm

2. The create an interface to the implementation that is user-friendly-enough to be used by a reasonably computer-proficient humanist. This step involves understanding the humanistic research process, and exploring how identification of text reuse can function as a complement to traditional humanist methods. To make identification of text reuse useful, it is probably necessary to:

  • design tools for suitable pre-processing of text
  • make it easy to run the algorithm with different parameter values, possibly in “batch mode”
  • design tools for exploration of results, e.g. comparing different versions of text, with visualization of differences, visualization of where different kinds of text reuse occurs

3. To test the work environment on a few medium sized corpora consisting of the collected works of some well-known philosophers. These corpora will be provided by the supervisor. This last step will be conducted in collaboration with a humanistic scholar. Possible questions that are to be studied in this last step are:

  • Does this author reuse text? What text?
  • How are re-used segments of text modified? In several steps?
  • How do the contexts of the reused segments of text differ?
  • What is the “structure of text reuse” in the corpus?

This project requires:

  • Skills in system design and programming
  • An interest in humanistic scholarship
  • An interest in language/linguistics

The project will be supervised by Civ. Ing., PhD. Sverker Lundin (sverker.lundin@gmail.com) in collaboration with a researcher from the department of Computer Science and Engineering.

Date range: 
November, 2018 to November, 2023