Skip to content

This app has been created by a group of students as part of a course in the Data Science Master’s Program at the University of Helsinki. The app was created for, and in collaboration with, the Research Unit for the Study of Variation, Contacts, and Change in English (VARIENG). It allows for the exploration of a corpus of historical letters.

License

Notifications You must be signed in to change notification settings

DSP2021-LanguageAnalysis/language-analysis

Repository files navigation

Data Science Project: Analysis of language variation and change

Data Science Master's Programme, University of Helsinki

Table of Contents

  • Description
  • Links
  • Installation
  • Usage
  • Theory
  • Credits and Licence
  • Backlog

Description

This app allows for the exploration of a corpus of historical letters using data science methods. The app has two sections.

The first is the part of speech, or POS tag, visualisation section. Here there are two tabs, containing bar and line graphs, to give the user a general overview of the dataset. This section contains various options to filter and restrict the dataset to allow the user more freedom in their exploration.

The second part of the app is the topic model section. This allows the user to generate, using the latent dirichlet allocation algorithm, a chosen number of “topics” from the data set. When properly filtered and parameterized, this allows the user to see which topics dominated the discussion in the letters. The app gives a wide array of options, so that the user can adjust based on their own questions of interest.

Links

App

Go to http://193.166.25.206:8050/app/overview

CLAWS7 Tagset

http://ucrel.lancs.ac.uk/claws7tags.html

Installation / How to get the app working locally

  1. Clone the repository
  2. Create a virtual environment with python3 -m venv venv
  3. Activate virtual environment with source venv/bin/activate
  4. Run pip install -r requirements.txt
  5. Add data folder TCEECE to local project root, this is ignored by GIT to avoid spreading the data (see .gitignore file)
  6. Start app with python index.py
  7. Visit http://127.0.0.1:8050/app/overview

Usage

POS Visualisation

Line:

  • Shows the percentage of chosen categories
  • User can select:
    • Year range
    • period length (10 years, 20 years ...)
  • User can choose up to three lines to compare and options for each line are:
    • Sender Sex (M,F)
    • Pre-Made Class Grouping Classifications
      • Fine grained - Royalty (R) , Nobility (N) , Gentry Upper (GU), Gentry Lower (GL, G), Clergy Upper (CU), Clergy Lower (CL), Professional (P), Merchant (M), Other (O)
      • Regular - Royalty (R) , Nobility (N) , Gentry (GU, GL, G), Clergy (CU, CL), Professional (P), Merchant (M), Other (O)
      • Tripartite - Upper (R, N, GU, GL, G, CU), Middle (CL, P, M), Lower (O)
      • Bipartite - Gentry (R, N, GU, GL, G, CU), Non-Gentry (CL, P, M, O)
    • Relationship (between sender and recipient)
      • Grouped: Family, Friends, Other relationships
      • Fine grained: Nuclear family, Other family, Family servant, Close friend, Other acquaintance
    • POS-tags
  • User can set custom name for the graph and each line

Bar:

  • Shows number of words, letters or senders in the data that was selected in the line graph view
  • The differently-coloured bars correspond to the lines selected in the line graph view
  • Bars can be divided by:
    • Sender's sex
    • Sender's rank
    • Sender's relationship with recipient

Topic model

BASIC PARAMETERS

Number of Topics

  • Number of topics to be generated by the LDA model.

Number of Iterations

  • Maximum number of iterations through the corpus when inferring the topic distribution.

ADVANCED PARAMETERS

Alpha

  • Parameter which determines the prior distribution over topic weights in documents.
  • Auto option: Learns an asymmetric prior from the corpus

Eta

  • Parameter which determines the prior distribution over word weights in each topic.
  • Auto option: Learns an asymmetric prior from the corpus

Set Seed

  • Option to choose a starting point for the generation of pseudorandom numbers to be used in the algorithm.

FILTERING

POS tag

  • Option to instruct the algorithm to only consider words of the chosen word type.

Stopwords

  • Option to instruct the algorithm to ignore words given in the list.

Filter Below Threshold

  • Option to instruct the algorithm to ignore words appearing in less than the selected number of documents.

Filter Above Threshold

  • Option to instruct the algorithm to ignore words appearing in more than the selected proportion of documents. Input is a decimal, between 0.01 and 1.

Sex

  • Option to instruct the algorithm to only consider letters of individuals from the chosen sex.

Rank

  • Option to instruct the algorithm to only consider letters of individuals from the chosen rank.

Relationship

  • Option to instruct the algorithm to only consider letters between senders and receivers of a chosen relationship status.

Time range

  • Option to instruct the algorithm to only consider letters sent between the chosen years.

Theory

Topic model

Latent Dirichlet Allocation

Latent dirichlet allocation, or LDA for short, is an algorithm used for topic modelling in natural language processing. Topics are groups of items, in this case tokens, which belong together due to their usage and prominence in the texts. Topics can be utilised to fully explore a corpora in unearthing and classifying the underlying themes present.

Preprocessing of the data plays a big part in obtaining significant results through this method, as the majority of words do not contribute any information about the topics themselves, but are there for other purposes, such as conveying the subjects in question, or linking together parts of the sentence. In our implementation, preprocessing includes transforming words into lowercase form, tokenization, lemmatization and filtering of tokens that consist only of one character. User can additionally selects stopwords that are be filtered out from the final data used for model training.

In brief, the algorithm works by iterating over documents. Firstly each word w in the document is assigned to one of k topics, at random. Then, conditional probabilities are calculated to represent the likelihood that w belongs to that topic. Then the words in the topics are updated based on these probabilities and the algorithm repeats.

The application is using a parallelized version of the LDA algorithm provided by the Gensim library for Python. More information on the Gensim implementation of the algorithm can be found from Gensim documentation. The article Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation provides more insights on the theoretical basis of the algorithm.

Iterations

Controls how many times the algorithm repeats a certain process, called the E-step, on each document. The E-step is a process during which the optimal values of the “variational parameters” are found for a document. The variational parameters are used to compute a lower bound for the log likelihood of the data, and when optimised will produce the tightest possible lower bound. Then inferences can be made about the log likelihood of the entire data, which is necessary for predicting which words belong to which topics.

Hyperparameters - Alpha and Eta

Alpha - Interpretation

Low alpha means each document is likely to consist of a few, or even one dominant topic. High alpha means each document is likely to consist of a mix of most of the topics.

Eta - Interpretation

Low eta means each topic is likely to be composed of only a few dominant words. High eta means each topic is likely to consist of a mixture of many words.

Ideally, we would like our documents to consist of only a few topics, and the words within those topics to belong to only one or a few of those topics. As such, alpha and eta can be adjusted to suit these purposes.

Credits and Licence

About

This app has been created by a group of students as part of a course in the Data Science Master’s Program at the University of Helsinki. The app was created for, and in collaboration with, the Research Unit for the Study of Variation, Contacts, and Change in English (VARIENG). It allows for the exploration of a corpus of historical letters.

Topics

Resources

License

Stars

Watchers

Forks

Languages