Skip to content

an7tang/uom-year3-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Third Year Project

General Information

Author Information:

Author Name: Anqi Tang

Supervisor: Prof. Goran Nenadic

Project Title

Analysis of public perception and emotion of COVID-19 on Twitter

Project Description

Aims

This project is divided into two parts: sentiment analysis and topic modelling, aiming to analyse the sentiment of tweets discussing COVID-19 and to extract most commonly discussed COVID-19 related topics.

Objectives

Project Objective Graph

Table of Content

Data Preparation

Data Source

Data were collected on Twitter using a set of pre-defined keywords: (covid OR covid19 OR covid-19 OR coronavirus OR (corona virus) OR pandemic). The re-tweets were excluded and only English tweets were collected.

The meta data of tweet counts per day were saved in a dedicated file as well.

Data Storage

All data were collected and stored on DataScience Server belonging to the University of Manchester.

Training Data

The labels of training data were manually annotated by the author, Anqi Tang, as the golden standard. OpenAI (chatGPT), VADER and TextBlob were also applied to annotate the training data, which were used as the baseline for comparison.

File List

util/tweets_collector.ipynb : This is the notebook for collecting tweets from Twitter.

util/sentiment_annotator.ipynb : This is the notebook for annotating the sentiment of tweets, using ChatGPT, VADER and TextBlob.

First Part - Sentiment Analysis

Brief Introduction

In the first part of the project, I trained a model to predict whether a given tweets was negative, neutral or positive sentiment based on the text of the review.

Techniquely, I fine-tuned a pre-trained BERT model (e.g. "distilbert-base-uncased" provided by Hugging Face) through adding one extra sequential layer on top of the BERT model using PyTorch. To improve the accuracy of prediction further, I implemented ensemble learning, Bootstrap Aggregating (Bagging) algorithm, to combine multiple models as an ensemble to make the "Majority Voting" prediction.

File List

src/sentiment_analyser.ipynb : This is the main notebook for sentiment analysis task. It includes model functions of fine-tuning (training), evaluation, prediction, and so on. (The detailed instruction is inside the notebook.)

Second Part - Topic Modelling

Brief Introduction

In the second part, I implemented topic modelling to extract the most commonly discussed topics related to COVID-19 on Twitter.

Techniquely, I implemented a topic modelling model using BERTopic. To optimise the model performance, I customised the BERTopic model by using a transformer embedding model, a UMAP dimensionality reduction layer, a HDBSCAN clustering layer, a tokenisation, lemmatisation and vetorisationand layer, and a c-TF-IDF transformer layer.

Additionaly, Gensim's LDA model was also implemented to compare the performance.

File List

src/topic_modelling.ipynb : This is the main notebook for topic modelling task. It includes the implementation to create clusters of topics. (The detailed instruction is inside the notebook.)

src/gensim_topic_modelling.ipynb : This is the notebook for topic modelling using Gensim's LDA model, which may be used for comparison later.