Mining and Analysing Twitter Data on Insomnia

This is my third-year Computer Science project about Analysing Twitter Data on Insomnia at the University of Manchester.

Code

There are five main code files:

DataCollector.py
TweetCollector.ipynb
TweetTransformersTraining.ipynb
TweetTopicModelling.ipynb
TweetExperiments.ipynb

DataCollector.py and TweetCollector.ipynb are very similar to each other because they basically do the same function - collect data from Twitter using its API. However, DataCollector.py was created to automate the data collection on the University Linux Data Science server. TweetTransformersTraining.ipynb is used to fine-tune a chosen transformer model using k-Cross-Validation or Bagging Ensembles. Also, it does tweet sentiment prediction (inference) which results can be stored in the given location. TweetTopicModelling.ipynb trains a topic modelling model and visualises results on the tweets dataset using the BERTopic model. TweetExperiments.ipynb performs various experiments with sentiment and topic labelled tweets.

Collected data cannot be published due to approved ethics application requirements.

Running DataCollector.py

To be able to run this, firstly Twitter API Bearer key should be added to the .env file. Then a couple of libraries should be downloaded using pip install -r requirements.txt. Also, Numpy and Pandas are used there as well. What is more, the path variable BASE_PATH can be edited to accommodate your specific needs regarding data storage.

The automation of data collection was achieved with Crontab jobs. It is needed as Twitter only allows to access tweets no older than one week with its API. This command was utilised to collect tweets each day at 9 am: 0 09 * * * /usr/bin/python3 /DataCollector.py.

Also, logging was set up for this script to observe the statistics about the fetched and stored tweets each day.

Tweets are stored in both .json and .csv (it may be used as backup) formats. The data is stored in .json in this way:

[
    {
        "Publish Date":1679561886000,
        "Location":"SOME LOCATION",
        "Tweet":"I can't sleep"
    },
    {
        "Publish Date":1679561886000,
        "Location":"SOME KINGDOM",
        "Tweet":"I can't sleep :("
    },
    ...
]

When tweets are annotated, "Sentiment" property is added as well where integers 0, 1 and 2 corresponds to negative, neutral and positive sentiment.

Running Notebooks

More details on how to run each notebook can be found at the beginning of each notebook with detailed instructions what is needed to run it and what it does. Notebooks are prepared to be run on the Google Colab platform, however, it is not be difficult to adapt them to run on other platforms like Kaggle.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
DataCollector.py		DataCollector.py
LICENSE		LICENSE
README.md		README.md
TweetCollector.ipynb		TweetCollector.ipynb
TweetExperiments.ipynb		TweetExperiments.ipynb
TweetTopicModelling.ipynb		TweetTopicModelling.ipynb
TweetTranformersTraining.ipynb		TweetTranformersTraining.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mining and Analysing Twitter Data on Insomnia

Code

Running DataCollector.py

Running Notebooks

About

Releases

Packages

Languages

License

LukasRimkus/Analysing-Insomnia-Twitter-Data

Folders and files

Latest commit

History

Repository files navigation

Mining and Analysing Twitter Data on Insomnia

Code

Running DataCollector.py

Running Notebooks

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages