Unveiling Voices: Identification of Concerns in a Social Media Breast Cancer Cohort via Natural Language Processing
![image](https://private-user-images.githubusercontent.com/145946818/318328451-79afc838-8dbe-4c34-b626-801817c22944.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAzODU4MDAsIm5iZiI6MTcyMDM4NTUwMCwicGF0aCI6Ii8xNDU5NDY4MTgvMzE4MzI4NDUxLTc5YWZjODM4LThkYmUtNGMzNC1iNjI2LTgwMTgxN2MyMjk0NC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzA3JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcwN1QyMDUxNDBaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1jYjMxNzdkMjNkNGM3ZGQ3ZjI0MTQ4ZWJlNmRkNGJkYzc0YjM3ZjEyZjUzZjk3OTU3OWZhYWQ2OGM4MTdiYTM3JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.SwfowDcj4tNT0-BE_wCGk4oDGXrHWx7hmgl8LyAfJkM)
Figure 1: Natural language processing pipeline for identifying Treatment Discontinuation Topics among breast cancer twitter cohort.
Our primary objectives were threefold:
- Develop a self-reported breast cancer tweet identification system utilizing traditional Machine learning models and RoBERTa.
- Identify breast cancer-related concern-based topics in patients’ tweets.
- Perform sentiment intensity analysis of patients who voice dissatisfaction and identification of treatment discontinuation in the self-reported tweets category.
- Python 3 is used as the programming language
- Download: GoogleNews-vectors-negative300.bin
- We have used Jupyter Notebook for most of the coding purposes
- The sequence of code to run is marked by the alphabet prefix in the file name (A to E)
- Dataset is available at request
Model | Hyperparameter | F1 Micro | F1 Macro | F2 Micro | F2 Macro | Log loss |
---|---|---|---|---|---|---|
Decision Tree | criterion='gini', max_depth=10 | 0.778 | 0.608 | 0.778 | 0.596 | 0.734 |
Logistic Reg. | C=10, penalty='l2' | 0.772 | 0.576 | 0.772 | 0.570 | 0.464 |
Naïve Bayes | alpha=0.1 | 0.745 | 0.427 | 0.745 | 0.468 | 0.568 |
Random forest | max_depth=None, n_estimators=50 | 0.752 | 0.476 | 0.752 | 0.498 | 0.652 |
RoBERTa | epochs=20, batch_size=16 | 0.894 | 0.853 | 0.894 | 0.841 | 0.332 |
Table 1: Classification Results across various Evaluation Metrics
Please consider citing 📑 our paper if our repository is helpful to your work.
@inproceedings{rajwal-etal-2024-unveiling,
title = "Unveiling Voices: Identification of Concerns in a Social Media Breast Cancer Cohort via Natural Language Processing",
author = "Rajwal, Swati and Pandey, Avinash Kumar and Han, Zhishuo and Sarker, Abeed",
editor = "Demner-Fushman, Dina and Ananiadou, Sophia and Thompson, Paul and Ondov, Brian",
booktitle = "Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC-COLING 2024",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.cl4health-1.32",
pages = "264--270"
}