Unveiling Voices: Identification of Concerns in a Social Media Breast Cancer Cohort via Natural Language Processing

📚 Pipeline

Figure 1: Natural language processing pipeline for identifying Treatment Discontinuation Topics among breast cancer twitter cohort.

🎯Objective

Our primary objectives were threefold:

Develop a self-reported breast cancer tweet identification system utilizing traditional Machine learning models and RoBERTa.
Identify breast cancer-related concern-based topics in patients’ tweets.
Perform sentiment intensity analysis of patients who voice dissatisfaction and identification of treatment discontinuation in the self-reported tweets category.

🏃‍♂️To Run the code

Python 3 is used as the programming language
Download: GoogleNews-vectors-negative300.bin
We have used Jupyter Notebook for most of the coding purposes
The sequence of code to run is marked by the alphabet prefix in the file name (A to E)
Dataset is available at request

📈Results

Model	Hyperparameter	F1 Micro	F1 Macro	F2 Micro	F2 Macro	Log loss
Decision Tree	criterion='gini', max_depth=10	0.778	0.608	0.778	0.596	0.734
Logistic Reg.	C=10, penalty='l2'	0.772	0.576	0.772	0.570	0.464
Naïve Bayes	alpha=0.1	0.745	0.427	0.745	0.468	0.568
Random forest	max_depth=None, n_estimators=50	0.752	0.476	0.752	0.498	0.652
RoBERTa	epochs=20, batch_size=16	0.894	0.853	0.894	0.841	0.332

Table 1: Classification Results across various Evaluation Metrics

📑 Citation

Please consider citing 📑 our paper if our repository is helpful to your work.

@inproceedings{rajwal-etal-2024-unveiling,
    title = "Unveiling Voices: Identification of Concerns in a Social Media Breast Cancer Cohort via Natural Language Processing",
    author = "Rajwal, Swati  and Pandey, Avinash Kumar  and Han, Zhishuo  and Sarker, Abeed",
    editor = "Demner-Fushman, Dina  and Ananiadou, Sophia  and Thompson, Paul  and Ondov, Brian",
    booktitle = "Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC-COLING 2024",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.cl4health-1.32",
    pages = "264--270"
}

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
Figures		Figures
Output		Output
2024_CL4Health_LREC_COLING.pdf		2024_CL4Health_LREC_COLING.pdf
A_Self_Reported_BC_Traditional_ML_Classifier.ipynb		A_Self_Reported_BC_Traditional_ML_Classifier.ipynb
B_1_Self_Reported_BC_RoBERTa_Classifier.py		B_1_Self_Reported_BC_RoBERTa_Classifier.py
B_2_Sentiment_distribution_plot.ipynb		B_2_Sentiment_distribution_plot.ipynb
C_Breast_Cancer_Concern_Identification.ipynb		C_Breast_Cancer_Concern_Identification.ipynb
D_Treatment_Topic_Identification_Word2Vec_Roberta_Self_Report.ipynb		D_Treatment_Topic_Identification_Word2Vec_Roberta_Self_Report.ipynb
E_Train_size_vs_performance.ipynb		E_Train_size_vs_performance.ipynb
F_Different_sentiment_analyzers.ipynb		F_Different_sentiment_analyzers.ipynb
LICENSE		LICENSE
README.md		README.md
Reviewer Response BioNLP Paper.pdf		Reviewer Response BioNLP Paper.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unveiling Voices: Identification of Concerns in a Social Media Breast Cancer Cohort via Natural Language Processing

📚 Pipeline

🎯Objective

🏃‍♂️To Run the code

📈Results

📑 Citation

About

Releases

Packages

Contributors 2

Languages

License

swati-rajwal/BreastCancer_tweets_project

Folders and files

Latest commit

History

Repository files navigation

Unveiling Voices: Identification of Concerns in a Social Media Breast Cancer Cohort via Natural Language Processing

📚 Pipeline

🎯Objective

🏃‍♂️To Run the code

📈Results

📑 Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages