Skip to content

Adapted BERTopic pipeline for Topic Modeling the arXiv dataset

License

Notifications You must be signed in to change notification settings

smartIU/arxiv-topics

Repository files navigation

Adapted BERTopic pipeline for Topic Modeling the arXiv dataset

arXiv-topics

This repository constitutes an extensive workflow for Topic Modeling the entire arXiv dataset. It utilizes an adapted BERTopic pipeline and also includes:

  • preprocessing with nltk
  • an SQLite database to save results
  • label generation with Llama
  • trend analysis with statsmodels
  • visualization with dash/plotly

The process was designed to be employed locally and ran successfully on a laptop with 16 GB of RAM and an 8GB Nvidia graphics card under Windows 10. Further optimizations are definitely needed to improve the computation speed though.

Setup for the complete workflow

Setup for the visualization of precomputed results only

  • Clone the repository
  • Install Python 3.10+ for your platform from https://www.python.org/downloads/
  • Install python modules dash and statsmodels
    pip install dash statsmodels
    
  • Download and unzip a trimmed down database (without abstracts, embeddings and subsets) from the releases

Usage

The process is split up into several steps to allow intermediate evaluation of the individual results and some variations in the execution.

  • Step 1 creates the SQLite database to then import the cleaned, filtered and transformed features from the arXiv snapshot
  • Step 2 converts the abstracts into sentence embeddings
  • Step 3 trains subsets of the dataset, based on arXiv categories/archives
  • Step 4 trains the main model
  • Step 4b creates heatmaps, barcharts and hierarchical representations of the topics
  • Step 5 (re)generates topic labels
  • Step 6 assigns additional papers to existing topics with the original UMAP & HDBSCAN models
  • Step 7 assigns outlier papers to existing topics using a BERTopic approximation
  • Step 8 create a new model of higher hierarchy by merging the topics of an existing model, based on cluster similarity
  • Step 9 computes the necessary statistics for all papers over all months of the dataset
  • Step 10 visualizes the topic trends

Of these only steps 1,2,4 and 9 are strictly necessary, before the visualization is possible. Steps 3,6 and 7 are needed if you have not enough RAM to train the whole dataset at once (Hint: 16 GB are not enough). Step 8 is optional, but recommended to further evaluate the clusters. And step 5 is meant to be used if you're not pleased with some initial labels generated by the LLM.

Configuration

All steps are configured via the config.json file, such that they can be run by simply starting the respective python file.

Notable configurations include:

  • pre_year_min: the minimum year after which you want to include papers from the snapshot into your model

  • embeddings_precision: precision used for the SentenceTransformer encoding, according to https://www.sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html
    Use 'ubinary' for quantized embeddings.

  • train_models: settings for the selection of papers in step 4
    If you wanna train all papers at once, set "model_filter" to "None".
    Otherwise you can choose between "archives", "outliers" or "hierarchy" and set "percent_of_papers" to the desired percentage of papers to be selected proportionally.

  • agg_models: settings for step 8
    parent_model: the previously trained model to use
    max_cluster_distance: the maximum distance between topics to be merged together in the resulting model

  • bert_params: various parameters for the different BERTopic components
    See the documentations for BERTopic, UMAP and HDBSCAN for understanding their respective use.
    Notably, the adapted BERTopic pipeline allows you to define a whole range of min_cluster_sizes and min_sample_sizes to be used by HDBSCAN, and the one with the highest DBCV score will be chosen automatically. Also, a new hyperparameter "hdbscan_min_clusters" was added, which allows setting a minimum number of resulting clusters when using the ranges.

You can add any number of additional entries to train_models, agg_models (and bert_params accordingly) to train models in batches. Set "generate_labels" to false in the bert_params to exclude the LLM and improve performance for hyperparameter tuning.

FAQ

  • You're using quantized embeddings and encounter an error in method "search_closure" when trying to load a pickled BERTopic model?
    Until the pynndescent module gets an official update, you'll have to manually apply the fix from lmcinnes/pynndescent#240. Download https://github.com/lmcinnes/pynndescent/blob/master/pynndescent/pynndescent_.py and replace the one throwing the error.

  • You want to update your model with a new version of the arXiv dataset?
    Simply run steps 1 and 2 again to import new papers and create their embeddings. This will not override your existing data. Then run steps 6 and/or 7 to assign the new papers to the exisiting topics, and finally recompute statistics with step 9.

  • You want to restart the whole process?
    Delete arxiv.db in the "database" folder as well as snapshot_update_date.txt in the "input" folder.

Disclaimer

No part of the code or its documentation was created by or with the help of artificial intelligence.