Skip to content

drob-xx/TopicModelTuning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TopicModelTuning

The has code that parallels the article Using Metrics to Determine The Right LDA Topic Model Size. Users can run the notebook and step-by-step re-create the procedures described in the article.

To run the code presented here, follow this outline (details in the cells below):

  1. Download two csv files from the GitHub repository into a directory accessible to the notebook.
  2. Download the text DB csv file from Kaggle.
  3. Assign the global directory value to the location of the above files.
  4. Install the required packages.
  5. Execute the imports.
  6. Run the cells containing Python function definitions used in the notebook.
  7. Generate the six models used in the evaluation. This shold take about 15 minutes on a standard Google Colab account. You can save the models for later use if desired.
  8. Run the evaluation code.
  9. Download CSV Files

There are three csv files that are needed to run this notebook:

In the GitHub repository:

  • ExcludelistDF.csv
  • ModelRunMetrics.csv

On Kaggle

  • NewsDF.csv
  • ExcludelistDF is a list of stop words which can be used when building models based on the sample text.

ModelRunMetrics are the metrics from 90 runs of the LDA and can be used to re-create and explore the data from the article.

NewsDF is a copy of the 30,000 article DB that has both the original text as well as pre-processed versions of the articles. You will need this if you want to run your own models AND if you want to explore the text that the models are built on.

It is recommended that you place all of these files in a location accessible to the Colab notebook and referenced in the DATA_DIR variable