This notes are extracted from my original Colab's notebook
hackernews-urls-from-browser-deduplicated.csv
Deduplicated urls from my most recent browser session
most recent date: 2022-10-16 15:20:47hackernews-data-from-phone.json
Urls from my phone. Still has onepoll
type data (easily filtered)
most recent date: 2022-10-11 23:20:45hackernews-stories-since-2018.csv
HN stories data since 2018. Dumped from BQ public dataset
most recent date: 2022-10-14 00:01:15+0000hackernews-stories-since-2022-10-14.csv
Additional HN data for (3) should be used along with the test case (5)
most recent date: 2022-10-24 23:54:08hackernews-since-20221016
Urls of the articles I've most recently opened (since 2022-10-16) should be used as test casehackernews-2019-2022-sessions.csv
Deduplicated urls from my stored browser session (it's longer)
There are basically two dataset: the Joined
dataset from a set(1 + 2 + 5 + 6)
and HN
dataset (since 2018
(3) the first one I dumped and since 20221014
(4) the most recent dump)
- Models are created using data from (3)
- Joined are split between before October 14 (4
min()
) and after that - Profile building are done using data from before October 14
- Validation are done using data after October 14
SELECT title, url, timestamp, type, id
FROM `bigquery-public-data.hacker_news.full`
WHERE
type = 'story'AND
title IS NOT NULL AND
timestamp BETWEEN 'START_DATE' AND 'END_DATE' -- YYYY-MM-DD format
-- or could be
-- timestamp >= DATE_STRING
- There are titles with different language included in the training set, like Spanish and Indonesian. I don't remove them because they don't amount to much and I don't think I have a good way to identify them. I've tried langdetect, but it doesn't work well with short texts :/
- Apparently I don't have to create my own sentence transformer embeddings because BERTopic by default uses it. Also, stop words are automatically handled by the vectorizer in my case
- BERTopic
cachedir
issues (coming from HDBSCAN) 1 2.
Solved by installing HDBSCAN first before installing BERTopic as, "If you install this dependency before running pip install it won't install it again. This means to you can install your patched version and then your package:" -- SO link - BERTopic finished training, but crashes as it process the result.
I tried playing with parameters such asmin_df
andlow_memory
but they doesn't help. So, I tried online topic modeling. Tried River, but it's too slow. The IncrementalPCA and BatchKMeans works nicely for my case (and it doesn't take too long!) - As I'm using online modeling, the topics learned are somehow not sorted correctly. This breaks the visualization mechanism of BERTopic (that's why the README is lacking in visualizations). It's probably easily to solve: sort the topics along with the sentence and feed the result back to the
topic_model
. But, I haven't tried this! - Using Kaggle and Colab interchangeably because of GPU restrictions (also, good news from Kaggle)
- Colab's CPU and GPU environment different
gdown
version breaks my path to Drive. Needs to anticipate for this - Some library that I wanted to try (I think it's cuML) needed python version above 3.7, but the python installed in Colab and Kaggle are 3.7 :(
- Lemmatizations (in my intuition) made the model better
- Somehow the
ngram
when first applied doesn't really work for me. Maybe because the sentence are too short? - I've had issue installing BERTopic in Colab. Apparently, the transformer must be with the specific
4.20.1
version and the flag--upgrade-strategy only-if-needed
must be passed when installing BERTopic - I've created 10 models, and three that are good are with settings of:
- 30 components, 100 clusters, 0.646 coherence score
- 60 components, 200 clusters, 0.653 coherence score
- 200 components, 300 clusters, 0.662 coherence score
Despite the drawbacks of coherence score as mentioned here, these three are nicely matched with my intuition. They go from big picture (100) to specific (300). Results are in README
- Try to visualize more stuffs and try to understand what happened (I'm not good with visualizations...)
- The result from testing the model are good profile-wise, but not really matched to my taste. I've only got a little testing dataset with me, but it have the accuracy of 0.012 (12 per 1000). To resolve this issue, I think I could either:
- Tune the hyperparameter and focus on what's match with the end testing dataset.
- Use other aggregation and weighting mechanism for the profile.
- Use the output from BERTopic as a feature for another model (try to get the "middle" point where my taste is located).
- Identify weird clusters and find a way to clear them
- Use more data (use all the data in the public dataset)
- Migrate the project to Kedro so that the workflow are much more clean (plus try experiment tracking tools like MLFlow)
- Implement BTM and Top2Vec to compare the result
- Try to solve OOV(s) in case it does happen
- Use perplexity based measure in addition to coherence score (Another example)
- Time
- Colab's super feature!
- Short-text topic modeling:
- Limitations of short-text topic modeling
- Extra NLP Stuff
- OOV(s)
- Good notebook examples