Notes

This notes are extracted from my original Colab's notebook

Datasets

hackernews-urls-from-browser-deduplicated.csv
Deduplicated urls from my most recent browser session
most recent date: 2022-10-16 15:20:47
hackernews-data-from-phone.json
Urls from my phone. Still has one poll type data (easily filtered)
most recent date: 2022-10-11 23:20:45
hackernews-stories-since-2018.csv HN stories data since 2018. Dumped from BQ public dataset
most recent date: 2022-10-14 00:01:15+0000
hackernews-stories-since-2022-10-14.csv
Additional HN data for (3) should be used along with the test case (5)
most recent date: 2022-10-24 23:54:08
hackernews-since-20221016
Urls of the articles I've most recently opened (since 2022-10-16) should be used as test case
hackernews-2019-2022-sessions.csv
Deduplicated urls from my stored browser session (it's longer)

There are basically two dataset: the Joined dataset from a set(1 + 2 + 5 + 6) and HN dataset (since 2018 (3) the first one I dumped and since 20221014 (4) the most recent dump)

Models are created using data from (3)
Joined are split between before October 14 (4 min()) and after that
Profile building are done using data from before October 14
Validation are done using data after October 14

Queries

SELECT title, url, timestamp, type, id
FROM `bigquery-public-data.hacker_news.full`
WHERE
    type = 'story'AND
    title IS NOT NULL AND
    timestamp BETWEEN 'START_DATE' AND 'END_DATE'  -- YYYY-MM-DD format
    -- or could be
    -- timestamp >= DATE_STRING

Processes

There are titles with different language included in the training set, like Spanish and Indonesian. I don't remove them because they don't amount to much and I don't think I have a good way to identify them. I've tried langdetect, but it doesn't work well with short texts :/
Apparently I don't have to create my own sentence transformer embeddings because BERTopic by default uses it. Also, stop words are automatically handled by the vectorizer in my case
BERTopic cachedir issues (coming from HDBSCAN) 1 2.
Solved by installing HDBSCAN first before installing BERTopic as, "If you install this dependency before running pip install it won't install it again. This means to you can install your patched version and then your package:" -- SO link
BERTopic finished training, but crashes as it process the result.
I tried playing with parameters such as min_df and low_memory but they doesn't help. So, I tried online topic modeling. Tried River, but it's too slow. The IncrementalPCA and BatchKMeans works nicely for my case (and it doesn't take too long!)
As I'm using online modeling, the topics learned are somehow not sorted correctly. This breaks the visualization mechanism of BERTopic (that's why the README is lacking in visualizations). It's probably easily to solve: sort the topics along with the sentence and feed the result back to the topic_model. But, I haven't tried this!
Using Kaggle and Colab interchangeably because of GPU restrictions (also, good news from Kaggle)
Colab's CPU and GPU environment different gdown version breaks my path to Drive. Needs to anticipate for this
Some library that I wanted to try (I think it's cuML) needed python version above 3.7, but the python installed in Colab and Kaggle are 3.7 :(
Lemmatizations (in my intuition) made the model better
Somehow the ngram when first applied doesn't really work for me. Maybe because the sentence are too short?
I've had issue installing BERTopic in Colab. Apparently, the transformer must be with the specific 4.20.1 version and the flag --upgrade-strategy only-if-needed must be passed when installing BERTopic
I've created 10 models, and three that are good are with settings of:

30 components, 100 clusters, 0.646 coherence score
60 components, 200 clusters, 0.653 coherence score
200 components, 300 clusters, 0.662 coherence score
Despite the drawbacks of coherence score as mentioned here, these three are nicely matched with my intuition. They go from big picture (100) to specific (300). Results are in README

Further Improvements

Try to visualize more stuffs and try to understand what happened (I'm not good with visualizations...)
The result from testing the model are good profile-wise, but not really matched to my taste. I've only got a little testing dataset with me, but it have the accuracy of 0.012 (12 per 1000). To resolve this issue, I think I could either:
1. Tune the hyperparameter and focus on what's match with the end testing dataset.
2. Use other aggregation and weighting mechanism for the profile.
3. Use the output from BERTopic as a feature for another model (try to get the "middle" point where my taste is located).
Identify weird clusters and find a way to clear them
Use more data (use all the data in the public dataset)
Migrate the project to Kedro so that the workflow are much more clean (plus try experiment tracking tools like MLFlow)
Implement BTM and Top2Vec to compare the result
Try to solve OOV(s) in case it does happen
Use perplexity based measure in addition to coherence score (Another example)

Useful References

Time
1. Python time format reference
2. UNIX timestamp to datetime
Colab's super feature!
Short-text topic modeling:
Limitations of short-text topic modeling
Extra NLP Stuff
1. Kaggle's NLP
2. NLP Specializations
OOV(s)
1. Various ways to handle it
2. Byte Pair Encoding
Good notebook examples
1. BERTopic notebooks are great!
2. BERTopic for covid-related tweets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NOTES.md

NOTES.md

Notes

Datasets

Queries

Processes

Further Improvements

Useful References

Files

NOTES.md

Latest commit

History

NOTES.md

File metadata and controls

Notes

Datasets

Queries

Processes

Further Improvements

Useful References