-
Notifications
You must be signed in to change notification settings - Fork 725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identical topics: some become outliers, some are assigned to their topic #2026
Comments
Are you familiar with the underlying algorithms of BERTopic? If not, I would highly advise reading through the clustering algorithm (HDBSCAN) which is the algorithm that actually assigns datapoints to certain clusters (also referred to as topics). If so, let me go a bit more in-depth with some of the things you shared:
Just note that it is also possible to automatically merge the topics if you are so inclined with
Absolute cosine similarities are a bit tricky to interpret since they till you little about the distribution of similarities. For instance, it is possible that this specific embedding simply creates very high similarities by default making the separation of documents a bit more difficult.
Have you tried setting the threshold to 0? Also, it would be interesting to use a different strategy ("embeddings" for instance) since your issue might seem to relate to the embedding model that you use. # Initialize BERTopic
#representation_model = KeyBERTInspired()
representation_model = MaximalMarginalRelevance(diversity=0.5)
topic_model = BERTopic(ctfidf_model=ctfidf_model, representation_model=representation_model, calculate_probabilities=True, language="swedish")
# Fit the model with documents
topics, probs = topic_model.fit_transform(docs, embeddings) You should also set
As mentioned above, note that you are using the default multilingual embedding model for creating the embeddings used in KeyBERTInspired and MMR.
Based on what you shared, it is difficult to say what the exact "issue" is since this might simply be a consequence of how the underlying clustering algorithm, HDBSCAN, handles these kinds of input embeddings. For instance, it has been suggested before that HDBSCAN is quite careful when assigning embeddings to clusters thereby creating many outliers. Instead, I would advise trying out some of HDBSCAN hyperparameters to see if that makes a difference (e.g., |
I'm using using BERTopic 0.16.2 and I'm trying to understand why about a third of my documents are categorized as outliers.
len(docs)
is 7578 and the number of documents in my largest topics areAs I looked through some of the outlier documents it immediately struck me that many of them quite obviously belong to one of the "real" topics. For example, I have a "modes of transportation" topic (my interpretation based on the top words in that topic being words like car, bike, train, bus) and there are outliers talking about biking and taking the bus.
The clearest indicator that something is going quite wrong, however, is this outlier:
(yes, the entire document consists of these two words.)
given that I have two topics that look like this:
godmorgon = good morning in Swedish.
I don't quite understand why there are two separate good morning topics, one where "god morgon" has been mostly transcribed as two words and the other where it has been transcribed as "godmorron", but that is probably not an error but a matter of fine tuning (or manually merging the topics, so let's not worry about that for now.
Here is the list of all documents in topic 35 (click to expand)
I realize that there is no document that is 100% identical to the outlier but pretty close:
outlier: "Godmorgon. Godmorgon."
two docs in topic 35: "Godmorgon godmorgon."
Cosine similarity: 0.9904226 (I'm using
KBLab/sentence-bert-swedish-cased
as sentence transformer because I could not get any meaningfull topics with the default multilingual embedding model)If we take "Godmorgon." as reference the cosine similarity drops to 0.9797847 but I still find it puzzling that BERTopic assigns classifies it as an outlier rather than adding it to topic 35.
Another, probably related observation is that the number of outliers remains the same, no matter what threshold I use in
I am new to BERTopic, so I am probably misconfiguring something and I would appreciate any hints as to where I might have gone wrong. See my code below. (That said, it would be great if BERTopic could handle this by itself somehow, given that it is rather difficult to realize what is going on when you don't have super-similar documents and a super simple topic like I do.)
Here is my code (click to expand)
And then
The text was updated successfully, but these errors were encountered: