Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identical topics: some become outliers, some are assigned to their topic #2026

Open
tophee opened this issue Jun 1, 2024 · 1 comment
Open

Comments

@tophee
Copy link

tophee commented Jun 1, 2024

I'm using using BERTopic 0.16.2 and I'm trying to understand why about a third of my documents are categorized as outliers. len(docs) is 7578 and the number of documents in my largest topics are

0       -1   2454                          
1        0    207                  
2        1    203
3        2    152  
4        3    130  

As I looked through some of the outlier documents it immediately struck me that many of them quite obviously belong to one of the "real" topics. For example, I have a "modes of transportation" topic (my interpretation based on the top words in that topic being words like car, bike, train, bus) and there are outliers talking about biking and taking the bus.

The clearest indicator that something is going quite wrong, however, is this outlier:

Document ID: sp-1-3.3
Text: Godmorgon. Godmorgon.

(yes, the entire document consists of these two words.)

given that I have two topics that look like this:

Topic 35 Topic 60
CleanShot 2024-06-01 at 12 29 48@2x CleanShot 2024-06-01 at 12 32 46@2x

godmorgon = good morning in Swedish.

I don't quite understand why there are two separate good morning topics, one where "god morgon" has been mostly transcribed as two words and the other where it has been transcribed as "godmorron", but that is probably not an error but a matter of fine tuning (or manually merging the topics, so let's not worry about that for now.

Here is the list of all documents in topic 35 (click to expand)
Document ID: sp-1-6.4
Text: Godmorgon. 

Document ID: sp-14-9.3
Text: Godmorgon. 

Document ID: sp-16-30.0
Text:  Godmorgon.

Document ID: sp-16-32.8
Text: Godmorgon. 

Document ID: sp-16-47.8
Text: Godmorgon.

Document ID: sp-16-96.5
Text: Godmorgon.

Document ID: sp-16-100.1
Text:  Godmorgon.

Document ID: sp-19-66.7
Text: Godmorgon godmorgon.

Document ID: sp-19-131.0
Text: Godmorgon.

Document ID: sp-19-146.0
Text: Godmorgon. 

Document ID: sp-19-180.3
Text: Godmorgon. 

Document ID: sp-19-182.2
Text: Godmorgon. 

Document ID: sp-19-183.2
Text: Godmorgon.

Document ID: sp-21-23.0
Text: Godmorgon. 

Document ID: sp-30-230.9
Text: Godmorgon.

Document ID: sp-30-232.3
Text: Godmorgon. 

Document ID: sp-31-236.5
Text: Godmorgon.

Document ID: sp-33-20.8
Text: Godmorgon.

Document ID: sp-33-22.7
Text: Godmorgon.

Document ID: sp-34-2.9
Text: Godmorgon. 

Document ID: sp-34-4.5
Text: Godmorgon. 

Document ID: sp-38-85.2
Text: Godmorgon. 

Document ID: sp-38-93.0
Text: Godmorgon godmorgon. 

Document ID: sp-38-94.6
Text: Godmorgon godmorgon.

Document ID: sp-38-96.8
Text: Godmorgon. 

Document ID: sp-38-100.6
Text: Godmorgon. 

Document ID: sp-40-48.5
Text:  Godmorgon.

Document ID: sp-40-79.1
Text: Godmorgon. 

Document ID: sp-40-80.5
Text: Godmorgon godmorgon 

Document ID: sp-40-82.6
Text: godmorgon. 

Document ID: sp-40-91.1
Text: Godmorgon.

Document ID: sp-41-68.3
Text:  Godmorgon. 

Document ID: sp-41-84.3
Text: Godmorgon.

Document ID: sp-41-85.8
Text: Godmorgon.

Document ID: sp-41-87.1
Text: Godmorgon.

Document ID: sp-41-103.7
Text: Godmorgon.

Document ID: sp-41-105.6
Text: Godmorgon.

Document ID: sp-42-17.9
Text: Godmorgon. 

Document ID: sp-43-56.9
Text: Godmorgon.

Document ID: sp-43-58.9
Text: Godmorgon.

Document ID: sp-43-108.8
Text: Godmorgon.

Document ID: sp-43-111.1
Text: Godmorgon.

I realize that there is no document that is 100% identical to the outlier but pretty close:

outlier: "Godmorgon. Godmorgon."
two docs in topic 35: "Godmorgon godmorgon."

Cosine similarity: 0.9904226 (I'm using KBLab/sentence-bert-swedish-cased as sentence transformer because I could not get any meaningfull topics with the default multilingual embedding model)

If we take "Godmorgon." as reference the cosine similarity drops to 0.9797847 but I still find it puzzling that BERTopic assigns classifies it as an outlier rather than adding it to topic 35.

Another, probably related observation is that the number of outliers remains the same, no matter what threshold I use in

new_topics = topic_model.reduce_outliers(docs, topics, probabilities=probs, strategy="probabilities", threshold=0.5)
new_topic_info = topic_model.get_topic_info()
print(new_topic_info)

I am new to BERTopic, so I am probably misconfiguring something and I would appreciate any hints as to where I might have gone wrong. See my code below. (That said, it would be great if BERTopic could handle this by itself somehow, given that it is rather difficult to realize what is going on when you don't have super-similar documents and a super simple topic like I do.)

Here is my code (click to expand)
# prepare embeddings so that they don't have to be calculated each time
import pandas as pd
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer

df = pd.read_csv('data_w_transcripts.csv')

df['Transcript'] = df['Transcript'].str.replace('\[\*\]', '', regex=True)
df = df[df['Transcript'].str.strip().astype(bool)]
df = df.dropna(subset=["Transcript"])

# Use the Transcript column as docs
docs = df['Transcript'].tolist()
doc_ids = df["speech_id"].tolist()
meetings = df["file"].tolist()

# Prepare embeddings
#sentence_model = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2") # doesn't seem to work w my data
sentence_model = SentenceTransformer("KBLab/sentence-bert-swedish-cased") 
embeddings = sentence_model.encode(docs, show_progress_bar=True)

And then

import pandas as pd
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from bertopic.representation import KeyBERTInspired
from bertopic.representation import MaximalMarginalRelevance
#from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer

ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

# Initialize BERTopic
#representation_model = KeyBERTInspired()
representation_model = MaximalMarginalRelevance(diversity=0.5)
topic_model = BERTopic(ctfidf_model=ctfidf_model, representation_model=representation_model, calculate_probabilities=True, language="swedish")

# Fit the model with documents
topics, probs = topic_model.fit_transform(docs, embeddings)

# Get topic information
topic_info = topic_model.get_topic_info()

# Map each document to its topic and add document IDs to topic_info
doc_topic_map = pd.DataFrame({
    "doc_id": doc_ids,
    "text": docs,
    "topic": topics,
    "meeting": meetings
})

# Create a dictionary to collect document metadata for each topic
topic_docs_metadata = doc_topic_map.groupby("topic").apply(lambda x: x.to_dict(orient='records'), include_groups=False).to_dict()

# Add a new column with document metadata to the topic_info DataFrame
topic_info["documents"] = topic_info["Topic"].map(topic_docs_metadata)

# Create a dictionary to collect document IDs for each topic
#topic_doc_ids = doc_topic_map.groupby("topic")["doc_id"].apply(list).to_dict()

# Adjust display settings for Pandas DataFrame
pd.set_option('display.max_rows', None)  # Display all rows
pd.set_option('display.max_columns', None)  # Display all columns
pd.set_option('display.max_colwidth', None)  # Display full content of each column

# Print modified topic information with document IDs
print(topic_info.head)

@MaartenGr
Copy link
Owner

I'm using using BERTopic 0.16.2 and I'm trying to understand why about a third of my documents are categorized as outliers.

Are you familiar with the underlying algorithms of BERTopic? If not, I would highly advise reading through the clustering algorithm (HDBSCAN) which is the algorithm that actually assigns datapoints to certain clusters (also referred to as topics).

If so, let me go a bit more in-depth with some of the things you shared:

I don't quite understand why there are two separate good morning topics, one where "god morgon" has been mostly transcribed as two words and the other where it has been transcribed as "godmorron", but that is probably not an error but a matter of fine tuning (or manually merging the topics, so let's not worry about that for now.

Just note that it is also possible to automatically merge the topics if you are so inclined with nr_topics="auto" although the preferred method is to use min_topic_size.

Cosine similarity: 0.9904226 (I'm using KBLab/sentence-bert-swedish-cased as sentence transformer because I could not get any meaningfull topics with the default multilingual embedding model)
If we take "Godmorgon." as reference the cosine similarity drops to 0.9797847 but I still find it puzzling that BERTopic assigns classifies it as an outlier rather than adding it to topic 35.

Absolute cosine similarities are a bit tricky to interpret since they till you little about the distribution of similarities. For instance, it is possible that this specific embedding simply creates very high similarities by default making the separation of documents a bit more difficult.

Another, probably related observation is that the number of outliers remains the same, no matter what threshold I use in

Have you tried setting the threshold to 0? Also, it would be interesting to use a different strategy ("embeddings" for instance) since your issue might seem to relate to the embedding model that you use.

# Initialize BERTopic
#representation_model = KeyBERTInspired()
representation_model = MaximalMarginalRelevance(diversity=0.5)
topic_model = BERTopic(ctfidf_model=ctfidf_model, representation_model=representation_model, calculate_probabilities=True, language="swedish")

# Fit the model with documents
topics, probs = topic_model.fit_transform(docs, embeddings)

You should also set embedding_model=sentence_model. The reason for this is that both KeyBERTInspired and MaximalMarginalRelevance are using the underlying embedding model together with the embeddings.

Cosine similarity: 0.9904226 (I'm using KBLab/sentence-bert-swedish-cased as sentence transformer because I could not get any meaningfull topics with the default multilingual embedding model)

As mentioned above, note that you are using the default multilingual embedding model for creating the embeddings used in KeyBERTInspired and MMR.

I am new to BERTopic, so I am probably misconfiguring something and I would appreciate any hints as to where I might have gone wrong. See my code below. (That said, it would be great if BERTopic could handle this by itself somehow, given that it is rather difficult to realize what is going on when you don't have super-similar documents and a super simple topic like I do.)

Based on what you shared, it is difficult to say what the exact "issue" is since this might simply be a consequence of how the underlying clustering algorithm, HDBSCAN, handles these kinds of input embeddings. For instance, it has been suggested before that HDBSCAN is quite careful when assigning embeddings to clusters thereby creating many outliers.

Instead, I would advise trying out some of HDBSCAN hyperparameters to see if that makes a difference (e.g., min_cluster_size). Moreover, it might be helpful to also optimize UMAP's n_neighbors since that also influences how much it "sees".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants