Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional representations did not update with topic reduction #2035

Open
vidieo opened this issue Jun 4, 2024 · 5 comments
Open

Additional representations did not update with topic reduction #2035

vidieo opened this issue Jun 4, 2024 · 5 comments

Comments

@vidieo
Copy link

vidieo commented Jun 4, 2024

Hi, I am trying to reduce the number of topics that I have with
topic_model.reduce_topics(docs, nr_topics=400)
which works fine. However, when I ran
topic_model.get_topic_info()
I got mismatched representations. Only the main representation was updated and all the other aspects were from the old topics.

image

I understand the preferred method of controlling topic number is min_cluster_size which I did use, but it would be nice to know if I could use reduce_topics with the additional representations updated. Thanks in advance!

@MaartenGr
Copy link
Owner

Strange, it seems that they are updated for some but not all others. If I'm not mistaken, topic 396 is not properly updated right but topic 0 is?

Also, can you share your full code along with the versions of your environment?

@vidieo
Copy link
Author

vidieo commented Jun 4, 2024

Thanks for such a great project and the quick response @MaartenGr! The additional representations does not get updated with the reduce_topics method, so for example topic 396 here has the KeyBERT and MMR of the old topic 396. It was just a coincidence before that the first three topics before and after reduction were similar. After a few more runs I learned that this happened only when loading a saved model since no sub-models is saved with it. Is there a way to pass these submodels so I can tweak the topics of a saved model?

I am running bertopic 0.16.2 on Python 3.10.12.

The code:

from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance
from bertopic import BERTopic
import pickle

embedding_model = SentenceTransformer("all-mpnet-base-v2")

with open("/content/drive/MyDrive/code_stuff/mpnet_embeddings.pickle", "rb") as pkl:
    embeddings = pickle.load(pkl)

umap_model = UMAP(n_neighbors=20, n_components=5, min_dist=0.0,
                  metric="cosine", random_state=42)

hdbscan_model = HDBSCAN(min_cluster_size=20, min_samples=10,
                        metric="euclidean", cluster_selection_method="eom",
                        prediction_data=True)

vectorizer_model = CountVectorizer(stop_words="english", min_df=5, max_df=0.9,
                                   ngram_range=(1, 3))

keybert_model = KeyBERTInspired()
mmr_model = MaximalMarginalRelevance(diversity=0.3)

representation_model = {"KeyBERT": keybert_model,
                        "MMR": mmr_model}


topic_model = BERTopic(
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,
  representation_model=representation_model,

  top_n_words=10,
  verbose=True,
)

topics, _ = topic_model.fit_transform(docs, embeddings=embeddings)

topic_model.reduce_topics(docs, nr_topics=400)

topic_model.get_topic_info()

@MaartenGr
Copy link
Owner

After a few more runs I learned that this happened only when loading a saved model since no sub-models is saved with it. Is there a way to pass these submodels so I can tweak the topics of a saved model?

I'm not sure if I understand correctly. The code you shared does not show loading a saved BERTopic model right?

Also, if you need to use nr_topics (which is not something I would recommend), you could also use that parameter in BERTopic(nr_topics=400). That might work for you.

@vidieo
Copy link
Author

vidieo commented Jun 6, 2024

I'm not sure if I understand correctly. The code you shared does not show loading a saved BERTopic model right?

Sorry, that's my bad. I shared the original code and not the code for the subsequent runs when I loaded the model. Again, it only happens when loading a saved model, so I will be fine. Still looking into the best way to reduce the number of topics for my case as I do want the small clusters if they are distinct enough, that's why I'm looking into merging methods.

@MaartenGr
Copy link
Owner

@vidieo Then it might indeed be helpful to start with min_topic_size to find the number of topics you are interested in and then manually merge topics instead of using .reduce_topics. If you run into any other problems, let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants