Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating and Pushing a BERTopic Model with New Documents to Hugging Face Hub still shows old no of training document #2071

Open
1 task done
sdave-connexion opened this issue Jun 30, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@sdave-connexion
Copy link

Have you searched existing issues? 馃攷

  • I have searched and found no existing issues

Desribe the bug

I have been using BERTopic for topic modelling and recently needed to update my existing BERTopic model with new documents. I want to push the updated model to the Hugging Face Hub, ensuring that it reflects the new number of documents and topics.

Here鈥檚 what I鈥檝e done so far:

  • Loaded my existing BERTopic model:
  • Added new documents and their embeddings:
  • Updated the model with new documents:
`new_topics, new_probs = topic_model.transform(lemmatized_docs, embeddings)`
  • Saved the updated model using safetensors:
  • Pushed the updated model to Hugging Face Hub:

Despite following these steps, I still see the old number of training documents in the repository on the Hugging Face Hub. How can I ensure that the updated model reflects the new number of training and topics?

Any help or guidance on this would be greatly appreciated!

Reproduction

from bertopic import BERTopic

# Load your existing BERTopic model
topic_model= BERTopic.load("shantanudave/BERTopic_ArXiv",embedding_model="sentence-transformers/all-MiniLM-L6-v2")

new_topics, new_probs = topic_model.transform(lemmatized_docs, embeddings)

new_model_name = "BERTopic_v2"

# Save the updated model locally using safetensors

embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save(new_model_name, serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

from huggingface_hub import login

# Authenticate with Hugging Face
login(token="your_hugging_face_token")

# Push the updated model to Hugging Face Hub
topic_model.push_to_hf_hub(
    repo_id=f"shantanudave/{new_model_name}",
    serialization="safetensors",
    save_ctfidf=True,
    save_embedding_model=embedding_model
)

BERTopic Version

pip install -U bertopic

@sdave-connexion sdave-connexion added the bug Something isn't working label Jun 30, 2024
@MaartenGr
Copy link
Owner

Updated the model with new documents:

That's the thing, you didn't update the model. When you use .transform, you are merely predicting the topics of the documents that you passed to it. .transform, like it's used in scikit-learn, it not meant to update the underlying model. Instead, if you want to update the model, I would advise using either online topic modeling or the .merge_model technique.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants