Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Zero-shot Topic Modeling) TypeError: object of type 'numpy.float64' has no len() #2034

Open
Paignn opened this issue Jun 3, 2024 · 1 comment

Comments

@Paignn
Copy link

Paignn commented Jun 3, 2024

Hello! I'm currently working on my project, and I have a specific NLP task using BERTopic - Zero-shot Topic Modeling. Unfortunately, a bug exists when I try to form the model.

Here is my model formation:

embedding_model_en = SentenceTransformer("all-MiniLM-L6-v2")
embeddings_en = embedding_model_en.encode(df_comment_1['text_en'], show_progress_bar=True)
umap_model_en = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
hdbscan_model_en = HDBSCAN(min_cluster_size=40, metric='euclidean', cluster_selection_method='eom',` prediction_data=True)
vectorizer_model_en = CountVectorizer(min_df=2, ngram_range=(1, 2))
zeroshot_topic_list = ["good", "bad"]
keybert_model = KeyBERTInspired()
mmr_model = MaximalMarginalRelevance(diversity=0.3)
representation_model_en = {
    "KeyBERT": keybert_model,
    "MMR": mmr_model,
}
topic_model_en = BERTopic(
    embedding_model=embedding_model_en,
    umap_model=umap_model_en,
    hdbscan_model=hdbscan_model_en,
    vectorizer_model=vectorizer_model_en,
    representation_model=representation_model_en,
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=0.5,
    verbose=True,
    nr_topics=50
)

And when I run:
topics_en, probs_en = topic_model_en.fit_transform(df_comment_1['text_en'], embeddings_en)

I get the following error:

TypeError                                 Traceback (most recent call last)
[<ipython-input-13-2af852ba034a>](https://localhost:8080/#) in <cell line: 13>()
     11 )
     12 
---> 13 topics_en, probs_en = topic_model_en.fit_transform(df_comment_1['text_en'], embeddings_en)
     14 topic_model_en.save('my_model_en_22', serialization="safetensors")
     15 topic_model_en.get_topic_info()

7 frames
[/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py](https://localhost:8080/#) in fit_transform(self, documents, embeddings, images, y)
    446         # Combine Zero-shot with outliers
    447         if self._is_zeroshot() and len(documents) != len(doc_ids):
--> 448             predictions = self._combine_zeroshot_topics(documents, assigned_documents, assigned_embeddings)
    449 
    450         return predictions, self.probabilities_

[/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py](https://localhost:8080/#) in _combine_zeroshot_topics(self, documents, assigned_documents, embeddings)
   3619         empty_dimensionality_model = BaseDimensionalityReduction()
   3620         empty_cluster_model = BaseCluster()
-> 3621         zeroshot_model = BERTopic(
   3622                 n_gram_range=self.n_gram_range,
   3623                 low_memory=self.low_memory,

[/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py](https://localhost:8080/#) in fit(self, documents, embeddings, images, y)
    314         ```
    315         """
--> 316         self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images)
    317         return self
    318 

[/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py](https://localhost:8080/#) in fit_transform(self, documents, embeddings, images, y)
    431         else:
    432             # Extract topics by calculating c-TF-IDF
--> 433             self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose)
    434 
    435             # Reduce topics

[/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py](https://localhost:8080/#) in _extract_topics(self, documents, embeddings, mappings, verbose)
   3784             logger.info("Representation - Extracting topics from clusters using representation models.")
   3785         documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
-> 3786         self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)
   3787         self.topic_representations_ = self._extract_words_per_topic(words, documents)
   3788         self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings)

[/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py](https://localhost:8080/#) in _c_tf_idf(self, documents_per_topic, fit, partial_fit)
   4006 
   4007         if fit:
-> 4008             self.ctfidf_model = self.ctfidf_model.fit(X, multiplier=multiplier)
   4009 
   4010         c_tf_idf = self.ctfidf_model.transform(X)

[/usr/local/lib/python3.10/dist-packages/bertopic/vectorizers/_ctfidf.py](https://localhost:8080/#) in fit(self, X, multiplier)
     86                 idf = idf * multiplier
     87 
---> 88             self._idf_diag = sp.diags(idf, offsets=0,
     89                                       shape=(n_features, n_features),
     90                                       format='csr',

[/usr/local/lib/python3.10/dist-packages/scipy/sparse/_construct.py](https://localhost:8080/#) in diags(diagonals, offsets, shape, format, dtype)
    146     if isscalarlike(offsets):
    147         # now check that there's actually only one diagonal
--> 148         if len(diagonals) == 0 or isscalarlike(diagonals[0]):
    149             diagonals = [np.atleast_1d(diagonals)]
    150         else:

TypeError: object of type 'numpy.float64' has no len()

How can I fix that error?
Thank you.

@MaartenGr
Copy link
Owner

Hmmm, not sure what is happening here. Which version of BERTopic are you using? Also, could you try again without using vectorizer_model_en and nr_topics?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants