Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transform method not handling single embeddings or strings given to it. #2018

Open
1jamesthompson1 opened this issue May 29, 2024 · 1 comment

Comments

@1jamesthompson1
Copy link

Issue

Currently the transform method of the BERTopic has to recieve a list of strings as the documents and if embeddings are present then it needs to be a 2darray of shape (len(documents, embedding_dimension). This requires extra reshape calls when trying to call the transform method on a single document.

For example it would be nicer to do this:

document = 'This is a really interesting document'

document_embedding = np.random.rand(768)

topic_model.transform(document, document_embedding)
Example of what you currently have to do
document = 'This is a really interesting document'

document_embedding = np.random.rand(768)

topic_model.transform([document], document_embedding.reshape(1,-1))

Currently if you run it you get this error.

BFile ~/code/BERTopic/bertopic/_utils.py:55, in check_embeddings_shape(embeddings, docs)
      else:
         if embeddings.shape[0] != len(docs):
--->         raise ValueError("Make sure that the embeddings are a numpy array with shape: "
                               "(len(docs), vector_dim) where vector_dim is the dimensionality "
                               "of the vector embeddings. ")

ValueError: Make sure that the embeddings are a numpy array with shape: (len(docs), vector_dim) where vector_dim is the dimensionality of the vector embeddings.

Why I think it is happening

This is because the embeddings shape check inside transform happens before:

  1. the document argument is converted into a list if it is str
  2. The embedding is reshaped from a 1d array to 2d array (Note that it currently never does this).

Thoughts on making it better
Am I missing something here about np arrays or the transform function? It seems to me that wanting to transform a single document is common enough that the function could have that little bit of extra functionality.

I have already made the change on a fork for my uses did you agree with the idea and would you like a PR?

@MaartenGr
Copy link
Owner

It seems to me that wanting to transform a single document is common enough that the function could have that little bit of extra functionality.

It's interesting but in my experience, it rarely happens that users try it on a single document and also pass an embedding. Transforming a single document to an embedding is incredibly cheap, so there is no need to then also pass the embedding. I think that's why this issue hasn't been caught until now if I remember most issues of the last couple of years well enough.

Either way, a fix should indeed be necessary here to transform an embedding into the right shape when there is only one document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants