`transform` method not handling single embeddings or strings given to it. #2018

1jamesthompson1 · 2024-05-29T07:03:27Z

Issue

Currently the transform method of the BERTopic has to recieve a list of strings as the documents and if embeddings are present then it needs to be a 2darray of shape (len(documents, embedding_dimension). This requires extra reshape calls when trying to call the transform method on a single document.

For example it would be nicer to do this:

document = 'This is a really interesting document'

document_embedding = np.random.rand(768)

topic_model.transform(document, document_embedding)

Example of what you currently have to do

document = 'This is a really interesting document'

document_embedding = np.random.rand(768)

topic_model.transform([document], document_embedding.reshape(1,-1))

Currently if you run it you get this error.

BFile ~/code/BERTopic/bertopic/_utils.py:55, in check_embeddings_shape(embeddings, docs)
      else:
         if embeddings.shape[0] != len(docs):
--->         raise ValueError("Make sure that the embeddings are a numpy array with shape: "
                               "(len(docs), vector_dim) where vector_dim is the dimensionality "
                               "of the vector embeddings. ")

ValueError: Make sure that the embeddings are a numpy array with shape: (len(docs), vector_dim) where vector_dim is the dimensionality of the vector embeddings.

Why I think it is happening

This is because the embeddings shape check inside transform happens before:

the document argument is converted into a list if it is str
The embedding is reshaped from a 1d array to 2d array (Note that it currently never does this).

Thoughts on making it better
Am I missing something here about np arrays or the transform function? It seems to me that wanting to transform a single document is common enough that the function could have that little bit of extra functionality.

I have already made the change on a fork for my uses did you agree with the idea and would you like a PR?

The text was updated successfully, but these errors were encountered:

MaartenGr · 2024-05-31T13:31:05Z

It seems to me that wanting to transform a single document is common enough that the function could have that little bit of extra functionality.

It's interesting but in my experience, it rarely happens that users try it on a single document and also pass an embedding. Transforming a single document to an embedding is incredibly cheap, so there is no need to then also pass the embedding. I think that's why this issue hasn't been caught until now if I remember most issues of the last couple of years well enough.

Either way, a fix should indeed be necessary here to transform an embedding into the right shape when there is only one document.

1jamesthompson1 mentioned this issue Jun 9, 2024

transform method handle a single document with provided embedidngs #2043

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`transform` method not handling single embeddings or strings given to it. #2018

`transform` method not handling single embeddings or strings given to it. #2018

1jamesthompson1 commented May 29, 2024

MaartenGr commented May 31, 2024

transform method not handling single embeddings or strings given to it. #2018

transform method not handling single embeddings or strings given to it. #2018

Comments

1jamesthompson1 commented May 29, 2024

MaartenGr commented May 31, 2024

`transform` method not handling single embeddings or strings given to it. #2018

`transform` method not handling single embeddings or strings given to it. #2018