Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scipy2scipy_clipped may return a matrix with a different shape to that of the input matrix #2065

Open
psorianom opened this issue May 25, 2018 · 1 comment · May be fixed by #2066
Open

scipy2scipy_clipped may return a matrix with a different shape to that of the input matrix #2065

psorianom opened this issue May 25, 2018 · 1 comment · May be fixed by #2066
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills

Comments

@psorianom
Copy link

psorianom commented May 25, 2018

Description

The function scipy2scipy_clipped may return a clipped matrix with a different shape if the last dimension item is not among the top similar items of any row of the input matrix. This is particularly possible while chunking in SparseMatrixSimilarity, as the similarity matrix is incomplete and thus we don't get to see the last column of the last row (which, being a similarity matrix, usually contains 1)

Steps/Code/Corpus to Reproduce

Example:

from scipy.sparse import random, vstack
from gensim.matutils import scipy2scipy_clipped
from sklearn.metrics.pairwise import cosine_similarity

#Some random sparse matrix
X = random(1000, 2000, density=.2, format="csc")

#Getting its similarity matrix
X_sim = cosine_similarity(X, dense_output=False)

#Splitting it to simulate chunking
X_sim_chunk1 = X[:500, :]
X_sim_chunk2 = X[500:, :]

#Assuring that in the first chunk no row is similar to the last item
X_sim_chunk1[:, -1] = 0

X_clipped1 = scipy2scipy_clipped(X_sim_chunk1, 100)
print(X_clipped1.shape) # (500, 1999)

X_clipped2 = scipy2scipy_clipped(X_sim_chunk2, 100)
print(X_clipped2.shape) # (500, 2000)

#While trying to recreate the matrix, this fails because of dimensions' inconsistency
vstack([X_clipped1, X_clipped2])
# ValueError: incompatible dimensions for axis 1

Expected Results

X_clipped1 = scipy2scipy_clipped(X_sim_chunk1, 100)
print(X_clipped1.shape) # (500, 1000)

Actual Results

X_clipped1 = scipy2scipy_clipped(X_sim_chunk1, 100)
print(X_clipped1.shape) # (500, 999)

Versions

Linux-4.4.0-116-generic-x86_64-with-debian-stretch-sid
('Python', '2.7.14 |Anaconda, Inc.| (default, Nov 8 2017, 22:44:41) \n[GCC 7.2.0]')
('NumPy', '1.13.3')
('SciPy', '1.0.0')
('gensim', '3.4.0')
('FAST_VERSION', 1)

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills labels Jul 30, 2018
@menshikh-iv
Copy link
Contributor

thanks for report @psorianom 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants