Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

if sparse and not np.issubdtype(doc_word.dtype, int) issue!!!! #55

Open
ThomasADuffy opened this issue Aug 25, 2020 · 2 comments
Open

Comments

@ThomasADuffy
Copy link

ThomasADuffy commented Aug 25, 2020

Hey all, I ran into an issue but also found a fix! I was passing a sparse matrix into the guidedLDA and there was an error i was getting where it was raising an error due to this if statement being reached in the utils.py


def matrix_to_lists(doc_word):
    """Convert a (sparse) matrix of counts into arrays of word and doc indices

    Parameters
    ----------
    doc_word : array or sparse matrix (D, V)
        document-term matrix of counts

    Returns
    -------
    (WS, DS) : tuple of two arrays
        WS[k] contains the kth word in the corpus
        DS[k] contains the document index for the kth word

    """
    if np.count_nonzero(doc_word.sum(axis=1)) != doc_word.shape[0]:
        logger.warning("all zero row in document-term matrix found")
    if np.count_nonzero(doc_word.sum(axis=0)) != doc_word.shape[1]:
        logger.warning("all zero column in document-term matrix found")
    sparse = True
    try:
        # if doc_word is a scipy sparse matrix
        doc_word = doc_word.copy().tolil()
    except AttributeError:
        sparse = False
    if sparse and not np.issubdtype(doc_word.dtype, int):
        raise ValueError("expected sparse matrix with integer values, found float values") <-----------------------------

    ii, jj = np.nonzero(doc_word)
    if sparse:
        ss = tuple(doc_word[i, j] for i, j in zip(ii, jj))
    else:
        ss = doc_word[ii, jj]

    n_tokens = int(doc_word.sum())
    DS = np.repeat(ii, ss).astype(np.intc)
    WS = np.empty(n_tokens, dtype=np.intc)
    startidx = 0
    for i, cnt in enumerate(ss):
        cnt = int(cnt)
        WS[startidx:startidx + cnt] = jj[i]
        startidx += cnt
    return WS, DS

The reason for this is because the data type of the sparse matrix going in gets converted to a little matrix and has a np.int64 data type which does not equate to base level "int" so I had to change it to np.int 64 in order to circumvent this issue, so the new one function just has this changed


    if sparse and not np.issubdtype(doc_word.dtype, np.int64):
        raise ValueError("expected sparse matrix with integer values, found float values")

Everything now is working as usual. let me know how i can do a commit request,push request if needed as i have not done it before. I believe a better work around would be a catch all like datatype isin then a list of int versions, because they should all work with LDA.

On windows 10-python3.8.5

@ParitoshSingh07
Copy link

Would love to see this implemented, it sounds like it's only the faulty ValueError that's stopping the use of Sparse Matrix, while the underlying code can handle sparse matrix perfectly well.

@hhagedorn
Copy link

hhagedorn commented Apr 15, 2021

Thank you for the solution! On my machine (Windows 10 & Python 3.9) np.int64 did not solve it, but substituting it with np.integer did.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants