if sparse and not np.issubdtype(doc_word.dtype, int) issue!!!! #55

ThomasADuffy · 2020-08-25T00:27:32Z

Hey all, I ran into an issue but also found a fix! I was passing a sparse matrix into the guidedLDA and there was an error i was getting where it was raising an error due to this if statement being reached in the utils.py


def matrix_to_lists(doc_word):
    """Convert a (sparse) matrix of counts into arrays of word and doc indices

    Parameters
    ----------
    doc_word : array or sparse matrix (D, V)
        document-term matrix of counts

    Returns
    -------
    (WS, DS) : tuple of two arrays
        WS[k] contains the kth word in the corpus
        DS[k] contains the document index for the kth word

    """
    if np.count_nonzero(doc_word.sum(axis=1)) != doc_word.shape[0]:
        logger.warning("all zero row in document-term matrix found")
    if np.count_nonzero(doc_word.sum(axis=0)) != doc_word.shape[1]:
        logger.warning("all zero column in document-term matrix found")
    sparse = True
    try:
        # if doc_word is a scipy sparse matrix
        doc_word = doc_word.copy().tolil()
    except AttributeError:
        sparse = False
    if sparse and not np.issubdtype(doc_word.dtype, int):
        raise ValueError("expected sparse matrix with integer values, found float values") <-----------------------------

    ii, jj = np.nonzero(doc_word)
    if sparse:
        ss = tuple(doc_word[i, j] for i, j in zip(ii, jj))
    else:
        ss = doc_word[ii, jj]

    n_tokens = int(doc_word.sum())
    DS = np.repeat(ii, ss).astype(np.intc)
    WS = np.empty(n_tokens, dtype=np.intc)
    startidx = 0
    for i, cnt in enumerate(ss):
        cnt = int(cnt)
        WS[startidx:startidx + cnt] = jj[i]
        startidx += cnt
    return WS, DS

The reason for this is because the data type of the sparse matrix going in gets converted to a little matrix and has a np.int64 data type which does not equate to base level "int" so I had to change it to np.int 64 in order to circumvent this issue, so the new one function just has this changed


    if sparse and not np.issubdtype(doc_word.dtype, np.int64):
        raise ValueError("expected sparse matrix with integer values, found float values")

Everything now is working as usual. let me know how i can do a commit request,push request if needed as i have not done it before. I believe a better work around would be a catch all like datatype isin then a list of int versions, because they should all work with LDA.

On windows 10-python3.8.5

The text was updated successfully, but these errors were encountered:

ParitoshSingh07 · 2021-02-24T09:36:02Z

Would love to see this implemented, it sounds like it's only the faulty ValueError that's stopping the use of Sparse Matrix, while the underlying code can handle sparse matrix perfectly well.

hhagedorn · 2021-04-15T09:28:36Z

Thank you for the solution! On my machine (Windows 10 & Python 3.9) np.int64 did not solve it, but substituting it with np.integer did.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

if sparse and not np.issubdtype(doc_word.dtype, int) issue!!!! #55

if sparse and not np.issubdtype(doc_word.dtype, int) issue!!!! #55

ThomasADuffy commented Aug 25, 2020 •

edited

Loading

ParitoshSingh07 commented Feb 24, 2021

hhagedorn commented Apr 15, 2021 •

edited

Loading

if sparse and not np.issubdtype(doc_word.dtype, int) issue!!!! #55

if sparse and not np.issubdtype(doc_word.dtype, int) issue!!!! #55

Comments

ThomasADuffy commented Aug 25, 2020 • edited Loading

ParitoshSingh07 commented Feb 24, 2021

hhagedorn commented Apr 15, 2021 • edited Loading

ThomasADuffy commented Aug 25, 2020 •

edited

Loading

hhagedorn commented Apr 15, 2021 •

edited

Loading