Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference against a corpus is segfaulting #181

Open
erip opened this issue Aug 1, 2022 · 4 comments
Open

Inference against a corpus is segfaulting #181

erip opened this issue Aug 1, 2022 · 4 comments
Labels
bug Something isn't working

Comments

@erip
Copy link

erip commented Aug 1, 2022

I am migrating away from model.make_doc to tp.util.Corpus and am finding that using Corpus segfaults. My tiny repro is here:

#!/usr/bin/env python3

import time

import numpy as np
import tomotopy as tp

# Workaround for `str.split` received unknown kwarg user_data
class WSTok:
    def __call__(self, raw, **kwargs):
        return raw.split()

def get_highest_lda_list(model, N, docs):
    corpus = [model.make_doc(doc.split()) for doc in docs]
    topic_dist, ll = model.infer(corpus)
    k = np.argmax(topic_dist, axis=1)
    return [" ".join(e[0] for e in model.get_topic_words(k_, top_n=N)) for k_ in k]

def get_highest_lda_corpus(model, N, docs):
    corpus = tp.utils.Corpus(tokenizer=WSTok(), stopwords=[])
    corpus.process(doc for doc in docs)
    topic_dist, ll = model.infer(corpus)
    k = np.argmax([doc.get_topic_dist() for doc in topic_dist], axis=1)
    return [" ".join(e[0] for e in model.get_topic_words(k_, top_n=N)) for k_ in k]

if __name__ == "__main__":
    docs = [line.strip() for line in open('10_line_pretokenized_corpus.tsv')]
    lda = tp.LDAModel.load('tm_model.bin')
    N = 10
    t0 = time.time()
    list_res = get_highest_lda_list(lda, N, docs)
    print(f"Took {time.time() - t0} seconds (list)")
    corpus_res = get_highest_lda_corpus(lda, N, docs)
    print(f"Took {time.time() - t0} seconds (corpus)")
    t0 = time.time()
    assert all(e == f for e, f in zip(corpus_res, list_res))

When I run this, I see:

Took 19.61503529548645 seconds (list)
Segmentation fault (core dumped)

Running this with catchsegv shows these relevant lines:

/usr/lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f171bd43210]
/home/erip/.venv/lib/python3.8/site-packages/_tomotopy_avx2.cpython-38-x86_64-linux-gnu.so(_ZNSt6vectorIjSaIjEE12emplace_backIJRjEEEvDpOT_+0x7c)[0x7f16d662705c]
/home/erip/.venv/lib/python3.8/site-packages/_tomotopy_avx2.cpython-38-x86_64-linux-gnu.so(_Z10makeCorpusP16TopicModelObjectP7_objectS2_+0x681)[0x7f16d6db8f51]
/home/erip/.venv/lib/python3.8/site-packages/_tomotopy_avx2.cpython-38-x86_64-linux-gnu.so(_Z9LDA_inferP16TopicModelObjectP7_objectS2_+0x25a)[0x7f16d6d71b8a]

which seems to point here... maybe d.get() is null?

@erip
Copy link
Author

erip commented Aug 1, 2022

It seems like my WSTok is the issue and that it doesn't meet the expected interface (__call__ should return (tok, start, stop)). If I use tp.util.SimpleTokenizer(pattern="\w+"), it seems to be OK... this is somewhat unexpected, though so maybe documentation can be slightly improved.

@bab2min bab2min added the bug Something isn't working label Aug 7, 2022
@bab2min
Copy link
Owner

bab2min commented Aug 8, 2022

Hi, @erip
Could you share some pieces of the file 10_line_pretokenized_corpus.tsv for reproducing?
A similar error is not reproduced in the sample text I have, so it is not easy to determine the cause.
If you share the file where the problem is reproduced, it will be of great help to find the cause.

@erip
Copy link
Author

erip commented Aug 8, 2022

@bab2min are you using the WSTok here? It should cause the error

@bab2min
Copy link
Owner

bab2min commented Sep 14, 2022

Ooops sorry @erip , I forgot this feed entirely.
Yes, I used WSTok and it worked well.
Since I don't have tm_model.bin and 10_line_pretokenized_corpus.tsv, I ran the code, which is modifed like:

class WSTok:
    def __call__(self, raw, **kwargs):
        return raw.split()

docs = ["this is test text", "this is another text", "somewhat long text...."]

corpus = tp.utils.Corpus(tokenizer=WSTok(), stopwords=[])
corpus.process(doc for doc in docs)
for doc in corpus:
    print(doc)
# it will print
# <tomotopy.Document with words="this is test text">
# <tomotopy.Document with words="this is another text">
# <tomotopy.Document with words="somewhat long text....">

I suspect that some lines in the 10_line_pretokenized_corpus.tsv corrupt the inner c++ code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants