Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with saving/loading PLDAModel #204

Open
juhopaak opened this issue Jun 13, 2023 · 0 comments
Open

Error with saving/loading PLDAModel #204

juhopaak opened this issue Jun 13, 2023 · 0 comments

Comments

@juhopaak
Copy link

When I train a PLDAModel and then save and load it, after load the model's properties have changed.

For instance

from tomotopy import PLDAModel
docs = [['foo'], ['bar'], ['baz'], ['foo', 'bar'], ['baz', 'bar']]
mdl = tp.PLDAModel(latent_topics=2)
for doc in docs:
    mdl.add_doc(doc)
mdl.train(100)
print(mdl.summary())
print(mdl.perplexity)

produces

<Basic Info>
| PLDAModel (current version: 0.12.4)
| 5 docs, 7 words
| Total Vocabs: 3, Used Vocabs: 3
| Entropy of words: 1.07899
| Entropy of term-weighted words: 1.07899
| Removed Vocabs: <NA>
| Label of docs and its distribution
|
<Training Info>
| Iterations: 100, Burn-in steps: 0
| Optimization Interval: 10
| Log-likelihood per word: -1.94159
|
<Initial Parameters>
| tw: TermWeight.ONE
| min_cf: 0 (minimum collection frequency of words)
| min_df: 0 (minimum document frequency of words)
| rm_top: 0 (the number of top words to be removed)
| latent_topics: 2 (the number of latent topics, which are shared to all documents, between 1 ~ 32767)
| topics_per_label: 1 (the number of topics per label between 1 ~ 32767)
| alpha: [0.1] (hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.)
| eta: 0.01 (hyperparameter of Dirichlet distribution for topic-word)
| seed: 3261328688 (random seed)
| trained in version 0.12.4
|
<Parameters>
| alpha (Dirichlet prior on the per-document topic distributions)
|  [3.0139716 7.3531275]
| eta (Dirichlet prior on the per-topic word distribution)
|  0.01
|
<Topics>
| Latent 0 (#0) (2) : foo bar baz
| Latent 1 (#1) (5) : bar baz foo

6.985572470333207

but after calling

mdl.save('model.bin', full=True)
mdl = PLDAModel.load('model.bin')
print(mdl.summary())
print(mdl.perplexity)

I get

<Basic Info>
| PLDAModel (current version: 0.12.4)
| 5 docs, 7 words
| Total Vocabs: 3, Used Vocabs: 3
| Entropy of words: 1.07899
| Entropy of term-weighted words: 1.07899
| Removed Vocabs: <NA>
| Label of docs and its distribution
|
<Training Info>
| Iterations: 100, Burn-in steps: 0
| Optimization Interval: 10
| Log-likelihood per word: -2.19768
|
<Initial Parameters>
| tw: TermWeight.ONE
| min_cf: 0 (minimum collection frequency of words)
| min_df: 0 (minimum document frequency of words)
| rm_top: 0 (the number of top words to be removed)
| latent_topics: 2 (the number of latent topics, which are shared to all documents, between 1 ~ 32767)
| topics_per_label: 1 (the number of topics per label between 1 ~ 32767)
| alpha: [0.1] (hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.)
| eta: 0.01 (hyperparameter of Dirichlet distribution for topic-word)
| seed: 3666141070 (random seed)
| trained in version 0.12.4
|
<Parameters>
| alpha (Dirichlet prior on the per-document topic distributions)
|  [0.1 0.1]
| eta (Dirichlet prior on the per-topic word distribution)
|  0.01
|
<Topics>
| Latent 0 (#0) (2) : foo bar baz
| Latent 1 (#1) (5) : bar baz foo
|

9.004082581035151

The model has diverging Log-likelihood per word and perplexity scores before and after saving/loading. I've tried this with both full=True and full=False, and with saves/loads, but the issue persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant