Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Priority is always given to the first anchor from anchor words #49

Open
ElizaLo opened this issue Apr 5, 2021 · 1 comment
Open

Priority is always given to the first anchor from anchor words #49

ElizaLo opened this issue Apr 5, 2021 · 1 comment

Comments

@ElizaLo
Copy link

ElizaLo commented Apr 5, 2021

I have a dataset that consists of 10 thousand documents. It definitely contains documents for 16 topics. With anchor words, I want to classify a dataset into 16 topics. For each topic, I set anchor words (some anchors have more words, some less, but on average about 50 words per topic).
For each topic anchor words are set in a separate list, then I check for the presence of anchor words in the texts and add them to the general list of lists anchors.

But at the output, one topic always dominates (90-95%) in my documents, and this is the topic whose words are set first in the anchor words (I checked this by changing the order of the anchor words).

For example, I have a desserts and alcoholic drinks theme. If I put the anchor words of the theme desserts first in the list of anchor words, then this theme will prevail in the output. If I first put the anchor words of the topic of alcoholic beverages, then the topic of alcoholic beverages will prevail.

To prevail this means that 90% or more of the documents are marked with the first topic of the anchor words. Other of the 16 topics also appear in the output, but much less often and also wrong.

Can you please tell me why this is happening and what am I doing possibly wrong?

Thank you in advance for your help and answer!

@ryanjgallagher
Copy link
Collaborator

CorEx is a bit different in LDA in the sense that the topic probabilities don't have to add up to 1. So it could be that 90% of your documents express the desserts topic, but also 90% of your topics express the drinks topic. What do those proportions look like each time you switch the order of the anchors? I think you could check by doing something like this, if you're not already:

n_docs = topic_model.labels.shape[0]
topic_proportions = np.sum(topic_model.labels, axis=0) / n_docs

Some other thoughts that might help

  • Do you say there are 16 topics because you have something like 16 different types of document labels? 16 topics don't necessarily have to line up with the 16 different labels because there might be stronger textual patterns than the labels. So are you running with exactly 16 topics and anchoring all 16 topics? If so, I would experiment with allowing for more topics that are not anchored. This would help give CorEx more room to sort words into topics that might not necessarily line up with the document labels or anchor words. For example, if you have recipes there might be a strong "numbers" topic (e.g. "2", "1/2", "1/4",...) from the amounts that are used for each recipe.
  • Using ~50 anchor words for multiple topics is a lot of anchor words. Since we usually look at the top 10-20 words to interpret a topic, using 50 or so for multiple topics is almost like just defining the topics ahead of time with the words that you have. You might want to experiment with using a smaller set of anchor words; for example, maybe the 5 most important ones for each topic.
  • The more anchor words you use, the lower you're probably going to want to set the anchor_strength. The anchor strength is relatively how much weight to put on the anchor words vs all other words. So if anchor_strength=2, then that says to give twice as much weight to the anchor words. If you're giving twice as much weight to something like 50 * 16 = 800 anchor words, then CorEx isn't going to really be able to find topics other than the ones you tell it to find. This is related to my first thought above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants