Priority is always given to the first anchor from anchor words #49

ElizaLo · 2021-04-05T13:01:34Z

I have a dataset that consists of 10 thousand documents. It definitely contains documents for 16 topics. With anchor words, I want to classify a dataset into 16 topics. For each topic, I set anchor words (some anchors have more words, some less, but on average about 50 words per topic).
For each topic anchor words are set in a separate list, then I check for the presence of anchor words in the texts and add them to the general list of lists anchors.

But at the output, one topic always dominates (90-95%) in my documents, and this is the topic whose words are set first in the anchor words (I checked this by changing the order of the anchor words).

For example, I have a desserts and alcoholic drinks theme. If I put the anchor words of the theme desserts first in the list of anchor words, then this theme will prevail in the output. If I first put the anchor words of the topic of alcoholic beverages, then the topic of alcoholic beverages will prevail.

To prevail this means that 90% or more of the documents are marked with the first topic of the anchor words. Other of the 16 topics also appear in the output, but much less often and also wrong.

Can you please tell me why this is happening and what am I doing possibly wrong?

Thank you in advance for your help and answer!

ryanjgallagher · 2021-04-05T14:06:24Z

CorEx is a bit different in LDA in the sense that the topic probabilities don't have to add up to 1. So it could be that 90% of your documents express the desserts topic, but also 90% of your topics express the drinks topic. What do those proportions look like each time you switch the order of the anchors? I think you could check by doing something like this, if you're not already:

n_docs = topic_model.labels.shape[0]
topic_proportions = np.sum(topic_model.labels, axis=0) / n_docs

Some other thoughts that might help

Do you say there are 16 topics because you have something like 16 different types of document labels? 16 topics don't necessarily have to line up with the 16 different labels because there might be stronger textual patterns than the labels. So are you running with exactly 16 topics and anchoring all 16 topics? If so, I would experiment with allowing for more topics that are not anchored. This would help give CorEx more room to sort words into topics that might not necessarily line up with the document labels or anchor words. For example, if you have recipes there might be a strong "numbers" topic (e.g. "2", "1/2", "1/4",...) from the amounts that are used for each recipe.
Using ~50 anchor words for multiple topics is a lot of anchor words. Since we usually look at the top 10-20 words to interpret a topic, using 50 or so for multiple topics is almost like just defining the topics ahead of time with the words that you have. You might want to experiment with using a smaller set of anchor words; for example, maybe the 5 most important ones for each topic.
The more anchor words you use, the lower you're probably going to want to set the anchor_strength. The anchor strength is relatively how much weight to put on the anchor words vs all other words. So if anchor_strength=2, then that says to give twice as much weight to the anchor words. If you're giving twice as much weight to something like 50 * 16 = 800 anchor words, then CorEx isn't going to really be able to find topics other than the ones you tell it to find. This is related to my first thought above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Priority is always given to the first anchor from anchor words #49

Priority is always given to the first anchor from anchor words #49

ElizaLo commented Apr 5, 2021

ryanjgallagher commented Apr 5, 2021

Priority is always given to the first anchor from anchor words #49

Priority is always given to the first anchor from anchor words #49

Comments

ElizaLo commented Apr 5, 2021

ryanjgallagher commented Apr 5, 2021