Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy in topic content when summarizing and visualizing with LDAvis #97

Open
leungi opened this issue May 2, 2019 · 1 comment

Comments

@leungi
Copy link

leungi commented May 2, 2019

Apologies for the non reprex (due to size), but below is code using example from the textmineR package, so it should be reproducible.

Issue: reviewing model$summary to for, say, topic 1 t_1, it seems that it doesn't match with the t_1 marked in LDAvis plot.

I believe the definitions of phi P(token|topic) and theta P(topic|document) are the same across textmineR and LDAvis, so I'd expect similar topic/word clusters.

Note that the issue was originally posted with textmineR (TommyJones/textmineR#72), and the author suggested that the reason may be with LDAvis.

library(textmineR)

# load nih_sample data set from textmineR
data(nih_sample)

# create a document term matrix 
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT, # character vector of documents
                 doc_names = nih_sample$APPLICATION_ID, # document names
                 ngram_window = c(1, 2), # minimum and maximum n-gram length
                 stopword_vec = c(stopwords::stopwords("en"), # stopwords from tm
                                  stopwords::stopwords(source = "smart")), # this is the default value
                 lower = TRUE, # lowercase - this is the default value
                 remove_punctuation = TRUE, # punctuation - this is the default
                 remove_numbers = TRUE, # numbers - this is the default
                 verbose = FALSE, # Turn off status bar for this demo
                 cpus = 2) # default is all available cpus on the system

dtm <- dtm[,colSums(dtm) > 2]

set.seed(12345)

model <- FitLdaModel(dtm = dtm, 
                     k = 20,
                     iterations = 200, # I usually recommend at least 500 iterations or more
                     burnin = 180,
                     alpha = 0.1,
                     beta = 0.05,
                     optimize_alpha = TRUE,
                     calc_likelihood = TRUE,
                     calc_coherence = TRUE,
                     calc_r2 = TRUE,
                     cpus = 2) 

model$top_terms <- GetTopTerms(phi = model$phi, M = 10)

# Get the prevalence of each topic
# You can make this discrete by applying a threshold, say 0.05, for
# topics in/out of docuemnts. 
model$prevalence <- colSums(model$theta) / sum(model$theta) * 100

# textmineR has a naive topic labeling tool based on probable bigrams
model$labels <- LabelTopics(assignments = model$theta > 0.05, 
                            dtm = dtm,
                            M = 1)


model$summary <- data.frame(topic = rownames(model$phi),
                            label = model$labels,
                            coherence = round(model$coherence, 3),
                            prevalence = round(model$prevalence,3),
                            top_terms = apply(model$top_terms, 2, function(x){
                              paste(x, collapse = ", ")
                            }),
                            stringsAsFactors = FALSE)
model$summary[ order(model$summary$prevalence, decreasing = TRUE) , ][ 1:10 , ]



# summary of document lengths
doc_lengths <- rowSums(dtm)
# get counts of tokens across the corpus
tf_mat <- TermDocFreq(dtm = dtm)
tf_mat


library(LDAvis)
# create the JSON object to feed the visualization:
json <- createJSON(
  phi = model$phi,
  theta = model$theta,
  doc.length = doc_lengths,
  vocab = tf_mat$term,
  term.frequency = tf_mat$term_freq
)

serVis(json, open.browser = TRUE)
@TommyJones
Copy link

Having played with @leungi's example, it looks like the row index on the phi matrix is shuffled in LDAvis compared to the row order of model$phi which is being fed into the JSON.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants