Switch to full style
Examples of R Scripting Language
Post a reply

Topic Modeling using LDA in R

Thu Mar 26, 2015 12:37 am

Latent Dirichlet Allocation (LDA) is a technique used to identify latent topics in a large corpus of text data. The idea behind LDA is that each document in the corpus is a mixture of different latent topics, and that each topic is characterized by a probability distribution over the words in the vocabulary.

In R, the 'lda' package can be used to perform LDA on a text corpus. Here is an example of how to do topic modeling using LDA in R:
    1- First, load the 'lda' package and any other necessary packages (such as 'tm' for text pre-processing).
    2- Next, read in the text data and create a Corpus object, which can be done using the 'tm' package.
    3- Perform text pre-processing on the corpus, including cleaning, stemming, and removing stopwords.
    4- Create a term-document matrix from the corpus using the 'TermDocumentMatrix' function.
    5- Perform LDA on the term-document matrix using the 'lda' function. The function takes two main arguments: the term-document matrix, and the number of topics to extract (k).
    6- The 'lda' function returns an object that contains the topic-word probabilities, the document-topic probabilities, and other information.
    7- To inspect the topics, use the 'topics' function, which returns a matrix of the top words for each topic, along with their probabilities.
    8- To assign each document to one or more topics, use the 'predict' function, which returns a matrix of topic assignments for each document.
    9- To visualize the topics, you can use the 'ggplot2' package to create a heatmap of the term-document matrix with different colors representing different topics.
    10- To evaluate the quality of the topics, you can use the 'perplexity' function, which measures how well the model fits the data. Lower perplexity values indicate better model fit.
It is worth noting that the quality of the LDA results depends on the quality of the data, the number of topics selected, and the pre-processing steps taken. It is also sensitive to the parameters used in the lda function, such as alpha, and eta. In LDA, the two main parameters are alpha and eta. alpha represents the prior probability of a document belonging to a particular topic. It is a hyperparameter that can be set by the user. A higher value of alpha will result in more topics being assigned to each document, while a lower value will result in fewer topics being assigned. eta, on the other hand, represents the prior probability of a word belonging to a particular topic. It is also a hyperparameter that can be set by the user. A higher value of eta will result in more words being assigned to each topic, while a lower value will result in fewer words being assigned. The values of alpha and eta can be either set by the user or estimated by the LDA algorithm. If set by the user, a common approach is to set the alpha parameter to a symmetric value for all topics, such as 0.1, and the eta parameter to a symmetric value for all words, such as 0.01. But it is important to note that the optimal values for alpha and eta can vary depending on the specific dataset and the number of topics being modeled.


This R script uses several packages, including 'ggplot2', 'grid', 'plyr', 'reshape', 'ScottKnott', 'lda', and 'tm' to perform Latent Dirichlet Allocation (LDA) on a dataset of comments.

    1- The script sets the working directory to a specific file path on the user's computer, "D:/SecondPaper/".
    2- It then reads a csv file named 'MozillaCommentsFixed.csv' and assigns it to a variable named 'dataValues'.
    3- It then selects 1000 random rows of data using 'sample' function and assigns it to the same variable.
    4- It then performs Text Pre-processing on the data. It starts by creating a Corpus object from the 'dataValues$text' variable.
    5- It then converts all text to lowercase, removes punctuation, numbers, stopwords and performs stemming on the words.
    6- Next, it creates a term document matrix from the Corpus object and inspects the first 10 rows and columns of the matrix.
    7- It then finds the frequent terms in the matrix that occur more than 1003 times, and checks the dimensions of the matrix.
    8- It then removes sparse terms from the matrix using the 'removeSparseTerms' function, with a sparse value of 0.88
    9- Next, it inspects the first 10 rows and 15 columns of the sparse matrix.
    10- It then creates a wordcloud of the frequent terms, saved as an image file named 'wordcloud.png'
    11- It then reformats the sparse matrix and creates a heatmap of the term-document matrix using ggplot2.
    12- Finally, it removes sparse terms from the matrix again using the 'removeSparseTerms' function, with a sparse value of 0.5
    13- It then converts the Document-term matrix to a dataframe, and prints the number of words remaining.

There are two packages in R that support Topic Modeling latent Dirichlet allocation (LDA) : 1) topicmodels 2) lda. In this example I read text from csv file then I convert it to corpus. I also apply pre-processing on the text using R package "tm": Such as stopwords, stemming, and removing numbers. I remove the low frequently words (Sparse Terms) after converting the corpus into Term Document Matrix. I also visualize the the words per documents and I print the words cloud as .png file. Two packages are used to build LDA models at the end, you can use whatever you feel more faster and easy to learn for you.
Code:
require("ggplot2")
require("grid")
require("plyr")
library(reshape)
library(ScottKnott)
setwd("D:/SecondPaper/")
library(lda)
library(tm)
dataValues<- read.csv('MozillaCommentsFixed.csv',sep=",")
dataValues=dataValues[sample(nrow(dataValues),size=1000,replace=FALSE),]


dim(dataValues)
## Text Pre-processing.
## Creating a Corpus from the Orginal Function
## interprets each element of the vector x as a document
CorpusObj<- VectorSource(dataValues$text);
CorpusObj<-Corpus(CorpusObj);
CorpusObj <- tm_map(CorpusObj, tolower) # convert all text to lower case
CorpusObj <- tm_map(CorpusObj, removePunctuation)
CorpusObj <- tm_map(CorpusObj, removeNumbers)
CorpusObj <- tm_map(CorpusObj, removeWords, stopwords("english"))
CorpusObj <- tm_map(CorpusObj, stemDocument, language = "english") ## Stemming the words
CorpusObj<-tm_map(CorpusObj,stripWhitespace)
##create a term document matrix
CorpusObj.tdm <- TermDocumentMatrix(CorpusObj, control = list(minWordLength = 3))
inspect(CorpusObj.tdm[1:10,1:10])
findFreqTerms(CorpusObj.tdm, lowfreq=1003)
dim(CorpusObj.tdm)
CorpusObj.tdm.sp <- removeSparseTerms(CorpusObj.tdm, sparse=0.88)
dim(CorpusObj.tdm.sp)
## Show Remining words per 15 Document.
inspect(CorpusObj.tdm.sp[1:10,1:15])







## visualizing  the TD -- 

## Words Cloud Visualizing
library(wordcloud)
library(RColorBrewer)


mTDM <- as.matrix(CorpusObj.tdm)
v <- sort(rowSums(mTDM),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal <- brewer.pal(9, "BuGn")
pal <- pal[-(1:2)]
png("wordcloud.png", width=1280,height=800)
wordcloud(d$word,d$freq, scale=c(8,.3),min.freq=2,max.words=100, random.order=T, rot.per=.15, colors=pal, vfont=c("sans serif","plain"))
dev.off()

CorpusObj.tdm.reformatted <- as.matrix(CorpusObj.tdm.sp)
## Reformatting should -- checking memory space ..
object.size(CorpusObj.tdm.sp)
object.size(CorpusObj.tdm.reformatted)
CorpusObj.tdm.reformatted = melt(CorpusObj.tdm.reformatted, value.name ="count")
head(CorpusObj.tdm.reformatted)


library(ggplot2)
head(CorpusObj.tdm.reformatted)
dim(CorpusObj.tdm.reformatted)
## I have many Documents, so I limiting the number of Docs in X axis



ggplot(CorpusObj.tdm.reformatted, aes(x = Docs, y = Terms, fill = log10( value) )) +
  geom_tile(colour = "white") +
   scale_fill_gradient(high="#FF0000" , low="#FFFFFF")+
     ylab("") +
       theme(panel.background = element_blank()) +
       theme(axis.text.x = element_blank(), axis.ticks.x = element_blank());


## Controlling Sparse Terms
CorpusObj.tdm.sp <- removeSparseTerms(CorpusObj.tdm, sparse=0.5)
## Convert document term matrix to data frame
CorpusObj.tdm.sp.df <- as.data.frame(inspect(CorpusObj.tdm.sp ))
## Number of words remaining
nrow(CorpusObj.tdm.sp.df)

require(slam)
# transpose document term matrix, necessary for the next steps using mean term
#frequency-inverse document frequency (tf-idf)
#to select the vocabulary for topic modeling
CorpusObj.tdm.sp.t <- t(CorpusObj.tdm.sp)
summary(col_sums(CorpusObj.tdm.sp.t))
# calculate tf-idf values
term_tfidf <- tapply(CorpusObj.tdm.sp.t$v/row_sums(CorpusObj.tdm.sp.t)[CorpusObj.tdm.sp.t$i], CorpusObj.tdm.sp.t$j,mean) * log2(nDocs(CorpusObj.tdm.sp.t)/col_sums(CorpusObj.tdm.sp.t>0))
summary(term_tfidf)
# keep only those terms that are slightly less frequent that the median
CorpusObj.tdm.sp.t.tdif <- CorpusObj.tdm.sp.t[,term_tfidf>=1.0]
CorpusObj.tdm.sp.t.tdif <- CorpusObj.tdm.sp.t[row_sums(CorpusObj.tdm.sp.t) > 0, ]
summary(col_sums(CorpusObj.tdm.sp.t.tdif))

require(topicmodels)

myModel=builtModel<-LDA(CorpusObj.tdm, 10);
head(topics(myModel))


best.model <- lapply(seq(2, 50, by = 1), function(d){LDA(CorpusObj.tdm.sp.t.tdif, d)}) # this will make a topic model for every number of topics between 2 and 50... it will take some time!
best.model.logLik <- as.data.frame(as.matrix(lapply(best.model, logLik)))  # this will produce a list of logLiks for each model...

# Top 5 words per Topics
terms(myModel, 5 )


 
### Faster Way of doing LDA

corpusLDA <- lexicalize(CorpusObj )
require(lda)

ldaModel=lda.collapsed.gibbs.sampler(corpusLDA$documents,K=10,vocab=corpusLDA$vocab,burnin=9999,num.iterations=1000,alpha=1,eta=0.1)
top.words <- top.topic.words(ldaModel$topics, 5, by.score=TRUE)
print(top.words)




Re: Topic Modeling using LDA in R

Fri Apr 10, 2015 3:29 am

If you interested to know the membership of each topic per document for LDA package you can use :
Code:


ldaModel=lda.collapsed.gibbs.sampler(corpusLDA$documents,K=100,vocab=corpusLDA$vocab,burnin=9999,num.iterations=1000,alpha=1,eta=0.1, compute.log.likelihood = TRUE)

topic.proportions <- t(ldaModel$document_sums) / colSums(ldaModel$document_sums)



Post a reply
  Related Posts  to : Topic Modeling using LDA in R
 modeling traffic light control system using neural network     -  
 seminar topic     -  
 seminar topic     -  
 Dissrtation topic for M.Tech     -  
 project topic in java     -  
 Ad after first topic in a post in Skymiles_red     -  
 Java seminar topic with demo     -  
 need a good topic for project in java     -  

Topic Tags

R Clustering