Switch to full style
Examples of R Scripting Language
Post a reply

Latent semantic analysis for Text in R using LSA

Tue Dec 16, 2014 6:21 am

Latent Semantic Analysis (LSA) is a technique used to extract latent semantic structures from a large corpus of text data. The idea behind LSA is that words that are semantically similar will occur in similar contexts across the corpus, and that these similar contexts can be used to identify latent semantic structures within the data. In R, the 'lsa' package can be used to perform LSA on a text matrix. The 'lsa' function takes two arguments: the text matrix, and the number of dimensions to reduce the data to (dims). The text matrix is typically created using the 'textmatrix' function, which takes a directory containing text files and converts them into a matrix of word counts. The 'lsa' function then performs a singular value decomposition (SVD) on this matrix, which reduces the dimensionality of the data while preserving as much of the original information as possible.

The resulting object, called myLSAspace, represents the transformed matrix in a lower-dimensional space. This transformed matrix can be used for tasks such as document retrieval, text classification, and semantic search. The 'associate' function can be used to calculate associations for specific words in the matrix, and the 'cosine' function can be used to calculate the cosine similarity between the rows of the transformed matrix. It is worth noting that LSA is a technique that is sensitive to the quality of the text data and parameters used, such as stemming and stopword removal. It also does not take into account word order or context, only word frequency and co-occurrence. This code shows an example of how to use the Latent semantic analysis approach (LSA) in R scripting.



The script sets the working directory using the setwd() function, which is used to specify the location where R should look for files. The script then uses the tempfile() function to create a temporary directory called "td", and the dir.create() function to create the directory. It then uses the write() function to write several lists of words to four different files within the temporary directory: "D1", "D2", "D3" and "D4". The script then uses the textmatrix() function from the lsr library to create a matrix of word counts from the files in the temporary directory. The function takes several arguments, including the location of the files, the minimum word length and the minimum document frequency. The script then uses the lsa() function from the lsa library to perform Latent Semantic Analysis (LSA) on the matrix of word counts. LSA is a technique used to extract the underlying meaning of a set of documents by analyzing their word usage patterns. The function takes several arguments, including the matrix of word counts and the number of dimensions to extract.

The script then uses the associate() function from the lsr library to calculate the association of the word "donkey" with the other words in the matrix. The script then uses the unlink() function to remove the temporary directory and its contents. The script then loads some sample data and create a matrix of word count for the sample data. Then it uses the cosine() function to calculate the cosine similarity between the three vectors in the matrix. The script then again creates a temporary directory and writes some lists of words to four different files within the temporary directory: "D1", "D2", "D3" and "D4". Then it uses the textmatrix() function from the lsr library to create a matrix of word counts from the files in the temporary directory. This time it uses stopwords_en data, stemming, and language argument. Finally, the script uses the unlink() function to remove the temporary directory and its contents. In summary, the script uses a combination of text analysis, LSA, and association analysis to examine the relationships between words in a set of documents. It also uses several different functions and libraries to perform different tasks. It uses the lsr, foreign, MASS, bootES, lsa, corrgram libraries to perform text analysis, latent semantic analysis, association analysis, and other tasks. It also uses the setwd() function to set the working directory, the tempfile() and dir.create() functions to create a temporary directory, and the write() function to write lists of words to files within the directory. Additionally, it uses the textmatrix() function to create a matrix of word counts, the lsa() function to perform latent semantic analysis, the associate() function to calculate the association of words, and the unlink() function to remove the temporary directory and its contents.



Code:
require(lsr)
require(
foreign)

require(
MASS)
require(
bootES)
library(lsa)

require(
corrgram)



setwd("D:/FirstPaper/FireFox/NewRQ/")


td = tempfile()
dir.create(td)
write( c("man", "cat", "donkey"), file=paste(td, "D1", sep="/"))
write( c("hamster", "donkey", "sushi"), file=paste(td, "D2", sep="/"))
write( c("man", "monster", "monster"), file=paste(td, "D3", sep="/"))
write( c("man", "donkey", "man"), file=paste(td, "D4", sep="/"))


# create matrices
myMatrix = textmatrix(td, minWordLength=1)
myMatrix
myLSAspace 
= lsa(myMatrix, dims=dimcalc_share())
myLSAspace
myNewMatrix 
= as.textmatrix(myLSAspace)

as.
textmatrix(myLSAspace)


# calc associations for donkey
associate(myNewMatrix, "donkey")
# clean up
unlink(td, recursive=TRUE)


data(corpus_training)
 
data
(corpus_essays)
data(corpus_scores)

vec1 = c( 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 )
vec2 = c( 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0 )
vec3 = c( 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0 )
matrix = cbind(vec1,vec2, vec3)

cosine(matrix)


# create some files
td = tempfile()
dir.create(td)
write( c("man", "cat", "donkey"), file=paste(td, "D1", sep="/"))
write( c("hamster", "donkey", "sushi"), file=paste(td, "D2", sep="/"))
write( c("man", "monster", "monster"), file=paste(td, "D3", sep="/"))
write( c("man", "donkey", "man"), file=paste(td, "D4", sep="/"))
# create matrices
myMatrix = textmatrix(td, minWordLength=1)
sample(myMatrix, 3)
# clean up
unlink(td, recursive=TRUE)


data(stopwords_en);



# create some files
td = tempfile()
dir.create(td)
write( c("while", "dance", "donkey","fifa"), file=paste(td, "D1", sep="/"))
write( c("hamster", "dance", "sushi"), file=paste(td, "D2", sep="/"))
write( c("man", "monster", "monster"), file=paste(td, "D3", sep="/"))
write( c("man", "dance", "man"), file=paste(td, "D4", sep="/"))
# create matrices
myMatrix = textmatrix(td, minWordLength=1,stopwords=stopwords_en,stemming=TRUE, language="english", minDocFreq=1)
myMatrix
# clean up
unlink(td, recursive=TRUE)




This R script is using several packages to perform various text processing tasks.


    1- The script begins by loading several packages, including 'lsr', 'foreign', 'MASS', 'bootES', 'lsa', and 'corrgram'.
    2- Next, it sets the working directory to a specific file path on the user's computer, "D:/FirstPaper/FireFox/NewRQ/".
    3- Then, it creates four text files in a temporary directory, each containing a list of words, and writes them to the temporary directory.
    4- It then creates a matrix of the text data, and uses the 'lsa' package to perform Latent Semantic Analysis on the matrix, reducing the dimensionality of the data. It then converts the resulting data back into a text matrix.
    5- It then uses the 'associate' function to calculate associations for the word "donkey" in the new matrix.
    6- Next, the script loads several data sets, including 'corpus_training', 'corpus_essays', and 'corpus_scores'.
    7- It then creates three vectors and binds them together into a matrix, and uses the 'cosine' function to calculate the cosine similarity between them.
    8- It creates another temporary directory and writes four text files to it, creates a text matrix of the data, and uses the 'sample' function to randomly sample 3 rows from the matrix.
    9- Finally, it loads the 'stopwords_en' dataset and repeats the process of creating a temporary directory, writing text files, creating a text matrix and cleaning up the temporary directory. This time, it also applies stopword removal, stemming and language detection on the text data using the 'textmatrix' function.

To understand the meaning of each function used in the code above, please check the user manual of "lsr" package.



Post a reply
  Related Posts  to : Latent semantic analysis for Text in R using LSA
 Human Voice Analysis in java     -  
 Learn Technical analysis and stock market tricks for free     -  
 Able to Copy Text from Uneditable Text Boxes(JTextfields)     -  
 Java- Copy text area into disabled text area     -  
 Transparent Text     -  
 text like a curve     -  
 Aligning the text within the div tag     -  
 i tag for italic text     -  
 Get Text Field value by php     -  
 Flaming Text     -  

Topic Tags

R Classifiers