Thanks to the extremely active community of gensim, I have made way through some basic commands in python and gensim
I have a directory of text documents that I want indexed and topic
model built on
Each file in the directory is a document containing plain text.
Lets assume that the text is pruned for stopwords and special
characters etc.
I will need to write custom over-rides of the get_text() function of textcorpus and this is how I achieve it
def split_line(text):
words = text.split()
out = []
for word in words:
out.append(word)
return out
import gensim
class MyCorpus(gensim.corpora.TextCorpus):
def get_texts(self):
for filename in self.input:
yield split_line(open(filename).read())
if b is a list of files then
myCorpus = MyCorpus(b)
will create the corpus and
myCorpus.dictionary has all the unique words
myCorpus.dictionary.token2id.items() gives the word-id pairs
myCorpus.dictionary.token2id.keys() gives the unique words
myCorpus.dictionary.token2id.values() gives the corresponding ids
One can save it in Matrix Market format using the following command
`gensim.corpora.MmCorpus.serialize('mycorpus.mm', myCorpus)`
In order to add new documents, just extend the list b to include the file names and redo all of the above. Internally the implementation takes off from where it left
I still need to work on indexing, lsi based and lda based modeling of the corpus using the above framework and I am hoping to add more posts as I learn about them.
This comment has been removed by the author.
ReplyDelete