Friday, April 22, 2011

Using gensim... the basics

Thanks to the extremely active community of gensim, I have made way through some basic commands in python and gensim

I have a directory of text documents that I want indexed and topic
model built on
Each file in the directory is a document containing plain text.
Lets assume that the text is pruned for stopwords and special
characters etc.
I will need to write custom over-rides of the get_text() function of textcorpus and this is how I achieve it

def split_line(text):
    words = text.split()
    out = []
    for word in words:
    return out

import gensim
class MyCorpus(gensim.corpora.TextCorpus):
    def get_texts(self):
        for filename in self.input:
            yield split_line(open(filename).read())

if b is a list of files then

myCorpus = MyCorpus(b)

will create the corpus and

myCorpus.dictionary has all the unique words

myCorpus.dictionary.token2id.items()  gives the word-id pairs

myCorpus.dictionary.token2id.keys() gives the unique words

myCorpus.dictionary.token2id.values() gives the corresponding ids

One can save it in Matrix Market format using the following command

`gensim.corpora.MmCorpus.serialize('', myCorpus)`

In order to add new documents, just extend the list b to include the file names and redo all of the above. Internally the implementation takes off from where it left

I still need to work on indexing, lsi based and lda based modeling of the corpus using the above framework and I am hoping to add more posts as I learn about them.

1 comment: