Wednesday, September 15, 2010

Parsing custom XML files using R

I have been trying to learn R, especially because I need to start using the text mining package in R to do loads of analysis on text data. A couple of datasets are already indexed in R but I wanted to index my own datasets written in custom XML formats. The tutorial is very useful. And I was able to get some success by writing my own XMLreaders and XMLsources. The end goal is to have a corpus object (say myCorpus) that has "PlainText Documents" so if you type the following command

> class(myCorpus)
TextDocument "character"

The XMLsource is something that reads all the files and extracts the XMLNode list which is then processed by XMLreader (that works on each XMLNode in the list) to return a List of "TextDocument"

Here is the custom XML file


library(XML);
library(tm);
mySource <- function(x, encoding = "UTF-8") XMLSource(x, function(tree) ;
xmlChildren(tree[1]]$children$corpus) myXMLReader, encoding);
myXMLReader <- readXML( spec = list(Content = list("node", "/DOC"), id = list("node", "/DOC/DOCNO")), doc = PlainTextDocument());
myCorpus <- Corpus(mySource("/home/shivani/research/mycode/R/query.xml"));



class(mycorpus[[1]]) The things that I did not like about this method of created a XMLSource is that it is extremely tailor made. See xmlChildren(tree[[1]]$children$corpus. This was done by trail and error, trying different functions till you find one that gives you a XMLNode list in return. Also, even after all this monkey-ing around I have not been able to get rid of the docid in my parsed document. Since there is a mix of data in the finally extracted documents.

If anybody could suggest a way to extract the text in the document without extracting a document number that would be great.

<?xml version="1.0"?>
<corpus>
<DOC>
<DOCNO>1</DOCNO>
emission obama dioxide
</DOC>
<DOC>
<DOCNO>2</DOCNO>
wall street market stock dollar
</DOC>
<DOC>
<DOCNO>3</DOCNO>
global greenhouse pollutants greenhouse dioxide
</DOC>
<DOC>
<DOCNO>4</DOCNO>
soldier field combat field
</DOC>
<DOC>
<DOCNO>5</DOCNO>
student student community
</DOC>
<DOC>
<DOCNO>6</DOCNO>
emission obama battle obama afghanistan field obama soldier soldier volumes emission
</DOC>
<DOC>
<DOCNO>7</DOCNO>
afghanistan traders field dollar iraq dollar iraq
</DOC>
<DOC>
<DOCNO>8</DOCNO>
emission obama soldier warming greenhouse obama carbon
</DOC>
<DOC>
<DOCNO>9</DOCNO>
dollar market dollar america
</DOC>
<DOC>
<DOCNO>10</DOCNO>
global dioxide
</DOC>
</corpus>

No comments:

Post a Comment