> class(myCorpus)
TextDocument "character"
The XMLsource is something that reads all the files and extracts the XMLNode list which is then processed by XMLreader (that works on each XMLNode in the list) to return a List of "TextDocument"
Here is the custom XML file
library(XML);
library(tm);
mySource <- function(x, encoding = "UTF-8") XMLSource(x, function(tree) ;
xmlChildren(tree[1]]$children$corpus) myXMLReader, encoding);
myXMLReader <- readXML( spec = list(Content = list("node", "/DOC"), id = list("node", "/DOC/DOCNO")), doc = PlainTextDocument());
myCorpus <- Corpus(mySource("/home/shivani/research/mycode/R/query.xml"));
class(mycorpus[[1]]) The things that I did not like about this method of created a XMLSource is that it is extremely tailor made. See xmlChildren(tree[[1]]$children$corpus. This was done by trail and error, trying different functions till you find one that gives you a XMLNode list in return. Also, even after all this monkey-ing around I have not been able to get rid of the docid in my parsed document. Since there is a mix of data in the
If anybody could suggest a way to extract the text in the document without extracting a document number that would be great.
<?xml version="1.0"?>
<corpus>
<DOC>
<DOCNO>1</DOCNO>
emission obama dioxide
</DOC>
<DOC>
<DOCNO>2</DOCNO>
wall street market stock dollar
</DOC>
<DOC>
<DOCNO>3</DOCNO>
global greenhouse pollutants greenhouse dioxide
</DOC>
<DOC>
<DOCNO>4</DOCNO>
soldier field combat field
</DOC>
<DOC>
<DOCNO>5</DOCNO>
student student community
</DOC>
<DOC>
<DOCNO>6</DOCNO>
emission obama battle obama afghanistan field obama soldier soldier volumes emission
</DOC>
<DOC>
<DOCNO>7</DOCNO>
afghanistan traders field dollar iraq dollar iraq
</DOC>
<DOC>
<DOCNO>8</DOCNO>
emission obama soldier warming greenhouse obama carbon
</DOC>
<DOC>
<DOCNO>9</DOCNO>
dollar market dollar america
</DOC>
<DOC>
<DOCNO>10</DOCNO>
global dioxide
</DOC>
</corpus>
No comments:
Post a Comment