Friday, October 23, 2009

Perl tutorial parsing SGML files using HTML parsers

I have not come across a useful tutorial for starters of HTML parsing using perl.

Step 1. Create a subclass

package MyParser;
use base qw(HTML::Parser);

Step 2. Create a list of global variables

our($self,$origtext,$is_cdata,$tagname,$attr,$attrseq);

you can add extra $origtext depending on where your text and attributes are embedded

Step 3. Define 3 functions that are going to override the three subs as follows




sub text{
($self, $origtext1, $is_cdata) = @_;
}

sub start {

($self, $tagname, $attr, $attrseq, $origtext2) = @_;
}
sub end {

}


Step 4. Each of the subs are nothing but event handlers. so if you want to detect any tag and anything in between use
if($tagname eq ''){

}
in each of the subs the start and end would probably set and reset a flag to detect start and end of that particular <'tag_to_detect'>

Step 5. Write the code that creates the object and calls these files.

my $fname = '';
my $parser = MyParser->new;
$parser->parse_file($fname);


An Example file, an example code

file doc.sgml


1

Stock Exchange wall street traders dollar market financial crisis public private banks investment funds regulators corporation board owenership decline competitors share broker buyers seller american capital globalize money



2

Iraq Afghanistan Soldier battle field combat zone wounded amputees blinded veteran base lieutenant explosion deploy force troops american british pakistan strategy civilian security general war commanders operations helicopters allies military death european army taliban terrorist islam extremist NATO Karzai al aaqeda brigade sergeant base injury bombs killed violence fatalities




code

#!/usr/bin/perl -w
use strict;
package MyParser;
use base qw(HTML::Parser);
our($self,$origtext1,$is_cdata,$tagname,$attr,$attrseq,$origtext2);
# two kinds of text in this document I want to extract
my $flagdocno=0; # tag for first text
my $flagtext=0; # tag for second text
# we use these three variables to count something


# here HTML::text/start/end are overridden
sub text {
($self, $origtext1, $is_cdata) = @_;
if($tagname eq 'docno'){
print "$origtext1->\t";
}
if($tagname eq 'doc'){
print "$origtext1->\t";
}
}
sub start {

($self, $tagname, $attr, $attrseq, $origtext2) = @_;
if($tagname eq 'docno'){
print "found topicno\t";
}
if($tagname eq 'doc'){
print "found text\t";
}
#print "$tagname\t";

}
sub end {
if($tagname eq 'docno'){
print "ending topicno\n";
}
if($tagname eq 'doc'){
print "ending text\n";
}
}

package main;

my $fname = '/media/windows/shivani/reserach/database/myown_test/topics.sgml';
my $parser = MyParser->new;
$parser->parse_file($fname);

Hopefully it is useful to somebody who is a starter just like me

No comments:

Post a Comment