Step 1. Create a subclass
package MyParser;
use base qw(HTML::Parser);
Step 2. Create a list of global variables
our($self,$origtext,$is_cdata,$tagname,$attr,$attrseq);
you can add extra $origtext depending on where your text and attributes are embedded
Step 3. Define 3 functions that are going to override the three subs as follows
sub text{
($self, $origtext1, $is_cdata) = @_;
}
sub start {
($self, $tagname, $attr, $attrseq, $origtext2) = @_;
}
sub end {
}
Step 4. Each of the subs are nothing but event handlers. so if you want to detect any tag and anything in between use
if($tagname eq '
}
in each of the subs the start and end would probably set and reset a flag to detect start and end of that particular <'tag_to_detect'>
Step 5. Write the code that creates the object and calls these files.
my $fname = '
my $parser = MyParser->new;
$parser->parse_file($fname);
An Example file, an example code
file doc.sgml
Stock Exchange wall street traders dollar market financial crisis public private banks investment funds regulators corporation board owenership decline competitors share broker buyers seller american capital globalize money
Iraq Afghanistan Soldier battle field combat zone wounded amputees blinded veteran base lieutenant explosion deploy force troops american british pakistan strategy civilian security general war commanders operations helicopters allies military death european army taliban terrorist islam extremist NATO Karzai al aaqeda brigade sergeant base injury bombs killed violence fatalities
code
#!/usr/bin/perl -w
use strict;
package MyParser;
use base qw(HTML::Parser);
our($self,$origtext1,$is_cdata,$tagname,$attr,$attrseq,$origtext2);
# two kinds of text in this document I want to extract
my $flagdocno=0; # tag for first text
my $flagtext=0; # tag for second text
# we use these three variables to count something
# here HTML::text/start/end are overridden
sub text {
($self, $origtext1, $is_cdata) = @_;
if($tagname eq 'docno'){
print "$origtext1->\t";
}
if($tagname eq 'doc'){
print "$origtext1->\t";
}
}
sub start {
($self, $tagname, $attr, $attrseq, $origtext2) = @_;
if($tagname eq 'docno'){
print "found topicno\t";
}
if($tagname eq 'doc'){
print "found text\t";
}
#print "$tagname\t";
}
sub end {
if($tagname eq 'docno'){
print "ending topicno\n";
}
if($tagname eq 'doc'){
print "ending text\n";
}
}
package main;
my $fname = '/media/windows/shivani/reserach/database/myown_test/topics.sgml';
my $parser = MyParser->new;
$parser->parse_file($fname);
Hopefully it is useful to somebody who is a starter just like me
No comments:
Post a Comment