Thursday, March 18, 2010

Perl: Detecting two words concatenated into one sepearted by uppercase character

If you are given a word 'textArea' and want to be able to separate them into 'text Area', then here is the perl code to help you achieve it

$item = 'textAreaBuffer';
if($item =~ m/.*?[a-z].*?[A-Z]/){
$count=0;
@rem=();
while($item =~ m/([a-z][A-Z])/g){
$rem[$count]=pos($item);$count=$count+1;

}

for($count=0;$count<@rem;$count++){
if($count==0){
print FILE2 substr($item,0,$rem[$count]-1);
print FILE2 " "; }
else{
print FILE2 substr($item,$rem[$count-1]-1,$rem[$count]-$rem[$count-1]);
print FILE2 " ";
}
}
print FILE2 substr($item,$rem[@rem-1]-1);
print FILE2 " ";
}

Wednesday, March 17, 2010

Vocabulary handling for software reuse

Vocabulary mismatch is even more of a pronounced problem when dealing with corpus made of source code. There are several tools used in the past and here are some of them that I will need to research and learn about

1) Soundex: Phonetic variations of the words are captured
2) Lexical affinity: A sliding window (grep-like) tool that calculates how close are two identifier names
3) Separate words using "_": counter_activity -> counter activity
4) Separate words using case sensitivity: LoadBuffer -> Load buffer

One could additionally use WORDNET and spelling correction or missing character handling to improve upon the existing techniques