My Research Diaries: March 2010

Thursday, March 18, 2010

Perl: Detecting two words concatenated into one sepearted by uppercase character

If you are given a word 'textArea' and want to be able to separate them into 'text Area', then here is the perl code to help you achieve it


   $item = 'textAreaBuffer';
   if($item =~ m/.*?[a-z].*?[A-Z]/){
        $count=0;
        @rem=();
        while($item =~ m/([a-z][A-Z])/g){
         $rem[$count]=pos($item);$count=$count+1;
        
        }
       
        for($count=0;$count<@rem;$count++){          
           if($count==0){           
                print FILE2 substr($item,0,$rem[$count]-1);           
                print FILE2 " ";            }
           else{           
                print FILE2 substr($item,$rem[$count-1]-1,$rem[$count]-$rem[$count-1]);           
                print FILE2 " ";          
                 }
           }
       print FILE2 substr($item,$rem[@rem-1]-1);         
       print FILE2 " ";
 }

Wednesday, March 17, 2010

Vocabulary handling for software reuse

Vocabulary mismatch is even more of a pronounced problem when dealing with corpus made of source code. There are several tools used in the past and here are some of them that I will need to research and learn about

1) Soundex: Phonetic variations of the words are captured
2) Lexical affinity: A sliding window (grep-like) tool that calculates how close are two identifier names
3) Separate words using "_": counter_activity -> counter activity
4) Separate words using case sensitivity: LoadBuffer -> Load buffer

One could additionally use WORDNET and spelling correction or missing character handling to improve upon the existing techniques