Vocabulary mismatch is even more of a pronounced problem when dealing with corpus made of source code. There are several tools used in the past and here are some of them that I will need to research and learn about
1) Soundex: Phonetic variations of the words are captured
2) Lexical affinity: A sliding window (grep-like) tool that calculates how close are two identifier names
3) Separate words using "_": counter_activity -> counter activity
4) Separate words using case sensitivity: LoadBuffer -> Load buffer
One could additionally use WORDNET and spelling correction or missing character handling to improve upon the existing techniques
No comments:
Post a Comment