Monday, November 11, 2019

Part of Speech Recognizer

Improving Identi?er Informativeness using Part of Speech Information Dave Binkley Matthew Hearn Dawn Lawrie Loyola University Maryland Baltimore MD 21210-2699, USA {binkley, lawrie}@cs. loyola. edu, [email  protected] edu Keywords: source code analysis tools, natural language processing, program comprehension, identi?er analysis Abstract Recent software development tools have exploited the mining of natural language information found within software and its supporting documentation. To make the most of this information, researchers have drawn upon the work of the natural language processing community for tools and techniques.One such tool provides part-of-speech information, which ?nds application in improving the searching of software repositories and extracting domain information found in identi?ers. Unfortunately, the natural language found is software differs from that found in standard prose. This difference potentially limits the effectiveness of off-the-shelf tools. The pres ented empirical investigation ?nds that this limitation can be partially overcome, resulting in a tagger that is up to 88% accurate when applied to source code identi?ers.The investigation then uses the improved part-of-speech information to tag a large corpus of over 145,000 ?eld names. From patterns in the tags several rules emerge that seek to improve structure-?eld naming. Source Part of Extract Split Apply Source ? Code ? Field ? Field ? ? Speech Template Code Mark-up Tagging Names Names Figure 1. Process for POS tagging of ?eld names. The text available in source-code artifacts, in particular a program’s identi?ers, has a very different structure. For example the words of an identi?er rarely form a grammatically correct sentence.This raises an interesting question: can an existing POS tagger be made to work well on the natural language found in source code? Better POS information would aid existing techniques that have used limited POS information to successfully improv e retrieval results from software repositories [1, 11] and have also investigated the comprehensibility of source code identi?ers [4, 6]. Fortunately, machine learning techniques are robust and, as reported in Section 2, good results are obtained using several sentence forming templates.This initial investigation also suggest rules speci?c for software that would improve tagging. For example the type of a declared variable can be factored into its tags. As an example application of POS tagging for source code, the tagger is then used to tag over 145,000 structure?eld names. Equivalence classes of tags are then examined to produce rules for the automatic identi?cation of poor names (as described in Section 3) and suggest improved names, which is left to future work. 1 IntroductionSoftware engineering can bene?t from leveraging tools and techniques of other disciplines. Traditionally, natural language processing (NLP) tools solve problems by processing the natural language found in do cuments such as news articles and web pages. One such NLP tool is a partof-speech (POS) tagger. Tagging is, for example, crucial to the Named-Entity Recognition [3], which enables information about a person to be tracked within and across documents. Many POS taggers are built using machine learning based on newswire training data.Conventional wisdom is that these taggers work well on the newswire and similar artifacts; however, their effectiveness degrades as the input moves further away from the highly structured sentences found in traditional newswire articles. 1 2 Part-of-Speech Tagging Before a POS tagger’s output can be used as input to down stream SE tools, the POS tagger itself needs to be vetted. This section describes an experiment performed to test the accuracy of POS tagging on ?eld names mined from source code. The process used for mining and tagging the ?elds is ?rst described, followed by the empirical results from the experiment.Figure 1 shows the pipeline used for the POS tagging of ?eld names. On the left, the input to the pipeline is mode=â€Å"space†/> (683 came from C++ ?les and 817 from Java ?les). A human accessor (and university student majoring in English) tagged the 1500 ?eld names with POS information producing the oracle set. This oracle set is used to evaluate the accuracy of automatic tagging techniques when applied to the test set. Preliminary study of the Stanford tagger indicates that it needed guidance when tagging ?eld names.Following the work of Abebe and Tonella [1], four templates were used to provide this guidance. Each template includes a slot into which the split ?eld name is inserted. Their accuracy is then evaluated using the oracle set. †¢ †¢ †¢ †¢ Sentence Template: List Item Template: Verb Template: Noun Template: . – Please, . is a thing . Figure 2. XML queries for extracting C++ and Java ?elds from srcML. source code. This is then marked up using XML tags by srcML [5] to id entify various syntactic categories. Third, ?eld names are extracted from the marked-up source using XPath queries.Figure 2 shows the queries for C++ and Java. The fourth stage splits ?eld names by replacing underscores with spaces and inserting a space where the case changes from lowercase to uppercase. For example, the names spongeBob and sponge bob become sponge bob. After splitting, all characters are shifted to lowercase. This stage also ?lters names so that only those that consist entirely of dictionary words are retained. Filtering uses Debian’s American (6-2) dictionary package, which consists of the 98,569 words from Kevin Atkinson’s SCOWL word lists that have size 10 through 50 [2].This dictionary includes some common abbreviations, which are thus included in the ?nal data set. Future work will obviate the need for ?ltering through vocabulary normalization in which non-words are split into their abbreviations and then expanded to their natural language equiva lents [9]. The ?fth stage applies a set of templates (described below) to each separated ?eld name. Each template effectively wraps the words of the ?eld name in an attempt to improve the performance of the POS tagger. Finally, POS tagging is performed by Version 1. 6 of the Stanford Log-linear POS Tagger [12].The default options are used including the pretrained bidirectional model [10]. The remainder of this section considers empirical results concerning the effectiveness of the tagging pipeline. A total of 145,163 ?eld names were mined from 10,985 C++ ?les and 9,614 Java ?les found in 171 programs. From this full data set, 1500 names were randomly chosen as a test set 2 The Sentence Template, the simplest of the four, considers the identi?er itself to be a â€Å"sentence† by appending a period to the split ?eld. The List Item Template exploits the tagger having learned about POS information found in the sentence fragments used in lists.The Verb Template tries to encourage the tagger to treat the ?eld name as a verb or a verb phrase by pre?xing it with â€Å"Please,† since usually a command follows. Finally, the Noun Template tries to encourage the tagger to treat the ?eld as a noun by post?xing it with â€Å"is a thing† as was done by Abebe and Tonella [1]. Table 1 shows the accuracy of using each template applied to the test set with the output compared to the oracle. The major diagonal represents each technique in isolation while the remaining entries require two techniques to agree and thus lowering the percentage.The similarity of the percentages in a column gives an indication of how similar the set of correctly tagged names is for two techniques. For example, considering Sentence Template, Verb Template has the lowest overlap of the remaining three as indicated by it’s joint percentage of 71. 7%. Overall, the List Item Template performs the best, and the Sentence Template and Noun Template produce essentially identical resu lts getting the correct tagging on nearly all the same ?elds. Perhaps unsurprising, the Verb Template performs the worst.Nonetheless, it is interesting that this template does produce the correct output on 3. 2% of the ?elds where no other template succeeds. As shown in Table 2 overall at least one template correctly tagged 88% of the test set. This suggests that it may be possible to combine these results, perhaps using machine learning, to produce higher accuracy than achieved using the individual templates. Although 88% is lower than the 97% achieved by natural language taggers on the newswire data, the performance is still quite high considering the lack of context provided by the words of a single structure ?eld.Sentence List Item Verb Noun Sentence 79. 1% 76. 5& 71. 7% 77. 0% List Item 76. 5% 81. 7% 71. 0% 76. 0% Verb 71. 7% 71. 0% 76. 0% 70. 8% Noun 77. 0% 76. 0% 70. 8% 78. 7% this context is used to represent a current state, and is therefore not confusing. Rule 1 Non-boolea n ?eld names should never contain a present tense verb * * ? * * Table 1. Each percentage is the percent of correctly tagged ?eld names using both the row and column technique; thus the major diagonal represent each technique independently. Correct in all templates Correct in at least one template 68. 9% 88. 0% Table 2.Correctly tagged identi?ers As illustrated in the next section, the identi?cation is suf?ciently accurate for use by downstream consumer applications. 3 Rules for Improving Field Names As an example application of POS tagging for source code, the 145,163 ?eld names of the full data set were tagged using the List Item Template, which showed the best performance in Table 1. The resulting tags were then used to form equivalence classes of ?eld names. Analysis of these classes led to four rules for improving the names of structure ?elds. Rule violations can be automatically identi?ed using POS tagging.Further, as illustrated in the examples, by mining the source code it i s possible to suggest potential replacements. The assumption behind each rule is that high quality ?eld names will provide better conceptual information, which aids an engineer in the task of forming a mental understanding of the code. Correct part-of-speech information can help inform the naming of identi?ers, a process that is essential in communicating intent to future programmers. Each rule is ?rst informally introduced and then formalized. After each rule, the percentage of ?elds that violate the rule is given.Finally, some rules are followed by a discussion of rule exceptions or related notions. The ?rst rule observes that ?eld names represent objects not actions; thus they should avoid present-tense verbs. For example, the ?eld name create mp4, clearly implies an action, which is unlikely the intent (unless perhaps the ?eld represent a function pointer). Inspection of the source code reveals that this ?eld holds the desired mp4 video stream container type. Based on the contex t of its use, a better, less ambiguous name for this identi?er is created mp4 container type, which includes the past-tense verb created.A notable exception to this is ?elds of type boolean, like, for example, is logged in where the present tense of the verb â€Å"to be† is used. A present tense verb in 3 Violations detected: 27,743 (19. 1% of ?eld names) Looking at the violations of Rule 1 one pattern that emerges suggests an improvement to the POS tagger that would better specialize it to source code. A pattern that frequently occurs in GUI programming ?nds verbs used as adjectives when describing GUI elements such as buttons. Recognizing such ?elds based on their type should improve tagger accuracy. Consider the ?elds delete button and to a lesser extent continue box.In isolation these appears to represent actions. However they actually represent GUI elements. Thus, a special context-sensitive case in the POS tagger would tag such verbs as adjectives. The second rule consi ders ?eld names that contain only a verb. For example the ?eld name recycle. This name communicates little to a programmer unfamiliar with the code. Examination of the source code reveals that this variable is an integer and, based on the comments, it counts the â€Å"number of things recycled. †While this meaning can be inferred from the declaration and the comments surrounding it, ?eld name uses often occur far from their eclaration, reducing the value of the declared type and supporting comments. A potential ?x in this case is to change the name to recycled count or things recycled. Both alternatives improve the clarity of the name. Rule 2 Field names should never be only a verb ? ? or ? ? Violations detected: 4,661 (3. 2% ?eld names identi?ers) The third rule considers ?eld names that contain only an adjective. While adjectives are useful when used with a noun, an adjective alone relies too much on the type of the variable to fully explain its use.For example, consider th e identi?er interesting. In this case, the declared type of â€Å"list† provides the insight that this ?eld holds a list of â€Å"interesting† items. Replacing this ?eld with interesting list or interesting items should improve code understanding. Rule 3 Field names should never be only an adjective ? Violations detected: 5,487 (3. 8% ?eld names identi?ers) An interesting exception to this rule occurs with data structures where the ?eld name has an established conventional meaning. For example, when naming the next node in a linked list, next is commonly accepted.Other similar common names include â€Å"previous† and â€Å"current. † The ?nal rule deals with ?eld names for booleans. Boolean variables represent a state that is or is not and this notion needs to be obvious in the name. The identi?er deleted offers a good example. By itself there is no way to know for sure what is being represented. Is this a pointer to a deleted thing? Is it a count of dele ted things? Source code inspection reveals that such boolean variables tend to represent whether or not something is deleted. Thus a potential improved names include is deleted or was deleted.Rule 4 Boolean ?eld names should contain third person forms of the verb â€Å"to be† or the auxiliary verb â€Å"should† * ? is | was | should * 5 Summary This paper presents the results on an experiment into the accuracy of the Stanford Log-linear POS Tagger applied to ?eld names. The best template, List Item, has an accuracy of 81. 7%. If an optimal combination of the four templates were used the accuracy rises to 88%. These POS tags were then used to develop ?eld name formation rules that 28. 9% of the identi?ers violated. Thus the tagging can be used to support improved naming.Looking forward, two avenues of future work include automating this improvement and enhancing POS tagging for source code. For the ?rst, the source code would be mined for related terms to be used in sug gested improved names. The second would explore training a POS tagger using, for example, the machine learning technique domain adaptation [8], which emphasize the text in the training that is most similar to identi?ers to produce a POS tagger for identi?ers. 6 Acknowledgments Special thanks to Mike Collard for his help with srcML and the XPath queries and Phil Hearn for his help with creating the oracle set.Support for this work was provided by NSF grant CCF 0916081. Violations detected: 5,487 (3. 8% ?eld names identi?ers) Simply adding â€Å"is† or â€Å"was† to booleans does not guarantee a ?x to the problem. For example, take a boolean variable that indicates whether something should be allocated in a program. In this case, the boolean captures whether some event should take place in the future. In this example an appropriate temporal sense is missing from the name. A name like allocated does not provide enough information and naming it is allocated does not make l ogical sense in the context of the program.A solution to this naming problem is to change the identi?er to should be allocated, which includes the necessary temporal sense communicating that this boolean is a ?ag for something expected to happen in the future. References [1] S. L. Abebe and P. Tonella. Natural language parsing of program element names for concept extraction. In 18th IEEE International Conference on Program Comprehension. IEEE, 2010. [2] K. Atkinson. Spell checking oriented word lists (scowl). [3] E. Boschee, R. Weischedel, and A. Zamanian. Automatic information extraction.In Proceedings of the International Conference on Intelligence Analysis, 2005. [4] B. Caprile and P. Tonella. Restructuring program identi?er names. In ICSM, 2000. [5] ML Collard, HH Kagdi, and JI Maletic. An XML-based lightweight C++ fact extractor. Program Comprehension, 2003. 11th IEEE International Workshop on, pages 134–143, 2003. [6] E. Hà ¸st and B. Østvold. The programmer’ s lexicon, volume i: The verbs. In International Working Conference on Source Code Analysis and Manipulation, Beijing, China, September 2008. [7] E. W. Hà ¸st and B. M. Østvold. Debugging method names.In ECOOP 09. Springer Berlin / Heidelberg, 2009. [8] J. Jiang and C. Zhai. Instance weighting for domain adaptation in nlp. In ACL 2007, 2007. [9] D. Lawrie, D. Binkley, and C. Morrell. Normalizing source code vocabulary. In Proceedings of the 17th Working Conference on Reverse Engineering, 2010. [10] L. Shen, G. Satta, and A. K. Joshi. Guided learning for bidirectional sequence classi?cation. In ACL 07. ACL, June 2007. [11] D. Shepherd, Z. P. Fry, E. Hill, L. Pollock, and K. Vijay-Shanker. Using natural language program analysis to locate and understand action-oriented conerns.In AOSD 07. ACM, March 2007. [12] K. Toutanova, D. Klein, C. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In HLTNAACL 2003, 2003. 4 Related Work This section brie?y reviews three projects that use POS information. Each uses an off-the-shelf POS tagger or lookup table. First, Host et al. study naming of Java methods using a lookup table to assign POS tags [7]. Their aim is to ?nd what they call â€Å"naming bugs† by checking to see if the method’s implementation is properly indicated with the name of the method.Second, Abebe and Tonella study class, method, and attribute names using a POS tagger based on a modi?cation of minipar to formulate domain concepts [1]. Nouns in the identi?ers are examined to form ontological relations between concepts. Based on a case study, their approach improved concept searching. Finally, Shepherd et al. considered ?nding concepts in code using natural language information [11]. The resulting Find-Concept tool locates action-oriented concerns more effectively than the other tools and with less user effort. This is made possible by POS information applied to source code. 4

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.