natural language analysis - expanding identifiers to normalize source code vocabulary

Post on 13-Jan-2015

351 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Paper: Expanding Identifiers to Normalize Source Code VocabularyAuthors: Dave Binkley and Dawn LawrieSession: Research Track 4: Natural Language Analysis

TRANSCRIPT

EXPANDING IDENTIFIERS TO NORMALIZING SOURCE

CODE VOCABULARYPRESENTED BY DAWN LAWRIE

LOYOLA UNIVERSITY MARYLAND

IN COLLABORATION WITH DAVE BINKLEY

Friday, October 7, 11

VOCABULARY MISMATCH

DIFFERENT VOCABULARY IN SOURCE CODE AND OTHER SOFTWARE ARTIFACTS

EXAMPLE

REQUIREMENT - “FEATURE LOCATION”

SOURCE CODE - “FEATURELOCATION”

OR WORSE “FLOC”

Friday, October 7, 11

PURPOSE OF NORMALIZE

COPE WITH VOCABULARY MISMATCH

SOURCE CODE

OTHER SOFTWARE DOCUMENTS

Friday, October 7, 11

EXAMPLE PROBLEMS

CONSIDER IDENTIFIERS

FEATURELOCATION

FLOC

Friday, October 7, 11

EXAMPLE PROBLEMS

CONSIDER IDENTIFIERS

FEATURE LOCATION

FLOC

SPLITTING PROBLEM

Friday, October 7, 11

EXAMPLE PROBLEMS

CONSIDER IDENTIFIERS

FEATURE LOCATION

F LOC

SPLITTING PROBLEM

SPLITTING PROBLEM

Friday, October 7, 11

EXAMPLE PROBLEMS

CONSIDER IDENTIFIERS

FEATURE LOCATION

FEATURE LOCATION

SPLITTING PROBLEM

SPLITTING ANDEXPANSION PROBLEM

Friday, October 7, 11

WHY NORMALIZE?

MANY SE PROBLEMS CAN BE ADDRESSED USING INFORMATION RETRIEVAL (IR) TECHNIQUES

UN-NORMALIZED CODE LEADS TO AN UNDER ESTIMATE OF THE IMPORTANCE OF CRUCIAL WORDS

Friday, October 7, 11

NORMALIZE PROBLEM STATEMENT

FIND THE BEST EXPANSION OVERALL POSSIBLE SPLITS

FLOC FEATURE LOCATION

Friday, October 7, 11

NORMALIZE ALGORITHM

TERMINOLOGY

HARD-WORD - WHITEHOUSE_LAWN

SOFT-WORD - WHITE-HOUSE_LAWN

Friday, October 7, 11

NORMALIZE ALGORITHM

TERMINOLOGY

HARD-WORD - WHITEHOUSE_LAWN

SOFT-WORD - WHITE-HOUSE_LAWN

(2)

Friday, October 7, 11

NORMALIZE ALGORITHM

TERMINOLOGY

HARD-WORD - WHITEHOUSE_LAWN

SOFT-WORD - WHITE-HOUSE_LAWN

(2)

(3)

Friday, October 7, 11

NORMALIZE ALGORITHM

Friday, October 7, 11

NORMALIZE ALGORITHM

STRLEN STRING LENGTH

Friday, October 7, 11

MACHINE TRANSLATION APPROACH

EL PAPA VISITA LA IGLESIA

Friday, October 7, 11

MACHINE TRANSLATION APPROACH

EL PAPA VISITA LA IGLESIA

THEFATHERPOTATOPOPE

VISITSVISITORHIT

THE CHURCH

Friday, October 7, 11

MACHINE TRANSLATION APPROACH

EL PAPA VISITA LA IGLESIA

THEFATHERPOTATOPOPE

VISITSVISITORHIT

THE CHURCH

Friday, October 7, 11

MACHINE TRANSLATION APPROACH

EL PAPA VISITA LA IGLESIA

THEFATHERPOTATOPOPE

VISITSVISITORHIT

THE CHURCH

STRONG COHESION

Friday, October 7, 11

MACHINE TRANSLATION APPROACH

EL PAPA VISITA LA IGLESIA

THEFATHERPOTATOPOPE

VISITSVISITORHIT

THE CHURCH

STRONG COHESION

Friday, October 7, 11

NORMALIZE ALGORITHM

Friday, October 7, 11

NORMALIZE ALGORITHM

STRLEN

Friday, October 7, 11

NORMALIZE ALGORITHM

STRLENS-TRLEN

ST-RLEN

STR-LENSTRL_ENSTRLE_NS_T_RLENS-TR-LENS_TRL_ENS_TRLE_NST_R_LENST_RL_ENST_RLE_NSTR_L_ENSTR_LE_NSTRL_E_NS_T_R_LENS_T_RL_ENS_T_RLE_NS_TR_L_ENS_TR_LE_NS_TRL_E_NST_R_L_ENST_R_LE_NST_RL_E_NSTR_L_E_NS_T_R_L_ENS_T_R_LE_NS_TR_L_E_NST_R_L_E_NS-T-R-L-E-N

Friday, October 7, 11

NORMALIZE ALGORITHM

STRLENS-TRLEN

ST-RLEN

STR-LENSTRL_ENSTRLE_NS_T_RLENS-TR-LENS_TRL_ENS_TRLE_NST_R_LENST_RL_ENST_RLE_NSTR_L_ENSTR_LE_NSTRL_E_NS_T_R_LENS_T_RL_ENS_T_RLE_NS_TR_L_ENS_TR_LE_NS_TRL_E_NST_R_L_ENST_R_LE_NST_RL_E_NSTR_L_E_NS_T_R_L_ENS_T_R_LE_NS_TR_L_E_NST_R_L_E_NS-T-R-L-E-N

E(RLEN) = {RIFLEMEN}

Friday, October 7, 11

NORMALIZE ALGORITHM

STRLENS-TRLEN

ST-RLEN

STR-LENSTRL_ENSTRLE_NS_T_RLENS-TR-LENS_TRL_ENS_TRLE_NST_R_LENST_RL_ENST_RLE_NSTR_L_ENSTR_LE_NSTRL_E_NS_T_R_LENS_T_RL_ENS_T_RLE_NS_TR_L_ENS_TR_LE_NS_TRL_E_NST_R_L_ENST_R_LE_NST_RL_E_NSTR_L_E_NS_T_R_L_ENS_T_R_LE_NS_TR_L_E_NST_R_L_E_NS-T-R-L-E-N

E(RLEN) = {RIFLEMEN}

WILDCARD EXPANSION

R*L*E*N*

Friday, October 7, 11

NORMALIZE ALGORITHM

STRLENS-TRLEN

ST-RLEN

STR-LENSTRL_ENSTRLE_NS_T_RLENS-TR-LENS_TRL_ENS_TRLE_NST_R_LENST_RL_ENST_RLE_NSTR_L_ENSTR_LE_NSTRL_E_NS_T_R_LENS_T_RL_ENS_T_RLE_NS_TR_L_ENS_TR_LE_NS_TRL_E_NST_R_L_ENST_R_LE_NST_RL_E_NSTR_L_E_NS_T_R_L_ENS_T_R_LE_NS_TR_L_E_NST_R_L_E_NS-T-R-L-E-N

E(ST) = {SET, STOP, STRING}E(RLEN) = {RIFLEMEN}

E(STR) = {STEER, STRING}E(LEN) = {LENDER, LENGTH}

Friday, October 7, 11

NORMALIZE ALGORITHM PART I

STRING STEER

VSSTR

Friday, October 7, 11

NORMALIZE ALGORITHM PART I

STRING STEERLENDERLENGTH

LENDERLENGTH

VSSTR

Friday, October 7, 11

NORMALIZE ALGORITHM PART I

STRING STEERLENDERLENGTH

LENDERLENGTH

VS

1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS

STR

Friday, October 7, 11

NORMALIZE ALGORITHM PART I

STRING STEERLENDERLENGTH

LENDERLENGTH

VS

+ +

1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS

COHESIONBCOHESIONA

STR

Friday, October 7, 11

NORMALIZE ALGORITHM PART I

STRING STEERLENDERLENGTH

LENDERLENGTH

VS

+ +

1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS

2. SELECT EXPANSION THAT MAXIMIZES COHESION

COHESIONBCOHESIONA

STR

Friday, October 7, 11

NORMALIZE ALGORITHM PART I

STRING STEERLENDERLENGTH

LENDERLENGTH

VS

+ +

1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS

2. SELECT EXPANSION THAT MAXIMIZES COHESION

COHESIONBCOHESIONA

STR

Friday, October 7, 11

NORMALIZE ALGORITHM PART I

STRING STEERLENDERLENGTH

LENDERLENGTH

VS

+ +

1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS

2. SELECT EXPANSION THAT MAXIMIZES COHESION

COHESIONBCOHESIONA

STRING

STR

Friday, October 7, 11

NORMALIZE ALGORITHM PART II

VS

STR-LEN ST-RLEN

Friday, October 7, 11

NORMALIZE ALGORITHM PART II

VS

STR-LEN ST-RLENSTRING LENGTH STOP RIFLEMEN

Friday, October 7, 11

NORMALIZE ALGORITHM PART II

VS

STR-LEN ST-RLENSTRING LENGTH STOP RIFLEMEN

1. FIND COHESION OVER EXPANSIONS

Friday, October 7, 11

NORMALIZE ALGORITHM PART II

VS

STR-LEN ST-RLENSTRING LENGTH STOP RIFLEMEN

1. FIND COHESION OVER EXPANSIONS

2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESION

Friday, October 7, 11

NORMALIZE ALGORITHM PART II

VS

STR-LEN ST-RLENSTRING LENGTH STOP RIFLEMEN

1. FIND COHESION OVER EXPANSIONS

2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESION

Friday, October 7, 11

NORMALIZE ALGORITHM PART II

VS

STR-LEN ST-RLENSTRING LENGTH STOP RIFLEMEN

1. FIND COHESION OVER EXPANSIONS

2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESION

STRING LENGTH

Friday, October 7, 11

ADDING CONTEXT

Friday, October 7, 11

ADDING CONTEXT

DIR

Friday, October 7, 11

ADDING CONTEXT

DIR E(DIR) = {DIRECTION, DIRECTORY}

Friday, October 7, 11

ADDING CONTEXT

DIR E(DIR) = {DIRECTION, DIRECTORY}

CONTEXT = {FORWARD, BACKWARD}

Friday, October 7, 11

ADDING CONTEXT

FIND COHESION WITH CONTEXT WORDS IN ADDITION TO EXPANSIONS OF OTHER SOFT WORDS

USED IN BOTH PART 1 AND PART 2

DIR E(DIR) = {DIRECTION, DIRECTORY}

CONTEXT = {FORWARD, BACKWARD}

Friday, October 7, 11

NORMALIZE IMPLEMENTATION

USES GenTest TO SPLIT IDENTIFIERS

RETURNS MULTIPLE SPLITS

GOOGLE 5-GRAM DATASET

Friday, October 7, 11

EVALUATION

Program Loc SLoc Unique Ids

which-2.20 3,670 2,293 487

a2ps-4.14 62,347 38,436 4,393

Program Selected Ids Hard Words Soft Words

which-2.20 487 903 1214

a2ps-4.14 211 459 618

Friday, October 7, 11

EVALUATION

THREE GROUPS OF IDENTIFIERS

STANDARD LIBRARY CALLS

NAMES FROM STANDARD HEADER FILES / KEYWORDS

DOMAIN NAMES

Friday, October 7, 11

EVALUATION

THREE GROUPS OF IDENTIFIERS

STANDARD LIBRARY CALLS

NAMES FROM STANDARD HEADER FILES / KEYWORDS

DOMAIN NAMES

THREE GROUPS OF IDENTIFIERS

DOMAIN NAMES

Friday, October 7, 11

EVALUATION

THREE GROUPS OF IDENTIFIERS

STANDARD LIBRARY CALLS

NAMES FROM STANDARD HEADER FILES / KEYWORDS

DOMAIN NAMES

THREE GROUPS OF IDENTIFIERS

DOMAIN NAMES

Program Filtered Ids Reported Ids

which-2.20 152 335

a2ps-4.14 46 166

Friday, October 7, 11

EXAMPLE EXPANSIONS

id Top 10 Expansion

Top Expansion

nextchar next_character next_character

indfound index_found_need index_found

optarg option_are_g optarg

itemno i_them_not itemno

Friday, October 7, 11

RESEARCH QUESTIONS

WHAT IS THE OVERALL ACCURACY OF NORMALIZE?

DOES THE VOCABULARY USED HAVE A SIGNIFICANT IMPACT ON THE EXPANSION’S ACCURACY?

CAN THE EXPANDER INFORM THE SPLITTER?

CAN THE SPLITTER INFORM THE EXPANDER?

Friday, October 7, 11

ACCURACY ON DOMAIN IDS

Friday, October 7, 11

SOURCE OF EXPANSION WORDS

SOURCE CODE

INTERNAL DOCUMENTATION

MANUAL

Friday, October 7, 11

BEST VOCABULARY SOURCE?

Friday, October 7, 11

FUTURE WORK

EXPLORING DIFFERENT SOURCES OF CO-OCCURRENCE DATA

EXPLORING DIFFERENT WAYS OF CALCULATING PROBABILITIES

EXAMINING NORMALIZATION IN CONTEXT OF AN INFORMATION RETRIEVAL TASK

Friday, October 7, 11

SUMMARY

IDENTIFIERS ARE WRITTEN DIFFERENTLY THAN OTHER SOFTWARE DOCUMENTS

DEGRADES PERFORMANCE OF IR TECHNIQUES

NORMALIZE CURRENTLY EXPANDS ABOUT HALF OF SOFT WORDS CORRECTLY

Friday, October 7, 11

QUESTIONS?

Need an identifier split?GenTest Splitter available at

splitit.cs.loyola.edu

Friday, October 7, 11

top related