How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf ·...

127
Denys Poshyvanyk Annibale Panichella Bogdan Dit Rocco Oliveto Massimiliano Di Penta Andrea De Lucia How to Effectively Use Topic Models for Software Engineering Tasks? An Approach Based on Genetic Algorithms

Transcript of How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf ·...

Page 1: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Denys Poshyvanyk

Annibale Panichella

Bogdan Dit

Rocco Oliveto

Massimiliano Di Penta

Andrea De Lucia

How to Effectively Use Topic Models for Software Engineering Tasks? 

An Approach Based on Genetic Algorithms

Page 2: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Source Code DocumentationBug Reports Design documents

Page 3: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Information Retrieval

Techniques

Maintenance Tasks

Page 4: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Information Retrieval

Techniques

Maintenance Tasks

Vector Space ModelLatent Semantic IndexingLatent Dirichlet Allocation

Relational Topic Model

Page 5: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

What is LDA?

Page 6: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Latent Dirichlet Allocation (LDA)

• Topic model that generates the distribution of latent topics from textual documents

Page 7: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Latent Dirichlet Allocation (LDA)

• Topic model that generates the distribution of latent topics from textual documents

LDA

logging

saving

export

Login

http connect

send

Page 8: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

SoftwareArtifacts

Page 9: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

SoftwareArtifacts Preprocessing

Remove special charactersSplit identifiersRemove common wordsStem

Page 10: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

SoftwareArtifacts Preprocessing Term-by-Document

Matrix

Page 11: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Preprocessing

LDA

Term-by-DocumentMatrix

SoftwareArtifacts

Page 12: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Preprocessing

LDAInput Parameters

Term-by-DocumentMatrix

SoftwareArtifacts

Page 13: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

SoftwareArtifacts Preprocessing Term-by-Document

Matrix

LDAInput Parameters

Topic by Documents

Page 14: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

SoftwareArtifacts Preprocessing Term-by-Document

Matrix

LDAInput Parameters

Topic by Documents

Page 15: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

SoftwareArtifacts Preprocessing Term-by-Document

Matrix

LDAInput Parameters

Topic by Documents

Probability that document is related

to topic

Page 16: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

SoftwareArtifacts Preprocessing Term-by-Document

Matrix

LDAInput Parameters

Topic by Documents

Page 17: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

SoftwareArtifacts Preprocessing Term-by-Document

Matrix

LDAInput Parameters

Topic by Documents

Page 18: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

SoftwareArtifacts Preprocessing Term-by-Document

Matrix

LDAInput Parameters

Topic by DocumentsCompute document to

document similarity

Compute query to document similarity

“Cluster” documents by topics

Page 19: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

SoftwareArtifacts Preprocessing Term-by-Document

Matrix

LDAInput Parameters

Topic by Documents

Page 20: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Preprocessing

LDAInput Parameters

Term-by-DocumentMatrix

Topic by Documents

SoftwareArtifacts

Page 21: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Preprocessing

LDAInput Parameters

Term-by-DocumentMatrix

Topic by Documents

SoftwareArtifacts

Words to Topic

Page 22: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Preprocessing

LDAInput Parameters

Term-by-DocumentMatrix

Topic by Documents

SoftwareArtifacts

Words to Topic

Page 23: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Preprocessing

LDAInput Parameters

Term-by-DocumentMatrix

Topic by Documents

SoftwareArtifacts

Words to Topic

Generate topic “labels” (top 5 words)

Page 24: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Preprocessing

LDAInput Parameters

Term-by-DocumentMatrix

Topic by Documents

SoftwareArtifacts

Words to Topic

Support Software Engineering Tasks

Support Software Engineering Tasks- Traceability Link Recovery- Feature Location- Developer recommendation- Impact Analysis

Page 25: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Preprocessing

LDAInput Parameters

Term-by-DocumentMatrix

Topic by Documents

SoftwareArtifacts

Words to Topic

Support Software Engineering Tasks

Support Software Engineering Tasks- Traceability Link Recovery- Feature Location- Developer recommendation- Impact Analysis

Page 26: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

SourceArtifacts

TargetArtifacts

Use cases, requirements, classes, etc.

Use cases, requirements, classes, etc.

Page 27: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

SourceArtifacts

TargetArtifacts

Preprocessing

LDAInput Parameters

Term-by-DocumentMatrix

Topic by Documents

Page 28: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

SourceArtifacts

TargetArtifacts

Preprocessing

LDAInput Parameters

Term-by-DocumentMatrix

Topic by Documents

Page 29: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

SourceArtifacts

TargetArtifacts

Preprocessing

LDAInput Parameters

Term-by-DocumentMatrix

Oracle Topic by Documents

CorrectCorrect links

Page 30: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

SourceArtifacts

TargetArtifacts

Preprocessing

LDAInput Parameters

Term-by-DocumentMatrix

Oracle

Precision/recallAverage Precision

Topic by Documents

CorrectCorrect linksFalse

PositivesFalse

Positives

Page 31: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Let’s examine the LDA input parameters in more details

Page 32: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Preprocessing

LDAInput Parameters

Term-by-DocumentMatrix

SoftwareArtifacts

Page 33: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Preprocessing

LDAInput Parameters

Term-by-DocumentMatrix

SoftwareArtifacts

“Configuration”# of iterations# of topicsαβ

Page 34: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Preprocessing

LDAInput Parameters

Term-by-DocumentMatrix

SoftwareArtifacts

“Configuration”# of iterations# of topicsαβ

Number of Gibbs samplings of LDA

model

Page 35: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Number of topics…

Low                                          High 

Page 36: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Number of topics…

Low                                          High 

General topics

Page 37: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Number of topics…

Low                                          High 

General topics Specific topics

Page 38: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

α, influences the “smoothness” of documents to topics distribution

Low α High α

Page 39: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

α, influences the “smoothness” of documents to topics distribution

Low α High α

Page 40: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

α, influences the “smoothness” of documents to topics distribution

Low α High α

Page 41: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

α, influences the “smoothness” of documents to topics distribution

Low α High α

Page 42: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

β, influences the “smoothness” of topics to words distribution

Low β High β

Page 43: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

β, influences the “smoothness” of topics to words distribution

Low β High β

Page 44: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

LDA parameters significantly influence the results

• Traceability Link Recovery

• 1,000 different configurations of LDA parameters– Evaluate the Average Precision on EasyClinic

Page 45: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

LDA parameters significantly influence the results

• Traceability Link Recovery

• 1,000 different configurations of LDA parameters– Evaluate the Average Precision on EasyClinic

Page 46: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

LDA parameters significantly influence the results

High variability in results

Page 47: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

LDA parameters significantly influence the results

High variability in results

Few configurations produce good results

Page 48: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

What kind of LDA configurations were used for software?

Page 49: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

What kind of LDA configurations were used for software?

“ad-hoc” configurations

Parameters “imported” from natural language

community

Assumption: source code

has the same characteristics as natural language

Page 50: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

What kind of LDA configurations were used for software?

“ad-hoc” configurations

Parameters “imported” from natural language

community

Assumption: source code

has the same characteristics as natural language

Parameters “imported” from natural language

community

11.80%

Page 51: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

What kind of LDA configurations were used for software?

Assumption: source code

has the same characteristics as natural language

11.80%

[Hindle et al. @ ICSE’12]: source code

is more regular and predictable than natural language

Page 52: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

What kind of LDA configurations were used for software?

Assumption: source code

has the same characteristics as natural language

11.80%

[Hindle et al. @ ICSE’12]: source code

is more regular and predictable than natural language

We need new techniques to find these configurations

Page 53: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Our contribution…LDA‐GA

Page 54: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

LDA‐GA: automatically calibrate the input parameters of LDA using a genetic algorithm

Page 55: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

LDA‐GA: automatically calibrate the input parameters of LDA using a genetic algorithm

Dataset LDA-GA

Page 56: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

LDA‐GA: automatically calibrate the input parameters of LDA using a genetic algorithm

Dataset LDA-GA

LDA parameters

# of iterations # of topics αβ

LDA

Page 57: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

LDA‐GA: automatically calibrate the input parameters of LDA using a genetic algorithm

Dataset LDA-GA

LDA parameters

# of iterations # of topics αβ

Oracle Task

ResultsTraceability link recovery

Feature locationDeveloper recommendation

etc.

LDA

Page 58: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

LDA‐GA: automatically calibrate the input parameters of LDA using a genetic algorithm

Dataset LDA-GA

LDA parameters

# of iterations # of topics αβ

LDA

Dataset dependent

Oracle & Task independent

Oracle Task

Results

Page 59: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

How to evaluate how “good” an LDA configuration is?

Page 60: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Topic by Documents

Page 61: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Topic by Documents

Page 62: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Topic by Documents

Dominant Topics

Page 63: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Topic by Documents

Dominant Topics

Page 64: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Topic by Documents LDA Model

Dominant Topics

Page 65: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Topic by Documents LDA Model

Page 66: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Topic by Documents LDA Model

Page 67: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Topic by Documents LDA Model

Page 68: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Topic by Documents LDA Model

Cohesion (similarity): how related the documents in the

same clusters are

Separation (dissimilarity):how distinct a cluster is from other clusters

Page 69: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support
Page 70: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support
Page 71: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support
Page 72: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support
Page 73: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support
Page 74: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support
Page 75: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support
Page 76: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support
Page 77: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Higher silhouette coefficients are better

Page 78: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

LDA Model 1

#Iterations1; #topics1; α1; β1

Page 79: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

LDA Model 1

#Iterations1; #topics1; α1; β1

LDA Model 2

#Iterations2; #topics2; α2; β2

Page 80: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

LDA Model 1

#Iterations1; #topics1; α1; β1

LDA Model 2

#Iterations2; #topics2; α2; β2

Page 81: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

LDA Model 1

#Iterations1; #topics1; α1; β1

LDA Model 2

#Iterations2; #topics2; α2; β2

silhouette = 0.3 silhouette = 0.9

Clusters more cohesiveClusters well separated

Page 82: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

How to evaluate how “good” an LDA configuration is?

How to identify the “good” LDAparameter configurations?

Page 83: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

How to identify the “good” LDA parameter configurations?

for numIter in [500, …]for numTopics in [5, …]

for α in [0.01, …]for β in [0.01, …]

LDA[numIter , numTopics , α, β]

Exhaustive approach:- Discretize search

space & iterate?

Page 84: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

How to identify the “good” LDA parameter configurations?

for numIter in [500, …]for numTopics in [5, …]

for α in [0.01, …]for β in [0.01, …]

LDA[numIter , numTopics , α, β]

Too many possibilities

Use a Genetic Algorithm

Exhaustive approach:- Discretize search

space & iterate?

Page 85: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

What is a Genetic Algorithm (GA)?

• Stochastic search technique based on the process of natural evolution to identify near‐optimal solutions to search problems

Page 86: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Choose a random population of LDA

parameters

Page 87: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Choose a random population of LDA

parameters

First generation: Population of random

chromosomes

Page 88: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Choose a random population of LDA

parameters

First generation:Population of random

chromosomes

Individual (chromosome):Represents one possible LDA

parameter configuration

Page 89: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Choose a random population of LDA

parameters

First generation: Population of random

chromosomes

Individual (chromosome):Represents one possible LDA

parameter configuration

Gene:LDA parameter

Page 90: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Choose a random population of LDA

parameters

Page 91: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Choose a random population of LDA

parameters

LDA

Page 92: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Choose a random population of LDA

parameters

LDA

Determine fitness of each chromosome

(individual)

Fitness = silhouette coefficient

Page 93: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Choose a random population of LDA

parameters

LDA

Determine fitness of each chromosome

(individual)

Fitness = silhouette coefficientFitness = silhouette coefficient

Page 94: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Choose a random population of LDA

parameters

LDA

Determine fitness of each chromosome

(individual)

Select next generation

Page 95: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Choose a random population of LDA

parameters

LDA

Determine fitness of each chromosome

(individual)

Select next generation

survive

Elitism:best (n=2) configurations will survivesurvive for next generation

Elitism:best (n=2) configurations will survive for next generation

Page 96: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Choose a random population of LDA

parameters

LDA

Determine fitness of each chromosome

(individual)

Select next generation

survive

Elitism:best (n=2) configurations will survive

Roulette selection:Chance of chromosomes to contribute to next generation is proportional to their fitness

survive for next generation

Elitism:best (n=2) configurations will survive for next generation

Page 97: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Choose a random population of LDA

parameters

LDA

Determine fitness of each chromosome

(individual)

Select next generation

Crossover, mutation

Page 98: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Choose a random population of LDA

parameters

LDA

Determine fitness of each chromosome

(individual)

Select next generation

Crossover, mutation

Crossover

Page 99: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Choose a random population of LDA

parameters

LDA

Determine fitness of each chromosome

(individual)

Select next generation

Crossover, mutation

Crossover

Page 100: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Choose a random population of LDA

parameters

LDA

Determine fitness of each chromosome

(individual)

Select next generation

Crossover, mutation

Crossover

Mutation

Page 101: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Choose a random population of LDA

parameters

LDA

Determine fitness of each chromosome

(individual)

Select next generation

Crossover, mutation

Crossover

Mutation

Page 102: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Choose a random population of LDA

parameters

LDA

Determine fitness of each chromosome

(individual)

Select next generation

Crossover, mutation

Next generation

Page 103: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Choose a random population of LDA

parameters

LDA

Determine fitness of each chromosome

(individual)

Select next generation

Crossover, mutation

Next generation

Best model for first generation

Page 104: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Choose a random population of LDA

parameters

LDA

Determine fitness of each chromosome

(individual)

Select next generation

Crossover, mutation

Next generation

Best model for first generation

Best model for last generation

Page 105: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Choose a random population of LDA

parameters

LDA

Determine fitness of each chromosome

(individual)

Select next generation

Crossover, mutation

Next generation

Return best LDA parameters (from chromosome with

highest fitness)

# of iterations# of topicsαβ

Page 106: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Evaluation…

Dataset LDA-GA

LDA parameters

# of iterations # of topics αβ

LDA

Oracle Task

Results1. Traceability link recovery2. Feature location3. Software artifact labeling (see paper)

Page 107: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Evaluation: Traceability Link Recovery

• Recover links between use cases and code classesSystem Size # use cases # code classes # correct links

EasyClinic 20KLOC 30 47 93eTour 45KLOC 58 174 366

Page 108: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Combinatorial: for numIter in [500, …]

for numTopics in [5, …]for α in [0.01, …]

for β in [0.01, …]LDA[numIter , numTopics , α, β]

Choose LDA parameters with best average

precision using an oracle

Page 109: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Combinatorial: for numIter in [500, …]

for numTopics in [5, …]for α in [0.01, …]

for β in [0.01, …]LDA[numIter , numTopics , α, β]

Choose LDA parameters with best silhouette

coefficient

Choose LDA parameters with best silhouette

coefficient

Choose LDA parameters with best average

precision using an oracle

Page 110: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Combinatorial: 

LDA‐GA: run LDA‐GA 30 times (to account for randomness)

Baseline: [ICPC’10]

for numIter in [500, …]for numTopics in [5, …]

for α in [0.01, …]for β in [0.01, …]

LDA[numIter , numTopics , α, β]

Choose LDA parameters corresponding to median

fitness over 30 runs

Use default LDA parameters from natural

language

Choose LDA parameters with best average

precision using an oracle

Page 111: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Precision/Recall EasyClinic

Page 112: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Precision/Recall EasyClinic

Default parameters from natural language

[ICPC’10]

LDA-GA

Combinatorial(best of 1,000 configurations)

Page 113: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Precision/Recall EasyClinic

Default parameters from natural language

[ICPC’10]

LDA-GA

Combinatorial(best of 1,000 configurations)

LDA-GA = Combinatorial

LDA-GA outperforms baseline [p-value < 0.05]

Page 114: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Precision/Recall eTour

Default parameters from natural language

[ICPC’10]

LDA-GA

Combinatorial(best of 2,000 configurations)

Page 115: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Precision/Recall eTour

Default parameters from natural language

[ICPC’10]

LDA-GA

Combinatorial(best of 2,000 configurations)

Combinatorial outperforms LDA-GA [p-value < 0.05]

LDA-GA outperforms baseline

Page 116: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Evaluation: Feature location

• Identify methods related to a maintenance task (e.g., bug, feature)

System Size # features # methodsjEdit 104KLOC 150 6,413

ArgoUML 149KLOC 91 11,000

Page 117: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Combinatorial: for numIter in [500, …]

for numTopics in [5, …]for α in [0.01, …]

for β in [0.01, …]LDA[numIter , numTopics , α, β]

Choose LDA parameters with best average effectiveness

using an oracle

Page 118: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Combinatorial: 

LDA‐GA: run LDA‐GA 30 times (to account for randomness)

Baseline: [SCAM’10]

for numIter in [500, …]for numTopics in [5, …]

for α in [0.01, …]for β in [0.01, …]

LDA[numIter , numTopics , α, β]

Choose LDA parameters corresponding to median

fitness over 30 runs

Use LDA parameters from source locality heuristic

Choose LDA parameters with best average effectiveness

using an oracle

Page 119: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Effectiveness measure ArgoUML jEdit

Page 120: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Effectiveness measure ArgoUML jEdit

LDA-GA = CombinatorialLDA-GA outperforms baseline

LDA-GA = CombinatorialLDA-GA outperforms baseline [p-value < 0.05]

Page 121: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Conclusions• Showed the impact of setting the LDA parameters on the results

• We proposed LDA‐GA, a genetic based approach to automatically configure and find the near‐optimal solution for LDA parameters– Dataset dependent– Oracle & task independent

• The approach was evaluated on three maintenance tasks

Page 122: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

LDA‐GA in TraceLabGenetic algorithm settings

LDA-GA settings

Page 123: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Thank you! Questions?

http://www.distat.unimol.it/reports/LDA‐GA/

http://www.cs.wm.edu/semeru/data/tefse13/

Page 124: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

References• [Hindle et al., ICSE’12] A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. T. Devanbu, “On the 

naturalness of software,” in Proc. of the 34th IEEE/ACM International Conference on Software Engineering (ICSE’12), Zurich, Switzerland, June 2‐9, 2012, pp. 837–847.

• [ICPC’10] R. Oliveto, M. Gethers, D. Poshyvanyk, and A. De Lucia, “On the equivalence of information retrieval methods for automated traceability link recovery,” in Proc of the 18th IEEE International Conference on Program Comprehension (ICPC’10), Braga, Portugal, 2010, pp. 68–71.

• [SCAM’10] S. Grant and J. R. Cordy, “Estimating the optimal number of latent concepts in source code analysis,” in Proc. of the 10th International Working Conference on Source Code Analysis and Manipulation (SCAM’10), 2010, pp. 65–74.

Page 125: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Threats to Validity

• We used datasets that have been used in other studies

• We ran GA 30 times to account for randomness

• Non‐parametric statistical test• Generalizability of results to other SE tasks

Page 126: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

GA Settings

• Implementation: GA library in R• Population size: 100• Elitism of 2 individuals• Roulette wheel selection operator• Crossover probability: 0.6• Mutation probability: 0.01• Stop criteria:

– No improvement in 10 generations– When reaching 100 generations

Page 127: How Use Topic Models for Software Engineering Tasks? An …denys/pubs/talks/ICSE'13.LDA-GA.pdf · 2014-01-13 · Matrix Topic by Documents Software Artifacts Words to Topic Support

Software Artifact Labeling