extraction tools and relational database schemas for cvs, svn, and bazaar revision control systems

15
Extraction Tools and Extraction Tools and Relational Database Relational Database Schemas for Schemas for CVS, SVN, and Bazaar CVS, SVN, and Bazaar Revision Control Systems Revision Control Systems

Upload: lawrence-tucker

Post on 18-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Extraction Tools and Relational Database Schemas for CVS, SVN, and Bazaar Revision Control Systems

Extraction Tools and Extraction Tools and Relational Database Schemas Relational Database Schemas

forforCVS, SVN, and Bazaar CVS, SVN, and Bazaar

Revision Control SystemsRevision Control Systems

Page 2: Extraction Tools and Relational Database Schemas for CVS, SVN, and Bazaar Revision Control Systems

Software ComplexitySoftware Complexity

““The software field is not a simple one The software field is not a simple one and, if anything, it is getting more and, if anything, it is getting more complex at a faster rate than we can complex at a faster rate than we can put into order” (Boehm, 1979)put into order” (Boehm, 1979)

Complex, abstract nature of software Complex, abstract nature of software makes it difficult to researchmakes it difficult to research

How can we “look” at the software How can we “look” at the software development process?development process?

Page 3: Extraction Tools and Relational Database Schemas for CVS, SVN, and Bazaar Revision Control Systems

Artifacts of Software Artifacts of Software EngineeringEngineering

Natural byproducts of software Natural byproducts of software development:development: EmailsEmails Bug reportsBug reports Source CodeSource Code

Non-intrusive look at softwareNon-intrusive look at software

Page 4: Extraction Tools and Relational Database Schemas for CVS, SVN, and Bazaar Revision Control Systems

SEQuOIA ArchitectureSEQuOIA Architecture

Artifact-based extraction, analysis & Artifact-based extraction, analysis & visualizationvisualization Four stage architectureFour stage architecture Reusable componentsReusable components Industrial strengthIndustrial strength

Page 5: Extraction Tools and Relational Database Schemas for CVS, SVN, and Bazaar Revision Control Systems

Raw DataRaw Data Focus of this thesisFocus of this thesis Data Extraction stageData Extraction stage GOAL: Capture all dataGOAL: Capture all data

Store in a databaseStore in a database Filter & refine laterFilter & refine later

More artifacts than we can cover in one thesisMore artifacts than we can cover in one thesis

Page 6: Extraction Tools and Relational Database Schemas for CVS, SVN, and Bazaar Revision Control Systems

Revision Control SystemsRevision Control Systems

Focus of this thesisFocus of this thesis History of file revisionsHistory of file revisions

Who modified a file?Who modified a file? When?When? What parts of the file changed? What parts of the file changed?

Particularly important software artifactParticularly important software artifact Frequently used in industry & open source Frequently used in industry & open source

projectsprojects Large quantity of open source data availableLarge quantity of open source data available

Page 7: Extraction Tools and Relational Database Schemas for CVS, SVN, and Bazaar Revision Control Systems

Challenges of Data Challenges of Data ExtractionExtraction

Not suitable for on-line analysis:Not suitable for on-line analysis: Slow!Slow! Not always availableNot always available Not suited for advanced queriesNot suited for advanced queries

Can extract and store data in a Can extract and store data in a relational databaserelational database Must be capable of storing all collectable Must be capable of storing all collectable

data from the system!data from the system!

Page 8: Extraction Tools and Relational Database Schemas for CVS, SVN, and Bazaar Revision Control Systems

Structural ChallengesStructural Challenges

Significant implementation differencesSignificant implementation differences StructuralStructural

Unique identifiersUnique identifiers Representation of copy / move operationsRepresentation of copy / move operations

ParadigmParadigm DistributedDistributed CentralizedCentralized MixedMixed

Need separate database schemas for each revision Need separate database schemas for each revision control systemcontrol system CVSCVS SVNSVN BazaarBazaar

Page 9: Extraction Tools and Relational Database Schemas for CVS, SVN, and Bazaar Revision Control Systems

Related WorkRelated Work

Early 90’sEarly 90’s Researchers recognize revision control systems Researchers recognize revision control systems

as an important data sourceas an important data source 2003-present2003-present

Handful of tools to extract data from revision Handful of tools to extract data from revision control systemscontrol systems

Nearly all store data in a relational databaseNearly all store data in a relational database Most are unavailableMost are unavailable None store the full set of available dataNone store the full set of available data

Not suitable for the SEQuOIA toolNot suitable for the SEQuOIA tool

Page 10: Extraction Tools and Relational Database Schemas for CVS, SVN, and Bazaar Revision Control Systems

ThesisThesis

Create:Create: Specialized database schemasSpecialized database schemas Python extraction applicationsPython extraction applications

To:To: Extract & Store all data available through client-side Extract & Store all data available through client-side

commandscommands From:From:

CVSCVS SVNSVN BazaarBazaar

Validate through:Validate through: Unit testingUnit testing Extract data from open source projectsExtract data from open source projects

Page 11: Extraction Tools and Relational Database Schemas for CVS, SVN, and Bazaar Revision Control Systems

SchemasSchemas

Specific to each revision control systemSpecific to each revision control system Must be capable of storing all data from Must be capable of storing all data from

revision control system, e.g.revision control system, e.g. SVN PropertiesSVN Properties File contentsFile contents Diff dataDiff data

May also contain ‘helpful’ tablesMay also contain ‘helpful’ tables Linkages needed to answer basic questionsLinkages needed to answer basic questions What are all the files in each revision?What are all the files in each revision? What files were implicitly moved when a What files were implicitly moved when a

directory moved?directory moved?

Page 12: Extraction Tools and Relational Database Schemas for CVS, SVN, and Bazaar Revision Control Systems

Extraction ApplicationsExtraction Applications

Written in pythonWritten in python SQLObject Object Relation Manager (ORM):SQLObject Object Relation Manager (ORM):

Minimize database-specific codeMinimize database-specific code Increases portability, maintainabilityIncreases portability, maintainability

ConfigurableConfigurable May be too time consuming to extract everything! May be too time consuming to extract everything! Select what to extractSelect what to extract

Core data (required)Core data (required) Auxiliary data optionalAuxiliary data optional

DiffDiff File contentsFile contents BlameBlame

Apply filters to refine:Apply filters to refine: Collect Collect diffdiff for all for all .java.java files files Collect Collect blameblame for all files in path for all files in path trunk/src/trunk/src/

Page 13: Extraction Tools and Relational Database Schemas for CVS, SVN, and Bazaar Revision Control Systems

ValidationValidation

Industrial ThesisIndustrial Thesis ““explain what will be done to assure the explain what will be done to assure the

quality of the work”quality of the work” How do we demonstrate this?How do we demonstrate this?

Page 14: Extraction Tools and Relational Database Schemas for CVS, SVN, and Bazaar Revision Control Systems

Unit TestsUnit Tests

Demonstrate functional components work Demonstrate functional components work as specifiedas specified

Need controlled test dataNeed controlled test data Create revision control system serverCreate revision control system server Test against locally hosted repositoryTest against locally hosted repository

Build repositories to test againstBuild repositories to test against Build with Python codeBuild with Python code Build by handBuild by hand

Manipulate with command line & GUI toolsManipulate with command line & GUI tools Save server-side directory dumpSave server-side directory dump Load test repository for appropriate testsLoad test repository for appropriate tests

Page 15: Extraction Tools and Relational Database Schemas for CVS, SVN, and Bazaar Revision Control Systems

Extraction from Open Source Extraction from Open Source ProjectsProjects

Real-world data:Real-world data: Data AnomaliesData Anomalies Performance AnomaliesPerformance Anomalies Performance CharacteristicsPerformance Characteristics

Project selectionProject selection Randomly select from FLOSSmole dataRandomly select from FLOSSmole data

But most projects ‘look’ the same!But most projects ‘look’ the same! Filter FLOSSmole data to find ‘large’ projectsFilter FLOSSmole data to find ‘large’ projects

Large # of developersLarge # of developers Long lifespanLong lifespan