extraction tools and relational database schemas for cvs, svn, and bazaar revision control systems
TRANSCRIPT
Extraction Tools and Extraction Tools and Relational Database Schemas Relational Database Schemas
forforCVS, SVN, and Bazaar CVS, SVN, and Bazaar
Revision Control SystemsRevision Control Systems
Software ComplexitySoftware Complexity
““The software field is not a simple one The software field is not a simple one and, if anything, it is getting more and, if anything, it is getting more complex at a faster rate than we can complex at a faster rate than we can put into order” (Boehm, 1979)put into order” (Boehm, 1979)
Complex, abstract nature of software Complex, abstract nature of software makes it difficult to researchmakes it difficult to research
How can we “look” at the software How can we “look” at the software development process?development process?
Artifacts of Software Artifacts of Software EngineeringEngineering
Natural byproducts of software Natural byproducts of software development:development: EmailsEmails Bug reportsBug reports Source CodeSource Code
Non-intrusive look at softwareNon-intrusive look at software
SEQuOIA ArchitectureSEQuOIA Architecture
Artifact-based extraction, analysis & Artifact-based extraction, analysis & visualizationvisualization Four stage architectureFour stage architecture Reusable componentsReusable components Industrial strengthIndustrial strength
Raw DataRaw Data Focus of this thesisFocus of this thesis Data Extraction stageData Extraction stage GOAL: Capture all dataGOAL: Capture all data
Store in a databaseStore in a database Filter & refine laterFilter & refine later
More artifacts than we can cover in one thesisMore artifacts than we can cover in one thesis
Revision Control SystemsRevision Control Systems
Focus of this thesisFocus of this thesis History of file revisionsHistory of file revisions
Who modified a file?Who modified a file? When?When? What parts of the file changed? What parts of the file changed?
Particularly important software artifactParticularly important software artifact Frequently used in industry & open source Frequently used in industry & open source
projectsprojects Large quantity of open source data availableLarge quantity of open source data available
Challenges of Data Challenges of Data ExtractionExtraction
Not suitable for on-line analysis:Not suitable for on-line analysis: Slow!Slow! Not always availableNot always available Not suited for advanced queriesNot suited for advanced queries
Can extract and store data in a Can extract and store data in a relational databaserelational database Must be capable of storing all collectable Must be capable of storing all collectable
data from the system!data from the system!
Structural ChallengesStructural Challenges
Significant implementation differencesSignificant implementation differences StructuralStructural
Unique identifiersUnique identifiers Representation of copy / move operationsRepresentation of copy / move operations
ParadigmParadigm DistributedDistributed CentralizedCentralized MixedMixed
Need separate database schemas for each revision Need separate database schemas for each revision control systemcontrol system CVSCVS SVNSVN BazaarBazaar
Related WorkRelated Work
Early 90’sEarly 90’s Researchers recognize revision control systems Researchers recognize revision control systems
as an important data sourceas an important data source 2003-present2003-present
Handful of tools to extract data from revision Handful of tools to extract data from revision control systemscontrol systems
Nearly all store data in a relational databaseNearly all store data in a relational database Most are unavailableMost are unavailable None store the full set of available dataNone store the full set of available data
Not suitable for the SEQuOIA toolNot suitable for the SEQuOIA tool
ThesisThesis
Create:Create: Specialized database schemasSpecialized database schemas Python extraction applicationsPython extraction applications
To:To: Extract & Store all data available through client-side Extract & Store all data available through client-side
commandscommands From:From:
CVSCVS SVNSVN BazaarBazaar
Validate through:Validate through: Unit testingUnit testing Extract data from open source projectsExtract data from open source projects
SchemasSchemas
Specific to each revision control systemSpecific to each revision control system Must be capable of storing all data from Must be capable of storing all data from
revision control system, e.g.revision control system, e.g. SVN PropertiesSVN Properties File contentsFile contents Diff dataDiff data
May also contain ‘helpful’ tablesMay also contain ‘helpful’ tables Linkages needed to answer basic questionsLinkages needed to answer basic questions What are all the files in each revision?What are all the files in each revision? What files were implicitly moved when a What files were implicitly moved when a
directory moved?directory moved?
Extraction ApplicationsExtraction Applications
Written in pythonWritten in python SQLObject Object Relation Manager (ORM):SQLObject Object Relation Manager (ORM):
Minimize database-specific codeMinimize database-specific code Increases portability, maintainabilityIncreases portability, maintainability
ConfigurableConfigurable May be too time consuming to extract everything! May be too time consuming to extract everything! Select what to extractSelect what to extract
Core data (required)Core data (required) Auxiliary data optionalAuxiliary data optional
DiffDiff File contentsFile contents BlameBlame
Apply filters to refine:Apply filters to refine: Collect Collect diffdiff for all for all .java.java files files Collect Collect blameblame for all files in path for all files in path trunk/src/trunk/src/
ValidationValidation
Industrial ThesisIndustrial Thesis ““explain what will be done to assure the explain what will be done to assure the
quality of the work”quality of the work” How do we demonstrate this?How do we demonstrate this?
Unit TestsUnit Tests
Demonstrate functional components work Demonstrate functional components work as specifiedas specified
Need controlled test dataNeed controlled test data Create revision control system serverCreate revision control system server Test against locally hosted repositoryTest against locally hosted repository
Build repositories to test againstBuild repositories to test against Build with Python codeBuild with Python code Build by handBuild by hand
Manipulate with command line & GUI toolsManipulate with command line & GUI tools Save server-side directory dumpSave server-side directory dump Load test repository for appropriate testsLoad test repository for appropriate tests
Extraction from Open Source Extraction from Open Source ProjectsProjects
Real-world data:Real-world data: Data AnomaliesData Anomalies Performance AnomaliesPerformance Anomalies Performance CharacteristicsPerformance Characteristics
Project selectionProject selection Randomly select from FLOSSmole dataRandomly select from FLOSSmole data
But most projects ‘look’ the same!But most projects ‘look’ the same! Filter FLOSSmole data to find ‘large’ projectsFilter FLOSSmole data to find ‘large’ projects
Large # of developersLarge # of developers Long lifespanLong lifespan