analysis of source code repositories tools for large-scale ... · tools for large-scale collection...

25

Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

Upload: others

Post on 22-May-2020

11 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

Page 2: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

Alexander Bezzubov source{d}

➔ committer & PMC @ apache zeppelin

➔ engineer @source{d}

➔ startup in Madrid

➔ builds the open-source components that enable large-scale code analysis and machine learning on source code

Intro

Page 3: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

motivation & vision

MOTIVATION: WHY COLLECTING SOURCE CODE

VISION:

➔ Academia: material for research in IR/ML/PL communities

➔ Industry: fuel for building data-driven products (i.e for sourcing candidates for hiring)

➔ OSS collection pipeline

➔ Use it to build public datasets industry and academia

➔ Use Git as “source of truth”, the most popular VCS

➔ A crawler (find URLs, git clone them), Distributed storage (FS, DB), Parallel processing framework

custom, in Golang standard, Apache HDFS + Postgres custom, library for Apache Spark

Page 4: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

Tech Stack

Page 5: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

CoreOS

infrastructure

Rovers

collection

storage

processing

analysis

K8s

go-git

HDFS

śiva

Apache Spark

source{d} Engine

Bblfsh

Enry

tech stack

Borders

Page 6: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

CoreOS

infrastructure

Rovers

collection

storage

processing

analysis

K8s

go-git

HDFS

śiva

Apache Spark

source{d} Engine

Bblfsh

Enry

tech stack

Borders

Page 7: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

infrastructure

➔ Dedicated cluster (cloud becomes prohibitively expensive for storing ~100sTb)

➔ CoreOS provisioned on bare-menta \w Terraform

➔ Booting and OS configuration Matchbox and Ignition

➔ K8s deployed on top of that

More details at talk at CfgMgmtCamp

http://cfgmgmtcamp.eu/schedule/terraform/CoreOS.html

http://cfgmgmtcamp.eu/schedule/terraform/CoreOS.html

Page 8: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

CoreOS

infrastructure

Rovers

collection

storage

processing

analysis

K8s

go-git

HDFS

śiva

Apache Spark

source{d} Engine

Bblfsh

Enry

tech stack

Borders

Page 9: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

collection

➔ Rovers: search for Git repository URLs

➔ Borges: fetching repository \w “git pull”

➔ Git storage format & protocol implementation

➔ Optimize for on-disk size: forks that share history, saved together

go-git to talk Git Last year had a talk at FOSDEM

https://archive.fosdem.org/2017/schedule/event/go_git/

https://archive.fosdem.org/2017/schedule/event/go_git/

Page 10: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

GIT LIBRARY FOR GO

• need to clone and analyze tens of millions of repositories with our core language Go

• be able to do so in memory, and by using custom filesystem implementations

• easy to use and stable API for the Go community

• used in production by companies, e.g.: keybase.io

motivation

go-git A HIGHLY EXTENSIBLE IMPLEMENTATION OF GIT IN GO

GO-GIT IN ACTION

example

• the most complete git library for any language after libgit2 and jgit

• highly extensible by design

• idiomatic API for plumbing and porcelain commands

• 2+ years of continuous development

• used by a significant number of open source projects

PURE GO SOURCE CODE

features

example mimicking `git clone` using go-git:

usage

• https://github.com/src-d/go-git

• go-git presentation at FOSDEM 2017

• go-git presentation at Git Merge 2017

• compatibility table of git vs. go-git

• comparing git trees in go

resourcesYOUR NEXT STEPS

TRY IT YOURSELF

# installation$ go get -u gopkg.in/src-d/go-git.v4/...

• list of more go-git usage examples

// Clone the repo to the given directory

url := "https://github.com/src-d/go-git",_, err := git.PlainClone( "/tmp/foo", false, &git.CloneOptions{ URL: url, Progress: os.Stdout, },)

CheckIfError(err)

output:

Counting objects: 4924, done.Compressing objects: 100% (1333/1333), done.Total 4924 (delta 530), reused 6 (delta 6), pack-reused 3533

https://keybase.io/blog/encrypted-git-for-everyone

https://godoc.org/gopkg.in/src-d/go-git.v4

https://github.com/src-d/go-git

https://archive.fosdem.org/2017/schedule/event/go_git/

https://www.youtube.com/watch?v=_gyLZVjekbo&feature=youtu.be&t=1332

https://github.com/src-d/go-git/blob/master/COMPATIBILITY.md

https://blog.sourced.tech/post/difftree/

https://github.com/src-d/go-git/tree/master/_examples

Page 11: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

CODE COLLECTION AT SCALE

• collection and storage of repositories at large scale

• automated process

• optimal usage of storage

• optimal to keep repositories up-to-date with the origin

motivation

rovers & borges LARGE SCALE CODE REPOSITORY COLLECTION AND STORAGE

KEY CONCEPT

• set up and run rovers

• set up borges

• run borges producer

• run borges consumer

architecture

• distributed system similar to a search engine

• src-d/rovers retrieves URLs from git hosting providers via API, plus self-hosted git repositories

• src-d/borges producer reads URL list, schedules fetching

• borges consumer fetches and pushes repo to storage

• borges packer also available as a standalone command, transforming repository urls into siva files

• stores using src-d/śiva repository storage file format

• optimized for storage and keeping repos up-to-date

SEEK, FETCH, STORE

architecture

• rooted repositories are standard git repositories that store all objects from all repositories that share a common history, identified by same initial commit:

• a rooted repository is saved in a single śiva file

• updates stored in concatenated siva files: no need to rewriting the whole repository file

• distributed-file-system backed, supports GCS & HDFS

usage

• https://github.com/src-d/rovers

• https://github.com/src-d/borges

• https://github.com/src-d/go-siva

• śiva: Why We Created Yet Another Archive Format

resourcesYOUR NEXT STEPS

SETUP & RUN

https://github.com/src-d/rovers#installation

https://github.com/src-d/borges#setting-up-borges

https://github.com/src-d/borges#producer

https://github.com/src-d/borges#consumer

https://github.com/src-d/rovers

https://github.com/src-d/borges

https://blog.sourced.tech/post/siva/

https://github.com/src-d/rovers

https://github.com/src-d/borges

https://github.com/src-d/go-siva

https://blog.sourced.tech/post/siva/

https://blog.sourced.tech/post/siva/

Page 12: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

CoreOS

infrastructure

Rovers

collection

storage

processing

analysis

K8s

go-git

HDFS

śiva

Apache Spark

source{d} Engine

Bblfsh

Enry

tech stack

Borders

Page 13: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

storage

➔ Metadata: PostgreSQL

➔ Built small type-safe ORM for Go<->Postgres

https://github.com/src-d/go-kallax

➔ Data: Apache Hadoop HDFS

➔ Custom (seekable, appendable) archive format: Siva 1 RootedRepository <-> 1 Siva file

https://github.com/src-d/go-kallax

Page 14: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

SMART REPO STORAGE

• store a git repository in a single file

• updates possible without rewriting the whole file

• friendly to distributed file systems

• seekable to allow random access to any file position

motivation

śiva SEEKABLE INDEXED BLOCK ARCHIVER FILE FORMAT

SIVA FILE BLOCK SCHEMA

architecture

CHARACTERISTICS

• src-d/go-siva is an archiving format similar to tar or zip

• allows constant-time random file access

• allows seekable read access to the contained files

• allows file concatenation given the block-based design

• command-line tool + implementations in Go and Java

architecture usage

• https://github.com/src-d/go-siva

• śiva: Why We Created Yet Another Archive Format

resourcesYOUR NEXT STEPSAPPENDING FILES

# pack into siva file$ siva pack example.siva qux

# append into siva file$ siva pack --append example.siva bar

# list siva file contents$ siva list example.sivaSep 20 13:04 4 B qux -rw-r--r-- Sep 20 13:07 4 B bar -rw-r--r--

https://github.com/src-d/go-siva

https://github.com/src-d/go-siva

https://blog.sourced.tech/post/siva/

https://blog.sourced.tech/post/siva/

Page 15: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

Core OS

infrastructure

Rovers

collection

storage

processing

analysis

K8s

go-git

HDFS

śiva

Apache Spark

source{d} Engine

Bblfsh

Enry

tech stack

Borders

Page 16: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

processing

Apache Spark➔ For batch processing, SparkSQL

Engine

➔ Library, \w custom DataSource implementation GitDataSource

➔ Read repositories from Siva archives in HDFS, exposes though DataFrame

➔ API for accessing refs/commits/files/blobs

➔ Talks to external services though gRPC for parsing/lexing, and other analysis

Page 17: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

YOUR NEXT STEPS

UNIFIED SCALABLE PIPELINE

• easy-to-use pipeline for git repository analysis

• integrated with standard tools for large scale data analysis

• avoid custom code in operations across millions of repos

motivation

engine UNIFIED SCALABLE CODE ANALYSIS PIPELINE ON SPARK

APACHE SPARK DATAFRAME

architecture

• listing and retrieval of git repositories

• Apache Spark datasource on top of git repositories

• iterators over any git object, references

• code exploration and querying using XPath expressions

• language identification and source code parsing

• feature extraction for machine learning at scale

PREPARATION

architecture

usage sample

• https://github.com/src-d/engine

• Early example jupyter notebook: https://github.com/src-d/spark-api/blob/master/examples/notebooks/Example.ipynb

resources

EngineAPI(spark, 'siva', '/path/to/siva-files').repositories.references.head_ref.files.classify_languages().extract_uasts().query_uast('//*[@roleImport and @roleDeclaration]', 'imports').filter("lang = 'java'").select('imports', 'path', 'repository_id').write.parquet("hdfs://...")

• extends Apache SparkSQL

• git repositories stored as siva files or standard repositories in HDFS

• metadata caching for faster lookups over all the dataset.

• fetches repositories in batches and on demand

• available APIs for Spark and PySpark

• can run either locally or in a distributed cluster

https://github.com/src-d/engine

https://github.com/src-d/spark-api/blob/master/examples/notebooks/Example.ipynb

https://github.com/src-d/spark-api/blob/master/examples/notebooks/Example.ipynb

Page 18: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

CoreOS

infrastructure

Rovers

collection

storage

processing

analysis

K8s

go-git

HDFS

śiva

Apache Spark

source{d} Engine

Bblfsh

Enry

tech stack

Borders

Page 19: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

analysis

Enry➔ Programming language identification

➔ Re-write of github/linguist in Golang, ~370 langs

Project Babelfish

➔ Distributed parser infrastructure for source code analysis

➔ Unified interface though gRPC to native parsers in containers: src -> uAST

Talk in Source Code Analysis devRoom Room: UD2.119, Sunday, 12:40 https://fosdem.org/2018/schedule/event/code_babelfish_a_universal_code_parser_for_source_code_analysis/

https://fosdem.org/2018/schedule/event/code_babelfish_a_universal_code_parser_for_source_code_analysis/

https://fosdem.org/2018/schedule/event/code_babelfish_a_universal_code_parser_for_source_code_analysis/

Page 20: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

• enry speed improvement over linguist when applied to linguist/samples folder file samples

usable in Go as a native library, in Java as shared library and as a CLI tool.

LANG DETECTION AT SCALE

• need to detect programming languages of every file in a git repository

• initially used github/linguist, but needed more performance for large scale applications

• keep compatibility with the original linguist project

motivation

enry A FASTER FILE PROGRAMMING LANGUAGE DETECTOR

benchmarks

COMPATIBLE AND FLEXIBLE

• linguist as source of information on language detection

• ignores binary and vendored files

• command line tool mimics the original linguist one

• can be used in Go (native library) or Java (shared library)

architecture

usage

• https://github.com/src-d/enry

• enry: detecting languages

• benchmark methodology and results

resourcesYOUR NEXT STEPS

• src-d/enry is at least 4x faster than linguist

• 5x (larger repos) to 20x faster (smaller repos)

GO FASTER

• additional info on benchmarking enry

$ enry /path/to/src-d/go-git98.28% Go0.69% Shell0.34% Makefile0.34% Markdown0.34% Text

https://github.com/github/linguist/tree/master/samples

https://github.com/github/linguist

https://github.com/src-d/enry

https://blog.sourced.tech/post/enry/

https://github.com/src-d/enry/tree/master/benchmarks

https://github.com/src-d/enry

https://github.com/src-d/enry#benchmarks

Page 21: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

UNIVERSAL CODE ANALYSIS

• was born as a solution for massive code analysis

• parsing single files in any programming language

• analyze all source code from all repositories in the world

• analyze many languages using a shared structure/format

motivation

babelfish A SELF-HOSTED SERVER FOR UNIVERSAL SOURCE CODE PARSING

CONTAINER-BASED

architecture

• AST-based diff'ing. Understanding changes made to code with finer-grained granularity.

• extract features for Machine Learning on Source Code.

• statistics of language features

• detecting similar coding patterns across languages

POWERFUL OPPORTUNITIES

use cases

• language drivers as the main building blocks

• parsing service via one driver per language

• language drivers can be written in any language and are packaged as standard Docker containers

• containers are executed by the babelfish server in a specific runtime built on-top of libcontainer.

usage

• https://github.com/bblfsh

• Babelfish documentation

• announcing Babelfish

• Babelfish presentation

• join the Babelfish community

resourcesYOUR NEXT STEPSUNIVERSAL AST

architecture

• UAST is a universal (normalized and annotated) form of Abstract Syntax Tree (AST)

• language-independent annotations (roles) such as Expression, Statement, Operator, Arithmetic, etc.

• can be easily ported to many languages using gogo/protobuf

TRY BABELFISH ONLINE

• or run babelfish server & dashboard locally:

$ docker run --privileged -d -p \ 9432:9432 --name bblfsh \ bblfsh/server

$ docker run -p 8080:80 --link \ bblfsh bblfsh/dashboard \ --bblfsh-addr bblfsh:9432

https://github.com/opencontainers/runc/tree/master/libcontainer

https://github.com/bblfsh

https://doc.bblf.sh/

https://blog.sourced.tech/post/announcing_babelfish/

https://docs.google.com/a/sourced.tech/presentation/d/1X3cWYq8CkI0HKDrkCq7s2hrMTneFlaAcKqFQplAN1vA/edit?usp=drivesdk

https://doc.bblf.sh/community.html

https://doc.bblf.sh/uast/specification.html

https://godoc.org/github.com/bblfsh/sdk/uast#Role

https://github.com/gogo/protobuf

http://dashboard.bblf.sh

Page 22: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

CoreOS

infrastructure

Rovers

collection

storage

processing

analysis

K8s

go-git

HDFS

śiva

Apache Spark

source{d} Engine

Bblfsh

Enry

tech stack

Borders

Page 23: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

Further directions

Page 24: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

further directions

INFRASTRUCTURE➔ Persistent storage in k8s on bare-metal cluster

COLLECTION

STORAGE

PROCESSING

ANALYSIS

➔ Explore SEDA architecture, to dynamically saturate throughput

➔ Better splittable Git object storage format (\w delta-encoding, etc)

➔ Distributed Indexes to speed up common Apache Spark queries

➔ AST-diff, cross-language abstractions on top of ASTs

Page 25: analysis of source code repositories Tools for large-scale ... · Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

thank you.

Dataverse and Related Projects - Harvard University · • Dataverse is an open-source software for building data repositories • Harvard Dataverse is a generic data repositories,

Frontera-Open Source Large Scale Web Crawling Framework

Source to Output Repositories (StORe) final report, 2008

Graph-Based Source Code Analysis of JavaScript Repositories · 2020. 2. 5. · Graph-Based Source Code Analysis of JavaScript Repositories Master’s Thesis Author: Dániel Stein

Open source software for building open access repositories

A Scalable Open-Source Pipeline for Large-Scale Root Phenotyping … · LARGE-SCALE BIOLOGY ARTICLE A Scalable Open-Source Pipeline for Large-Scale Root Phenotyping of ArabidopsisW

Open source software - COnnecting REpositories · Open source software for building open access repositories Imma Subirats Coll knowledge and information management ofﬁcer FAO of

Wilcox - Open Source Repositories and the Future of Fedora

Open Source Adoption and Promotion - SCALE · 18 Open Source Promotion and Adoption: Current State * SCALE Conference March 26, 2011 #3 Service-oriented architecture {open source

Topic Recommendation for Software Repositories using Multi … · 2020. 11. 2. · Software Repositories GitHub 1 Introduction Open-source software (OSS) communities provide a wide

MULTI-SOURCE MULTI-SCALE HIERARCHICAL CONDITIONAL …

Implementing Large Scale Digital Asset Repositories with Adobe Experience Manager (AEM) & Scene7

TDNetGen: An open-source, parametrizable, large-scale

Scale Test Report for Very Large Scale Document Repositories

Planning Very Large Scale Document Repositories with High Availability in SharePoint 2013

Graph-Based Source Code Analysis of JavaScript Repositories

Cloudifying Source Code Repositories: How much does it cost?

Open source @ scale McAffer Open... · Open source @ scale Jeff McAffer Director of Open Source Programs Office, Microsoft @jeffmcaffer. Why open source? Aspiring to world class is

House of Cards: Code Smells in Open-source C# Repositories

Comparing High Scale Multi-source Algorithms

VENUS MAPPING AT SMALL SCALE: SOURCE DATA …

Model-based Analysis of Large Scale Software Repositories

ABCD Open Source Software for managing ETD repositories

Data Collection from Open Source Software Repositories · Data Collection from Open Source Software Repositories G ORAN M AUŠA, T IHANA G ALINAC G RBAC SEIP . L. ABORATORY. F. ACULTY

HAProxy scale out using open source

Open source vs. COMMERCIAL SOLUTIONS FOR DIGITAL REPOSITORIES

Source: the Directory of Open Access Repositories http ...openscience.ens.fr/OPEN_ACCESS_MODELS/GREEN_OPEN... · Source: the Directory of Open Access Repositories • Advanced Technology

Frontera: Large-Scale Open Source Web Crawling Framework · Frontera: Large-Scale Open Source Web ... Apache Nutch instead of ... Frontera-Open Source Large Scale Web Crawling Framework

Audio and Video Repositories at Scale - Indiana University’s Media Digitization and Preservation Initiative

How to source from small scale agricultural producers

Yum Repo Server 2.0 Scale Out RPM Repositories

Open source project management at scale

Digital Repositories for Learning -- Vancouver, BC -- August 11, 2009 Moen 1 Open Source Software for Learning Object Repositories A Study Commissioned

Source-to-Output Repositories Aspects of repositories use: employing an ecology-based approach to report on user requirements in chemistry and the biosciences

Presentation - SCALE/MELCOR non-LWR Source Term