2014 10-14: github plus foss == 1 million spdx

36
Large-scale license transparency using open data, open standards and F/OSS + => 1 million SPDX http://triplecheck.net http://searchcode.com

Upload: nuno-brito

Post on 24-Jun-2015

533 views

Category:

Technology


0 download

DESCRIPTION

SPDX is an open format for describing software licenses, contents and ownership. It is a simple text document with great benefits for software governance. But have you ever seen one? Despite being an open standard, there aren't many available to public. Using only Linux, GitHub and F/OSS tools, Nuno and Ben were fuelled to prove that SPDX is also applicable to everyday projects. As result, the first large-scale SPDX Internet archive came to exist. Join this presentation to learn how over 1 million SPDX documents were created using open data in large-scale repositories and how easy it is to create one. From now forward you'll be able to express the licenses in your code automatically and create licensing transparency by yourself.

TRANSCRIPT

Page 1: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Large-scale license transparency using open data, open standards and F/OSS

+ => 1 million SPDX

http://triplecheck.net http://searchcode.com

Page 2: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Speaker

Slide #2

Nuno Brito

Free/open source contributor since 2005 Last 12 months wrote 100k F/OSS lines of code SPDX contributor, co-founder of TripleCheck

Around the web http://nunobrito.eu

Page 3: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Transparency

Slide #3

Take some source code as example

Who developed the code?Which licenses are applicable?Was the code copied from somewhere else?

Page 4: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Size

Slide #4

A problem of scale

Open licenses? > 300 types to choose> 5 million F/OSS projects

> 100 million source code files

Page 5: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Practice

Slide #5

Applying licenses

Burden on developer (do correctly, do enough) Expressed differently (difficult to understand) Scaling obstacles (scarce automation)

Transparency?

Page 6: 2014 10-14: GitHub plus FOSS == 1 million SPDX

What do?

Slide #6

Ideally, we'd have tooling that is..

a) Reachableb) Cooperativec) Free

Choose two. (sad reality)

Page 7: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Choose three

Slide #7

Choose building blocks based on:

a) Open standardsb) Open datac) Reachable tools

Learn, write, improve.

Share.

Page 8: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Standards

Slide #8

SPDX: Open standard for software licensing

Standardizes license description Defines Id for license terms http://spdx.org

Pro: Good docs, straightforward, getting better Cons: Slow adoption, scarce tooling

Page 9: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Open data

Slide #9

GitHub: Targeting open data repositories

API suited for intensive access Social coding Largest open source code collection

Pro: Reachable, diverse Cons: Repositories processed one-by-one

Page 10: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Tooling

Slide #10

Custom-built tools for software licenses

Large-scale repository data-mining Find applicable licenses inside content Share millions of SPDX documents

Pro: Learn by doing, modularized, single language Cons: Built from scratch, needs consolidation

Page 11: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Step 1

Slide #11

Desktop tool/engine to discover licenses

SPDX format as storage medium Identify copyright and 18 license types Java, released in Feb 2014. EUPL

http://spdx.org/tools/community/triplecheck-reporter

Page 12: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Desktop

Slide #12

Page 13: 2014 10-14: GitHub plus FOSS == 1 million SPDX

File detail

Slide #13

Page 14: 2014 10-14: GitHub plus FOSS == 1 million SPDX

SPDX file

Slide #14

Page 15: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Customize

Slide #15

Page 16: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Details

Slide #16

Underneath the hood

147 file extensions, 18 license types LOC, hashes (SHA1, MD5, SHA256, SSDEEP) Command line supported (Jenkins, cron) Fast, 40k files/minute (Pentium IV)

Page 17: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Step 2

Slide #17

Discovering repositories with gitFinder

Create a list of projects online to use as components. Get basic licensing information from each project.

Write text file with each github user (~7 million) For each user, find repositories not forked (~10M) Split each repository according to language (197) For each list of language/reps, download code

Page 18: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Performance

Slide #18

~70k repositories/day

Single machine (i7, 8Gb RAM, CentOS) 9 parallel threads Resume/recover supported Released in Jun. 2014

https://github.com/triplecheck/gitfinder

Page 19: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Output

Slide #19

Page 20: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Storage?

Slide #20https://what-if.xkcd.com/29/ (CC BY-NC 2.5)

Page 21: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Storage

Slide #21

BigZip, +100 million files on a single download

Flat-file, zip compression (per entry) Fast, simple, portable. Indexed search

https://github.com/triplecheck/big

Page 22: 2014 10-14: GitHub plus FOSS == 1 million SPDX

How it looks

Slide #22

Page 23: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Step 3

Slide #23

SPDX search engine

One-click SPDX creation from open data Visualize license and copyright data Visit at http://searchcode.com/spdx

Page 24: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Example

Slide #24

Using the original URL..

https://github.com/iuly/europa_kernel/

=>

https://spdxhub.com/iuly/europa_kernel/

Page 25: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Example

Slide #25

Page 26: 2014 10-14: GitHub plus FOSS == 1 million SPDX

SPDX-1M

Slide #26

“Do It Yourself” kit. Generate 1 million SPDX

https://github.com/triplecheck/diy 1.2 million open source projects “Arduino” for s/w licenses detection

9Gb worth of SPDX? Grab:http://triplecheck.net/public/storage/spdx.big

Page 27: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Screenshots

Slide #27

Page 28: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Next step?

Slide #28

F2F – pinpointing non-original code

Decompose code into blocks Tokenize/anonymize data Find code matches across knowledge base

ETA in Dec. 2014https://github.com/triplecheck/f2f

Page 29: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Preview

Slide #29

Page 30: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Conclusion

Slide #30

What is now available for everyone

Desktop tooling / detection engine Extraction of open data in scale Search engine for SPDX

Page 31: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Questions?

Slide #31

http://spdx.orghttp://searchcode.com/spdxhttp://github.com/triplecheck

Interesting stuff? Let us know: @nn81 @boyte #linuxcon

http://xkcd.com/1118/

Page 32: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Backup slides

Slide #32

Page 33: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Engine

Slide #33

Page 34: 2014 10-14: GitHub plus FOSS == 1 million SPDX

License DB

Slide #34

Page 35: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Components

Slide #35

Page 36: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Exporting

Slide #36