osspolice-identifying open-source license violation and 1
TRANSCRIPT
OSSPolice - IdentifyingOpen-SourceLicenseViolationand1-daySecurityRiskatLargeScale
Ruian Duan,AshishBijlani,Meng XuTaesoo Kim,Wenke Lee
ACMCCS2017
1
Background
⢠OpenSourceSoftware(OSS)isgainingpopularity,e.g.GitHubreported20Musersand57Mrepos
⢠Mobileappmarketgrowsfastwithover2MappsonPlayStore
⢠DevelopersreuseOSSasisforlotsofbenefits
⢠Legalrisksandsecurityrisksarise
2
RisksinOSSuse
⢠OSSlicenseshaveconstraints(e.g.GNUGPLrequiresderivativeworkstoopensource)
⢠1-dayvulnerabilitiesinstaleOSSversionsareexploitedbyhackers
3
Fornow,GNUGPLisanenforceablecontract,saysUSfederaljudge!
Artifex SlapsPalmwithPDFReaderCopyrightSuit
Equifaxblamesopen-sourcesoftwareforitsrecord-breakingsecuritybreach
CommunityHealthSystemsBreachPossibleduetoHeartbleedVulnerability
Goal
⢠Designatool,OSSPolice,toanalyzeAndroidappsforopen-sourcelicenseviolationand1-daysecurityriskbydetectingreuseofOSSandtheirversionsatlargescale
⢠Requirements⢠AccuratedetectionforhundredsofthousandsofOSS⢠Accurateversionpinpointing⢠Efficientresourceusage⢠FastsearchtosupportvettingalargenumberofAndroidapps
4
Overviewandchallenges
⢠Featureselection⢠Sourcevsbinary:automaticallybuildingsourcecodeishard,duetodependencies,variousbuildconfigs etc.
⢠CompareAppagainstOSS⢠Fusedappbinaries:multipleOSScanbelinkedorcompiledintoasinglefile⢠Partialbuildsandinternalcodeclones:notallOSSfeaturesarebuiltintolibrariesandOSSreusesotherOSS
⢠IdentifyOSSversions⢠Cross-matchofuniqueversionfeatures:fusedappbinariesandinternalcodeclonescanconfusetheprovenanceofuniquefeatures
5
Sourcevsbinary
⢠C/C++OSSarebuiltintostrippednativesharedlibraries(sofiles)
⢠JavaOSSarebuiltintoobfuscateddalvik executables(dex files)
6
SourceCode SharedLibrary StrippedSharedLibraryFoo.c
voidfoo(){w=âhelloââŚ}
.text.dynsym
.rodata.symtab
.debug_info
Bar.cstaticbar(){w=âworldâ}
.text.dynsym
.rodata
Sourcecode Dalvik Bytecode ObfuscatedDalvikBytecode.classedu/gatech/Foo
.methodbarconst-stringv1,"HelloWorldâinvoke-virtual{v0,v1},println
packageedu.gatech;classFoo{bar(){println(âhelloworldâ)};}
.classa .methodaconst-stringv1,"HelloWorldâinvoke-virtual{v0,v1},println
Featureselection
⢠C/C++OSSvssofiles⢠Stringliteral
⢠Clang-basedlexer forOSSand.rodata forlibraries⢠Exportedfunction
⢠Clang-basedparserforOSSand.dynsym forlibraries
⢠JavaOSSvsdex files⢠Stringconstant⢠Normalizedclass
⢠Capturesinteractionwithframework⢠Functioncentroid
⢠Capturesintra-proceduralcontrolflow 7
Fusedappbinaries
⢠AnappusesmultipleOSS⢠!"#âŠ%&&
!"#
⢠%&&âŠ!"#%&&
⢠Iterateđ OSShasđ(đ) timecomplexity
⢠FlagallOSSbeingusedatthesametime⢠IndexOSSandtheirversions!
8
edu.gatech.example
MuPDFOpenCV
OpenSSL OkHttpMoPubLog4j
Flatindexingandmatching
⢠Indexing:MapsfeaturestoOSS⢠Matching:Lookupfeature->OSSmappingtoidentifyOSSreuse
⢠Flatindexingblowuptableto90Gafterindexing7KOSS⢠IndexingmultipleversionsofOSSfurtheraddstotheproblem⢠Givenđ OSSwithđš featuresandđ versions,đ(đđšđ) spacecomplexity
9
feature1
feature2
feature3
MuPDF
OpenCVedu.gatech.example
Partialbuildsandinternalcodeclones
10
repodir file
LibJPEG LibPNG
MuPDF OpenCV
source thirdparty 3rdparty modules/core
test-dev.cpppdf-lex.c opengl.cpp test-io.cpp
pdf fitz testsrc
jpeglib.hpngtest.c
png.câŚâŚ ⌠⌠âŚ
Internalcodeclonesconfusesthird-partywithcoreand
requireshighmatchratiotofilter
Partialbuilds(e.g.examples,tests)causesthematchratiotobelow
Hierarchicalindexingandmatching
⢠HierarchicalIndexing⢠Recordssourcehierarchytotrackinternalclones⢠UsesSimhash algorithmtogenerateidsfornon-leafnodesfordeduplication⢠Recorduniquefeaturesacrossversionsviaseparatelists
⢠HierarchicalMatching⢠NormScore (TF-IDFbased)topromoteuniquepartswhencomputingmatchingratioofanode⢠Allow partialbuildsbyskippingnodeswithlowratio⢠Drop internalcodeclonesbyskippingnodeslikelytobethird-party
11
feature1
feature2
feature3
file1
file2
file3
dir 1
dir 2
dir 3
dir 4
dir 5MuPDFOpenCVLibPNG
edu.gatech.example
Cross-matchofuniqueversionfeatures
12
1.5.0
1.6.0
1.2.46
foo_string
int bar_func()
MuPDFV 1.5
V1.6
LibPNGV 1.2.46
V1.6.0
edu.gatech.exampleMuPDF V1.6
LibPNG V1.2.46
Collocation-basedfiltering
⢠Leveragecollocationinformationintheindexingtableandbinaries⢠UseNormScore toassigndifferentweightstofeatures
13
MuPDF V1.6
LibPNG V1.6.0
pdf.c
1.6.0
int pdf_read()
png.c
1.6.0
int png_read()
edu.gatech.exampleMuPDF V1.6
LibPNG V1.2.46
Implementation
⢠DataCollection⢠Scrapy forcrawlingofOSSrepos⢠PlayDrone forcrawlingAndroidapps
⢠FeatureExtraction⢠Clang-basedlexer andparserforC/C++source⢠Pyelftools fornativebinaries⢠Soot-basedparserforJavabytecodeandDex bytecode
⢠OSSDetection⢠Redis key-valueclusterforstoringandqueryingindexingresults⢠Celeryjobschedulerfordistributingworktomultipleservers
14
Evaluation
⢠FDroid Apps⢠4,469apps,579withnativelibraries⢠295C/C++OSSuses,7,055JavaOSSuses
⢠BAT:internalcodeclones⢠LibScout:partialbuilds(coderemoval)
15
55matches
020406080100
Precision(%) Recall(%) VersionPrecision(%)
C/C++OSSEvaluationResults
OSSPolice BAT
478matches
295matches
020406080100
Precision(%) Recall(%) VersionPrecision(%)
JavaOSSEvaluationResults
OSSPolice LibScout
MeasurementDataset
⢠C/C++OSSfromGitHub⢠3,119popularreposand60,450OSSversions⢠29%reposareGPL/AGPL⢠11%reposarevulnerablewith5,611severeCVEs(đśđđđ ⼠4.0)
⢠JavaOSSfromMavenandJCenter⢠4,777popularartifacts,77,308artifactversions⢠2.3%artifactsareGPL/AGPL⢠1.7%artifactsarevulnerablewith452severeCVEids
⢠AndroidAppsfromGooglePlay⢠1.6Mapps,515,812withnativelibraries
16
PerformanceandScalability
⢠Indexing⢠60,450C/C++repos and 77,308Javarepos⢠Timecost is 1000svs.40sonaverage⢠Memorygrows sublinearly to 30GBand 9GB
⢠Matching⢠Sampled10,000GooglePlayapps⢠80%ofdex andsofilesfinishwithin100sand200s
17
0 10 20 30 40 50 60 70 80Number of indexed repos(Thousands)
0.004.669.31
13.9718.6323.2827.9432.6037.25
Mem
ory
usag
e(G
B)
C/C++ Memory UsageJava Memory Usage
Popularlibraries
⢠Long-taileddistributionofOSSuses
18
020000400006000080000100000120000
Top10detectedJavaOSSexcludingAndroidandGoogleOSS
Utils Network Social
Image Codec
010,00020,00030,00040,00050,00060,00070,00080,00090,000100,000
Top10detectedC/C++OSS
Codec Game Font
Network Audio Viewer
LegalRisks
⢠Morethan40KpotentialGPLviolators⢠MoreviolatorsusingC/C++thanJavaandencodinglibrariesdominate
19
0200400600800100012001400
Top5offendedJavaOSS
0
5000
10000
15000
20000
25000
30000
35000
40000
MuPDF FFmpeg PJSIP VLCandX264
BZRTP
Top5offendedC/C++OSS
Codec Utils Compiler Codec Communication
LegalRisks
⢠WhyviolatingGPL/AGPL?⢠MuPDF andiTextPDF areusedduetolackoffreealternatives
⢠OSSdevelopersresponses⢠MuPDF gotnewcustomersJ⢠FFmpeg andVideoLANhaveinterest,butFFmpeg cannotenforceJ⢠PJSIPnotinterestedduetoNDA,iText didnotreplyL
⢠AwarenessofOSSlicensingterms⢠NoneoftheappdevelopersprovidedsourcecodeyetL
20
SecurityRisks
⢠Morethan100KappsusingvulnerableOSSversions⢠MoreappsusingvulnerableC/C++OSSthanJava
21
050001000015000200002500030000350004000045000
Top6C/C++and4JavavulnerableOSS
C/C++ Java
1,244LibPNG and4,919OpenSSLusesarenotdetectedbyAppSecurityImprovementProgram(ASIP)
SecurityRisks
⢠WhichversionsofOSSdonewappdeveloperschoose?⢠BothvulnerableandpatchedOSSarebeingused
⢠WhendodevelopersupdateOSSversions?⢠ASIPmitigatesvulnerableOSSusage,butstillremainsaproblem
22
0250500750
MoP
ub
0200400600800
Ope
nSSL
0800
16002400
OkH
ttp
2013-05-122013-11-28
2014-06-162015-01-02
2015-07-212016-02-06
2016-08-24
Date
080
160240
FFm
peg
# Vuln. Usage# Patched Usage
ASIP DeadlineASIP Notification
TimelineofOSSusageforthetop10Kapps,300Kappversions
Discussion
⢠Checkinglicensecompliancerequiresmanualefforts
⢠Obfuscationandoptimization⢠Stringencryptionindex files⢠Functionhidinginsofiles
⢠Versionpinpointing⢠Notallversionscanbeuniquelyidentified
⢠Moreprogramminglanguages(i.e.JS,Python)andplatforms(i.e.iOS)23
Conclusion
⢠OSSPolice:anaccurateandscalabletooltoidentifylicenseviolationsand1-daysecurityrisks⢠Hierarchicalindexingandmatchingscheme⢠Collocation-baseduniquefeaturefiltering
⢠Alargescalemeasurement⢠1.6MfreeGooglePlayStoreapps⢠40KcasesofpotentialGPL/AGPLviolationsand100KappsusingvulnerableOSS
⢠Interestinginsights⢠AppdevelopersviolateGPL/AGPLduetolackoffreealternatives⢠AppdevelopersusevulnerableOSSversionsdespiteeffortsfromGoogle
24