the billion object plaorm (bop): a real-:me, big data ...€¦ · • dropwizard • kotlin lang...
TRANSCRIPT
![Page 1: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/1.jpg)
TheBillionObjectPla1orm(BOP):AReal-:me,BigData,
Spa:o-TemporalExplora:onPla1orm
HarvardABCD-GIS
BenjaminLewis,MerceCrosas,DavidSmiley,DevikaKakkar,ArielNunez
![Page 2: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/2.jpg)
Outline• Introduc:on• Architecture• Harves:ng/Archiving• Sen:mentEnrichment• ApacheKaSa• SolrforGeo-enrichment• Solr&TimeSharding• BOPWeb-Service• ClientUI• Deployment/Opera:ons• DockerandKontena
![Page 3: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/3.jpg)
BOPRequirementsSummary
• Mostrecent~billiongeo-tweets• Real:mesearch(<5seclatency)• Sub-secondqueries– Includingheatmaps!
• Onthecheap:~6commodityservers
Provideaproof-of-conceptpla1ormdesignedtolowerthebarrierforresearcherswhoneedtoaccessbigstreamingspa:o-temporaldatasets.
![Page 4: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/4.jpg)
BOPasanExampleofaNewKindofDatasetAvailableinDataverse
![Page 5: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/5.jpg)
StreamingData:HarvestandArchive
![Page 6: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/6.jpg)
Ini:alfocusonGeo-tweets(couldbeanystreamingdataset)
• 1-2%oftweetshaveGPScoordinatesfromtheuser’sdevice,rangesfrom1to6millionperday
• TheCGAhasbeenharves:nggeo-tweetssince2012andhasaninformalarchiveofabout8billionobjects
• ResearcherRyanQiWangalsoharves:ngduringthisperiod.Histweetswereloadedfirst.CGAtweetswillbemergedlater.
• Collaborators:• HarvardDataverseTeam• BostonAreaResearchIni:a:ve
![Page 7: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/7.jpg)
LogicalHigh-LevelArchitecture
KaSa(archive)
SolrHarves:ng Enrichment
DataflowsviaApacheKa)a HTTP
WebService
BrowserUI
Docker,Kontena,OpenStackHos:ng:MassOpenCloud
BOP
![Page 8: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/8.jpg)
ApacheKaSa• KaSa:ascalablemessage/queuepla1orm• SeenewKaSaStreams&KaSaConnectAPIs• Noback-pressure;canbeachallenge• Non-obvioususe:– Forstorage;:mepar::oning
• Lotsofbenefitsyetseriouslimita:ons
![Page 9: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/9.jpg)
Real-TimeHarves:ng
Streamtweetsusingpredefinedusersandcoordinatesextent
KaSaTopic
ConnecttoTwijer’sStreamingAPI
Ifthetweetis
Geotagged
![Page 10: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/10.jpg)
Enrichment
Geo:QuerySolrviaspa:alpointquery;ajachrelatedmetadatatotweet
KaSaTopic Enrich KaSa
Topic
TwijerSen:mentClassifier
Geo:Solrwithregionalpolygons&metadata
![Page 11: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/11.jpg)
Sen:mentAnalysis• Classifier:SupportVectorMachine(SVM)withLinearKernel• SourcecodeinPython• Usesscikit-learn,numpy,scipy,NLTK• Twoclassesofsen:ment:Posi:ve(1),Nega:ve(0)• TrainingCorpus:Sen:ment140,Polaritydatasetv2.0,Universityof
Michigan• Preprocessing:Lowercase,URLs,@user,#tags,trimming,repea:ng
characters,emo:cons• Stemming:Porterstemmer• Precision,Recall,F1score:0.82(82%)• Processingspeed:20ms/tweet(noemo:con),5ms/tweet(emo:con)
![Page 12: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/12.jpg)
Sen:mentAnalysisPhase1:Training
Phase2:Predic5on
Loadtheclassifier
Foreachtweet
Parse Preprocess Stem Predict
Traintheclassifier
Saveaspickle
![Page 13: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/13.jpg)
SolrforGeoEnrichment“ReverseGeocoding”
• Tweets(docs)canhaveageolat/lon• EnrichtweetwithCountry,State/Province,…– Gazejeerlookup(point-in-polygon)
DataSet Features Rawsize Index5me Indexsize
Admin2 46,311 824MB 510min 892MB
USStates 74,002 747MB 4.9min 840MB
MassachusejsCensusBlocks 154,621 152MB 5.9min 507MB
![Page 14: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/14.jpg)
FastPoint-in-PolygonTricksIndex/Config• Op:mizeto1segment• RptWithGeometry
Spa:alField– precisionModel=
"floating_single"– autoIndex="true"
• <cachename="perSegSpatialFieldCache_WKT"…
Search• EmbedSolr(in-process)• UsedocValues,notstored
– fl=block:field(GEOID10)Querylikethis:• q={!fieldcache=false
f=WKT}Intersects(POINT($lon$lat))
Sub-Millisecond!
![Page 15: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/15.jpg)
ApacheSolr• Search/analy:csserver,basedonLucene• Customadd-ons:– Timeshardedrou:ng(index+query)– LatLonPointSpa:alField–inSolr6.5
• Faster/leanersearch&sortforpointdata– HeatmapSpa:alField–inSolr6.6TBD
• Faster/leanerheatmapsatscale
![Page 16: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/16.jpg)
Time“Sharding”Solrhasnobuilt-in:mebasedsharding.ASolrcustom“URP”wasdevelopedtoroutetweetstotherightby-monthshard.Itautocreatesanddeletesshards.ASolrcustom“SearchHandler”wasdevelopedtodecidewhichsubsetofshardstosearchbasedoncustomparameterssentbytheweb-service.Generallyusefulforothers.Needmoreworkforcontribu:ontoSolritself.
![Page 17: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/17.jpg)
TheBOPWeb-Service• HTTP/RESTAPI– Keyowrdsearch– Face:ng
• Heatmaps– CSVexport
• WhynotSolrdirect?– DefineasupportedAPI– Easeofuseforclients– Security
Tech:• Swagger• Dropwizard• Kotlinlang(onJVM)
![Page 18: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/18.jpg)
ClientUI• BrowsersideUIwithnoservercomponent• Itusesthefollowingtechnologies:– AngularJS– OpenLayers3– npm(dependencies,scriptminifica:on,development)
![Page 19: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/19.jpg)
UIadaptstolaptop
![Page 20: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/20.jpg)
UIadaptstotablets
![Page 21: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/21.jpg)
UIadaptstophones
![Page 22: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/22.jpg)
Temporalfiltering
![Page 23: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/23.jpg)
Temporalface:ng(histogram)
![Page 24: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/24.jpg)
Spa:alfiltering
![Page 25: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/25.jpg)
Spa:alface:ng(heatmap)
![Page 26: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/26.jpg)
Textface:ng(tagcloud)
![Page 27: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/27.jpg)
Nearbytweets
![Page 28: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/28.jpg)
Deployment/Opera:ons• MassOpenCloud“MOC”– OpenStackbasedcloud(mimicsAmazonEC2)
• CoreOS• Kontena&Docker• Admin/Opstools:– KaSaManager(Yahoo!)– Solr’sadminUI
Stats:• 12nodes(machines)
• 5toSolr• 3toKaSa• 3toenrichment,…
• 217GBRAM• 3500GBdisk• 17services(soywarepieces)
• 133containers
![Page 29: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/29.jpg)
Docker• Easytofind/try/use
soyware– Noinstalla:on– Simplifiedconfigura:on(envvariables)
– Commonlogging– Isolated
• Idealfor:– Con:nuousInt.servers– Tryingnewsoyware– Produc:onadvantagestoo
• but“new”
![Page 30: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following](https://reader034.vdocument.in/reader034/viewer/2022042410/5f26f9dd3fe15f5f461b1b62/html5/thumbnails/30.jpg)
DockerinProduc:on• Weuse“Kontena”• Commonlogging,machine/procstats,security– VPNtosecurenetwork;accesseverythingaslocal
• Nolongerneedtocareabout:– Ansible,Chef,Puppet,etc.– Securityatnetworkorproxy;notservicespecific
• Challenges:state&big-data