the library (big) data scien4st - · pdf filethe library (big) data scien4st ifla/ala webinar:...
TRANSCRIPT
TheLibrary(Big)Datascien4st
IFLA/ALAwebinar:“BigData:newrolesandopportuni4esfornewlibrarians”
June15th2016
IFLABigDataSpecialInterestGroup(SIG)WouterKlapwijk,StellenboschUniversity,SIGconvenor
IFLABigDataSIG
• ProposedatWLIC2014,Lyon• Pe44onedforatWLIC2015,CapeTown• EndorsedbytheIFLAProfessionalCommiXeeinDecember2015
• SIGsponsor:ITSec4on• Objec4ves:
1. ProvideafocuspointfordevelopingideasregardingBigDataasitaffectslibraries
2. Provideapla[ormwithinIFLAtoassessanddeveloptheavenuesofresponsefromIFLAtothisdevelopingarea
Deconstruc<ng“BigData”and“datascience”
Thistalkisbasedonanumberofeverydayques4ons:1. Whatdoes“datascience”mean?
² isitonlyhappeninginTechplaceslikeFacebookandGoogle?
2. WhatareDataScien4st?² canLibrariansalsobeDataScien4sts?
3. IsdatasciencethescienceofBigData?² whatistherela4onshipbetweenBigDataanddatascience?
4. Exactlywhatis“BigData”anyway?² justhowbigisBig?orisBigrela4ve?islibrarydataBig?
Datascience
§ Asetoffundamentalprinciplesthatguidetheextrac4onofknowledgefromdata
§ The“civilengineeringofdata”:turningdataintodataproducts
Goalofdatascience:Ø toimprovedecision-making,forthebeXermentoforganiza4onsandsocietyatlarge
Rela<ontoother“engineering”concepts
“datamining”ü helpsaccomplishdatasciencegoalsviatechnologiesthatincorporatedatascienceprinciples
ü but…itstechniquesaremuchmoreextensivethanthesetofprinciplescomprisingdatascience
“datawarehousing”ü afacilita4ngtechnologyfor“datamining”ü but…notalwaysincludedaspartof“datamining”
Rela<ontoothercompu<ngconcepts
“dataprocessing”ü ismoregeneralthandatascienceü thereisprocessinginvolvedinallaspectsofcompu4ng
“BigDatatechnologies”ü areocenusedfordataprocessinginsupportofdataminingtechniques
ü …andotherdatascienceac4vi4es
ScienceorCraG?
v Thetermdatasciencehasexistedforover30years–itisafieldontoitself
v Founda4onrestsincenturyoldprac4cesofSta4s4cs,
Mathema4cs,andsincemid-20thcentury,alsoComputerSciences
v ItisnotjustarebrandingofSta4s4csandMachineLearning
inthecontextoftheTechindustryv MuchofthefielddevelopmentishappeninginIndustry,and
notinAcademia
Examplesofdatascienceproducts
Domain ExampleInternet Recommenda4onsystems
(Amazon=books;Facebook=friends)
Finance
Creditra4ngs
Educa<on
Personalizedlearningandassessment
Government
Policiesbasedondata
Prac<cingdatascience
WhatdoDataScien4stdo?
TheDataScien<st
Twoaspectstoconsider:1. understandwhattheyDOinbusiness2. understandwhichSKILLStheymustpossess
WhatdotheyDO?
1. Theyaskques4onso Probe,beingcurious
2. They(tryto)solveproblemso Analy4calthinking,makingnewdiscoveries
3. Theycul4vate(new)socskillso Communica4ngandvisualizingdata
WhatdotheyDO?
1. Theyaskques4onso Probe,beingcurious
2. They(tryto)solveproblemso Analy4calthinking,makingnewdiscoveries
3. Theycul4vate(new)socskillso Communica4ngandvisualizingdata
Howmuchoftheabovedoyouaslibrariando?
Thelibraryprofessional’sgenes?
Isitinourpedigreetocon4nuouslyaskques4ons?
Dowehavethetasteandmindsetforanaly4calthinking?
Doweonlydoadhocanalysis,ordowepreferanongoingconversa4onwithdata?
IsthereenoughofaBusinessAnalystorSocialScien4stinus?
WhichSKILLSdotheyneed?
DataScien4st
Domain-specificskills
Socskills
Hardskils
WhichSKILLSdotheyneed?
DataScien4st
Domain-specificskills
Socskills
HardskilsCommunica4on
LinearalgebraSta4s4csAr4ficialIntelligenceMachineLearning
Understandthebusiness,e.g.librarianship
DataScien<stsareteammembers
Sta4s4cian
Mathema4cian
Dataprogrammer Socialscien4st
Systemsadministrator
?
Differentskillsareembeddedacrossmul<-disciplinaryteammembers
TheDataScien<stteamprofile
Itisimportanttounderstanddatascienceevenifyouneverintendtodoityourself
0
2
4
6
8
10
12
14
Sta4s4cs Mathema4cs ComputerScience Domainexper4se
Prac<cingdatascience
Whatdoesthecracofdatasciencelooklike?
3disciplinaryareas
SOURCES • DATA
ANALYTICS
• COLLECT• CLEAN• INTEGRATE• PROCESS
VISUALIZATION • COMMUNICATE
3disciplinaryareas
SOURCES • DATA
ANALYTICS
• COLLECT• CLEAN• INTEGRATE• PROCESS
VISUALIZATION • COMMUNICATE
Eachdisciplinaryarearequiresdifferentskills
SystemsAdministrator
DataProgrammer
Appdesigner
SOURCES
Database(1960-)
Firstintegrateddatastore(Bachman),1963Rela4onaldatamodel(Codd),1970SQL(Boyce&Chamberlain),1970+
DataWarehouse(1975-)
FirstcommercialRDBMS(Oracle),1979DB2(IBM),1983FirstKDDworkshop,1989FirstKDDdataminingconference(Fayyaad,Shapiro),1995
BigData(2005-)
NoSQL(Evans),2009
SOURCES
SMALLDATA
Databases
DataWarehouses
Quan4ta4veandqualita4veMostlystructuredandindexical
Metadata
“Longtaildata”
BIGDATA
Mostlyunstructured(80%)
Varioussources
Needstoberelatedandcombined
Social
A/V
Logs
IncompleteDataTaxonomy:somedataareneitherjustbignorjustsmall
SOURCES
Smalldata• ThetermdenotestheoppositeofBigData• Datausuallyhousedindatabasesanddatawarehouses
• Usuallystructured,qualita4veandindexicalinnature
• Examples:Librarydata,ResearchData(RDM)• Researchdata=primarydata
SOURCES
Bigdata• Datasetsthataretoolargefortradi4onaldataprocessingandstoragesystems(3V’s,4V’s,5V’s)
• Classifiedinto3classesof“datafica2on”:
1. Directeddata(e.g.surveillancedata)2. Automateddata(e.g.devicegenerateddata)3. Volunteereddata(e.g.socialnetworksdata)
ANALYTICS
Database(1960-)
Firstintegrateddatastore(Bachman),1963Rela4onaldatamodel(Codd),1970SQL(Boyce&Chamerlain),1970+
DataWarehouse(1975-)
FirstcommercialRDBMS(Oracle),1979DB2(IBM),1983FirstKDDworkshop,1989FirstKDDdataminingconference(Fayyaad,Shapiro),1995
BigData(2005-)
NoSQL(Evans),2009
BusinessIntelligence(BI)DATA
DELUGE DataScience
ANALYTICS
Thereare4broadclassesofanaly4cs(ocenusedincombina4on):1. DataminingandpaXernrecogni4on
v AI–MachineLearning–DataMining
2. Datavisualiza4onandvisualanaly4csv Appdevelopment
3. Sta4s4calanalysisv Sta4s4caltechniquesandprinciples(regression,etc.)
4. Predic4on,simula4on,andop4miza4onv Algorithms
Prac<cingdatascience
InLibraries
1.DataatScale
ValueandInsightcanbeextractedfromsmalldatabyscalingthemupintolargerdatasets,forreusethroughdigitaldatainfrastructures
2.AnalyzingExhaustdata
Exhaustdata=producedasaby-productofthemainfunc4onofadeviceorsystemMostexhaustdataistransientinnature–itisneverexaminedandsimplydiscarded!Example:logofaself-checkoutunit
ExampleofanalyzingExhaustdata
StructuredandUnstructureddata
VISITSPATRONSLOANSLOCATIONSDIGITIZEDBOOKS
Books
DVDs
Journals
Ac<onableInsights
BeXerforecastsforfuture
libraryplanning
BeXerusageofsystemsand
resources
Produc4vitygainwithbeXer
decision-making
ExamplesofAnalysisandVisualiza<on
Libraryanaly4cstoolkit–HarvardUniversity:hXps://osc.hul.harvard.edu/liblab/projects/library-analy4cs-toolkitTextanaly4cs–GoogleBooksNgramViewer:hXps://books.google.com/ngramsOpenSourceimplementa4on–Bookworm:hXp://bookworm.culturomics.org
Insummary
Thefundamentalprincipleofdatascienceisthatdata,andthecapabilitytoextractusefulknowledgefromit,
shouldberegardedasakeystrategicasset.
Librariesmustlearntostartthinkingdata-analy<cally.Doweonlyusegutandintui4on,oralsodataandrigor,
inourdecision-making?
Insummary
Youcanapplythesameprinciples,toolsand
techniquesforsmalldatathanyouwouldforbigdata
“…thetoolsofdatascienceareasappropriateforgigabyteastheyareforpetabytescaledatasets…”
(hXps://datascience.berkeley.edu/about/what-is-data-science/)
Insummary
Challengesforlibrarians:q ThereisashortageofBigDatatalentq TheBigDataSIGisaXemp4ngtounderstandandframe
BigDataproblems
Opportuni4esforlibrarians:q Growyourdataanaly4calskillsq AXendonlinecourses:KhanAcademy,Coursera,SocwareCarpentry,digitalbooks
q Therearefreesocwaretools:R,(SQLServer2016includesR),Python,appvisualiza4ontools
Thankyou