lancaster ucrel summer school 2017 - big data and nlp

61
UCREL Summer School | Presented By Date Big Data NLP Daniel Kershaw 27/06/2017

Upload: daniel-kershaw

Post on 29-Jan-2018

254 views

Category:

Technology


1 download

TRANSCRIPT

UCREL Summer School |

Presented ByDate

Big Data NLP

Daniel Kershaw27/06/2017

UCREL Summer School |

Daniel KershawRecommender SystemSenior Data Scientist

@danjamker

www.danjamker.com

2

About

UCREL Summer School |

• Part 1 – 30 Minutes • Big Data (What is it?)• Map Reduce• Spark• Document Similarity

• Part 2 – 1 hour• Downloading Zepplin on Dockers• Read document set, extract data with • Tokenize • Implement Document Similarity• Cosine Similarity between documents

3

Outline

UCREL Summer School |

Set up docker:sudo docker pull epahomov/docker-zeppelin

Download Zeppelin Notebook: https://www.dropbox.com/s/161hpz02cafblsg/SDOA.json?dl=0

4

First

UCREL Summer School |

Presented ByDate

Part 1 - Big Data and NLP

Daniel Kershaw20th June 2017

UCREL Summer School |

640KoughttobeenoughforanyoneBillGates,Microsoft,1981

UCREL Summer School |

“There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days”EricSchmidt,Google,2010

UCREL Summer School |

Google processes 20 PB a day (2008)Wayback Machine has 3 PB + 100 TB/month (3/2009)Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009)CERN’s Large Hydron Collider (LHC) generates 15 PB a year

How much data?

UCREL Summer School |

GoogleBigDataTrend

UCREL Summer School |

What is Big Data

Too big to fit in an Excel spreadsheetProfessorStevenWeber,UCBerkeleySchoolofInformation

UCREL Summer School |

What is Big Data

Big data means data that cannot fit easily into a standard relational databaseHalVarian,ChiefEconomist,Google

UCREL Summer School |

What is Big Data

The term ‘Big Data’ applies toinformation that can't beprocessed or analysed usingtraditional processes or toolsProfessorStevenWeber,UCBerkeleySchoolofInformation

UCREL Summer School |

VolumeVelocityVarietyExhaustiveVeracityRelational & IndexicalRelationalFlexible

The Big V’s

UCREL Summer School |

WikipediaHansardEnron Email CorpusReddit Data ReleaseTwitter Data Set

Examples of Big / Large Data (NLP )

ScienceDirectCorpusMendeleyUserCatalogsEngineeringVillageUserinteractionlogsFundingdataEVISE

UCREL Summer School |

Scaling up Computation

ServersCPUs(Xeon)RAM(32Gb)Disks(2x1Tb)

Rack40- 80Server

NetworkedTogetherUPS(PowerSupply)

UCREL Summer School |

Google Data Center image

UCREL Summer School |

• How do we split across nodes• Network and data locality

• How do we deal with failures• 1 server fails ever 3 years => 10k nodes would

be about 10 failure a day• How do we deal with slow machines

Programming at Scale

UCREL Summer School |

Hadoop

GoogleMapReducepublish2004GoogleFileSystempublish2004

UCREL Summer School |

MapperReducer

Map Reduce

UCREL Summer School |

MapperReducer

Map Reduce - Mapper

Takesaseriesof<key,value>ProcesseseachtupleOutput’s0ormore<key,value>tuples

UCREL Summer School |

MapperReducer

Map Reduce - Reducer

Calledonceforeachunique<key,[value]>IteratesthougheachvalueOutputs0ormoreresultsas<key,value>

UCREL Summer School |

Example Code – Word Count

UCREL Summer School |

Map Reduce

UCREL Summer School |

Map Reduce

UCREL Summer School |

MapReduce - Overview

UCREL Summer School |

• Application need more than on step• Google pipeline was 22 steps• Analytic queries e.g. K-mean 2-5 steps• Iterative queries e.g. page-rank 10-20 steps

• Problems with performance and ease of development

Issues with Hadoop - Complexity

UCREL Summer School |

• Multiple map and reduce classes• A lot of boiler plate code• Easy to combine incorrectly

Issues with Hadoop - Usability

UCREL Summer School |

• One pass at a time• Must write to HDFS between jobs• Expensive to reuse data• Hand optimize code to combine steps

Issues with Hadoop - Performance

UCREL Summer School |

Big Data Processing

UCREL Summer School |

Spark

UCREL Summer School |

• Resilient distributed datasets (RDD)• Immutable, partitioned collections of objects• Created through parallel transformations (map, filter, groupBy,

join, …) on data in stable storage • Can be cached for effect use• Actions on RDDs• Count, reduce, collect, save, …

Spark Model

UCREL Summer School |

Spark vs Hadoop – Data Sharing

Spark

Hadoop

UCREL Summer School |

UCREL Summer School |

UCREL Summer School |

SparkML

val train_data = // RDD of Vector!valmodel = KMeans.train(train_data, k=10)!

// evaluate the model!val test_data = // RDD of Vector!test_data.map(t => model.predict(t)).collect().foreach(println)!

UCREL Summer School |

• Interact with data like a table• Inbuilt function to:

• Tokenize• Stop-word removal• TFIDF transformation

Spark Dataframes

Name Age Gender Abstract

UCREL Summer School |

Title abstract

keywords

ASJC Title abstract

keywords

ASJC Title_tok

UCREL Summer School |

Presented ByDate

Part 2 – Document SimilarityTechnical Workshop

Daniel Kershaw29th June 2017

UCREL Summer School |

• Download apache Zepplin• Download datasets• Read datasets• Tokenize and remove stopwords• Read word vectors

39

Outline

UCREL Summer School | 40

UCREL Summer School |

• Clone docker image• docker pull epahomov/docker-zeppelin

• Run docker image• docker run -d -p 8080:8080 -p 7077:7077 -p 4040:4040 epahomov/docker-zeppelin

• Goto• localhost:8080

41

Install Apache Zeppelin

UCREL Summer School |

Document Embedding Similarity

Apple[0.5,0.6,0.3,0.1,0.6,0.5, 0.5, 0.9,0.9,0.3,0.5,0.4,0.4,0.5, 0.5,]

Wordrepresentedasdensevector

Documentrepresentedassum(mean)ofdensevectorsApple[0.5,0.6,0.3,0.1,0.6,0.5, 0.5, 0.9,0.9,0.3,0.5,0.4,0.4,0.5, 0.5,]

Mac[0.5,0.6,0.3,0.1,0.6,0.5, 0.5, 0.9,0.9,0.3,0.5,0.4,0.4,0.5, 0.5,]

Computer[0.5,0.6,0.3,0.1,0.6,0.5, 0.5, 0.9,0.9,0.3,0.5,0.4,0.4,0.5, 0.5,]

+

+

=Document[0.5,0.6,0.3,0.1,0.6,0.5, 0.5, 0.9,0.9,0.3,0.5,0.4,0.4,0.5, 0.5,]

UCREL Summer School | 43

Download Spark Dependencies

UCREL Summer School | 44

Download Sample Science Direct Corpus

UCREL Summer School | 45

Science Direct Open Access Corpus

ContainsallcontentseenonSDfrontendAvailableonGithub

ExtractPII(documentID)ExtractAbstract

UseElsevierOpensource XMLparser

Extractfieldswithxpath &xquery

UCREL Summer School | 46

Read Documents

UCREL Summer School | 47

Extract Title and Document Abstract

UCREL Summer School | 48

Tokenize and Remove Stop words

UCREL Summer School | 49

Download Word Vectors

UCREL Summer School | 50

Load Word Vectors

word vector

apple [0.2,0.4,0.8]

computer

[0.2,0.4,0.8]

mac [0.2,0.4,0.8]

Google [0.2,0.4,0.8]

this [0.2,0.4,0.8]

UCREL Summer School | 51

DocID Tokens

1 [apple,computer,mac]

2 [apple,computer,mac]

3 [apple,computer,mac]

4 [apple,computer,mac]

5 [apple,computer,mac] DocID Tokens

1 apple

1 computer

1 mac

2 apple

2 computer

Explodethetokens

UCREL Summer School | 52

DocID word

1 apple

1 computer

1 mac

2 apple

2 computer

word vector

apple [0.2,0.4,0.8]

computer

[0.2,0.4,0.8]

mac [0.2,0.4,0.8]

Google [0.2,0.4,0.8]

this [0.2,0.4,0.8]

Joinonwords

Doc ID word vector

1 apple [0.2,0.4,0.8]

1 computer [0.2,0.4,0.8]

UCREL Summer School | 53

Doc ID word vector

1 apple [0.2,0.4,0.8]

1 computer [0.2,0.4,0.8]

GroupbydocumentID,mean thevectors

DocID vector

1 [0.2,0.4,0.8]

2 [0.2,0.4,0.8]

3 [0.2,0.4,0.8]

4 [0.2,0.4,0.8]

5 [0.2,0.4,0.8]

UCREL Summer School | 54

Join word vectors to document

UCREL Summer School | 55

Join word vectors to document

UCREL Summer School | 56

Join word vectors to document

UCREL Summer School | 57

Join word vectors to document

UCREL Summer School | 58

Join word vectors to document

UCREL Summer School | 59

Join word vectors to document

UCREL Summer School |

• Cartesian join of documents • Compute cosine similarity between each document

60

Identify similar documents

1 2 3

1 0.4 0.6 0.6

2 0.5 0.4 0.7

3 0.6 0.1 0.3

DocID vector

1 [0.2,0.4,0.8]

2 [0.2,0.4,0.8]

3 [0.2,0.4,0.8]

4 [0.2,0.4,0.8]

5 [0.2,0.4,0.8]

DocID vector

1 [0.2,0.4,0.8]

2 [0.2,0.4,0.8]

3 [0.2,0.4,0.8]

4 [0.2,0.4,0.8]

5 [0.2,0.4,0.8]

Jointoself

UCREL Summer School |

Thank youAny questions

61