demystifying data science - vvtesh.co.in · demystifying data science venkatesh vinayakarao...

1

Upload: others

Post on 20-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

Venkatesh Vinayakarao (Vv)

Demystifying

Data Science

Venkatesh [email protected]

http://vvtesh.co.in

SSN School of Advanced Career Education

The world's most valuable resource is no longer oil, but data. – Economist Report, 2017.

Page 2: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

What Comes Next?

byte

kilobyte

megabyte

gigabyte

??

???

????

?????

2

Page 3: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

Sizes

3

Name Size

Byte 8 bits

Kilobyte 1024 bytes

Megabyte 1024 kilobytes

Gigabyte 1024 megabytes

Terabyte 1024 gigabytes

Petabyte 1024 terabytes

Exabyte 1024 petabytes

Zettabyte 1024 exabytes

Yottabyte 1024 zettabytes

Page 4: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

Big Data is Ubiquitous

• Facebook Statistics• 1.5 billion people are active on Facebook daily!

• Every minute there are 510,000 comments posted and 293,000 statuses updated!

• More than 300 million photos get uploaded per day!

• Totally, more than 2.5 Trillion posts!

4Source: Forbes

Page 5: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

5Source: https://www.visualcapitalist.com/big-data-keeps-getting-bigger/

Page 6: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

And, It is Growing!

6

Page 7: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

Data Growth

7

Mankind’s quest to digitize the world!33 ZB (2018) → 175 ZB (2025)

size of global datasphere*

*Source: https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf

Page 8: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

Solitary Confinement is Cruel

8

Page 9: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

9

World needs data scientists!

Page 10: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

Data Science

10

Loads of (structured and unstructured) data available.

Need scientifically sound methods to capture, maintain, process,

communicate and analyze data.

Page 11: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

Modern Text ProcessingVector Space Model

11

Page 12: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

Which Document to Retrieve?

12

d1:“SSN Chennai”

d2:“XYZ Delhi”

Ind

exe

d C

on

ten

t

Retrieval Model

{VSM, LDA, BM25, …}Results = ??

Query = “SSN”

Page 13: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

Vectors

• Geometric entity which has magnitude and direction

13

x

yA

1

1

Page 14: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

Sentences are vectors

• “SSN Chennai” as a vector

14

SSNC

he

nn

ai

SSN Chennai 1

1

Page 15: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

Sentences are vectors

• “The SSN Chennai” is a 3-dimensional vector

15

SSNC

he

nn

ai

The SSN Chennai

1

1

Page 16: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

Sentences are vectors

• On this 3D space, “The SSN” vector will lie on the x (The) and z (SSN) plane.

16

SSNC

he

nn

ai

The SSN1

1

Page 17: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

Comparing Sentences

• We can compare sentences using the angle between vectors

17

SSNC

he

nn

ai

The SSN1

1

The Chennai

Page 18: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

Angle between two vectors

• What is the angle between The and SSN vectors?

• What is the angle between SSN and Chennaivectors?

• What is the angle between The SSN and The SSN vectors?

18

Page 19: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

Mathematical Notation

• We represent vectors as follows:• Vector = (dimension1, dimension2, dimension3, …)

• First, define the dimensions

• Next, put “1” if the word is present in the sentence, else “0”

• Example

19

In our example, vector = (The, SSN, Chennai)

So,

The Chennai = (1,0,1)

The SSN = (1,1,0)

Page 20: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

Converting from “0 – 90” to “0 – 1”

• For convenience, We convert the angles 0 –90 to values 0 – 1• When vectors are same, we want to output 1.

• When vectors are perpendicular, we want to output 0.

20

𝑑𝑗

q

q

Page 21: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

A Way to Calculate cosθ

• cos 𝜃 =𝑥.𝑦

𝑥 | 𝑦 |

• Here,• x.y is the “dot product” of x and y vectors.

• So, similarity between “The SSN” and “SSN Chennai”

= 1.0 + 1.1 +0.1

12+12+02 02+12+12=

1

2 2= 0.5

21

Page 22: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

Which Document to Retrieve?

22

d1:“SSN Chennai”

d2:“XYZ Delhi”

Ind

exe

d C

on

ten

t

Retrieval Model

{VSM, LDA, BM25, …}Results = ??

Query = “SSN”

Page 23: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

Example

Let query q = “SSN”.

Let document, d1 = “SSN Chennai” and d2 = “XYZ Delhi”.

In our VSM, q = (1,0,0,0), d1= (1,1,0,0) and d2 = (0,0,1,1)

similarity(d1, q) = 𝑑1.𝑞

𝑑1

| 𝑞 |= 1.1+1.0+0.0+0.0

12+12 12=

1

2= 0.71

similarity(d2, q) = 𝑑2.𝑞

𝑑2

| 𝑞 |= 1.0+0.0+0.1+0.1

12+12 12= 0.

23

SSN Chennai XYZ Delhi

q 1 0 0 0

d1 1 1 0 0

d2 0 0 1 1

Page 24: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

Which Document to Retrieve?

24

d1:“SSN Chennai”

d2:“XYZ Delhi”

Ind

exe

d C

on

ten

t

Retrieval Model

{VSM, LDA, BM25, …}Results = ??

Query = “SSN”

Page 25: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

Summary

• Data is Ubiquitous• and it is growing too!

• Modern Text Processing• Vector Space Model

• Remember• Data processing goes beyond common sense… we need

techniques and tools.

• Products are good to learn. Principles are even more important. Don’t ignore them.

25

Page 26: Demystifying Data Science - vvtesh.co.in · Demystifying Data Science Venkatesh Vinayakarao venkateshv@cmi.ac.in ... Big Data is Ubiquitous •Facebook Statistics •1.5 billion people

Memories

26