what is "data"?

11

Click here to load reader

Upload: clement-levallois

Post on 26-May-2015

561 views

Category:

Education


1 download

DESCRIPTION

Slides of the course on big data by C. Levallois from EMLYON Business School. For business students. Check the online video connected with these slides. -> Basic definition of data and related concepts that you need to characterize a dataset.

TRANSCRIPT

Page 1: What is "data"?

MK99 – Big Data 1

Big data &

cross-platform analytics MOOC lectures Pr. Clement Levallois

Page 2: What is "data"?

MK99 – Big Data 2

Note • You will find terms squared like this in the slides.

• These terms are part of your quizz assignment for

the week, to be found on the online platform.

• Often technical terms, it is vital that you know their meaning, as they are the basic vocabulary of data science.

Page 3: What is "data"?

MK99 – Big Data 3

What you we learn here: • The definition of data

• The many ways to speak about data.

Page 4: What is "data"?

MK99 – Big Data 4

What is data? • Definition:

– Originally, data is plural for “datum”, a Latin word

– a “datum” is a single factual, a single entity, a single point of matter.

– Datums are most often called “data points”.

– Data represents a collection of data points. • We speak also of datasets instead of data (so a dataset is a collection of data points).

– Today, “data” is used in a singular or plural form.

-> “My data is…”, but we sometimes still hear “My data are…”

Page 5: What is "data"?

MK99 – Big Data 5

Examples! • A date • A color • A grade • An address • A price • A number of friends • A longitude • An index of poverty • An item in a catalogue

• A sound frequency • A list of favorite

movies • A movie • A number of clicks on

a web page • A duration • A book • An author of a book

• A vote at an election • A still image • A measurement of

CO2 • A response to a

consumer survey • A purchase ticket • A curriculum vitae • Your blood pressure

Page 6: What is "data"?

MK99 – Big Data 6

Data or Metadata? • Metadata: this is some data describing some other data.

• Example:

– The bibliographical reference describing a book.

– Key takeaway: data without metadata can be worthless -> What would you do with a pile of 10,000 books without any indication on their title, authors, or date of publication?

– The difference between data and metadata is not always relevant -> In the alumni network dataset, what is data and what is metadata?

The metadata The data

Page 7: What is "data"?

MK99 – Big Data 7

Data: how to talk about it • Example of some data point -> “Four more years. http://t.co/bAJE6Vom”

This textual data is in digital form (because it is stored in bits on a computer, not by hand writing on a piece of paper)

(as opposed to analog).

The tweet is textual (as opposed to numerical. In programming, text can also be called a String)

this is the type (or format) of the data

The tweet appears plain text “plain text” is one sort of format for text. Others formats are JSON, XML or CSV

this is the format of the data

The text of the tweet is encoded in UTF-8 this is the encoding of the data

The tweet is part of a list of tweets I collected this is the data structure

The tweet is stored in a Word file on my laptop this is the format of the data

Notice the ambiguity in the terminology!

Page 8: What is "data"?

MK99 – Big Data 8

Data stored in tables: vocabulary

Rows, or lines. Each represents a data point

Columns. Each represents an attribute of the data.

Header: these are the names of the attributes.

A value. (can be empty).

A spreadsheet, or a table. This is still the most common way to represent a dataset.

Page 9: What is "data"?

MK99 – Big Data 9

Data and size. • The size of data gives an idea of what can be done with it and the

challenges it might pose.

• The size of a dataset can be expressed in number of datapoints. – Often called lines because we store them as lines in a spreadsheet

• Or the size can be expressed in terms of the storage space the data

takes on a computer drive (see next slide). – A dataset with 23,000 lines and 16 columns takes ~ 2.6Mb when

presented as an Excel file.

Page 10: What is "data"?

MK99 – Big Data 10

Bytes!

1 bit Can store a yes / no value

8 bits 1 byte (or octet) Can store a single letter

~ 1,000 bytes 1 kilobyte (kb) Can store a paragraph

~ 1 million bytes 1 megabyte (Mb) Can store a low res picture.

~ 1 billion bytes 1 gigabyte (Gb) Can store a movie

~ 1 trillion bytes 1 terabyte (Tb) Can store 1,000 movies. Size of commercial hard drives in 2014.

~ 1,000 trillion bytes 1 petabyte (Pb) 20 Pb = Google Maps in 2013

Most firms today

Page 11: What is "data"?

MK99 – Big Data 11

Much more… • Make the readings for Week 1.

• Watch the video on big data, also in Week 1.

• Start following #bigdata and #dataanalytics on

Twitter.