-
Dinesh Priyankara |
Senior Architect Specialist, Virtusa(Pvt) Ltd
http://dinesql.blogspot.com/
Processing Unstructured Data
http://dinesql.blogspot.com/
-
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
Dinesh Priyankara | @dinesh_priya
Senior Architect Specialist, Virtusa(Pvt) Ltd.
Microsoft Most Valuable ProfessionalSince 2006, Data Platform (SQL Server)
Consultant, Trainer, Speaker
MSc in IT, MCSE, MCDBA
http://dinesql.blogspot.com
mailto:[email protected]://dinesql.blogspot.com/
-
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
Agenda
01 | Understanding unstructured data
02 | Introduction to Hadoop and MapReduce
03 | The Microsoft way
04 | Processing unstructured data with Integration Services
05 | Processing unstructured data with Azure Cloud Services
06 | Demo
07 | Q & A
-
01 | Understanding unstructured data
Dinesh Priyankara | Senior Architect Specialist, Virtusa(Pvt) Ltd.
http://dinesql.blogspot.com/
http://dinesql.blogspot.com/
-
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
Structured Data
Structured data resides in a fixed field within a record or file.Relational databases and spreadsheets hold structured data
Always integrated with a schema (model)Schema defines the structure of data with data types such as string, integers, date,
etc.
Schema defines how data is stored,
accessed and processed.
Easy maintainability and data managementBased on Schema-on-write method
Managed with most known Structured Query
Language (SQL)
-
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
Semi-Structured Data
Semi-Structured data does not follow a standard model defined with a
schema.Structureis imposed in the form of tags or markers.
Different set of attributes in elements even though
they are belong to one class.
Example: XML, HTML, JSON, etc.
Considered as self-describing data.
-
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
Unstructured Data: The Definition
Unstructured Data (or unstructured information) refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner ~ wikipedia
Unstructured data represents any data that does not have a recognizable structure. It is unorganized and raw and can be non-textual or textual ~ techopedia
~ web
-
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
Unstructured Data
Unstructured data does not reside in a field
or recordNo standard model, does not follow a schema.
No specific definition on storing, accessing
and processing.
Can be seen as word documents, audio files
, videos, photos, etc.
Might follow a structure internally But no schema, tags, or markers describing
the fields of data.
Difficult to process using traditional computer modules.Has many irregularities and ambiguities
http://www.informationweek.com/it-life/cartoon-unstructured-data-fatigue/a/d-id/1316534
-
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
Why it is important?
80%-90% data is unstructured
and it growsPreviously unidentified or ignored.
Hidden business insight
Provides holistic view of the business
Provides competitive advantages
Reveals social trends for improving
customer satisfaction
Saves time and money
2.5 quintillion bytes of data per day
175 million tweets per day
1.49 billion monthly active FB users
-
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
Common Unstructured and Semi-structured Data
Sentiment datamainly from social networks, online reviews,
customer support interaction
Clickstream data
Sensor or machine data
Server log data
-
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
http://dinesql.blogspot.com
Ways of accessing Unstructured Data
Mainly two methods
Impose a structure on unstructured dataBased on schema-on-read method
Transform unstructured data into
a structured schemaPermanent structure makes it
accessible by traditional computer
modules.