big data and internet thinking - sjtuwuct/bdit/slides/lec8.pdfanalytical technology, resources, and...

110
Big Data and Internet Thinking Chentao Wu Associate Professor Dept. of Computer Science and Engineering [email protected]

Upload: others

Post on 31-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Big Data and Internet Thinking

Chentao WuAssociate Professor

Dept. of Computer Science and [email protected]

Page 2: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Download lectures

• ftp://public.sjtu.edu.cn

•User: wuct

•Password: wuct123456

•http://www.cs.sjtu.edu.cn/~wuct/bdit/

Page 3: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Schedule

• lec1: Introduction on big data, cloud computing & IoT

• Iec2: Parallel processing framework (e.g., MapReduce)

• lec3: Advanced parallel processing techniques (e.g., YARN, Spark)

• lec4: Cloud & Fog/Edge Computing

• lec5: Data reliability & data consistency

• lec6: Distributed file system & objected-based storage

• lec7: Metadata management & NoSQL Database

• lec8: Big Data Analytics

Page 4: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Collaborators

Page 5: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Contents

Big Data Analytics1

Page 6: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Big Data Challenges

Page 7: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

It’s not just about the data…

Machine Learning/Deep Learning

IoT (Internet of Things) & Sensor Analytics

Modeling Willingness-to-Pay

Natural Language Processing

Analyzing Data @ Scale

Creating a Lake Streaming Consumer Behavior Data

Big Data Big Data Analytics+

• Leveraging a computer’s ability to learn without being explicitly programmed to solve business problems

• Understanding value drivers from the ever-growing network of connected physical objects and the communication between them

• Mining product reviews to estimate willingness-to-pay for product features

• Understanding human speech as it is spoken through application of computer science, AI, and computational linguistics

• Using distributed computing and machine learning tools to analyze hundreds of gigabytes of data

• Mining social data in real time to understand when and where consumers are making choices

1

2

3

4

5

6

Methods of using Big Data to generate insightRefers to the DATA only

• It is important to understand the distinction between Big Data sets (large, unstructured, fast, and uncertain data) and ‘Big Data Analytics’.

Page 8: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

It’s also about what, how, and why you use it• Big Data Analytics – the process of harnessing Big Data to yield

actionable insights – is a combination of five key elements:

Decisions Analytics Data TechnologyMindset &

Skills

The value of Big

Data Analytics is

driven by the unique

decisions facing

leaders, companies,

and countries today.

In turn, the type,

frequency, speed,

and complexity of

decisions drive how

Big Data Analytics is

deployed.

To leverage the

variety and volume of

Big Data while

managing its

volatility, advanced

analytical

approaches are

necessary, such as

natural language

processing, network

analysis, simulative

modeling, artificial

intelligence, etc.

Big Data Analytics is

about

operationalizing new

and more data, but it

is also about data

quality, data

interoperability, data

disaggregation, and

the ability to

modularize data

structures to quickly

absorb new data and

new types of data.

To store, manage,

and use Big Data

often requires

investments in new

technologies and

data processing

methods, such as

distributed

processing (e.g.,

Hadoop), NoSQL

storage, and Cloud

computing.

Big Data Analytics

requires firm

commitment to using

analytics in decision-

making; a decisive

mentality capable of

employing in-the-

moment intelligence;

and investment in

analytical technology,

resources, and skills.

Page 9: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Big Data Analytical Capabilities• Continuing increases in processing capacity have opened the

door to a range of advanced algorithms and modeling techniques that can produce valuable insights from Big Data.

Tra

dit

ion

al

Em

erg

ing

Structured Unstructured

A/B/N Testing

Experiment to find the most effective

variation of a website, product, etc

Sentiment Analysis

Extract consumer reactions based on

social media behavior

Complex Event Processing

Combine data sources to recognize events

Predictive Modeling

Use data to forecast or infer behavior

Regression

Discover relationships between variables

Time Series Analysis

Discover relationships over time

Classification

Organize data points into known categories

Simulation Modeling

Experiment with a system virtually

Spatial Analysis

Extract geographic or topological information

Cluster Analysis

Discover meaningful groupings of data

points

Signal Analysis

Distinguish between noise and meaningful

information

Visualization

Use visual representations of

data to find and communicate info

Network Analysis

Discover meaningful nodes and

relationships on networks

Optimization

Improve a process or function based on

criteria

Deep QA

Find answers to human questions

using artificial intelligence

Natural Language Processing

Extract meaning from human speech or

writing

Page 10: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Forward-Looking vs. Rear-View Analytics• Big Data Analytics improves the speed and efficiency with which we

understand the past, and opens up entirely new avenues for preparing for and adapting to the future.

What happened?Describe, summarize

and analyze historical data

What should be done?

Recommend ‘right’ or optimal actions or

decisions

How do we adapt to change?

Monitor, decide, and act autonomously or semi-autonomouslyWhat could

happen?Predict future

outcomes based on the past

• Observed behavior or events

• Non-traditional data sources such as social listening and web crawling

• Forward-looking view of current and future value

• Sentiment Scoring

• Graph analysis and Natural Language Processing to identify hidden relationships and themes

• Dual objective models

• Behavioral economics

• Real-time product and service propositions (graph analysis, entity resolution on data lakes to infer present customer need)

• Rapid evaluation of multiple ‘what-if’ scenarios

• Optimization decisions and actions

• Monitor results on a continuous basis

• Dynamically adjust strategies based on changing environment and improved predictions

• Agent-based and dynamic simulation models, time-series analysis

Descriptive Analytics

Predictive Analytics

Prescriptive Analytics

Continuous Analytics

Inc

re

as

ing

Bu

sin

es

s V

alu

e

Why did it happen?

Identify causes of trends and outcomes

• Observed behavior or events

• Non-traditional data sources such as social listening and web crawling

• Statistical and regression analysis

• Dynamic visualization

Diagnostic Analytics

Increasing Sophistication of Data & Analytics

Rear-view Forward-looking

Page 11: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Examples of Big Data Analytics in Action• Market Leaders are leveraging Big Data Analytics to generate

value by starting with a business need and focusing on implementing actionable insights quickly and decisively.

Company Business Need Data and Analytics Impact

Greater tailoring of credit card offers to fit customer needs

Statistical model based on public credit and demographic data to targetcustomized products to customers

Net revenue grew at a CAGR of 32% from 1994 to 2003; prompted competitors to shift focus to data and analytics

Data-enabled engine prognostics, monitoring, maintenance and repair

Analysis of sensor data from hundreds of sensors in 4,000 engines to identify and solve issues weeks in advance

Over 70% annual revenue from the aircraft engine division attributable to this service

Search-to-purchase conversion by anticipating intent of a shopper’s search and delivering relevant results

Semantic search, which enables discovery using algorithms that rank results via social signals from around the web

Increases 10-15% the likelihood that a customer will complete their purchase – translating to millions of dollars in revenue

Transformation from subscription streaming service to original content producer

Analysis of data from 66 million subscribers’ viewing habits and preferences

Revenue and subscriber base increased by 15% and 9% respectively in 2013

Leverage Internet of Things (IoT) by connecting machines to facilitate data-enabled prognostics, increase efficiency and reduce downtime

Launched software to help airlines and railroads move their data to the cloudand predict mechanical malfunctions, improve safety, and reduce trip cancellations and cost

Estimated 1% reduction in fuel costs, projected to save the airline industry $30 billion over 15 years

ImpactBig Data AnalyticsBusiness Need

Page 12: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Big Data Analytics in Development• Big Data Analytics is making an equally impressive impact on

Development interventions – allowing decision-makers to reach and serve previously neglected populations.

Company Business Need Data and Analytics Impact

More transparent, reliable, and low-cost method to track inflation in Argentina

Web scraping of online price data used to produce price indices, and econometric analysis used to model disaggregated impacts of policies

Government statistical offices shifting to accept Big Data. Central banks using Big Data to see day-to-day volatility.

Understand how migrants act as arbitrageurs to bring labor markets into equilibrium

Iterative analysis of call detail records (CDRs) to track movement of migrants in response to local shocks to labor demand (weather, economy, conflict, etc.)

Informing labor policy design in low-income countries to incentivize or disincentivizemigratory behavior

The city of Rio de Janeiro wanted to improve its emergency response by better predicting heavy rainfall and subsequent severe landslides and flooding

The city combines data from 30 city agencies – including weather, satellite, video, GPS, historic rainfall, and topographic survey data – in a central Operations Center

Rio has improved emergency response time by 30%, catalogued 200+ flood points, and can now predict heavy rains 48 hours in advance on a half-km basis

Create a better ecosystem for mobile services in the agricultural sectors of Kenya, Tanzania, and Mozambique

Remote crowdsourced data gathered via cell phones used to connect farmers to markets, assess farmers’ credit worthiness, and incubate new mobile businesses with greater predictors of success

M-PESA is being used to lower costs for farmers to receive loans and perform transactions with distributers and buyers, as well as to provide geography-specific market information

ImpactBig Data AnalyticsBusiness Need

Page 13: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Big Data Landscape

Page 14: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Creating a Big Data OrganizationStep 1: Be Yourself• Beginning with a clear understanding of the specific questions you intend to use

Big Data Analytics to address can help guide where and which data solutions are deployed.

Value enablement

Value enhancement

Strategic

Tactical

Operational

Day to day operations• Struggle to move from narrow focus on reactive

operations to more proactive, comprehensive management of daily operations

• High value for digitization of operational processes across program units

• Often already proficient in traditional business intelligence

Enabling strategy and improving performance• Use analytics to reduce political divergence and

drive consensus• Real-time analytics to enable quick responses to

events• Use data to develop personalized services• Need for more objective and higher quality data

Delivering future value• Data-driven decision-making in real time• Use analytics to develop new

programs/opportunities• Relies heavily on data supplied by others• Often struggles to move away from exclusively

intuitive decision-making

Page 15: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Creating a Big Data OrganizationStep 2: Secure People & Skills• The competencies required of “data scientists” within an analytics organization

or project converge from multiple skill domains.

Organization-specific knowledge about data assets – including enterprise “metadata” – their location and appropriate business context for use in advanced analytics

Comfort in programming across various languages, a thorough understanding of external and

internal data sources, data gathering, storing, and retrieving

methods which help combine disparate data sources to generate unique insights

Subject Area or Domain Expertise

Computer Science & ProgrammingStatistical &

Mathematical

Organization-specificInformation Knowledge

Expertise in statistical techniques, tools and languages used to run analyses that generate insights to effectively determine and communicate actionable insights

Deep understanding of industry, subject area, or research domain to help determine which questions need

answering and on what frequency, specificity, or geography

Page 16: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Creating a Big Data OrganizationStep 3: Let objectives dictate structure, not vice versa• How analytics efforts or organizations are structured – whether reporting is vertically or

horizontally aligned, how interconnected or autonomous separate units are, how resources and successes are shared – can influence efficiency and impact.

CENTRAL Analytics Competency Center

Distributed Analytics Centralized Analytics

LOCAL

CENTRAL Analytics Competency Center

ETL

Data Warehouse

BI Applications

Metadata Repository

Data Mart

Federated Analytics

LOCAL

CENTRAL Analytics Competency Center

ETL

Data Warehouse

BI Applications

Metadata Repository

Data Mart

ETL

Data Warehouse

BI Applications

Metadata Repository

Data Mart

Objectives• Adopt previously proven practices• Highly focused analytics support

• Subject area-specific innovations• Repeatable models

• Governance• Aligning analytics to organization-

wide strategy

Data Warehouses, Marts, etc.

• Deployed locally • Deployed locally• Some data and models shared

across groups

• Deployed and managed centrally

Analytics Tools• Managed locally • Managed locally, but connected to

group framework• Controlled centrally, with units

having access to shared resources

Analytics Staff/ Competencies

• Placed within individual units • Placed within individual units• Skills tailored to specific region or

subject matter

• Placed within central analytics team,available as needed to support individual units

Page 17: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

The ‘Hub-Spoke’ operating model often serves as a well-synchronized, connected system

Competency Center

‘Standardization’

2

Local Business Operations

Global Business Strategy

Local Adoptionof Practices

Centers of Excellence (Regional)

CompetencyCenter

(‘Standards’)

Central Decision Hub

Local‘Spoke’

Central Decision

Hub1

Center of Excellence(Regional)

3

Center of Excellence(Regional)

3

Center of Excellence(Regional)

3

1234

Local‘Spoke’

4

Local‘Spoke’

4

Local‘Spoke’

4

Local‘Spoke’

4

Local‘Spoke’

4

Local‘Spoke’

4

Local‘Spoke’

4

Local‘Spoke’

4

Local‘Spoke’

4

Local‘Spoke’

4

Sample Hub-Spoke Interaction Model

Page 18: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Creating a Big Data OrganizationStep 4: Invest in Appropriate Infrastructure• Big Data introduces challenges related to data volume and variety, processing

constraints, and new data structures that traditional data infrastructure is not equipped to support Objective Considerations Impact

Identify the type of analysis that will be

conducted and define which analytics

capabilities will be employed

Dictates performance needs along with data structures and processing architecture

Interface could restrict the ability to perform analysis ad hoc and restrict ability to update

Support for analysis specific data structures can improve performance and reduce analysis effort

Define the data set that will be used for the analysis including its

sources, size, and structure

Size of data sets introduce need for scalable infrastructure and performance

Variability of source data models and data set structure require data model flexibility

Diverse sources will require scalability, modelflexibility, and flexible interfaces

Define the timeliness and frequency of the

analysis results for reporting and

downstream systems

Frequency of analysis will dictate the processing architecture (batch or real time)

The timeliness of the analysis will impact the need for scalability and performance

In and out bound interfaces are defined by the use of data and required flexibility

Analytics Capabilities

Data Variety

Application

Analysis Type

Size

Structure

Sources

Frequency

Speed

Interfaces

Analysis Flexibility

Analysis Structures

Page 19: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Contents

Architecture Design2

Page 20: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Emerging Infrastructure Options• To harness Big Data, storage solutions must be able to support

targeted analytics capabilities, data diversity and performance needs

Distributed Processing Hadoop and similar solutions that provide scalable distributed storage and distributed computation on commodity hardware

NoSQL Embedded and persisted storage that implement data models through document, graph, and dictionary structures

Cloud Computing Cloud computing can improve flexibility, scalability and cost management and enable a cohesive business strategy across a org

• Scalability Issues

• Big Data set information extraction and queries require large volumes of processing cycles that can quickly scale

• Data storage solutions need to provide flexible data models to better ingest unstructured and semi structured data

• Need to combine and link multiple data sources

Traditional challenges being addressed…

Page 21: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Building an Analytics Organization: Critical Components

Distributed Processing Hadoop and similar solutions that provide scalable distributed storage and distributed computation on commodity hardware

Introduction to Hadoop

• Hadoop is based on work done by Google in early 2000s (combination of Google File System (GFS) and MapReduce)

• Useful for analyzing copious amounts of complex data across multiple data sources

• Distributes data as it is initially stored in the system

• Applications are written in high-level code

• Computation happens where data is stored, whenever possible

• Data is replicated multiple times on the system for increased availability and reliability

Faster and Lower Cost Analysis

Linear Scalability

Greater flexibility

Emerging Infrastructure – Computing/Storage Options

Page 22: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Building an Analytics Organization: Critical Components

Emerging Infrastructure – Storage Options

NoSQL Embedded and persisted storage that implement data models through document, graph, and dictionary structures

NoSQL - Storage Types

Document StoreKey – Value Store

Graph StoreColumnar Store

So

luti

on

Ex

am

ple

s

Increasing Data Complexity

Pros: Simplicity & ScalabilityCons: Lack of advanced features/queries

Pros: Scalability & FlexibilityCons: Complexity

Pros: Easy to UseCons: Scalability

Pros: Graph JoinsCons: Flexibility

Page 23: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Building an Analytics Organization: Critical Components

Emerging Infrastructure – System Options

Cloud Computing The model is compelling; cloud computing can improve flexibility, scalability and cost management. Businesses best able to realize the potential will establish a cohesive business strategy as cloud computing can transform your entire organization — people, processes, and systems

Source: PwC, “Digital IQ Snapshot: Cloud,”; PwC, “FS Viewpoint: Clouds is the forecast”

Cloud transformation begins at the infrastructure level and leads to more agile applications, resulting in faster speed to market and more flexibility to meet client needs.

The key benefits, beyond consolidation, include standardized application and development environments, resulting in better controlled and more efficient application lifecycles.

Page 24: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Relational Reference Architecture

Page 25: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Extended Relational Reference Architecture

Page 26: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Non-Relational Reference Architecture

Page 27: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Data Discovery: Non-Relational Architecture

Page 28: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Business Reporting: Hybrid Architecture

Page 29: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Contents

Big Data Algorithms3

Page 30: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Key components of Mahout in Hadoop (1)

Page 31: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Key components of Mahout in Hadoop (2)

Page 32: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Key Components of Spark MLlib

Page 33: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Spark ML Basic Statistics

◼ Correlation: Calculating the correlation between two series of data is a common operation in Statistics➢Pearson’s Correlation➢ Spearman’s Correlation

Page 34: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Example of Popular Similarity Measurements

◆Pearson Correlation Similarity◆Euclidean Distance Similarity◆Cosine Measure Similarity◆Spearman Correlation Similarity◆Tanimoto Coefficient Similarity (Jaccard coefficient)◆Log-Likelihood Similarity

Page 35: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Pearson Correlation Similarity

Data:

Missing Data

Page 36: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

On Pearson Similarity

Three problems with the Pearson Similarity:

1. Not take into account of the number of items in which two users’ preferences overlap. (e.g., 2 overlap items ==> 1, more items may not be better.)

2. If two users overlap on only one item, no correlation can be computed.

3. The correlation is undefined if either series of preference values are identical.

Adding Weighting. WEIGHTED as 2nd parameter of the constructor can cause the

resulting correlation to be pushed towards 1.0, or -1.0, depending on how many

points are used.

Page 37: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Spearman Correlation SimilarityExample for ties

Pearson value on the relative ranks

Page 38: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Basic Spark Data FormatData: 1.0, 0.0, 3.0

// straightforward

// number of parameters, location of non-zero indices, and non-zero values

// number of parameters, Sequence of non-value values (index, value)

Page 39: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Correlation Example in Spark1.0, 0.0, 0.0, -2.04.0, 5.0, 0.0, 3.06.0, 7.0, 0.0, 8.09.0, 0.0, 0.0, 1.0

Page 40: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Euclidean Distance Similarity

Similarity = 1 / ( 1 + d )

Page 41: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Cosine Similarity

Cosine similarity and Pearson similarity get the same results if data are normalized (mean == 0).

Page 42: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Spearman Correlation Similarity is time consuming.Need to use Caching ==> remember s user-user similarity which was previously computed.

Caching User Similarity

Page 43: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Tanimoto (Jaccard) Coefficient Similarity

Discard preference values

Page 44: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Log-LikeLihood SimilarityAsses how unlikely it is that the overlap between the two users is just due to chance.

Page 45: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Performance MeasurementsUsing GroupLens data (http://grouplens.org): 10 million rating MovieLens dataset.

• Spearnman: 0.8• Tanimoto: 0.82• Log-Likelihood: 0.73• Euclidean: 0.75• Pearson (weighted): 0.77• Pearson: 0.89

Page 46: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Spark ML Basic Statistics

• Hypothesis testing: Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically significant. Spark ML currently supports Pearson’s Chi-squared (χ2) tests for independence.

• ChiSquareTest conducts Pearson’s independence test for every feature against the label.

Page 47: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Chi-Square Tests (1)

Page 48: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Chi-Square Tests (2)

Page 49: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Chi-Square Tests (3)

We would reject the null hypothesis that there is no relationship between location and type of malaria. Our data tell us there is a relationship between type of malaria and location.

Page 50: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Chi-Square Tests in Spark

Page 51: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Spark ML Clustering

Page 52: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Example: Clustering

FeatureSpace

Page 53: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Clustering

Page 54: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Clustering – on feature plane

Page 55: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Clustering example

Page 56: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Steps on Clustering

Page 57: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Making Initial Cluster Centers

Page 58: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

K-means Clustering

Page 59: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

HelloWorld Clustering Scenario Result

Page 60: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Testing difference distance measures

Page 61: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Manhattan and Cosine distances

Page 62: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Tanimoto distance and weighted distance

Page 63: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Results Comparison

Page 64: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Sample Code of K-Means Clustering in Spark

Page 65: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Vectorization Example

0: Weight1: Color2: Size

Page 66: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Canopy Clustering (estimate the number of clusters)Tell what size clusters to look for. The algorithm will find the number of clusters that have approximately that size. The algorithm uses two distance thresholds. This method prevents all points close to an already existing canopy from being the center of a new canopy.

Page 67: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Other Clustering Algorithms

Hierarchical clustering

Page 68: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Different Clustering Algorithms

https://github.com/HewlettPackard/cacti

Page 69: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Spark ML Classification

Page 70: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Spark ML Classification

Page 71: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Classification - definition

Page 72: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Classification example: using SVM to recognize a Toyota Camry

Page 73: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Classification example: using SVM to recognize a Toyota Camry

Page 74: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

When to use Big Data System for Classification?

Page 75: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

The advantage of using Big Data System for Classification

Page 76: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

How does a classification systems work?

Page 77: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Key Terminology for Classification

Page 78: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Input and Output of a classification model

Page 79: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Four types of values for predictor variables

Page 80: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Sample data that illustrates all four values

Page 81: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Supervised vs. Unsupervised Learning

Page 82: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Work flow in a typical classification project

Page 83: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Classification Example – Color-Fill

Position looks promising, especially the x-axis ==> predictor variable. Shape seems to be irrelevant. Target variable is “color-fill” label.

Page 84: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Classification Example – Color-Fill (another feature)

Page 85: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Fundamental classification algorithm

Example of fundamental classification algorithms: • Naive Bayesian• Complementary Naive Bayesian• Stochastic Gradient Descent (SDG) • Random Forest• Support Vector Machines

Page 86: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Choose algorithm

Page 87: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Support Vector Machine (SVM)

maximize boundary distances; remembering “support vectors”

nonlinear kernels

Page 88: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Example SVM code in Spark

Page 89: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Contents

Tools Support4

Page 90: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Data Mining, Text Mining, and Natural Language Processing

Extraction of implicit, previously unknown, and potentially useful information from data

Data Mining

Analysis of large quantities of natural language text and detecting lexical or linguistic usage patterns to extract probably useful information

Text Mining

Natural Language Processing

NLP is a theoretically motivated range ofcomputational techniques for analyzing and representing naturally occurring textsat one or more levels of linguistic analysis for the purpose of achieving human-likelanguage processing for a range of tasks or applications.

Page 91: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

NLP Tools

Tool Description Analysis Type

OpenNLPA machine learning based toolkit for the processing of natural language text. Link

• Tokenization• sentence segmentation• Part-of-speech tagging

• Named entity extraction• Chunking, parsing• Coreference resolution.

GATEA Java suite of tools that can perform natural language processing tasks for multiple languages. Link

• Information extraction• Part of speech tagging

• Tokenizer• Sentence splitter

NLTK A suite of libraries and programs for symbolic and statistical natural language processing Python. Link

• Information extraction• Part of speech tagging,• Tokenizer

• Word categorization• Text classification

Stanford NLP

Statistical NLP toolkits for various computational linguistics problems that can be incorporated into applications with human language technology needs.Link

• Including tokenization• Part-of-speech tagging• Named entity recognition• Parsing

• Classification• Segmentation• Coreference Resolution

LingPipe

A tool kit for processing text using computational linguistics. Link

• Sentiment analysis• Entity recognition• Clustering• Topic classification

• Part of speech tagging• Sentence detection• Disambiguation

MontyLingua

A suite of libraries and programs for symbolic and statistical natural language processing for both Python and Java. Link

• Information extraction• Part of speech tagging • Tokenizer• Word categorization

• Text generation• Stemming• Phrase chunking

Rosetta Linguistic Platform

A suite of linguistic analysis components that integrate into applications for mining unstructured data. Link

• Language Identification• Name, places, and key

concept extraction

• name matching • name translation

Page 92: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Text Mining/Analytics Tools

Tool Description Analysis Type

RapidMinerAn open source environment for machine learning, data mining, text mining, predictive analytics, and business analytics. Link

• Document classification• Sentiment analysis• Topic tracking

• Data mining• Traditional analytics

SAS Text MinerA suite of text processing and analysis tools. Link, • Text Parsing

• Filtering• Feature Extraction• Topic Clustering

VisualText

Integrated development environment for building information extraction systems, natural language processing systems, and text analyzers. Link

• Information extractions• Summarization• Categorization

• Data Mining• Document Filtering• Natural Language

Search

SAS Sentiment Analysis

Commercial tool that is dedicated to customer sentiment analysis. Link

• Customer sentiment monitoring

• sentiment discovery

TextifierTool for sorting large amounts of unstructured text with The Public Comment Analysis Toolkit (PCAT).Link

• Topic modeling,• Information retrieval

• Document analysis• Social media analysis

InfiniteInsight

System for automatically preparing and transforming unstructured text attributes into a structured representation. Link

• Term frequency• Term frequency inverse• Document frequency • Root word coding• synonym identification

• Customization of stop words

• Stemming rules• Concepts merging

ClustifySoftware for grouping related documents into clusters, providing an overview of the document set and aiding with categorization. Link

• Document clustering

Page 93: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Text Mining/Analytics Tools Cont.

Tool Description Analysis Type

AttensityAnalyze

Customer analytics applications that help analyze high volumes of customer conversations across multiple channels. Link

• Unstructured communication analysis

• sentiment analysis

• consumer profiling

ReVerbA program that automatically identifies and extracts binary relationships from English sentences. Link

• Information extraction

• Topic Identification

• Topic Linking

Open text summarizer

Open source tool for summarizing texts. Link

• Document summarization

Open CalaisWeb based API that is used to analyze content and extract topics or information. Link

• Attribute/feature extraction

• Fact identification

KnowledgeSearch

Family of techniques tools for searching and organizing large data collections. Link

• Semantic Analysis

KH CoderA free software for Quantitative Content Analysis or Text Mining Link

• Text Parsing• document search

• Network analysis

Page 94: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Image Analytics Overview

Overview

• The process of pulling relevant information from an image or sets of images for advanced classification and traditional analysis

• Applies image capture, image processing, and machine learning techniques to extract, quantify, and structure, image information

Advantages

• Provides a method to structure, organize, and search information that is stored within images

•Offers an additional data set that can be applied to understanding consumer behavior, automating business processes, and discovering knowledge enterprise content

Page 95: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Image Analytics Tools

Tool OverviewImage

ProcessingComputer

VisionMachine Learning

OpenCV

Open source library of computer vision functions that is accessible via C, Java, and Python

X X X

PAXit Image Analysis

Integrated image analysis platform that provides basic feature identification functions

X X

ImageJ

Java based image processing platform that can be accessed via an API and expanded with custom plugins

X

PIL Python image processing library X

PyBrainA modular machine learning library for Python

X

Page 96: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Audio Analytics OverviewOverview

•The process of capturing audio and analyzing its features as to extract content and context of an event

•Applies speech analysis and signal processing principles to structure audio information for analysis via NLP or traditional analytics techniques

Advantages

•Provides a method for identifying events or common patterns within sound bytes

•Offers a way of capturing not only the content and topics within a conversation, but also the emotions and context

Page 97: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Audio Analytics Tools

Tool OverviewAudio

ProcessingInformation

Retrieval

ClamA C++ library that provides varying level of audio processing and information retrieval capabilities

X X

CallMiner

A tool that is capable of translating calls to a more structured text data set and combining with other communication forms

X

NuanceLogs calls and structures audio for text based search and retrieval

X

yaafeAduio feature extraction toolkit with wrappers for several languages

X

PRAAT Multiple platform audio analysis toolkit X

Page 98: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Social Network → Applications (1)Analysis Objectives

Collaboration Analysis

Evaluate team structures , information flows among team members, and information exchanges with other teams to improve working structures

• Identify team structures that are not effective

• Identify informal organizational structures

• Identify individuals/roles or groups that are influential to collaborative work environments

Content/Knowledge

Management

Evaluate how knowledge or content is diffused and accessed within an organization

• Improve content and knowledge distribution

• Identify content bottlenecks, open communication flows, and establish channels

• Explore impact of new communication methods

Community Mining

Identify groups or informal teams that share knowledge, communicate frequently, solve problems, or work together to perform specific tasks

• Improved structures for key organizational functions.

• Improved information flows

• Identify potential bottlenecks for organizational functions

• Identify cultural patterns to build other communities

Organization Development

Explore formal and informal organization structures and how individuals work with one another to improve the design of the organization

• Improve hierarchy and structure of organization to better align with the informal practices

• Identify team members that are effective leaders and would impact the organization if promoted

Page 99: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Social Network → Applications (2)

Analysis Objectives

Disaster recovery planning

Assess organizational structures and communication patterns as they relate to the groups that play a role in disaster recovery plans

• Identify communication improvements to disaster recovery teams

• Identify weak links among functional groups to improve collaboration during recovery plan execution

Data/Information

Dissemination

Assess how data points or information sets originate or are distributed across the enterprise to their intended targets

• Identify overlapping information sets and bottlenecks for information dissemination

• Assess how organization structures or information architecture impact the flow of information to its targets

Fraud Detection / prevention

Assess the organization or external network to identify communication or collaboration patterns that align with known fraudulent activity

• Identify network agents that collaborate with known fraudulent agents

• Identify activities that align with known fraudulent behavior

ProcessDiscovery /

Improvement

Analyze the organization structure and communication patterns to uncover process improvements or identify new processes

• Identify process improvements through discovery of hidden process steps, communication flows , and actors

• Discover undocumented or informal processes that are hidden within frequent collaboration and communication paths

Supply Chain Analysis

Evaluate the structure of a supply network and the interactions among the entities that comprise the network to identify gaps, bottlenecks and sourcing strategies

• Identify communication gaps that could impact dependent process or operations

• Identify strategic relationships to optimize the supply network

• Identify supply nodes that create inefficiencies

Page 100: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Social Network → Applications (3)

Analysis Objectives

Novelty/Sentiment Diffusion

Analysis

Observe how a specific topic, news articles or sentiment diffuses through a consumer network

• Assess how target consumers/market will react to a piece of news or campaign

• Evaluate how long news, data, or sentiment will be retained within a system and how far it will spread

Market Influencer Identification

Monitor and analyze connections within social media networks to identify markets or consumers that are influential within communities

• Identify individuals or groups that influence markets and adoption

• Identify untapped markets

• Identify market segments as targets for ad campaigns to improve product/service adoption

Consumer Segmentation

Analyze the connections and consumer attributes within the target market to discover communities or groups with common characteristics

• Improve product or service offerings based on attributes that connect the consumer market

• Develop strategies to target new or existing consumers based on identified segmentation characteristics

Product or BrandDiffusion Analysis

Analyze the flow of communication or ideas through a market segment to evaluate how a product may diffuse

• Identify segments or individuals that will be likely early adopters

• Identify incentives or campaigns that will improve product/service adoption

Recommendation Systems

Analyze consumer network connections and common features among consumers to develop recommendations

• Identify new feature sets for products and services

• Assess new markets for selling similar or new products

• Target consumers with specific products or services

Page 101: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Social Network → Tools (1)

Tool OverviewNetworkAnalysis

Network Visual

Network Manipulation

SNAPA general purpose network analysis and graph mining library for C++ . Link X X

StatnetA package for R that provides capabilities for social network statistical analysis. Link

X

libSNA, graphTool, networkX

Python libraries for network analysis and manipulation. libSNA, networkX, graphTool

X X

JUNGJava package for network analysis and modeling. Link X X X

NodeXLExcel plug-in that provides an easy to use and interactive interface to explore and visualize networks Link

X X

Page 102: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Social Network → Tools (2)

Tool OverviewNetworkAnalysis

Network Visual

NetworkManipulation

GEPHIInteractive open source platform for network analysis and visualization. Gephi X X X

UcinetCommercial social network analysis tool with separate visualization component. Link X X

Graphviz Open source graph visualization package. Link X

NetMinerProprietary package that provides the ability to develop and implement custom algorithmslink

X X X

kxen SNANetwork analysis package that provides predictive analytics and customer MDM integration. Link X X X

ProMOpen source package for mining business process networks. Link X X X

CytoscapeOpen source tool for network modeling, and analysis. Can connect to external data sources Link X X X

NetworkWorkbench

Large-Scale Network Analysis, Modeling and Visualization Toolkit for Biomedical, Social Science and Physics Research. Link

X X X

Page 103: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Contents

Deep QA/Mind/Brain Systems5

Page 104: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

DeepQA/Mind/BrainWhat is DeepQA?

• DeepQA forms that core of Watson, the open domain question analysis and answering system

• The DeepQA stack is comprised of set of search, NLP, learning, and scoring algorithms

• DeepQA operates on a distributed computing infrastructure that leverages Map Reduce and the Unstructured Information Management Architecture

What is the target problem set?

• Understanding the meaning and context of human language

• Searching and retrieving information from large library of unstructured information

• Identifying accurate and precise answers to questions that are complex and must sourced from a large knowledge set

Page 105: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

DeepQA Infrastructure TechnologyData Management and Search

Technology Links

Unstructured Information Architecture

UIMA Link

SQL ServerMySQL Link

Apache Derby Link

Java Natural Language Toolkit

Open NLP Link

Stanford NLP Link

Map/Reduce Apache Hadoop Link

CommonsenseKnowledgebase

OpenCYC Link

Open Mind Common Sense Link

Triple StoreApache Jena Link

OpenAnzo Link

Text SearchLucene Link

Open FTS Link

Page 106: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

DeepQA Infrastructure TechnologyPlatform and Administration

Technology Links

Web Server Apache Link

Virtualization HostVMWare Link

Zen Link

Distributed File System

Apache Hadoop Link

OpenAFS Link

File Management/Archival

rSync Link

OS Fedora Link

Cloud ManagementExtreme Cloud Administration Link

Open Nebula Link

Page 107: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Business ApplicationsOverview Objectives

Knowledge Discovery

Search internal and external unstructured/structured information assets to uncover previously unknown knowledge

• Identify information about a subject through deep analysis of internal and external information sources

• Answer questions about a business problem or trend that may be difficult to analyze within traditional data sources

E-Discovery

Search documents and communications to uncover relevant information associated with a specific topic

• Identify business topics and trends within communication and documents

• Search for non compliance activities within internal and external data sources

ContractEvaluations

Search through single or multiple contracts to answer specific questions about the nature of the contract

• Identify key facts or issues that comprise a contract or sets of contracts

• Identify contracts or legal documents that contain similar entities or features

RelationshipManagement

Provide the ability to interact with consumers providing precise responses to technical and open domain questions

• Provide a platform for automatically answering consumer questions about products or services

• Reduce reliance on call centers and improve interaction with consumers

Consumer Discovery

Search consumer communications, social media, and sales information to identify opportunities and demographics

• Identify background information about consumers• Identify consumer qualities that create risks or represent

opportunities

Technical Troubleshooting

Find answers to technical and process problems through

• Utilize unstructured data and communications to identify solutions or root causes to system and process problems

Page 108: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Areas for Further ResearchInfrastructure/Tools and Search Technologies/Concepts

Topic Research

Tools

HadoopMap/Reduce

The tool is used to distribute queries, analysis, and other processing activities across multiple CPUs. Further research is required to understand the tools architecture and how to integrate it with other tool kits. OpenNLP, UIMA, Lucene, etc.

OpenNLPA Java library for NLP tasks. Need to evaluate the tools capabilities and gaps as well as how it can be incorporated into the UIMA

OpenCYCAn open common sense reasoning platform. Need to better understand the tools role as well as how it fits within the other technologies

UIMAAn architecture for managing unstructured data. Further research is needed to understand how to run in parallel and how the SDK can be applied to NLP activities

LuceneA text search platform. Further research is needed to understand the library and how to incorporate it into UIMA

Search

Text Search Scoring

Algorithms are used to score search results based on their alignment with the question. Further research is needed to understand what models and scoring metrics can be applied to search results at various phases of DeepQA.

Triple Store Search

Triple stores maintain data in a subject-predicate-object structure and is used for turning around quick facts. Further research is needed to understand the philosophy and technologies behind these data storage mechanisms

CommonsenseReasoning

Research is required to understand the branch of AI, technologies and role within DeepQA.

Document/Information

Retrieval

Generate research on information and document retrieval practices. Technologies and algorithms need to be reviewed. Falls within a broader research topic for enterprise search.

Page 109: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Areas for Further ResearchMachine Learning and Natural Language Processing

Topic Description

Machine Learning

MetaLearnersResearch the concept and how they are to used evaluate learning models and assign a confidence score based on the learning models that are used to rank search results

Question Classification

Identify techniques and models that can be employed to analyze and classify questions

Search Ranking Models

Research models are available for ranking search results based on the various search and recall techniques that are employed for a question

NLP

Logical FormAnalysis

Research how SNA is used to discover logical relationships within text and product an understanding about the information within the text

SemanticStructure Analysis

Identify tools and algorithms that are employed to uncover semantic relationships within texts/phrases and how these relationships can be applied to extract relevant information for question analysis and search

RelationshipAnalysis

Research techniques and tools for uncovering temporal, geospatial and spatial relationships within a knowledge set

Feature Extraction

Evaluate tools and algorithms that are used to extract features of entities from text and identify methods for structuring the data for search

Phrase AnalysisIdentify algorithms and tools that can be applied to extract key phrases from text based on a search context

Page 110: Big Data and Internet Thinking - SJTUwuct/bdit/slides/lec8.pdfanalytical technology, resources, and skills. Big Data Analytical Capabilities •Continuing increases in processing capacity

Thank you!