smart seminar series: "from big data to smart data"

28
From Big Data to Smart data Jie (Jack) Yang | April 2016

Upload: smart-infrastructure-facility

Post on 10-Jan-2017

356 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: SMART Seminar Series: "From Big Data to Smart data"

From Big Data to Smart dataJie (Jack) Yang | April 2016

Page 2: SMART Seminar Series: "From Big Data to Smart data"

—What is Big Data?

—Challenge of Big Data processing

—Smart Learning framework

—Applications

—Conclusions

Outline

Page 3: SMART Seminar Series: "From Big Data to Smart data"

—No single standard definition

—5-V information assets that require innovative techniques, algorithms, and analytics that enable decision making, and process automation

Big Data definition

Page 4: SMART Seminar Series: "From Big Data to Smart data"

1 – Scale (Volume)12+ TBs

of tweet data every day

25+ TBs oflog data every

day

? TB

s of

data

eve

ry d

ay

2+ billion people on the Web by

end 2011

30 billion RFID tags today

(1.3B in 2005)4.6 billion

camera phones

world wide

100s of millions of

GPS enabled devices sold

annually

76 million smart meters in 2009…

200M by 2014

Page 5: SMART Seminar Series: "From Big Data to Smart data"

The ability to manage, analyse, summarise, visualise, and discover knowledge from the collected data in a timely and scalable manner

2 – Speed (Velocity)

Social media and networks(millions of active users)

Mobile devices(tracking objects all the time)

Infrastructure sensors and/or instruments

(measuring all kinds of data)

Page 6: SMART Seminar Series: "From Big Data to Smart data"

Various formats, types and structures:

— Text

— Numerical

— Multi-dim arrays

— Images, audio, video, sequences

— Time series

— Graph (network)

— Streaming data

— etc

3 – Complexity (Varity)

Page 7: SMART Seminar Series: "From Big Data to Smart data"

4 – Uncertainty (Veracity)

Page 8: SMART Seminar Series: "From Big Data to Smart data"

5 – Benefit (Value)

Value ($, time, performance)

Page 9: SMART Seminar Series: "From Big Data to Smart data"

Beer & Diaper (Woolworths in Illawarra)“A number of convenience store clerks noticed that men often bought beer at the same time they bought diapers. The store mined its receipts and proved the clerks' observations correct. So, the store began stocking diapers next to the beer coolers, and sales skyrocketed”

A simple example

Page 10: SMART Seminar Series: "From Big Data to Smart data"

Hardware

—Choose machines

—System failure

Challenge of Big Data processing

16 Cores, 32G RAM, $AUD6000+

Page 11: SMART Seminar Series: "From Big Data to Smart data"

Software

—Different data sources

—Really slow

—Memory issue (out of memory, for 52 million records)

Challenge of Big Data processing

Page 12: SMART Seminar Series: "From Big Data to Smart data"

Smart Learning Framework

Data harvesting

data partners

Data mining

Data storage

Data streaming

Data visualisation

Page 13: SMART Seminar Series: "From Big Data to Smart data"

Hardware

—Money wise

—Tolerance to hardware failure

Smart Learning Framework

16 Cores, 32G RAM, $AUD6000+ 4 Cores, 8G RAM, $AUD 600+

Page 14: SMART Seminar Series: "From Big Data to Smart data"

Main features

— Collection across different platforms and formats

• APIs

• Web crawling

— 1 master and 6 workers

• distributing–working–waiting–reactivating process

— Data volume (per day)

• 20K+ records user activities

• 25K+ records from social platforms

• 200K+ tweets around AU and EU

Data harvesting

Page 15: SMART Seminar Series: "From Big Data to Smart data"

Main features

— save data into different formats

• Pure TXT / CSV

• (NO)SQL

— Query across all

— Fast respond

Data storage

SELECT * FROM (SELECT * FROM /web/logs/CSV) t0JOIN ( SELECT country, count(*) FROM mysql.web.users GROUP BY country) t1JOIN (SELECT timestamp FROM s3.root.clicks.json WHERE user_id = 'jdoe‘) t2

Page 16: SMART Seminar Series: "From Big Data to Smart data"

Main features

— Preprocessing (filtering, cleansing, feature extraction)

— Event simulation

— Saving to DBs

— Running ML jobs on the fly

• Receiver throughput = 3kb /sec

• Consumer throughput = 2kb /sec

• Consumer latency = 0.23 sec

Data streaming

Page 17: SMART Seminar Series: "From Big Data to Smart data"

Main features (35 online training jobs per day)

— Supervised (with a human assisting in classification) / unsupervised machine learning techniques, to assist with classification, clustering and prediction;

— Geospatial analysis: K-pop cluster in geographical regions;

— Network analysis to understand social connections between consumers and producers;

— Other analysis including:

• More sophisticated number crunching of comments, such as time series analysis to examine trends;

• Natural language processing techniques to assist with sentiment analysis.

Data mining

Page 18: SMART Seminar Series: "From Big Data to Smart data"

Student behaviour analysis (OLPC, until Feb 2016):

— 153+ schools

— 20K+ active laptops

— 4.2M+ activity records

Application 1

1.2M 2.6M 4.2M0

100020003000

Most popular Apps (per school) App usage (per school)

1.2M 2.6M 4.2M0

1000

2000

Page 19: SMART Seminar Series: "From Big Data to Smart data"

Car parking

Application 2

Page 20: SMART Seminar Series: "From Big Data to Smart data"

Car parking

— Every 2 minutes

— 604800 records (May to Oct 2015)

— Temporal and spatial features

Application 2

Page 21: SMART Seminar Series: "From Big Data to Smart data"

Application 2Average classification accuracy (%) as a function of the size of the selected samples.

Average computational time (second)

Page 22: SMART Seminar Series: "From Big Data to Smart data"

Social media analysis

— 70K+ films

— 228K+ users (2M + friendships)

— 1M+ reviews

— 13 features

Application 3

Page 23: SMART Seminar Series: "From Big Data to Smart data"

— User profile vs film preference

— User profile vs topics

Application 3

Page 24: SMART Seminar Series: "From Big Data to Smart data"

— Network analysis

— Opinion leadership

Application 3

4K nodes + 7K edges 76 nodes + 253 edges

Page 25: SMART Seminar Series: "From Big Data to Smart data"

Jie Yang; Jun Ma, A structure optimization algorithm of neural networks for large-scale data sets, Fuzz-IEEE,2014;

Jie Yang; Jun Ma, A Sparsity-Based Training Algorithm for Least Squares SVM, IEEE SSCI, 2014;

Jie Yang, Jun Ma, A big-data processing framework for uncertainties in Transportation data, Fuzz-IEEE, 2015

Jie Yang, Jun Ma, and Sarah K. Howard, A Structure Optimization Algorithm of Neural Networks for Pattern Learning from Educational Data, Springer Studies in Computational Intelligence ANN Modelling, 2015

Jie Yang; Jun Ma, A hybrid gene expression programming algorithm based on orthogonal design, International Journal of Computational Intelligence Systems, 2015

Jie Yang, Brian Yecies, Mining Chinese Social Media UGC A SmartLearning Framework For Analyzing Douban Movie Reviews, Journal of Big Data, 2016

Jie Yang; Jun Ma, A structure optimization framework for feed-forward neural networks using sparse representation, Knowledge-Based Systems, 2016;

Jie Yang; Jun Ma, Sarah K. Howard, Exploring Technology Integration in Education using Fuzzy Representation and Feature Selection, Fuzz-IEEE, 2016

Brian Yecies, Jie Yang, Matthew Berryman, Kai Soh, Marketing Bait: Using SMART Data to Identify E-guanxi Among China’s ‘Internet Aborigines, Film Marketing in a Global Era, 2015

Brian Yecies, Jie Yang, Matthew Berryman, Aegyung Shim, and Kai Soh, Korean Female Writer-Directors and SMART Analysis of Douban commentary Among China’s Digital Natives, Women Screenwriters: An International Guide, 2015

Brian Yecies, Jie Yang, Matthew Berryman, Aegyung Shim, and Kai Soh, Korean Female Writer–Directors and SMART Analysis of Douban Commentary Among China’s Digital Natives, Participations: International Journal of Audience Research, 2016

Sarah K. Howard, Jun Ma, Jie Yang, Kate Thompson, The use of data mining to explore factors of technology integration in learning and teaching, EARLI 2015

Sarah K. Howard, Ellie Rennie, Jun Ma, Jie Yang, Big Data, Big Theory: Moving Beyond New Empiricism to Generate Powerful Explanations, The New Data “Revolution” in Sociology, 2016

Jun Ma, Jie Yang, Rohan W. Denagamage and Murad Safadi, A Conceptual Model for Clustering Local Government Areas using Complex Fuzzy Sets, Fuzz-IEEE, 2016

Publications

Page 26: SMART Seminar Series: "From Big Data to Smart data"

— OLPC (ARC-Linkage)

— NSW-DER

— CAAR

— China-South Korean Foundation

— Healthcare (Pubmed, Seer)

— Tourism business project (UTS)

— MTR

Projects and grants

Page 27: SMART Seminar Series: "From Big Data to Smart data"

— Big Data processing:

• Data collection; streaming data; data storage; and Machine learning

• Open source libraries

— Other domains:

• Public transportation

• Business Intelligence

• Health care

Conclusions

Page 28: SMART Seminar Series: "From Big Data to Smart data"

Thank you