data mining with weka putting it all together

43
TEAM MEMBERS: HAO ZHOU, YUJIA TIAN, PENGFEI GENG, YANQING ZHOU, YA LIU, KEXIN LIU DIRECTOR: PROF. ELKE A. RUNDENSTEINER PROF. IAN H. WITTEN Data Mining with Weka Putting it all together

Upload: nishi

Post on 23-Feb-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Data Mining with Weka Putting it all together. Team members: Hao Zhou, Yujia Tian , Pengfei ‎ Geng , Yanqing Zhou, Ya Liu‎, Kexin Liu Director: Prof. Elke A.  Rundensteiner Prof. Ian H. Witten. Outline. 5.1 The data mining process By Hao 5.2 Pitfalls and pratfalls - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Mining with Weka Putting  it all together

TEAM MEMBERS: HAO ZHOU, YUJIA TIAN, PENGFEI GENG, YANQING ZHOU, YA LIU, KEXIN LIU

DIRECTOR:PROF. ELKE A. RUNDENSTEINERPROF. IAN H. WITTEN

Data Mining with WekaPutting it all together

Page 2: Data Mining with Weka Putting  it all together

Outline

5.1 The data mining process By Hao

5.2 Pitfalls and pratfalls By Yujia, Pengfei

5.3 Data mining and ethics By Yanqing

5.4 Summary By Ya, Kexin

Page 3: Data Mining with Weka Putting  it all together

5.1 The data mining processBy Hao

Page 4: Data Mining with Weka Putting  it all together

5.1 The data mining process

Feel Lucky:- Weka is not everything I need

to talk about in my part (Know how rather than why to use Weka)

Maybe Not so Lucky:- Talking about Weka is time-

consuming. =)

Page 5: Data Mining with Weka Putting  it all together

From Weka to real life

When we use weka for MOOC, we never care about the dataset, as it has been already collected.

Page 6: Data Mining with Weka Putting  it all together

Procedures in real life

Why we do data mining in real life?- for course projects (This is my current

situation)- for solving real life problem - for fun- for …

Now, we have specified our “question[1]”, then what we do is to gather the data[2] we need.

Page 7: Data Mining with Weka Putting  it all together

Real life projectThis summer vacation, I worked as volunteer

programmer(no payment) for a start-up, whose objective is to provide article recommendations for developers[1].

In this case, we must keep our database, which will index all the up-to-date articles we gather from the whole Internet(mongoDB).

We use many ways to gather articles, and I just focused on one of them – Get articles links from influencers’ tweets through APIs.

Page 8: Data Mining with Weka Putting  it all together

Procedures in real lifeDo all the links I gathered work? - Never, even I wish they did

1. Due to algorithm issue, some links I got are in bad format.

2. Even links are correct, I cannot get articles from all links, as some of them are not links for articles.

[3. More problems after getting articles from links]

-- We must do some clean up[3], after we gathered our data, to better use it.

Page 9: Data Mining with Weka Putting  it all together

Procedures in real life

OK, assume that now we have all the [raw] data(articles here) we need.

The most important jobs comes – one of them is how to rank articles for different keywords [how to define keywords collection]. (It is more about mathematics issue than computer science here, and I did not participate in this part)

-- Define new features

Page 10: Data Mining with Weka Putting  it all together

Procedures in real life

After new features defined, the last step is to generate a web app, so that users can enjoy “our” work.

Now the last step of this project is still under construction, which means “we” still need more time to “deploy the result”.

We will go to section 5.2 now -->

Page 11: Data Mining with Weka Putting  it all together

5.2 Pitfalls and pratfallsBy Yujia, Pengfei

Page 12: Data Mining with Weka Putting  it all together

5.2 Pitfalls and pratfalls

Pitfall: A hidden or unsuspected danger or difficulty

Pratfall: A stupid and humiliating action

Tricky parts and how to deal with them

Page 13: Data Mining with Weka Putting  it all together

Be skeptical

In data mining, it’s very easy to cheat whether consciously or unconsciously

For reliable tests, use a completely fresh sample of data that has never been seen before

Page 14: Data Mining with Weka Putting  it all together

Overfitting has many faces

Don’t test on the training set (of course!)

Data that has been used for development (in any way) is tainted

Leave some evaluation data aside for the very end

Key: always test on completely fresh data.

Page 15: Data Mining with Weka Putting  it all together

Missing values“Missing” means what … Unknown? Unrecorded? Irrelevant?Missing valuesOmit instances where the attribute value is missing?

or Treat “missing” as a separate possible value?Is there significance in the fact that a value is

missing?Most learning algorithms deal with missing values– but they may make different assumptions about them

Page 16: Data Mining with Weka Putting  it all together

An Example

OneR and J48 deal with missing values in different ways

Load weather‐nominal.arffOneR gets 43%, J48 gets 50% (using 10‐fold

cross‐validation)Change the outlook value to unknown on the

first four no instancesOneR gets 93%, J48 still gets 50%Look at OneR’s rules: it uses “?” as a fourth

value for outlook

Page 17: Data Mining with Weka Putting  it all together

An Example

Page 18: Data Mining with Weka Putting  it all together

5.2 Pitfalls and pratfallsPart 2 By Pengfei

Page 19: Data Mining with Weka Putting  it all together

 No “universal” best algorithm, No free lunch

2‐class problem with 100 binary attributes

Say you know a million instances, and their classes (training set)

You don’t know the classes of 99.9999…% of the data set

How could you possibly figure them out

Example

Page 20: Data Mining with Weka Putting  it all together

 No “universal” best algorithm, No free lunch

In order to generalize, every learner must embody some knowledge or assumptions beyond the data it’s given

Delete less useful attributesFind better filterData mining is an experimental science

Page 21: Data Mining with Weka Putting  it all together

5.3 Data mining and ethicsBy Yanqing

Page 22: Data Mining with Weka Putting  it all together

5.3 Data mining and ethics

Information privacy lawsAnonymization The purpose of data miningCorrelation causation

Source: www.mum.eduSource: www.ediscoveryreadingroom.com

Source: www.zerohedge.com

Source: www.johnmyleswhite.com

Page 23: Data Mining with Weka Putting  it all together

Information privacy lawsIn Europe

Purpose; Keep secret; Accurately update; Provider can review; Deleted asap; Un-transmittable (if less protection) No sensitive data (sexual orientation, religion )

In US Not highly legislated or regulated Computer Security, Privacy and Criminal

Law But hard to be anonymous...

Be aware ethical issues and laws AOL (2006)

650,000 users (3days in public web) at least $5,000 for the identifiable person

Source: www.livingwithgod.org

Source: blog.brainhost.com

Page 24: Data Mining with Weka Putting  it all together

Anonymization

It is much harder than you think. Story: MA release medical records (mid‐1990s) No name, address, social security number Re-identification technique

Public records: City, Birthday, gender: 50% of US can be identify One more attribute – zipcode: 85% identification

Netflix Use movie rating system to identify people 99% 6 movies 70% 2 movies

Source: eofdreams.com

www.resteasypestcontrol.com

Page 25: Data Mining with Weka Putting  it all together

The purpose of data miningThe purpose of data mining is to discriminate …

who gets the loan who gets the special offer

Certain kinds of discrimination are unethical, and illegal racial, sexual, religious, …

But it depends on the context sexual discrimination is usually illegal … except for doctors, who are expected to take gender into account

… and information that appears innocuous may not be ZIP code correlates with race membership of certain organizations correlates with gender

Page 26: Data Mining with Weka Putting  it all together

Correlation and Causation

Correlation does not imply causation As icecream sales increase, so does the rate of

drownings. Therefore icecream consumption causes drowning???

Data mining reveals correlation, not causation but really, we want to predict the effects of our actions

Source: commons.wikimedia.org Source: www.thevisualeverything.comSource: www.michaelnielsen.org

Page 27: Data Mining with Weka Putting  it all together

5.3 Summary

Privacy of personal informationAnonymization is harder than you thinkReidentification from supposedly anonymized

dataData mining and discriminationCorrelation does not imply causation

Page 28: Data Mining with Weka Putting  it all together

5.4 Summary By Ya, Kexin

Page 29: Data Mining with Weka Putting  it all together

5.4 SUMMARY

There’s no magic in data mining– Instead, a huge array of alternative techniques

There’s no single universal “best

method” – It’s an experimental science!– What works best on your problem?

Page 30: Data Mining with Weka Putting  it all together

5.4 SUMMARY

Produce comprehensive models

When attributes contribute equally and independently to the decision

Simply stores the training data without processing it

Calculate a linear decision boundary

Avoids overfitting, even with large numbers of attributes

Determines the baseline performance

Page 31: Data Mining with Weka Putting  it all together

5.4 SUMMARY

Weka makes it easy – ... maybe too easy?

There are many pitfalls– You need to understand what you’re doing!

filters Attribute selection

Data visualization classifiers clusters

Page 32: Data Mining with Weka Putting  it all together

5.4 SUMMARY

Focus on evaluation ... and significance– Different algorithms differ in performance – but is it significant?

Page 33: Data Mining with Weka Putting  it all together
Page 34: Data Mining with Weka Putting  it all together

Advanced Datamining with Weka

Some missing parts in the lecturesFiltered ClassifierCost-sensitive evaluation and classificationAttribute selectionClustering Association rulesText classificationWeka Experimenter

Page 35: Data Mining with Weka Putting  it all together

Filtered Classifier

Filter the training data, not testing data

Why do we need Filtered Classifier?

Page 36: Data Mining with Weka Putting  it all together

Cost-sensitive evaluation and classification

Costs of different decisions and different kinds of errors

Costs in datamining Misclassification Costs Test Costs Costs of Cases Computation Costs

Page 37: Data Mining with Weka Putting  it all together

Attribute Selection

Uesful parts of Attribute SelectionSelect relevant attributesRemove irrelevant attributesReasons for Attribute SelctionSimpler modelMore Transparent and easier to understandShorter Training timeKnowing which attribute is important

Page 38: Data Mining with Weka Putting  it all together

Clustering

Cluster the instances according to their attribute values

Clustering method: k-means k-means++

Page 39: Data Mining with Weka Putting  it all together

Experimenter

Page 40: Data Mining with Weka Putting  it all together

Experimenter

Page 41: Data Mining with Weka Putting  it all together

Experimenter

Page 42: Data Mining with Weka Putting  it all together

Acknowledgement

Thanks Prof. Ian H. Witten and his Weka MOOC direction.

Page 43: Data Mining with Weka Putting  it all together

Thank you for your kind attention