romi dm 01 introduction 1juli2011

79
Data Mining Romi Satria Wahono [email protected] http://romisatriawahono.net 0878-804804-85

Upload: nora-asteria

Post on 25-Sep-2015

230 views

Category:

Documents


2 download

DESCRIPTION

JAVA - data mining

TRANSCRIPT

romi-jsai2000-presentation

Data MiningRomi Satria [email protected]://romisatriawahono.net0878-804804-85

SD Sompok Semarang (1987)SMPN 8 Semarang (1990)SMA Taruna Nusantara, Magelang (1993)S1, S2 dan S3 (on-leave)Department of Computer SciencesSaitama University, Japan (1994-2004)Research Interests: Software Engineering,Intelligent SystemsFounder dan Koordinator IlmuKomputer.ComPeneliti LIPI (2004-2009)Founder dan CEO PT Brainmatics Cipta InformatikaRomi Satria WahonoLearning MethodsLectureDiscussionCase StudyPracticeTextbooks

Course OutlineIntroduction to Data MiningInput - Concept, Instance and AttributesOutput - Knowledge RepresentationMethods and AlgorithmEvaluation and ValidationData Mining Research

Introduction toData MiningContentsWhat is Data MiningMain Task of Data MiningData Mining Standard ProcessData Mining ApplicationsData Mining and Ethics

What is Data MiningWhy Data Mining?Society produces huge amounts of dataSources: business, science, medicine, economics, geography, environment, sports, Potentially valuable resourceRaw data is useless: need techniques to automatically extract information (recognize pattern) from itData: recorded factsInformation: patterns underlying the dataKnowledge Discovery in Database (KDD)Definition of Data MiningExtracting implicit, previously unknown, potentially useful information from data (Witten, 2011)The process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques (Gartner Group)Definition of Data MiningThe analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner (Hand et al., 2001)Kegiatan yang meliputi pengumpulan, pemakaian data historis untuk menemukan keteraturan, pola dan hubungan dalam set data berukuran besar (Santosa, 2007)Definition of Data MiningAn interdisciplinary field bringing together techniques from machine learning, pattern recognition, statistics, databases, and visualization to address the issue of information extraction from large data bases(Cabena et al, 1998).Irisan Bidang Ilmu Data MiningStatistik: Lebih bersifat teoriFokus ke pengujian hipotesisMachine Learning:Lebih bersifat heuristikFokus pada perbaikan performansi dari suatu teknik learningData Mining:Gabungan teori dan heuristikFokus pada seluruh proses penemuan knowledge dan polaTermasuk data cleaning, learning dan visualisasi hasilnyaData Mining ToolsWEKARapidMinerClementineMatlabRLearning Methods-1-Unsupervised Learning: the data mining algorithm searches for patterns and structure among all the variablesno target variable is identified as suchclustering algorithm is an unsupervised learning methodSupervised Learning:most data mining methods (classification and prediction) are supervised methodsthe algorithm is given many examples where the value of the target variable is providedthe algorithm may learn which values of the target variable are associated with which values of the predictor variablesLearning Methods-2-Another data mining method, which may be supervised or unsupervised, is association rule miningIn market basket analysis, for example, one may simply be interested in which items are purchased together, in which case no target variable would be identifiedThe problem here, is that there are so many items for sale, that searching for all possible associations may present a daunting task, due to the resulting combinatorial explosionThe a priori algorithm, attack this problem cleverlyMain Task of Data MiningMain Task of Data MiningDescriptionEstimationPredictionClassificationClusteringAssociationDescriptionResearchers and analysts are simply trying to find ways to describe patterns and trends lying within dataData mining model should describe clear patterns that are amenable to intuitive interpretation and explanation. Some data mining methods are more suited than others to transparent interpretationdecision trees provide an intuitive and human friendly explanation of their resultsneural networks are comparatively opaque to nonspecialists, due to the nonlinearity and complexity of the modelHigh-quality description can often be accomplished by exploratory data analysis, a graphical method of exploring data in search of patterns and trends.Description TechniquesDeskripsi GrafisDiagram TitikHistogramDeskripsi LokasiMean (Rata-Rata)Median (Nilai Tengah)Modus (Paling Sering Muncul)Kuartil (Nilai di Tiap Seperempat Bagian)PersentilDeskripsi KeberagamanRange (Rentang)Varians dab Standar Deviasi

EstimationEstimation is similar to classification except that the target variable is numerical rather than categoricalModels are built using complete records, which provide the value of the target variable as well as the predictorsThen, for new observations, estimates of the value of the target variable are made, based on the values of the predictorsEstimation TechniquesThe field of statistical analysis supplies several venerable and widely used estimation methodsThese include point estimation and confidence interval estimations, simple linear regression and correlation, and multiple regressionNeural networks may also be used for estimationEstimation - ExamplesEstimating the amount of money a randomly chosen family of four will spend for back-to-school shopping this fallEstimating the percentage decrease in rotary-movement sustained by a National Football League running back with a knee injuryEstimating the number of points per game that Patrick Ewing will score when double-teamed in the playoffsEstimating the grade-point average (GPA) of a graduate student, based on that students undergraduate GPARegression estimates lie on the regression line

Estimating CPU PerformanceExample: 209 different computer configurations

Linear regression functionPRP = -55.9 + 0.0489 MYCT + 0.0153 MMIN + 0.0056 MMAX+ 0.6410 CACH - 0.2700 CHMIN + 1.480 CHMAX

0032128CHMAX00816CHMINChannelsPerformanceCache (Kb)Main memory (Kb)Cycle time (ns)45040001000480209673280005124802082693232000800029219825660002561251PRPCACHMMAXMMINMYCTPredictionPrediction is similar to classification and estimation, except that for prediction, the results lie in the futurePrediction TechniquesAny of the methods and techniques used for classification and estimation may also be used, under appropriate circumstances, for predictionStatistical methods: point estimation and confidence interval estimations, simple linear regression and correlation, and multiple regressionData mining methods: neural network, decision tree, and k-nearest neighborPrediction - ExamplesPredicting the price of a stock three months into the futurePredicting the percentage increase in traffic deaths next year if the speed limit is increasedPredicting the winner of this falls baseball World Series, based on a comparison of team statisticsPredicting whether a particular molecule in drug discovery will lead to a profitable new drug for a pharmaceutical companyPredicting the price of a stock

ClassificationIn classification, there is a target categorical variable, such as income bracket, which, for example, could be partitioned into three classes or categories: high incomemiddle incomelow incomeThe data mining model examines a large set of records, each record containing information on the target variable as well as a set of input or predictor variablesClassification Techniquesneural networkdecision treek-nearest neighbornaive bayesClassification - ExamplesDetermining whether a particular credit card transaction is fraudulentPlacing a new student into a particular track with regard to special needsAssessing whether a mortgage application is a good or bad credit riskDiagnosing whether a particular disease is presentIdentifying whether or not certain financial or personal behavior indicates a possible terrorist threatThe Contact Lenses Data

A Complete and Correct Rule Set

A Decision Tree for This Problem

The Weather ProblemExample: Conditions for playing a certain game

Rules:If outlook = sunny and humidity = high then play = noIf outlook = rainy and windy = true then play = noIf outlook = overcast then play = yesIf humidity = normal then play = yesIf none of the above then play = yes

Weather Data with Mixed AttributesExample: Some attributes have numeric values

Rules:If outlook = sunny and humidity = high then play = noIf outlook = sunny and humidity > 83 then play = noIf outlook = rainy and windy = true then play = noIf outlook = overcast then play = yesIf humidity < 85 then play = yesIf none of the above then play = yes

Classifying Iris Flowers

A Complete and Correct Rule Set

Data from Labor Negotiationsgoodgoodgoodbad{good,bad}Acceptability of contracthalffull?none{none,half,full}Health plan contributionyes??no{yes,no}Bereavement assistancefullfull?none{none,half,full}Dental plan contributionyes??no{yes,no}Long-term disability assistanceavggengenavg{below-avg,avg,gen}Vacation12121511(Number of days)Statutory holidays???yes{yes,no}Education allowanceShift-work supplementStandby payPensionWorking hours per weekCost of living adjustmentWage increase third yearWage increase second yearWage increase first yearDurationAttribute44%5%?Percentage??13%?Percentage???none{none,ret-allw, empl-cntr}40383528(Number of hours)none?tcfnone{none,tcf,tc}????Percentage4.04.4%5%?Percentage4.54.3%4%2%Percentage2321(Number of years)40321TypeDecision Trees for the Labor Data

ClusteringClustering refers to the grouping of records, observations, or cases into classes of similar objectsA cluster is a collection of records that are similar to one another, and dissimilar to records in other clustersClustering differs from classification in that there is no target variable for clustering (unsupervised learning)The clustering task does not try to classify, estimate, or predict the value of a target variableClustering is often performed as a preliminary step in a data mining process, with the resulting clusters being used as further inputs into a different technique downstream, such as neural networksClustering TechniquesHierarchical clusteringK-means clusteringSelf Organizing Map (SOM)Clustering - ExamplesTarget marketing of a niche product for a small-capitalization business that does not have a large marketing budgetFor accounting auditing purposes, to segmentize financial behavior into benign and suspicious categoriesAs a dimension-reduction tool when the data set has hundreds of attributesFor gene expression clustering, where very large quantities of genes may exhibit similar behaviorClustering the Lifestyle TypesClaritas, Inc. provide a demographic profile of each of the geographic areas in the country, as defined by zip code. One of the clustering mechanisms they use is the PRIZM segmentation system, which describes every U.S. zip code area in terms of distinct lifestyle types. Just go to the companys Web site, enter a particular zip code, and you are shown the most common PRIZM clusters for that zip code.What do these clusters mean? For illustration, lets look up the clusters for zip code 90210, Beverly Hills, California. The resulting clusters for zip code 90210 are:Cluster 01: Blue Blood EstatesCluster 10: Bohemian MixCluster 02: Winners CircleCluster 07: Money and BrainsCluster 08: Young LiteratiAssociationThe association task for data mining is the job of finding which attributes go togetherMost prevalent in the business world, where it is known as affinity analysis or market basket analysis, the task of association seeks to uncover rules for quantifying the relationship between two or more attributesAssociation rules are of the form If antecedent, then consequent, together with a measure of the support and confidence associated with the rule

AssociationFor example, a particular supermarket may find that of the 1000 customers shopping on a Thursday night:200 bought diapersthose 200 who bought diapers, 50 bought beerThus, the association rule would be If buy diapers, then buy beer with a support of 200/1000 = 20% and a confidence of 50/200 = 25%Association TechniquesA priori algorithmFP-Growth algorithmGRI algorithmAssociation - ExamplesInvestigating the proportion of subscribers to a companys cell phone plan that respond positively to an offer of a service upgradePredicting degradation in telecommunications networksFinding out which items in a supermarket are purchased together and which items are never purchased togetherDetermining the proportion of cases in which a new drug will exhibit dangerous side effectsLatihan (Classification)Lakukan training pada data pemilu (datakpu-training.xls) dengan menggunakan algoritma C4.5Lakukan pengujian untuk datakpu-testing.xls Ukur performance-nya dengan menggunakan:Confusion Matric (Accuracy)ROC Curve (AUC)

Latihan (Estimation)Lakukan training pada data cpu (cpu.arff) dengan menggunakan linear regressionLakukan pengujian dengan XValidation Ukur performance-nya dengan menggunakan:RMSE

Latihan (Time Series Prediction)Lakukan training pada data harga saham (hargasaham-training.xls) dengan menggunakan neural networkLakukan pengujian dengan data uji (hargasaham-testing.xls) Ukur performance-nya dengan menggunakan:Prediction AccuracyRMSE

TugasCoba semua data set yang ada di folder case study dengan berbagai metode data mining. Bila data tanpa testing, gunakan X validationPelajari dan coba semua yang ada di rapidminer-movietutorialBuat laporan tentang seluruh ujicoba dari tugas 1 dan 2 beserta screenshootnya dan kirimkan via email ke [email protected]: [datamining1-udinus] nama-nimDeadline: 5 agustus 2011Data Mining Standard ProcessData Mining Standard Process (CRISPDM) A cross-industry standard was clearly required that is industry neutral, tool-neutral, and application-neutralThe Cross-Industry Standard Process for Data Mining (CRISPDM) was developed in 1996 (Chapman, 2000) CRISP-DM provides a nonproprietary and freely available standard process for fitting data mining into the general problem-solving strategy of a business or research unitCRISP-DM

1. Business Understanding PhaseEnunciate the project objectives and requirements clearly in terms of the business or research unit as a wholeTranslate these goals and restrictions into the formulation of a data mining problem definitionPrepare a preliminary strategy for achieving these objectives2. Data Understanding PhaseCollect the dataUse exploratory data analysis to familiarize yourself with the data and discover initial insightsEvaluate the quality of the dataIf desired, select interesting subsets that may contain actionable patterns3. Data Preparation PhasePrepare from the initial raw data the final data set that is to be used for all subsequent phases. This phase is very labor intensiveSelect the cases and variables you want to analyze and that are appropriate for your analysisPerform transformations on certain variables, if neededClean the raw data so that it is ready for the modeling tools4. Modeling phaseSelect and apply appropriate modeling techniquesCalibrate model settings to optimize resultsRemember that often, several different techniques may be used for the same data mining problemIf necessary, loop back to the data preparation phase to bring the form of the data into line with the specific requirements of a particular data mining technique5. Evaluation phaseEvaluate the one or more models delivered in the modeling phase for quality and effectiveness before deploying them for use in the fieldDetermine whether the model in fact achieves the objectives set for it in the first phaseEstablish whether some important facet of the business or research problem has not been accounted for sufficientlyCome to a decision regarding use of the data mining results6. Deployment phaseMake use of the models created: Model creation does not signify the completion of a projectExample of a simple deployment: Generate a reportExample of a more complex deployment: Implement a parallel data mining process in another departmentFor businesses, the customer often carries out the deployment based on your modelLatihanPelajari dan pahami Case Study 1-5 dari buku Larose (2005) Chapter 1

Pelajari dan pahami bagaimana menerapkan CRISP-DM pada tesis Firmansyah (2011) tentang penerapan algoritma C4.5 untuk penentuan kelayakan kreditData Mining ApplicationsFielded ApplicationsProcessing loan applicationsScreening images for oil slicksElectricity supply forecastingDiagnosis of machine faultsMarketing and salesSeparating crude oil and natural gasReducing banding in rotogravure printingFinding appropriate technicians for telephone faultsScientific applications: biology, astronomy, chemistryAutomatic selection of TV programsMonitoring intensive care patients

Processing Loan Applications (American Express)Given: questionnaire withfinancial and personal informationQuestion: should money be lent?Simple statistical method covers 90% of casesBorderline cases referred to loan officersBut: 50% of accepted borderline cases defaulted!Solution: reject all borderline cases?No! Borderline cases are most active customers

Enter Machine Learning1000 training examples of borderline cases20 attributes:ageyears with current employeryears at current addressyears with the bankother credit cards possessed,Learned rules: correct on 70% of caseshuman experts only 50%Rules could be used to explain decisions to customers

Screening ImagesGiven: radar satellite images of coastal watersProblem: detect oil slicks in those imagesOil slicks appear as dark regions with changing size and shapeNot easy: lookalike dark regions can be caused by weather conditions (e.g. high wind)Expensive process requiring highly trained personnel

Enter Machine LearningExtract dark regions from normalized imageAttributes:size of regionshape, areaintensitysharpness and jaggedness of boundariesproximity of other regionsinfo about backgroundConstraints:Few training examplesoil slicks are rare!Unbalanced data: most dark regions arent slicksRegions from same image form a batchRequirement: adjustable false-alarm rate

Load ForecastingElectricity supply companies need forecastof future demand for powerForecasts of min/max load for each hour significant savingsGiven: manually constructed load model that assumes normal climatic conditionsProblem: adjust for weather conditionsStatic model consist of:base load for the yearload periodicity over the yeareffect of holidays

Enter Machine LearningPrediction corrected using most similar daysAttributes:temperaturehumiditywind speedcloud cover readingsplus difference between actual load and predicted loadAverage difference among three most similar days added to static modelLinear regression coefficients form attribute weights in similarity function

Diagnosis of Machine FaultsDiagnosis: classical domainof expert systemsGiven: Fourier analysis of vibrations measured at various points of a devices mountingQuestion: which fault is present?Preventative maintenance of electromechanical motors and generatorsInformation very noisySo far: diagnosis by expert/hand-crafted rules

Enter Machine LearningAvailable: 600 faults with experts diagnosis~300 unsatisfactory, rest used for trainingAttributes augmented by intermediate concepts that embodied causal domain knowledgeExpert not satisfied with initial rules because they did not relate to his domain knowledgeFurther background knowledge resulted in more complex rules that were satisfactoryLearned rules outperformed hand-crafted ones

Marketing and Sales ICompanies precisely record massive amounts of marketing and sales dataApplications:Customer loyalty:identifying customers that are likely to defect by detecting changes in their behavior(e.g. banks/phone companies)Special offers:identifying profitable customers(e.g. reliable owners of credit cards that need extra money during the holiday season)

Marketing and Sales IIMarket basket analysisAssociation techniques findgroups of items that tend tooccur together in a transaction(used to analyze checkout data)Historical analysis of purchasing patternsIdentifying prospective customersFocusing promotional mailouts(targeted campaigns are cheaper than mass-marketed ones)

Data Mining and EthicsData Mining and Ethics IEthical issues arise in practical applicationsAnonymizing data is difficult85% of Americans can be identified from just zip code, birth date and sexData mining often used to discriminateE.g. loan applications: using some information (e.g. sex, religion, race) is unethicalEthical situation depends on applicationE.g. same information ok in medical applicationAttributes may contain problematic informationE.g. area code may correlate with race

Data Mining and Ethics IIImportant questions:Who is permitted access to the data?For what purpose was the data collected?What kind of conclusions can be legitimately drawn from it?Caveats must be attached to resultsPurely statistical arguments are never sufficient!Are resources put to good use?

ReferensiIan H. Witten, Frank Eibe, Mark A. Hall, Data mining: Practical Machine Learning Tools and Techniques 3rd Edition, Elsevier, 2011Daniel T. Larose, Discovering Knowledge in Data: an Introduction to Data Mining, John Wiley & Sons, 2005Florin Gorunescu, Data Mining: Concepts, Models and Techniques, Springer, 2011 Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques Second Edition, Elsevier, 2006Oded Maimon and Lior Rokach, Data Mining and Knowledge Discovery Handbook Second Edition, Springer, 2010Warren Liao and Evangelos Triantaphyllou (eds.), Recent Advances in Data Mining of Enterprise Data: Algorithms and Applications, World Scientific, 2007