data mining - lecture 1

Upload: ankush-jindal

Post on 10-Jan-2016

21 views

Category:

Documents


0 download

DESCRIPTION

Data mining at IIT Mandi

TRANSCRIPT

  • 1CS 660: Data Mining For Decision Making

    Lecture 1 (Week 1)

    Varun Dutt

    School of Computing and Electrical Engineering

    School of Humanities and Social Sciences

    Indian Institute of Technology Mandi, India

    Scaling the Heights

  • Course Instructor

    Prof. Varun Dutt

    School of Computing and Electrical Engineering

    School of Humanities and Social Sciences

    Indian Institute of Technology, Mandi

    PWD Rest House 2nd Floor, Mandi - 175 001, H.P., India

    Phone: +91-1905-267041

    Email: [email protected]

    Office Hours: Only with a prior appointment

    2

  • A Little About Me! In the office

    Qualifications

    M.S. degrees in Software Engineering, Engineering and Public Policy, and Rational Simulation (cognitive modeling) from Carnegie Mellon University

    Ph.D. in Engineering and Public Policy from Carnegie Mellon University

    Post-doctoral fellowship from Carnegie Mellon University

    Since 2012 at Indian Institute of Technology, Mandi, India

    Research interests

    Artificial intelligence and cognitive modeling, Human-Computer Interaction,Environmental decision making, Judgment and Decision Making

    Professional Experience

    Served as a Software Engineer in Tata Consultancy Services (TCS) and in MothersonSumi INfotech and Designs Ltd.

    Serves as Knowledge Editor of a financial daily, Financial Chronicle

    Serves as Lead Author on Chapter 2 on UN IPCCs AR5 (WG III) report

  • 4A Little About Me! At home

    ABBA Fan

    Married to Dr. Rajeshwari Dutt with a cute little daughter

    Get no sleep!

    Do a lot of writing and have a back problem

    I have a TA to help!

    x5

  • Teaching Assistants

    - Sanjay Rathee, Ph.D. student, SCEE, IIT Mandi. Email:

    [email protected] (Has been working on

    parallelizing A-priori algorithm recently.)

    - Akash Porwal, Ph.D. student, SCEE, IIT Mandi. Email:

    [email protected] (Has recently joined and is working on

    electrical problems concerning Solar Photovoltaics)

    5

  • What about you folks?

    Please introduce yourselves

    6

  • Announcements

    Syllabus

    Your Grade:

    30% Final exam

    20% Surprise Quizzes

    10% Class Participation

    20% Class Assignments

    20% Class Project

    7

  • 8Course Logistics

    - Please dont copy or plagiarize! - Being an AI researcher, I know how to catch it

    - If found, consequences will be catastrophic!

    - If you did copy, then please cite the sources as

    (author, date). E.g., (Dutt, 2012)

  • An Example (Witten, Frank, & Hall, 2011)

    9

  • Data mining is defined as the process of discovering structural patterns in data.

    The process must be automatic or (more usually) semiautomatic.

    The patterns discovered must be meaningful in that they lead to some

    advantage, usually an economic one.

    The data is invariably present in substantial quantities.

    Data Mining: What is it? (Witten, Frank, & Hall, 2011)

    10

  • Example

    11

  • If tear production rate = reduced then

    recommendation = none

    Otherwise, if age = young and astigmatic =

    no then recommendation = soft

    Structural Description (Pattern) in Data

    12

  • Weather Dataset

    13

    In this case there are four attributes: outlook, temperature,

    humidity, and windy. The outcome is whether to play or not.

  • A set of rules learned from this information might look like this:

    If outlook = sunny and humidity = high then play = no

    If outlook = rainy and windy = true then play = no

    If outlook = overcast then play = yes

    If humidity = normal then play = yes

    If none of the above then play = yes

    Structural Description (Pattern) in Data (also, called a Decision List)

    14

  • These rules are meant to be interpreted in order:

    The first one; then, if it doesnt apply, the second; and so on. A set of rules that are intended to be

    interpreted in sequence is called a decision list.

    Interpreted as a decision list, the rules correctly classify all of the examples in the table, whereas

    taken individually, out of context, some of the rules

    are incorrect. For example, the rule if humidity =

    normal then play = yes gets one of the examples

    wrong (check which one).

    Decision List

    15

  • Weather Dataset: Two of the attributestemperature and humidityhave numeric values

    16

  • Structural Description (Classification Rules)

    17

    For this example, there must be inequalities involving these attributes

    rather than simple equality tests as in the former case.

    This is called a numeric-attribute problemin this case, a mixed-attribute problem because not all attributes are numeric.

    Now the first rule given earlier might take the formIf outlook = sunny and humidity > 83 then play = no

  • Association Rules

    18

  • Association Rules

    19

  • Data Cleaning (scrubbing, also called data cleansing), is the process of amending or removing data in a database that is

    incorrect, incomplete, improperly formatted, or duplicated. It is a time

    consuming activity often done in a semi-automated manner.

    Missing Values: Missing values are frequently indicated by out-of-range entries. Example: A negative number (e.g., 1) in a numeric field that is normally only positive, or a 0 in a numeric field that can

    never normally be 0. For nominal attributes, missing values may be

    indicated by blanks or dashes.

    Inaccurate Values: Pepsi somewhere and Pepsi-Cola somewhere else. Typographical errors. Example: Super-market seller uses her

    own cards for discounts to those who forgot their cards.

    Preparing Input Data for Data Mining

    20

  • Web-mining: Prestige of a web-page based upon how many link to it (PageRank)

    Decisions involving judgments (Banks use data-mining while giving you loans accept or reject cases)

    Screening images (oil slicks or not in sea using satellite data)

    Load forecasting in Electricity Industry

    Diagnosing faults in machines in Industry

    Marketing and Sales (Pharmaceutical Industry Patient Journeys, Market-Basket Analysis (Pepsi and Diapers on

    Thursdays), Discount or Loyalty Cards to Collect Data

    Applications of Data Mining in Real World

    21

  • Activities

    Read Witten, Frank, and Hall, 2011: Chapter 1 (up to page 15 before CPU performance; 21-29, 51-52,

    58-60):

    http://www.cse.hcmut.edu.vn/~chauvtn/data_mining/

    Texts/[7]%20Data%20Mining%20-

    %20Practical%20Machine%20Learning%20Tools%2

    0and%20Techniques%20(3rd%20Ed).pdf

    Read Singhal, 2011

    22

  • Thank you!

    23

    Comments and Questions most welcome!