seminar data & web mining - knowledge engineering … · 1 –100 % scale indicating the...

56
Seminar Data & Web Mining Christian Groß Timo Philipp

Upload: lamminh

Post on 11-Sep-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Seminar Data & Web Mining

Christian GroßTimo Philipp

Agenda Application types

Introduction to Applications

Evaluation

19.07.2007 Applications 2

Overview Crime enforcement

Crime Link Explorer

Money laundering

Peer Group Analysis

Open Source Software Development

Adolescent cigarette smoking

19.07.2007 Applications 3

Crime Link Explorer (1) Software developed by University of Arizona

Crime investigators should be enabled to automatically conduct effective and efficient link analysis

Link Analysis

Identification, analysis and visualization of associations between entities :

persons

locations

criminal incidents

19.07.2007 Applications 4

Crime Link Explorer (2) Three techniques included:

Concept space approach

Shortest path algorithm

Heuristic approach

19.07.2007 Applications 5

Data source for Link Analysis Data source:

crime incident report (“Anzeige”)

Uniform Crime Report (UCR) established 1930

surveillance logs

telephone records

financial transactions

Link:

If two entities appear in the same document / log / telephone record

19.07.2007 Applications 6

Problems Costs much time and human effort

Information overload

Information buried in large volume of raw data

High branching factors

The number of direct links an entity has

Determining the importance of links

Relies heavily on domain knowledge

19.07.2007 Applications 7

System design GUI for visualizing

founded association paths Dijkstra shortest path

algorithm used for finding strong association between entities

Associations are identified and extracted from the dataset using the concept space approach

Heuristics, capturing domain knowledge, are used for identifying criminal associations

19.07.2007 Applications 8

Co-occurrence weights

19.07.2007 Applications 9

Concept Space Network consisting of domain specific concepts (nodes)

Weighted co-occurrence relationship (links)

Example: COPLINK Concepts: (nodes) Persons

Organizations

Locations

Vehicles

Link: if two concepts appear in the same criminal incidents

19.07.2007 Applications 10

Co-occurrence weights (1)

19.07.2007 Applications 11

Incident report A

Person A

Person B

Incident report B

Person A

Person B

Person C

Location A

Location B

Co-occurrence weigths (2)

19.07.2007 Applications 12

Person A Person Bweight

Co-occurrence weight computed based on frequency

that two persons appear together in same incident

report

Con:Weights computed are only a minor assistance in term of uncovering investigative leads

Heuristic approach

19.07.2007 Applications 13

Heuristic approach three criteria:

Relationship between crime type and person roles

Shared addresses

Shared telephone numbers

Repeated co-occurrence in incident report

1 – 100 % scale indicating the strength of associations

19.07.2007 Applications 14

Heuristic approachCrime type / person role (1) Construction of a matrix for each crime type

Homicide Robbery Auto Theft Sexual Assault ….

Each matrix containing strength estimation for each role combination victim <-> witness witness <-> suspect suspect <-> victim ….

19.07.2007 Applications 15

Heuristic approachCrime type / person role (2)

Homicide Victim Witness Suspect Arrestee Other

Victim ... … 98 … …

Witness ... … … … …

Suspect 98 ... … … …

Arrestee ... … … … …

Other … … … … …

19.07.2007 Applications 16

Estimation of strength of associations occurring for role combination and crime type out of every 100 incidents

Table for crime type homicide:

Heuristic score could be improved by including statistical analysis

Heuristic approachShared address / phone Important indicator for associations

But: phone number often erroneous

only 5 % to final weight

Address more accurate than phone number

10 % to final weight

19.07.2007 Applications 17

Heuristic approachCo-occurrence

Co-occurrence count

Associationprobability (%)

1 1

2 45

3 98

≥ 4 100

19.07.2007 Applications 18

Same idea as concept space approach

But: estimation of co-occurrence weights based on empirically derived probability distribution

Heuristic approachFinal heuristic weight

19.07.2007 Applications 19

P1 = crime_type / person_score

P2 = shared phone score

P3 = shared address score

P4 = association probability based on co-occurrence counts

4321 ,10.005.085.0 PPPPMaxw final

Association Path

19.07.2007 Applications 20

Person A

Person C

Person B

Person D

Person F

Person E

w1

w2

w4

w3

w5

Association path search Logarithmic transformation done on weights wi

Modified Dijkstra Algorithm used for finding strongest association path between two or more persons

19.07.2007 Applications 21

System Evaluation Data set:

239.780 incident reports

229.938 persons involved

Age, gender, race, address, phone number

10 crime analysts

Heuristic approach more accurate than concept space approach

Heuristic approach uses domain knowledge

Reduced time and effort needed for link analysis

19.07.2007 Applications 22

Overview Crime enforcement

Link Exploration

Money laundering

Peer Group Analysis

Open Source Software Development

Adolescent cigarette smoking

19.07.2007 Applications 23

FAIS The Financial Crimes Enforcement Network AI System

maintained by FinCEN (U.S.)

Aim: detection of money laundering

19.07.2007 Applications 24

7/19/2007 Applications 25

Money Transaction over 10.000 into/out of US

Fill Currency Transaction Report (CTR)

Information injection into DB (U.S. Customs Services Data Centre)

DB

Information collection

Load Program

FAIS – Load and prepare Data

19.07.2007 Applications 26

FAIS DB (Sybase)

transactionDB (U.S. Customs

Services Data Centre)

Suspicious Rating Prog.

Analysis rules (336)

Consolidated dataData with

rating

-Subject, Accounts(linked with transactions)

Link Analysis

NEXPERT: • GUI for investigating how result is received• Allow what if statements• Heuristic knowledge for text fields

Rules result in individual pos./neg. evidencesBayesian transform to single rate

FAIS – Data AnalysisData Driven Mode User Directed Mode

Apply filters Create SQL query

19.07.2007 Applications 27

FAIS – Data Analysis (cont.)

19.07.2007 Applications 28

FAIS – Use and Payoff Introduction 1993 Beginning 1995:

20 mio transactions 3000 subjects detected 2,5 mio accounts

Beginning 1997: (see Strategy Plan of FinCEN 1997 - 2002) 39 mio (Bank Secrecy Acts including CTR) Revealing new 3500 subjects 5,000 bank accounts of Colombian/Mexican money launderers

detected

Received Feedback 50% known hits 50% hits with similar behaviour 90% of leads are correct

19.07.2007 Applications 29

Consequences Revised Form

19.07.2007 Applications 30

Overview Crime enforcement

Link Exploration

Money laundering

Peer Group Analysis

Open Source Software Development

Adolescent cigarette smoking

19.07.2007 Applications 31

OSS development phenomenon OSS := Open Source Software

Hypothesis: Open Source Software development could be modeled as a

self-organizing, collaborative network

Collaborative network Variation of social network

Edge between nodes if they are part of a collaboration

Linchpins connect disparate groups into larger cluster

Motivation: Better understanding of how the OSS community works

IT planners are able to better calculate the risk of OSS usage

19.07.2007 Applications 32

OSS development (1) Recent studies showed:

OSS development produces better, more bug-free software

Most developers work for enjoyment and pride of being part of an successful OSS project.

Not working for monetary return

Collaborate from around the world

Developers rarely meet face-to-face

Developers are self-organized

19.07.2007 Applications 33

OSS development (2) OSS movement is a example of a decentralized self-

organizing process.

No central control or planning

Threatens traditional proprietary software business

Open Questions:

Intellectual property rights

Role of the government concerning OSS

Software licensing

19.07.2007 Applications 34

Power Law Networks Collaborative networks

often show power law distribution

Examples for power law distributions:

City size distribution

Word ranking in languages and writing

Internet

19.07.2007 Applications 35

Example:

Data Collection and Analysis Web Crawler collected data from SourceForge (Mailing

Lists, project sites, forums) from Jan 2001 to March 2002

Project number

Developer id

SourceForge

Number of projects: 39.000 (2002), 152.000 (2007)

Number of developers: 33.000 (2002)

Number of registered users: 1.600.000 (2007)

19.07.2007 Applications 36

Modeling approach Modeling the OSS Community as collaborative social

network

Hypothesis:

The OSS Movement displays power law relationships in its structure

Cluster size

Degree of nodes

19.07.2007 Applications 37

Graph modelingNode = developer

Edge = work on the same project

Node = projectsEdge = same developer works on both projects

19.07.2007 Applications 38

Dev[14]

Dev[53]

Dev[75]

eMule

GIMP

Azureus

Results

19.07.2007 Applications 39

Both figures show, that the two modeled networks satisfy the power law property

Clustering Analysis (1)

19.07.2007 Applications 40

Linchpins

Linchpins

Clustering Analysis (2)

19.07.2007 Applications 41

Cluster size

Nu

mb

er

of

clu

ste

r

Conclusions (1) OSS developer network fits to the power law

relationship

OSS developer network is not a random network

The graph displays preferential attachment of new nodes Initial success of a OSS project more developer join

the project

Important role of linchpins Attractors for other developers

Facilitate the diffusion of ideas and technology between clusters

19.07.2007 Applications 43

Conclusions (2) Long term study needed because of high fluctuation

rate of nodes

Further research should be done on the OSS network

Additional graph theoretic properties could be computed (cluster coefficients, network diameter, etc)

Deeper understanding of how nodes join and leave

Role of SourceForge?

19.07.2007 Applications 44

Overview Crime enforcement

Link Exploration

Money laundering

Peer Group Analysis

Open Source Software Development

Adolescent cigarette smoking

19.07.2007 Applications 45

Adolescent Cigarette Smoking Social network theory and analysis applied to examine

whether adolescents differ in prevalence of current smoking.

Research project on 1092 ninth graders of 5 schools: Each choose 3 best friends (ordered by better friends

first)

Aim to classified each adolescent in Clique member

Clique liaison

Isolate

Additional information provided

7/19/2007 Applications 46

Building Link Graph

19.07.2007 Applications 47

Clique members:-Belong to group of min 3-50+% of their links within their group-Connected by some path lying entirely within the clique

Clique liaisons:-2+ links with clique members/other liaisons-Not in a clique

Isolates:•Few/no links to other

Liaisons

Isolates

Weight of arcs = 1 if non reciprocated friendship otherwise 2

Test data

19.07.2007 Applications 48

Cigarette Smoking Defined by self-report (current smoker and 1+ packs of

cigarette) and carbon monoxide content in alveolar breath samples.

19.07.2007 Applications 49

Result Smokers

tent more often to be white than black (2 schools significant)

come from families with mothers having lower education

19.07.2007 Applications 50

Additional Analysis significance in interaction

at 4 schools School E significant in

interactions between social position and variables grander and mother education?!?

Including nonsurveyed subjects leads to 5 schools with significant relationship between social position and current smoking (not shown)

19.07.2007 Applications 51

Underestimation of relationship

Additional Analysis (cont.)

19.07.2007 Applications 52

Possibility remains that isolates are integrated into peer groups outside the school social network.

Fiend Smoking Behaviour Isolates have more smoking friend than clique

members/liaisons (1,5 – 4 times as many); Isolates have fewer friends than other subjects.

Add attribute “friend smoking” to graph (ø of 3 friends - smoking/non smoking ) -> Not significant ->friend smoking is strongly related to subject smoking.

Friend smoking is not a proxy for peer group social position.

19.07.2007 Applications 53

Isolates tend to be smokers Explanation:

1. Social Isolation cause smoking

2. Smoking cause social isolation

3. No relationship between smoking & isolation(both caused by same factors)

4. Isolates are members of cliques from outside the school environment

Regardless of explanation smoking is not a peer group phenomenon!

19.07.2007 Applications 54

Similarities / Differences between Applications

19.07.2007 Applications 55

Evaluation Link analysis offers a great potential to crime

investigation

Reduce time and human effort

Domain knowledge could improve link analysis

More accurate results with domain knowledge based link analysis

Peer Group Analysis is a helpful tool for social network analysis

19.07.2007 Applications 56

7/19/2007 Applications 57