seminar data & web mining - knowledge engineering … · 1 –100 % scale indicating the...
TRANSCRIPT
Overview Crime enforcement
Crime Link Explorer
Money laundering
Peer Group Analysis
Open Source Software Development
Adolescent cigarette smoking
19.07.2007 Applications 3
Crime Link Explorer (1) Software developed by University of Arizona
Crime investigators should be enabled to automatically conduct effective and efficient link analysis
Link Analysis
Identification, analysis and visualization of associations between entities :
persons
locations
criminal incidents
19.07.2007 Applications 4
Crime Link Explorer (2) Three techniques included:
Concept space approach
Shortest path algorithm
Heuristic approach
19.07.2007 Applications 5
Data source for Link Analysis Data source:
crime incident report (“Anzeige”)
Uniform Crime Report (UCR) established 1930
surveillance logs
telephone records
financial transactions
Link:
If two entities appear in the same document / log / telephone record
19.07.2007 Applications 6
Problems Costs much time and human effort
Information overload
Information buried in large volume of raw data
High branching factors
The number of direct links an entity has
Determining the importance of links
Relies heavily on domain knowledge
19.07.2007 Applications 7
System design GUI for visualizing
founded association paths Dijkstra shortest path
algorithm used for finding strong association between entities
Associations are identified and extracted from the dataset using the concept space approach
Heuristics, capturing domain knowledge, are used for identifying criminal associations
19.07.2007 Applications 8
Concept Space Network consisting of domain specific concepts (nodes)
Weighted co-occurrence relationship (links)
Example: COPLINK Concepts: (nodes) Persons
Organizations
Locations
Vehicles
Link: if two concepts appear in the same criminal incidents
19.07.2007 Applications 10
Co-occurrence weights (1)
19.07.2007 Applications 11
Incident report A
Person A
Person B
Incident report B
Person A
Person B
Person C
Location A
Location B
Co-occurrence weigths (2)
19.07.2007 Applications 12
Person A Person Bweight
Co-occurrence weight computed based on frequency
that two persons appear together in same incident
report
Con:Weights computed are only a minor assistance in term of uncovering investigative leads
Heuristic approach three criteria:
Relationship between crime type and person roles
Shared addresses
Shared telephone numbers
Repeated co-occurrence in incident report
1 – 100 % scale indicating the strength of associations
19.07.2007 Applications 14
Heuristic approachCrime type / person role (1) Construction of a matrix for each crime type
Homicide Robbery Auto Theft Sexual Assault ….
Each matrix containing strength estimation for each role combination victim <-> witness witness <-> suspect suspect <-> victim ….
19.07.2007 Applications 15
Heuristic approachCrime type / person role (2)
Homicide Victim Witness Suspect Arrestee Other
Victim ... … 98 … …
Witness ... … … … …
Suspect 98 ... … … …
Arrestee ... … … … …
Other … … … … …
19.07.2007 Applications 16
Estimation of strength of associations occurring for role combination and crime type out of every 100 incidents
Table for crime type homicide:
Heuristic score could be improved by including statistical analysis
Heuristic approachShared address / phone Important indicator for associations
But: phone number often erroneous
only 5 % to final weight
Address more accurate than phone number
10 % to final weight
19.07.2007 Applications 17
Heuristic approachCo-occurrence
Co-occurrence count
Associationprobability (%)
1 1
2 45
3 98
≥ 4 100
19.07.2007 Applications 18
Same idea as concept space approach
But: estimation of co-occurrence weights based on empirically derived probability distribution
Heuristic approachFinal heuristic weight
19.07.2007 Applications 19
P1 = crime_type / person_score
P2 = shared phone score
P3 = shared address score
P4 = association probability based on co-occurrence counts
4321 ,10.005.085.0 PPPPMaxw final
Association Path
19.07.2007 Applications 20
Person A
Person C
Person B
Person D
Person F
Person E
w1
w2
w4
w3
w5
Association path search Logarithmic transformation done on weights wi
Modified Dijkstra Algorithm used for finding strongest association path between two or more persons
19.07.2007 Applications 21
System Evaluation Data set:
239.780 incident reports
229.938 persons involved
Age, gender, race, address, phone number
10 crime analysts
Heuristic approach more accurate than concept space approach
Heuristic approach uses domain knowledge
Reduced time and effort needed for link analysis
19.07.2007 Applications 22
Overview Crime enforcement
Link Exploration
Money laundering
Peer Group Analysis
Open Source Software Development
Adolescent cigarette smoking
19.07.2007 Applications 23
FAIS The Financial Crimes Enforcement Network AI System
maintained by FinCEN (U.S.)
Aim: detection of money laundering
19.07.2007 Applications 24
7/19/2007 Applications 25
Money Transaction over 10.000 into/out of US
Fill Currency Transaction Report (CTR)
Information injection into DB (U.S. Customs Services Data Centre)
DB
Information collection
Load Program
FAIS – Load and prepare Data
19.07.2007 Applications 26
FAIS DB (Sybase)
transactionDB (U.S. Customs
Services Data Centre)
Suspicious Rating Prog.
Analysis rules (336)
Consolidated dataData with
rating
-Subject, Accounts(linked with transactions)
Link Analysis
NEXPERT: • GUI for investigating how result is received• Allow what if statements• Heuristic knowledge for text fields
Rules result in individual pos./neg. evidencesBayesian transform to single rate
FAIS – Data AnalysisData Driven Mode User Directed Mode
Apply filters Create SQL query
19.07.2007 Applications 27
FAIS – Use and Payoff Introduction 1993 Beginning 1995:
20 mio transactions 3000 subjects detected 2,5 mio accounts
Beginning 1997: (see Strategy Plan of FinCEN 1997 - 2002) 39 mio (Bank Secrecy Acts including CTR) Revealing new 3500 subjects 5,000 bank accounts of Colombian/Mexican money launderers
detected
Received Feedback 50% known hits 50% hits with similar behaviour 90% of leads are correct
19.07.2007 Applications 29
Overview Crime enforcement
Link Exploration
Money laundering
Peer Group Analysis
Open Source Software Development
Adolescent cigarette smoking
19.07.2007 Applications 31
OSS development phenomenon OSS := Open Source Software
Hypothesis: Open Source Software development could be modeled as a
self-organizing, collaborative network
Collaborative network Variation of social network
Edge between nodes if they are part of a collaboration
Linchpins connect disparate groups into larger cluster
Motivation: Better understanding of how the OSS community works
IT planners are able to better calculate the risk of OSS usage
19.07.2007 Applications 32
OSS development (1) Recent studies showed:
OSS development produces better, more bug-free software
Most developers work for enjoyment and pride of being part of an successful OSS project.
Not working for monetary return
Collaborate from around the world
Developers rarely meet face-to-face
Developers are self-organized
19.07.2007 Applications 33
OSS development (2) OSS movement is a example of a decentralized self-
organizing process.
No central control or planning
Threatens traditional proprietary software business
Open Questions:
Intellectual property rights
Role of the government concerning OSS
Software licensing
19.07.2007 Applications 34
Power Law Networks Collaborative networks
often show power law distribution
Examples for power law distributions:
City size distribution
Word ranking in languages and writing
Internet
19.07.2007 Applications 35
Example:
Data Collection and Analysis Web Crawler collected data from SourceForge (Mailing
Lists, project sites, forums) from Jan 2001 to March 2002
Project number
Developer id
SourceForge
Number of projects: 39.000 (2002), 152.000 (2007)
Number of developers: 33.000 (2002)
Number of registered users: 1.600.000 (2007)
19.07.2007 Applications 36
Modeling approach Modeling the OSS Community as collaborative social
network
Hypothesis:
The OSS Movement displays power law relationships in its structure
Cluster size
Degree of nodes
19.07.2007 Applications 37
Graph modelingNode = developer
Edge = work on the same project
Node = projectsEdge = same developer works on both projects
19.07.2007 Applications 38
Dev[14]
Dev[53]
Dev[75]
eMule
GIMP
Azureus
Results
19.07.2007 Applications 39
Both figures show, that the two modeled networks satisfy the power law property
Conclusions (1) OSS developer network fits to the power law
relationship
OSS developer network is not a random network
The graph displays preferential attachment of new nodes Initial success of a OSS project more developer join
the project
Important role of linchpins Attractors for other developers
Facilitate the diffusion of ideas and technology between clusters
19.07.2007 Applications 43
Conclusions (2) Long term study needed because of high fluctuation
rate of nodes
Further research should be done on the OSS network
Additional graph theoretic properties could be computed (cluster coefficients, network diameter, etc)
Deeper understanding of how nodes join and leave
Role of SourceForge?
19.07.2007 Applications 44
Overview Crime enforcement
Link Exploration
Money laundering
Peer Group Analysis
Open Source Software Development
Adolescent cigarette smoking
19.07.2007 Applications 45
Adolescent Cigarette Smoking Social network theory and analysis applied to examine
whether adolescents differ in prevalence of current smoking.
Research project on 1092 ninth graders of 5 schools: Each choose 3 best friends (ordered by better friends
first)
Aim to classified each adolescent in Clique member
Clique liaison
Isolate
Additional information provided
7/19/2007 Applications 46
Building Link Graph
19.07.2007 Applications 47
Clique members:-Belong to group of min 3-50+% of their links within their group-Connected by some path lying entirely within the clique
Clique liaisons:-2+ links with clique members/other liaisons-Not in a clique
Isolates:•Few/no links to other
Liaisons
Isolates
Weight of arcs = 1 if non reciprocated friendship otherwise 2
Cigarette Smoking Defined by self-report (current smoker and 1+ packs of
cigarette) and carbon monoxide content in alveolar breath samples.
19.07.2007 Applications 49
Result Smokers
tent more often to be white than black (2 schools significant)
come from families with mothers having lower education
19.07.2007 Applications 50
Additional Analysis significance in interaction
at 4 schools School E significant in
interactions between social position and variables grander and mother education?!?
Including nonsurveyed subjects leads to 5 schools with significant relationship between social position and current smoking (not shown)
19.07.2007 Applications 51
Underestimation of relationship
Additional Analysis (cont.)
19.07.2007 Applications 52
Possibility remains that isolates are integrated into peer groups outside the school social network.
Fiend Smoking Behaviour Isolates have more smoking friend than clique
members/liaisons (1,5 – 4 times as many); Isolates have fewer friends than other subjects.
Add attribute “friend smoking” to graph (ø of 3 friends - smoking/non smoking ) -> Not significant ->friend smoking is strongly related to subject smoking.
Friend smoking is not a proxy for peer group social position.
19.07.2007 Applications 53
Isolates tend to be smokers Explanation:
1. Social Isolation cause smoking
2. Smoking cause social isolation
3. No relationship between smoking & isolation(both caused by same factors)
4. Isolates are members of cliques from outside the school environment
Regardless of explanation smoking is not a peer group phenomenon!
19.07.2007 Applications 54
Evaluation Link analysis offers a great potential to crime
investigation
Reduce time and human effort
Domain knowledge could improve link analysis
More accurate results with domain knowledge based link analysis
Peer Group Analysis is a helpful tool for social network analysis
19.07.2007 Applications 56