matching in information systems isd3 lecture 11. contents matching exercises integrity and fidelity...

33
Matching in Information Systems ISD3 Lecture 11

Upload: joseph-brown

Post on 31-Dec-2015

243 views

Category:

Documents


3 download

TRANSCRIPT

Matching in Information Systems

ISD3 Lecture 11

Contents

• Matching exercises

• Integrity and Fidelity – Fidelity as a matching problem – between the

world and its representation in the system

• Stateful-stateless interaction– Co-evolution of user-machine fitness

Fuzzy Matching in the Telephone Directory

• UWE telephone directory– Only fuzzy matching is partial matching on

initial string• ‘wall’ finds ‘wallace’, ‘wallis’, ‘walls’, …

– Easy to do in SQL– ..where surname like ‘reqsurname%’

– Substring matching anywhere is slower– .. Where surname like ‘%reqsurname%’

Telephone Schema

• Facilities(‘help desk’, ‘reception’ etc) forced to fit Person schema • Lack of inclusion in schema creates searching problems:

– Helpdesk– Help desk– CSM help desk

• No support for categories of facility to control vocabulary – A Naming and Classification problem

• Need for generalisation:

Person

Surname : str

Firstname : str

ExtNo : str

Contact

Person

Facility

Dating Problemdates

Person

age:INTEGERgender:INTEGERs1:INTEGERs2:INTEGERs3:INTEGERs4:INTEGER

id:INTEGERname:VARCHAR

Preference

minage:INTEGERmaxage:INTEGER

s1:INTEGERs2:INTEGERs3:INTEGERs4:INTEGER

gender:INTEGER

pair

weights

wtage:INTEGERwtgen:INTEGERwts1:INTEGERwts2:INTEGERwts3:INTEGERwts4:INTEGER

weightid:INTEGER

Distance (fitness) function• Distance (P1, P2) =

– Distance(P1, P2-Pref) + Distance(P2,P1-Pref)• Individual differences:

– agediff = if P1.age <P2-Pref.min or P1.age >P2-Pref.max ? 1000 : 1 – abs(P1.age / ((P2-Pref.min+P2-Pref.max)/2 ))

– gendiff = P1.gen == P2-Pref.gen ? 1000 : 0– s1diff = abs(P1.s1 – P2-Pref.s1)– s2diff = abs(P1.s2 – P2-Pref.s2)

• Combined weighted differences– Euclidean distance– sqrt (wtage*agediff^2 + wtgen*gendiff^2 + wts1*s1diff^2 +

wts2*s2diff^2…..)• Problems

– Age is a ratio scale (40 is twice as old as 20)– Preference scales are not – rating a scenario a 6 does not imply it is twice

as good as a rating of 3 – Preference scales are Ordinal– Age and Gen are go-no go – simulated by very high value for a mismatch

Integrity

• Data in a database should agree with the rules in the schema – Checks on values– Referential integrity– Primary key

• A weak schema allows erroneous data– E.g. Invalid manager relationships in the Emp-Dept

example– Need for extended Business rules in middle tier of

application

Fidelity

• HiFi “exactitude in reproduction” • A database as an image of its Domain of

Discourse (Real World)• Loss of fidelity when:

– Two records in database but only one person in the RW

– Address data does not correspond to an existing address in the RW

– Address in database does not correspond to the current address of its owner

• But fidelity only has to be ‘good enough’ for its purpose

• Veracity means roughly the same – ‘truthful’

Data Quality

• Poor data quality results from loss of integrity and lack of fidelity.

• “Current data quality problems cost US businesses more that $600 billion per year” (report by the Data Warehousing Institute, 2002

• Gartner Research estimates that through 2005 more than 50% of business intelligence and CRM deployments will suffer limited acceptance if not outright failure due to lack of attention to data quality issues.

• Direct costs of poor quality information estimated at between 10% and 20% of revenue

Information systems / computer systems

• Computer system quality depends only on ensuring the system doesn’t fall over when presented with bad data

• Information Systems quality depends on ensuring the system delivers information of high quality

• Information System includes procedures and guidance to users to meet this need.

Problem analysis

• Analyse chain of cause and effect of poor quality

• Systems approach:– Information system:

• Data flow model analysed for points where errors can be injected

– Organisation:• Attitudes and ethos

Data Flow in the Information System

• Information source

• Information gathering

• Information collation

• Information storage

• Information retrieval

Data source problems

• Data has only a limited lifetime of fidelity since world is in constant flux

• Length of lifetime depends on – Volatility of the data source – address for

young out-of-work person or address of retired person

• Need to re-validate data on a cycle dependent on the lifetime

Data capture

• Data gathering procedures a major source of error.

• Integrity and Fidelity can be in conflict– If telephone number is mandatory, operator in

hurry will enter any old number to get the record accepted

• Data quality depends on training and guidance given to operators

Collation

• Matching of new applicants with existing applicants is poor so duplicates generated.

• Postcodes accepted even if not matching Post Office database

Storage

• Database integrity failures or loss of backup data, or reload with duplicates (auto number primary key)

Improvement Process

• Based on learning cycle– Shewart cycle – Plan- Do –Check – Act – Deming cycle– Six Sigma – Define-measure-analyse-

improve-control– Kolb learning cycle – act – reflect – theorise –

plan

Improvement/ Learning Cycle

• Measure and observe the current process

• Analyse / develop theory of causes of problem

• Plan changes based in the theory

• Put plan into effect

• Measure /observe the resultant improvement ….

Stateless/ Stateful Interaction

• Stateless– Person interacts with machine– Machine response depends only on the request (and the state of

data sources..)– Each interaction is independent of previous interactions with the

same person– Machine has no memory of previous interactions– Person presumably does have memory of previous interactions!

• Stateful– Machine has memory of previous interactions– Response to an request depends on only on the current request

but on previous interactions– Support for ‘long-running transactions’ such as placing an order,

booking a holiday, buying the best house insurance

Example stateless/stateful interactions

• Person- organisation– I enter my local supermarket – I enter my local pub

• Person – organisation– I make a purchase from my local supermarket with a

loyalty card – I go to my local pub for a drink

• Person – website– I click on a link to the UWE website– I click on a link to a site and I’m prompted to accept a

cookie

Stateful interaction

• Advantages– Interaction is not one sided – I remember how the system has

behaved, it remembers something about me and how I’ve behaved

– Interaction is more like talking to another person– Machine can make better decisions about a suitable response

• State can be a problem too– Stateful behaviour can be hard to understand.– Bad memories - ‘let’s just start all over again’– Modal dialogue problem

• Application puts up a modal dialogue box which must be responded to before anything else happens.

• Dialogue box gets hidden behind other windows.

The evolving person-machine system

MachineUser

usermachine

Machine-side state mechanisms

• A state mechanism has to deal with– What to store about the interaction

• How much information about the user to retain• Issues : explicit/ implicit, transaction log, data protection act

– How to store the state for the duration of the interaction

• Length of interaction ranges from a site visit to ‘forever’• Issues : what to store, security, reliability, access by other

applications

– Matching a user to a stored state – the ‘identity’ problem

• How is a user identified • Issues : can id be spoofed, is id secure, can identity be

mistaken..

Storing the state

• Hidden fields in form– Server can sent data to the user in a hidden field,

which will then be returned when the user resubmits• Session variable

– Server can store data keyed by a session variable – session id can be sent back in hidden field

• Cookies– Server sends the user a cookie to store data which is

send back when the user next visits the site• Database

– State is stored in a database keyed by some user characteristic

Identifying the user

• IP address of client machine

• Session id

• User id – login id, National Insurance number, passport number …– Cahoot internet bank problem last week

• Address

• Mobile phone number

• Biometric data – finger print, iris pattern..

What to keep

• State must grow and change as the system learns more about you.

• State of interaction includes:– Current attributes of user : name, company.. – History of every interaction allows unanticipated

questions to be asked – cf data mining– Derived / deduced attributes – total expenditure,

most recent address• For data protection reasons, must not retain any

more information than necessary??• State can be defined using a ER model even if

not stored in a database

Explicit / Implicit distinction

• Explicit– Facts held as data in the database

• the person’s name and address

• Implicit– The implicit assumptions about the user which are

built into the system:• The user’s language, ethnicity, location, capabilities

• Implicit -> Explicit– Surfacing assumptions– Representing assumptions explicitly

• multi-language responses

User’s model of the machine

• User’s need to develop their model of the machine to be able to us it effectively

• Part of the machine’s task is to help the user develop an appropriate model of itself.

• User’s have an implicit model of the machine – preconceptions about how to use it.

• What does a person’s model of the machine look like and how does it develop?

Strategies to help the user

• Reduce the need for the user to have an extensive machine model

• Provide guidance• Design the interaction to work in the way a

user would naturally expect:– Donald Norman’s idea of affordance

• The door handle example

• Use natural language• Follow / establish standards

SMS Currency converter

• Exercise last year to design an SMS currency converter.

• More difficult interaction design than a web page converter:– No list of currencies to select from– Message length limits explanations

• More interesting– Input is limited natural language– User is mobile

Currency converter – stateful interaction

• Stateful interaction– Request: Cur 100 GBP USD

• Machine stores from and to codes as state, identified by originating mobile number

– Request: Cur 200 • Machine identifies the request as originating from

the same user, no from or to code supplied, so default to stored values

– Request: Cur 100 GBP EUR• From and to codes set , so update state

Currency Converter – message format

• Natural interaction– Allow multiple and surrounding spaces – Allow all sensible ordering of codes

• 100 gbp usd• Gbp 100 usd• Gbp usd (assume 1 unit)

– Allow noise words• Convert 100 usd into eur

– Allow synonyms• Convert 100 pounds into euros (assume GBP)

– Allow mistypes?• 100 GPB ERU

Currency converter - help

• Helpful feedback– If request not understood, give helpful response

• Format of request• Codes for common currencies• Reference to source of codes

– Support country to currency code query (perhaps by another service to get basic country data?)

– Should help be stateful – not the same response each time, but one which depends on what has already been send ( but how long ago?)