privacy-oriented data mining by proof checking
DESCRIPTION
Privacy-oriented Data Mining by Proof Checking. Stan Matwin (joint work with Amy Felty ) SITE University of Ottawa, Canada [email protected]. The TAMALE Group. 4 profs Some 30 graduate students Areas: machine learning, data mining, text mining, NLP, data warehousing Research in - PowerPoint PPT PresentationTRANSCRIPT
Privacy-oriented Data Mining by Proof Checking
Stan Matwin(joint work with Amy Felty )
SITE
University of Ottawa, Canada [email protected]
UBC 29/7/03 2
The TAMALE Group
• 4 profs• Some 30 graduate students• Areas: machine learning, data mining, text mining,
NLP, data warehousing• Research in
– Inductive Logic Programming– Text mining– Learning in the presence of knowledge– Applications of ML/DM (e.g. in SE: tools for maintenance
personnel)
UBC 29/7/03 3
• Why did I get into this research?• what is already being done… and why it ’s not enough• the main idea
• its operation• discussion – ’correctness ’
• prototype - Coq and CIC• example
• some technical challenges• acceptance?
UBC 29/7/03 4
Some useful concepts...
• opting out vs opting in
• Use Limitation Principle: data should be used only for the explicit purpose for which it has been collected
UBC 29/7/03 5
…and existing technical proposals
On the web: P3P Platform for Privacy Preferences
• W3C standard• XML specifications - on websites and in
browsers - of what can be collected and for what purpose - ULP?
• Handles cookies • Data exchange protocol more than privacy
protocol: no provisions for opting out after an initial opt-in
• the ULP part is in NL…not verifiable
UBC 29/7/03 6
• data is perturbed by random distortion: xi xi +r
r uniform or gaussian• a procedure to reconstruct a PAC-
esitimation of the original distribution (but not the values)
• a procedure to build an accurate decision tree on the perturbed distribution
Agrawal ’s data perturbation transformations
UBC 29/7/03 7
Agrawal ’s transformations cont’d
• proposes a measure to quantify privacy –estimate intervals and their size
• lately extended to non-numerical attributes, and to association rules
• does not address the ULP
• how do we know it is applied?
UBC 29/7/03 8
• User sets permissions: what can and cannot be done with her data
• Any claim that a software respects these permissions is a proof of a theorem about the software
• Verifying the claim is then checking that proof against the software
the main idea: towards a verifiable ULP
UBC 29/7/03 9
Who are the players?
• User C
• Data miner Org
• Data mining software developer Dev
• Independent verifier Veri
• …BUT no one owns the data D
UBC 29/7/03 10
PC(D,A): C’s permissionsT(PC,S): theorem that S respects PC
R(PC,S): proof of T(PC,S): B: binary code of S
D: database schemeA: given set of database and
data mining operationsS: source code for A
UBC 29/7/03 11
Discussion - properties
• It can be proven that C ’s permissions are respected (or not): PC is in fact a verifiable ULP
• PC can be negative (out) or positive (in)
• proof construction needs to be done only once for a given PC ,D and A
• Scheme is robust against cheating by Dev or Org
UBC 29/7/03 12
Acceptance issues
• No Org will give Veri access to S
• Too much overhead to check R(PC,S) for each task, and each user
• Too cumbersome for C
• Based on all Orgs buying in
UBC 29/7/03 13
• Veri needs – PC from C– R(S, PC) from Dev– S from Dev– B from Org
• Veri could check R(S, PC) at Dev’s• Veri needs to verify that S (belonging
normally to Dev) corresponds to B that Org runs.
Acceptance1:Veri’s operation- access
UBC 29/7/03 14
Acceptance2: overhead
• Veri runs proof checking on a control basis
• Org’s execution ovhd ?
UBC 29/7/03 15
Issues
• Naming the fields: XML or disclosure
• restricted class of theorems for a given P-automating proof techniques for this class
UBC 29/7/03 16
Acceptance3: C’s perspective
• Building PCs must be easy for C, based on D and processing schema; initially a closed set?
• permissions could be encoded on a credit card, smart card, in the electronic wallet
• or in the CA; they can then be dynamically modified and revoked
UBC 29/7/03 17
« Political » aspects: who is Veri?
• generally trusted – – « consumer association »? – « Ralph Nader »?– « transparency international »?
• IT expert at the level of instrumenting and running the proof checker – connection to Open Software Foundation?
• theorem proving can be cast as « better testing »
UBC 29/7/03 18
how to make Orgs buy in?
• The first Org is needed to volunteer
• a Green Data Mining logo will be granted and administered (verified) by Veri
• other Orgs will have an incentive to join
UBC 29/7/03 19
Future work
• Build the tools
• expand the prototype
• extend from Weka to commercial data mining packages
• Integrate with P3P?
• find a willing Org
UBC 29/7/03 20
UBC 29/7/03 21
Link between S and B
• compilation not an option
• watermarking solution: B is watermaked by a slightly modified compiler with MD5(tar(S)) =128 bytes
• marks are inserted by a trusted makefile-and-compiler in locations in B given by Veri and unknown to Org
UBC 29/7/03 22
Link…
• Veri, given access to S, can verify that B corresponds to S
• An attack by I requires hacking the compiler
• An attack by Org requires knowing the locations of watermarks
UBC 29/7/03 23
Example
C restricts her Employee data from participating in a join with her Payroll data
Record Payroll : Set :=
mkPay{PID : nat; JoinInd : bool; Position : string; Salary: nat}.
Record Employee : Set :=
mkEmp{Name : string; EID : nat; …}.
Record Combined : Set :=
mkComb{CID : nat; CName : string; Csalary: nat; …}.
UBC 29/7/03 24
Fixpoint Join [Ps: list Payroll]: (list Employee) (list Combined) :=
[Es : list Employee] Cases Ps of
nil (nil Combined) | (cons p ps) (app
(check_JoinInd_and_find_employee_record p Es) (Join ps Es))
end.
(check_JoinInd_and_find_employee_record p Es)
if a record is found in Es whose EID matches Ps PID and JoinInd permits Join, then a list of length 1 with the result of Join is returned, otherwise empty
UBC 29/7/03 25
Definition Pc:=[S:((list Payroll)(list Employee) (list Combined)) Prop]
Ps:list Payroll}. Es:list Employee. (UniqueJoinInd Ps) P: Payroll.(In P Ps) ((JoinInd P)=false
not C:Combined ((In C (S Ps Es)) ((CID C)=(PID P)))
• PC(S) is written as (PC Join): Coq expands the definition of PC and provides the theorem
• request to proof checking operator of Coq will check this proof i.e; it will check that the user permissions are encoded into the Join program given
• Whole proof: 300 lines of Coq code; proof checking: 1 sec on a 600MHz machine