using data mining techniques to learn layouts of flat-file biological datasets

1 BIBE’05April 19, 2023

Using Data Mining Techniques to Learn Layouts of Flat-File Biological Datasets

Kaushik SinhaXuan ZhangRuoming Jin

Gagan Agrawal

2 BIBE’05April 19, 2023

Overall Goal Informatics tools for biological data

integration driven by: Data explosion

Data size & number of data sources New analysis tools Autonomous resources

Heterogeneous data representation & various interfaces

Frequent Updates Common Situations:

Flat-file datasets Ad-hoc sharing of data

3 BIBE’05April 19, 2023

Current Approaches Manually written wrappers

Problems O(N2) wrappers needed, O(N) for a single updates

Mediator-based integration systems Problems

Need a common intermediate format Unnecessary data transformation

Integration using web/grid services Needs all tools to be web-services (all data in

XML?)

4 BIBE’05April 19, 2023

Our Approach Automatically generate wrappers Transform data in files of arbitrary

formats No domain- or format-specific heuristics Layout information provided by users

Help biologists write layout descriptors using data mining techniques

5 BIBE’05April 19, 2023

Our Approach: Challenges Description language

Format and logical view of data in flat files Easy to interpret and write

Wrapper generation and Execution Correspondence between data items Separating wrapper analysis and execution

Interactive tools for writing layout descriptors What data mining techniques to use ?

6 BIBE’05April 19, 2023

Wrapper Generation System Overview

Layout Descriptor Schema Descriptors

Parser Mapping Generator

Data Entry Representation Schema Mapping

DataReader DataWriterSynchronizer

SourceDataset

TargetDataset

Application Analyzer

WRAPINFO

7 BIBE’05April 19, 2023

Key Open Questions

How hard is it to write layout descriptors ?

Given a flat file, how hard is it to learn its layout?

Can we make the process semi-automatic ?

8 BIBE’05April 19, 2023

Learning Layout of a Flat-File In general – intractable Try and learn the layout, have a

domain expert verify Key issue: what delimiters are

being used ?

9 BIBE’05April 19, 2023

Finding Delimiters Difficult problem Some knowledge from domain expert

is required (Semi-automatic) Naïve approaches

Frequency Counting Counts frequently occurring single tokens

(word separated by space) Sequence Mining

Counts frequently occurring sequence of tokens

10 BIBE’05April 19, 2023

Frequency Counting Problems

Some tokens, appearing very frequently, are not delimiters

Delimiters could be a sequence of token rather than a single token

Possible Solution Use knowledge from frequency of

token sequence and all its subsequences to decide possible delimiter sequence

11 BIBE’05April 19, 2023

Sequence Mining Example For any sequence of tokens s, f(s) represents

frequency of s Lets say A,B,C are tokens Case 1:

f(ABC)=10, f(AB)=10, f(BC)=10, f(CA)=10 Information about AB, BC, CA is already embedded in

ABC ABC is possible delimiter but AB, BC, CA are not

Case 2: f(ABC)=10, f(AB)=20, f(BC)=10, f(CA)=10 BC and CA occur less frequently than AB ABC cannot be a delimiter AB is possible delimiter

12 BIBE’05April 19, 2023

Limitations of Sequence Mining

Does not work very well if token frequencies are distributed in a skewed manner

An example where it does not work in (Pfam dataset) \n, #=GF, AC are tokens with

f(\n,#=GF)>>f(#=GF,AC) F(\n,#=GF)>>f(\n,#=GF,AC)

\n #=GF is concluded as possible delimiter In reality \n #=GF AC is a delimiter

13 BIBE’05April 19, 2023

Can we do better? Biological datasets are written for

humans to read It is very unlikely that delimiters will be

scattered all around, in different places in a line

Position of the possible delimiters might provide useful information

Combination of positional and frequency information might be a better choice

14 BIBE’05April 19, 2023

Positional Weight

Let P be the different positions in a line where a token can appear

For each position i є P, tot_seqji represents total # of

token sequences of length j starting at position i

For each position i є P, tot_unique_seqji represents total

# of unique token sequences of length j starting at position i

For any tuple (i,j), p_ratio(i,j) is defined as shown above

p_ratio(i,j) can be log normalized to get positional weight, p_wt(i,j) with the property p_wt(i,j) є (0,1)

ji

ji

sequniqetot

seqtotjiratiop

__

_),(_

15 BIBE’05April 19, 2023

Delimiter score (d_score) Frequency weight for any token sequence sj

i with length j and starting at position i, f_wt(sj

i), is obtained by log normalizing frequency f(sj

i)

Obviously, f_wt(sji) є (0,1)

Positional and frequency weight now can be combined together to get d_score as follows,

d_score(sji)= α * p_wt(i,j) + (1-α) * f_wt(sj

i) Where α є(0,1)

Thus d_scrore has the following two properties, d_score(sj

i) є(0,1) d_score(sj

i) > d_score(sjk) implies sj

i is more likely to be a delimiter than sj

k

16 BIBE’05April 19, 2023

Finding delimiters using d_score

Since delimiter sequence length is not known in advance, an iterative algorithm is used to get a superset S of potential delimiters, where,

At any iteration i, ci represents the cut-off value which is determined by observing a substantial difference in sorted d_score values

All token sequences above ci are called Ni

17 BIBE’05April 19, 2023

Generating layout descriptor

Once the delimiters are identified, an NFA can be built scanning the whole database where, delimiters are different states of the NFA

This NFA can be used to generate a layout descriptor since it nicely represents optional and repeating states

The following figures shows an NFA, where A, B, C, D, and E are delimiters with B being an optional delimiter and C D being a repeating delimiters

18 BIBE’05April 19, 2023

Realistic Situation The task of identifying complete list

of correct delimiters is difficult Most likely we will end up with

getting an incomplete list of delimiters

The delimiters which does not appear in every data record (optional) are the ones to be possibly missed

19 BIBE’05April 19, 2023

Identifying Optional Delimiters Given a list of incomplete

delimiters how can we identify optional delimiters, if any? Build a NFA based on given

incomplete information Perform clustering to identify possible

crucial delimiters Perform contrast analysis

20 BIBE’05April 19, 2023

Crucial delimiter A delimiter is considered crucial, if

missing delimiters will appear immediately following these delimiters

The goal is to create two clusters, one having delimiters which are not crucial The other one having crucial delimiters

21 BIBE’05April 19, 2023

Identifying crucial delimiters:A few definitions Succ(X): Set of delimiters that can

immediately follow X Dist_App: # of groups of occurrences of

X based on # of text lines between X and immediately next delimiter

Info_Tuple(nXi,fX

i,tXi): Information for

each Dist_App Info_Tuple_List Lx: For any X, list of all

possible Info_Tuple.

22 BIBE’05April 19, 2023

Metric for clustering

rXf is likely to be low if an optional delimiter appears

immediately after X, and high otherwise Choose a suitable cut-off value rc and assign

delimiters to different groups as follows,- If rX

f < rc, assign X to a group containing possible crucial delimiters

Else assign X to the group containing non crucial delimiters

totalX

XfX f

fr

max

23 BIBE’05April 19, 2023

Observations and Facts Missing optional delimiters can appear

immediately after crucial delimiters ONLY Non-crucial delimiters can be pruned away Consider two Info_Tuples (nX

1, fX1 ,tX

1) and (nX

2, fX2 ,tX

2) in LX

If a missing delimiter appears immediately after the appearance corresponding to the first tuple but not the second one,- nX

1 > nX2

Missing delimiter will appear in tX1 but not in tX

2

24 BIBE’05April 19, 2023

A hypothetical example illustrating Contrast Analysis

Suppose, X is a crucial delimiter having 2 Info_tuples, L1 and L2 , as follows,

L1=(50, 20, l1 .txt) L2=(20, 12, l2 .txt)

Sequence mining on l1 .txt and l2 .txt yields two sets of frequently occurring sequences, S1 and S2 , as follows,

S1={ f1 , f5 , f6 , f8 , f13 , f21 } S2={ f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21 }

Since but , f5 is a possible missing delimiter f5 is a missing delimiter only if it has a high d_score or

is verified by a domain expert as a valid delimiter

15 Sf 25 Sf

25 BIBE’05April 19, 2023

Contrast Analysis For any i,j, if nX

i > nXj , look for frequently

occurring sequences in tXi and tX

j, call them fsX

i and fsXj respectively

If there exists a frequent sequence fs such that, but then, fs is quite likely to be a possible delimiter

If fs has a fairly high d_score or identified by a domain expert as valid delimiter add it to the incomplete list as newly found delimiter

iXfsfs j

Xfsfs

26 BIBE’05April 19, 2023

Generalized Contrast Analysis In case of more than two Info_Tuples,

identify mean of all nXi values

Form a group by appending text from all Info_Tuples, where

Form another group by appending text from all Info_Tuples, where

Perform contrast analysis among all such possible groups

totalX

l

i

iX

iX

meanX f

fnn

1

meanX

iX nn

meanX

jX nn

27 BIBE’05April 19, 2023

Another example illustrating Generalized Contrast Analysis

Suppose, X is a crucial delimiter having 3 Info_tuples, L1 , L2 , L3

, as follows, L1=(50, 20, l1 .txt) L2=(20, 12, l2 .txt) L3=(15, 10, l3 .txt)

Mean number of lines, Append l2 .txt and l3 .txt , call it t2 .txt Sequence mining on l1 .txt and t2 .txt yields two sets of frequently

occurring sequences, S1 and S2 , as follows, S1={ f1 , f5 , f6 , f8 , f13 , f21 } S2={ f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21 }

Since but , f5 is a possible missing delimiter f5 is a missing delimiter only if it has a high d_score or is verified

by a domain expert as a valid delimiter

15 Sf 25 Sf

09.33101220

)1015()1220()2050(

meanXn

28 BIBE’05April 19, 2023

Overall Algorithms

29 BIBE’05April 19, 2023

Results: Optional delimiters

% Pruning=

30 BIBE’05April 19, 2023

Results: Non-optional Missing delimiters

Even though designed for finding optional delimiters, our algorithms works, in some cases, for missing non-optional delimiters too

If a missing non-optional delimiter appears exactly in the same location in each record, then our algorithm fails

If a non-optional delimiter has a backward edge coming from a delimiter that appears later in a topologically sorted NFA then our algorithm works

31 BIBE’05April 19, 2023

Summary Semi-automatic tool for learning

the layout of a flat-file dataset Mechanism for identifying missing

optional delimiters Automatic tool for wrapper

generation Once the layout descriptor is known

Can ease integration of new/updated sources

32 BIBE’05April 19, 2023

Questions..

using data mining techniques to learn layouts of flat-file biological datasets

Documents