Download - Noise Removal Efficient Web Data Mining
-
8/3/2019 Noise Removal Efficient Web Data Mining
1/35
4/22/2012 1
Eliminating Noisy
Information in Web Pagesfor Data Mining
Presented by
Jyothi.B
S1 SE
TKMIT
Guided by
Ms.Revathy
-
8/3/2019 Noise Removal Efficient Web Data Mining
2/35
4/22/2012 2
Noisy information
In web sites the noises are considered as blocks of copyright,privacy notices and advertisements.
Web noises are of two types.
Global noises.
Local noises.
-
8/3/2019 Noise Removal Efficient Web Data Mining
3/35
4/22/2012 3
Web mining
Web usage mining is a process of extracting useful
information from server logs.
Web content mining is the process to discover usefulinformation from text, image, audio or video data in the
web.
Web structure mining is the process of using graphtheory to analyze the node and connection structure of aweb site.
Web mining is the application of data mining techniques todiscover patterns from the web.
-
8/3/2019 Noise Removal Efficient Web Data Mining
4/35
4/22/2012 4
Sample web site
-
8/3/2019 Noise Removal Efficient Web Data Mining
5/35
4/22/2012 5
Style tree
In this technique propose a tree structure called style tree to
capture the common presentation style and actual contents of
the pages in a given site.
Site style tree can be built for a web site.
Web page cleaning done using a cleaning technique.
-
8/3/2019 Noise Removal Efficient Web Data Mining
6/35
4/22/2012 6
Cleaning Technique
Web page cleaning is a kind of preprocessing.
Based on the observation that most web pages are
automatically generated.
Parts of a page whose layouts and actual contents also appearin other pages in the site are more likely to be noises.
Parts of a page whose layouts or actual contents are quitedifferent from other pages are usually the main contents of thepage.
-
8/3/2019 Noise Removal Efficient Web Data Mining
7/35
4/22/2012 7
Entropy based information
measure
Proposes an information based measure to determine which
parts of the style tree indicate noises and which part of the tree
contain the main contents of the pages in the website.
Importance measure formula proposed is entropy based.
Experimental results show a increase in accuracy of webmining using the proposed webpage cleaning method.
-
8/3/2019 Noise Removal Efficient Web Data Mining
8/35
4/22/2012 8
Data structureSite Style Tree Web site commonly presented by DOM tree.
.
.
-
8/3/2019 Noise Removal Efficient Web Data Mining
9/35
4/22/2012 9
Disadvantage of DOM
DOM tree is insufficient.
It is hard to study the overall presentation style and content of
a set of HTML pages and clean them based on individual
DOM trees.
-
8/3/2019 Noise Removal Efficient Web Data Mining
10/35
4/22/2012 10
P
Root
IMGTABLE
P BR A
BODY
TABLE
DOMTreesandStyleTree
d1 d2
Root
IMG
TABLE TABLE
P
BODY
IMG A
Bgcolor=white Bgcolor=white
Width=800Width=800
Bgcolor=red
Bgcolor=red
-
8/3/2019 Noise Removal Efficient Web Data Mining
11/35
4/22/2012 11
DOM TreesandStyleTree
Compressed representation of two DOM trees. It shows which
parts of the DOM trees are common and which parts are
different.
Root
BODY
TABLE IMG
IMG BR
TABLE
P A P AP
Width=800
Bgcolor=white
Bgcolor=red
-
8/3/2019 Noise Removal Efficient Web Data Mining
12/35
4/22/2012 12
Site Style Tree
A style node(S) represents a layout or presentation style,which has two components ,denoted by (Es,n),where Es is asequence of element nodes and n is the number of pages thathas this particular style at this node level.
An element node E has three components denoted by(TAG,Attr,Ss), where
TAG is the tag name. Attr is the set of display attributes of TAG.
Ss is a set of style nodes below E.
-
8/3/2019 Noise Removal Efficient Web Data Mining
13/35
4/22/2012 13
Importance measure
Entropy based importance measure is used fordetermining noisy elements in Style Tree(ST).
Based on the following assumptions
1. The more presentation styles that an elementnode has, the more important and vice versa.
2.The more diverse that the actual contents of anelement node are, the more important the elementnode is ,and vice versa.
.
-
8/3/2019 Noise Removal Efficient Web Data Mining
14/35
4/22/2012 14
Importance of an element node is given by combining its
presentation importance and content importance.
-
8/3/2019 Noise Removal Efficient Web Data Mining
15/35
4/22/2012 15
Root
ImgTable Table
BODY
Table
Tr Tr
A A
P
AA A
Img
P PPP
A
A
Text
Text
ImgA
ImportanceMeasure
-
8/3/2019 Noise Removal Efficient Web Data Mining
16/35
4/22/2012 16
Importance measure formula
Composite importance measure for a node is the importancemeasure of the element node and its descendants.
For the internal node it is based on the presentation styles andimportance of its descendants.
CompImp(E)=(1-l
)NodeImp(E)+l
li=1(pi CompImp(Si))
Where is the attenuating factor which is set to 0.9.
-li=1 pi log m pi if m>1
NodeImp(E)=
1 if m=1
Where pi is the probability that a web page uses the ith style node in E.Sss
kj=1CompImp(Ej)
CompImp(Si)=
k
Where pi is the probability that E has the ith child style node in E.Ss
-
8/3/2019 Noise Removal Efficient Web Data Mining
17/35
4/22/2012 17
Importance measure formula
Leaf nodes are different from internal nodes. composite
importance for leaf nodes is based on the information in its
actual contents of the nodes with no tags.
1 If m=1
lj=1H(ai)CompImp(E)= 1- if m>1
lWhere ai is an actual feature of the content in E. H(ai) is the
information entropy of ai within the context of E.
H(ai)=-j=1m pij log m pij
-
8/3/2019 Noise Removal Efficient Web Data Mining
18/35
4/22/2012 18
Overall algorithm
1.Randomly crawl k pages from the given website S
2.Set null SST with virtual root E
3.For each page W in the k pages do
4. Build PST(W);
5. BuildSST(E,Ew);
6.End for
7.CalcCompImp(E)8.MarkNoise(E);
-
8/3/2019 Noise Removal Efficient Web Data Mining
19/35
4/22/2012 19
9.Markmeaningful(E);
10.For each target web pages p do
11. Ep =BuildPST(P)
12. MapSST(E,Ep)
13.End for
-
8/3/2019 Noise Removal Efficient Web Data Mining
20/35
4/22/2012 20
Algorithm-Mark Noise
Input:E: root element node of a SST. Return: TRUEifEand all of its descendents are noisy,else
FALSE.
Mark Noise(E)1. For each S E.Ss do
2. For each e S.Es do
3. If(markNoise(e)==FALSE) then
Noisy: For an element nodeEin the SST, if all
of its descendents and itself have composite
importance less than aspecified threshold t, then we
say element nodeEis noisy.
-
8/3/2019 Noise Removal Efficient Web Data Mining
21/35
4/22/2012 21
4. return FALSE
5. End if
6. End for
7.End for
8. if(E.CompImp
-
8/3/2019 Noise Removal Efficient Web Data Mining
22/35
4/22/2012 22
Algorithm -Definitions
Maximal noisy element node: If a noisy element nodeEinthe SST is not a descendent of any other noisy element node,we callEa maximal noisy element node.
Meaningful: If an element nodeEin the SST does not containany noisy descendent, we say thatEis meaningful.
Maximal meaningful element node: If a meaningful element
nodeEis not a descendent of any other meaningful elementnode, we sayEis a maximal meaningful element node.
-
8/3/2019 Noise Removal Efficient Web Data Mining
23/35
4/22/2012 23
Root
ImgTable Table
BODY
Table
Tr Tr
A A
P
AA A
Img
P PPP
A
A
Text
Text
ImgA
Algorithm-Definition
-
8/3/2019 Noise Removal Efficient Web Data Mining
24/35
4/22/2012 24
A simplified SST
Root
Body
Table Img Table Table
Tr Tr Text
-
8/3/2019 Noise Removal Efficient Web Data Mining
25/35
4/22/2012 25
Algorithm-MapSST
MapSST(E,Ep)
1. IfE is noisy then
2. Delete Ep as noises
MapSST uses the simplified SST, compares thatwith page style tree and get the actual contents.
Input:E: Root element node of the simplified
SST.
Input:EPST: root element node of the pagestyle tree.
Return: The main content of the page after
cleaning.
-
8/3/2019 Noise Removal Efficient Web Data Mining
26/35
4/22/2012 26
Algorithm-MapSST
3. Return NULL
4. end if
5. IfE is meaningful then
6. Ep is meaningful
7. Return the content under Ep
8. Else
9. ReturnContent = NULL
10. S2 is the style node in Ep.Ss
11. If(S1 E E.Ss S2 matches S1) then
-
8/3/2019 Noise Removal Efficient Web Data Mining
27/35
4/22/2012 27
12. E1.i is the ith element node in sequence s1.Es;
13. E2.i is the ith node in sequence s2.Es
14. For each pair (e1i
, e2i
) do
15. returnContent += MapSST(e1i,e 2i)
16. End for
17. Return returnContent
18. Else Ep is possibly meaningful;
19. Return the content under Ep
20. End if
21.End if
-
8/3/2019 Noise Removal Efficient Web Data Mining
28/35
4/22/2012 28
Overall algorithm-revisited
1.Randomly crawl k pages from the given website S
2.Set null SST with virtual root E
3.For each page W in the k pages do
4. Build PST(W);
5. BuildSST(E,Ew);
6.End for
7.CalcCompImp(E)8.MarkNoise(E);
-
8/3/2019 Noise Removal Efficient Web Data Mining
29/35
4/22/2012 29
9.Markmeaningful(E);
10.For each target web pages p do
11. Ep =BuildPST(P)
12. MapSST(E,Ep)
13.End for
-
8/3/2019 Noise Removal Efficient Web Data Mining
30/35
4/22/2012 30
Execution Time
The time taken to build SST is always below 20
second.
The process of computing composite Importance
finished in 2 seconds.
Final step of cleaning each page takes less than 0.1
second.
-
8/3/2019 Noise Removal Efficient Web Data Mining
31/35
4/22/2012 31
Advantages
Proposed system is faster than the existing one.
Very efficient in handling the noise.
Removes around 99% of the unwanted content.
-
8/3/2019 Noise Removal Efficient Web Data Mining
32/35
4/22/2012 32
Disadvantages
Faces difficulty in handling scripts inside the body.
Unformatted structure of a Web page causes exceptions. Only focused on HTML pages.
-
8/3/2019 Noise Removal Efficient Web Data Mining
33/35
4/22/2012 33
Conclusions
Proposes a technique to clean web pages for web mining.
Introduces a data structure SST to capture layout andpresentation styles.
Proposes an information based measure to evaluate theimportance of element nodes in SST so as to detect noises.
Results show that proposed technique is highly effective.
-
8/3/2019 Noise Removal Efficient Web Data Mining
34/35
4/22/2012 34
Reference
Anderberg, M.R. Cluster Analysis for Applications,AcademicPress, Inc. New York, 1973.
Bar-Yossef, Z. and Rajagopalan, S. Template DetectionviaData Mining and its Applications, WWW 2002, 2002.
Beeferman, D., Berger, A. and Lafferty, J.A model oflexicalattraction and repulsion. ACL-97, 1997
. Beeferman, D., Berger, A. and Lafferty, J. Statistical
modelsfor text segmentation. Machine learning, 34(1-3), 1999.
-
8/3/2019 Noise Removal Efficient Web Data Mining
35/35
4/22/2012 35