[lecture notes in computer science] computational science and its applications – iccsa 2004 volume...

11
A. Laganà et al. (Eds.): ICCSA 2004, LNCS 3046, pp. 772–782, 2004. © Springer-Verlag Berlin Heidelberg 2004 Annotation Repositioning Methods in the XML Documents: Context-Based Approach 1 Won-Sung Sohn 1 , Myeong-Cheol Ko 2 , Hak-Keun Kim 1 , Soon-Bum Lim 3 , and Yoon-Chul Choy 1 1 Department of Computer Science, Yonsei University, Shinchon-dong, Seodaemun-ku, 120-749, Seoul, Korea {sohnws, ycchoy}@rainbow.yonsei.ac.kr 2 Department of Computer Science, Konkuk University, Danwol-dong Chungju-si, Chungbuk, 380-701, Korea. [email protected] 3 Department of Multimedia Science, Sookmyung Women's University, 140-742, Seoul, Korea [email protected] Abstract. This paper presents context-based repositioning methods for annota- tions in the XML document. In the proposed methods, the XML-based original document and annotation information are presented as logical structure trees, and candidate anchors are produced in the process of creating matching rela- tions between the trees. To select an appropriate candidate anchor among many candidates, repositioning rules are presented by stages based on the textual data and label information of anchor nodes of the logical structure trees. As a result, annotations in the structured document are robustly positioned even after vari- ous modifications of contexts in the document. 1 Introduction The uses of annotation technique in the electronic document environment have rapidly expanded as they are more advantageous[1],[2],[3],[4] in the electronic document than in the paper document. Annotations in the electronic document are usually produced as fine grained and external link and saved within or outside of a system apart from the original document[2],[5],[6]. Therefore, they become orphan data when contents of the target document are de- leted or modified[1],[2],[6],[7]. To solve that problem, annotations’ anchor reposi- tioning (anchoring) should be possible even after modifications of the original docu- ment[1],[6]. The repositioning process is considered as the most important function of an annotation system[3]. 1 This work was supported by the Post-doctoral Fellowship Program of Korea Science & Engineering Foundation (KOSEF) Contact Author

Upload: osvaldo

Post on 30-Nov-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

A. Laganà et al. (Eds.): ICCSA 2004, LNCS 3046, pp. 772–782, 2004.© Springer-Verlag Berlin Heidelberg 2004

Annotation Repositioning Methods in the XML Documents:Context-Based Approach1

Won-Sung Sohn1, Myeong-Cheol Ko2 , Hak-Keun Kim1,Soon-Bum Lim3, and Yoon-Chul Choy1

1 Department of Computer Science, Yonsei University,Shinchon-dong, Seodaemun-ku, 120-749, Seoul, Korea{sohnws, ycchoy}@rainbow.yonsei.ac.kr2Department of Computer Science, Konkuk University,Danwol-dong Chungju-si, Chungbuk, 380-701, Korea.

[email protected] Department of Multimedia Science,

Sookmyung Women's University, 140-742, Seoul, [email protected]

Abstract. This paper presents context-based repositioning methods for annota-tions in the XML document. In the proposed methods, the XML-based originaldocument and annotation information are presented as logical structure trees,and candidate anchors are produced in the process of creating matching rela-tions between the trees. To select an appropriate candidate anchor among manycandidates, repositioning rules are presented by stages based on the textual dataand label information of anchor nodes of the logical structure trees. As a result,annotations in the structured document are robustly positioned even after vari-ous modifications of contexts in the document.

1 Introduction

The uses of annotation technique in the electronic document environment have rapidlyexpanded as they are more advantageous[1],[2],[3],[4] in the electronic document thanin the paper document. Annotations in the electronic document are usually producedas fine grained and external link and saved within or outside of a system apart fromthe original document[2],[5],[6].

Therefore, they become orphan data when contents of the target document are de-leted or modified[1],[2],[6],[7]. To solve that problem, annotations’ anchor reposi-tioning (anchoring) should be possible even after modifications of the original docu-ment[1],[6]. The repositioning process is considered as the most important function ofan annotation system[3].

1 This work was supported by the Post-doctoral Fellowship Program of Korea Science &

Engineering Foundation (KOSEF)Contact Author

Annotation Repositioning Methods in the XML Documents 773

The annotation repositioning process involves the relations between annotationsand sub-resources including contents and spans presented as external links. Therefore,the contexts between annotated information and the document should be consideredin the annotation’s anchor repositioning[2],[6]. Actually most robust annotation repo-sitioning methods primarily consider contexts like unique IDs, substrings, surround-ing texts, and keywords, as shown in table 1.

However, related works on the annotation repositioning detects only moves of textsin the original document[4],[8],[9],[10]. Therefore, if texts are modified, all the re-lated annotations are usually orphaned[1],[6].

The robust anchoring method is asked for in particular in the XML-based annota-tion environment[5] because XML is the original document in the programs that useannotations the most frequently such as Cyber-Class, e-Learning, and e-Book[13].However, most of the related works deal with only text documents, and other worksthat consider structured documents[6],[7] determine modifications of the document bycomparing only the paths between annotations and the original document. Therefore,the repositioning of annotations is difficult when structures of the original documentwere deleted, moved, or modified.

Table 1. Comparison of annotation repositioning methods.

ApproachTypes

Repositioning Methods Characteristics Limitations

UniqueContext[4],[8]

Edited or not determined bythe existence of the uniquecontext (id, substring)

Well applicable to varioussystems

Annotations are usuallyorphaned or deleted ifunique substring detectionis failed.

RedundantContext[9],[10]

Edited or not determined bycomparing the anchor textsand original documents, andselects final positions bycomparing surrounding texts.

Applied in most annotationsystems that provide anchorrepositioning

Does not consider selec-tion of candidate anchorsthat are created in theprocess of repositioning onmodified texts.

KeywordAnchoring

[1],[2]

Extracts unique keywordsfrom anchor texts, and detectsanchor positions based on thekeywords.

Reflects the cognitive featurefor anchor detection andprovide robust anchoringinterface that uses confidencescores of anchors

Handles only general textdocuments, and it cannotoperate without keyword-finding processes even in ahuge-sized web environ-ment.

TreeWalks[6],[7]

Checks out unique identifierbetween original documentsand anchors, and then selectsproper nodes through treewalks if updates are found.

Attempt repositioning evenafter documents were modi-fied. And, it provides inter-face through which newanchors are reattached ifanchoring is failed.

Tree walk method per-forms only path matchingoperations betweenHTML documents andannotations.

This paper presents context-based annotation repositioning methods in the XML-based annotation system. The proposed repositioning methods present the XML origi-

774 W.-S. Sohn et al.

nal document and annotation information as logical structure trees and create match-ing relations between the trees. Candidate anchors are created in the process, andrepositioning rules are presented by stages for selecting an appropriate anchor amongthe candidates. The proposed repositioning rules are about creating and merging can-didate anchors based on the label and textual contexts of anchor nodes of the logicalstructure trees.

In that way, annotations are robustly repositioned even after contexts are deleted,moved, or modified in the structured document.

2 Context-Based Repositioning of Annotations

This paper proposes annotation processing methods for its robust repositioning in theXML document. In the proposed methods, logical structure trees are created for thetarget document (XML) and annotations, and annotations are robustly repositioned inthe modified original document, as shown in Figure 1.

Old Version

(XML)

New Version

(XML)

Annotation Logical Structure Tree

Matching and Repositioning

Fig. 1. Overall processing of the proposed annotation repositioning methods.

To find out how much the target document was modified, the proposed methodsdetermine whether created annotation information and document structures differwith each other by traversing the logical structure trees. According to the degree ofthe modifications, matching relations[12] are created between the nodes of the trees,and proper candidate anchors are created by stages in the process. To obtain resultsthat can be properly applied in the annotation environment, this work consideredvarious changes of paths in the logical structure trees and anchor texts. The proposedrepositioning methods are divided into robust repositioning in the document wherestructures remain the same and robust repositioning between documents of differentstructure information. The details of the proposed methods are as follows.

Annotation Repositioning Methods in the XML Documents 775

2.1 Annotation Repositioning between the Non-changed Structures

The proposed methods follow different repositioning processes depending on whetherstructures of the original document were modified or not. To do that, the methodsexamine how annotation anchors’ path, offset, and anchor text information exist inthe original document by traversing the logical structure trees of the original docu-ment first. If annotations’ path information remains the same in the original docu-ment, only whether there were modifications between anchor texts is determined. Inthis work, the following comparing function – the longest common subsequencerate[13], which is based on the longest common subsequence[12],[13], was used fordetermining whether there were modifications between annotations’ anchor texts andthe original document’s text nodes.

|||||),(|2

),(yx

yxlcsyxLCSR

+×= (1)2

If the structures between annotation paths and the original document were notchanged, the following repositioning rule 1 and 2 are applied to extract proper an-chors. The details are as follows.

Repositioning Rule 1: Let’s suppose that there exist annotations’ anchor text nodes, T1

= {xi}, 1 i s, and the original document’s text nodes T2 = {yi}, 1 i α, onannotation information T1 and annotations’ target document T2, respectively. Each textnode includes textual data xi = {T1strin}, 1 n l and yi = { T2strin }, 1 n β,and each textual data includes characters T1strik = {aik}, 1 n k and T2strik ={biγ}, 1 n 0. Create matching relations, [xio, yiq],…,[xip, yir] if all the nodes ofannotation information T1 exist in T2 in the same label and order, and also there existanchor text nodes xio,…, xip of T1 and text nodes yiq,…,yir of T2 having the same parentnode labels each other.

If the detection rule 1 is satisfied, it can be assumed that the original document’spaths were not changed. In that case, one-to-one matching relations between the an-chor texts can be created. For instance, as shown in Figure 2, if the original docu-ment’s structures were not changed, one-to-one matching relations between paths ofannotations and the original document can be created as [25, 25] and 1, [27, 27] and2, [28, 28] and 3, and [36, 36] and 4.

If matching relations were created by the repositioning rule 1, similarity rates be-tween anchor texts should be measured. Annotations are repositioned according to themeasuring results. The details are explained in the repositioning rule 2.

Repositioning Rule 2: If the LCSR between the nodes under the matchings created inthe rule 1 is 1, the original document’s text nodes yiq,…,yir are designated as anchor-ing boundaries. If the LCSR is between a threshold value and 1, text nodes yiq,…,yir

2 |lcs(x,y)|, |x|, and |y| denote length of LCS between text x and y, length of x, and length of y,

respectively.

776 W.-S. Sohn et al.

are regarded as candidate anchors. If the LCSR is below a threshold value, whichmeans anchor information of the old version were deleted or structures were modi-fied, repositioning rule 3-6 are applied again. If there is no text node in T2 that satis-fies rule 3-6, yiq is appointed as a candidate anchor.

After the repositioning rule 1 and 2 are applied, if it is found that anchor texts re-main the same or were updated without changes in paths, either the old boundariesare used, or new candidate anchor boundaries are extracted. For instance, as theLCSR of the matching [25, 25] and 1 in Figure 3 is 1, the node [25] is anchored rightaway. The LCSR of the matchings [27, 27] and 2, and [28, 28] and 3 are between athreshold value and 1. Therefore, nodes [27] and [28] of T2 are appointed as candidateanchor boundaries. On the other hand, the similarity rate of the matching [36, 36] and4 is below a threshold value. Therefore, a new repositioning rule should be applied.

Fig. 2. When relations between annotation paths and the original document were not changed.

Fig. 3. An example of repositioning between anchor nodes of the same path.

Annotation Repositioning Methods in the XML Documents 777

2.2 Annotation Repositioning between the Changed Structures

If there were modifications in paths and anchor texts between the original documentand annotations, similarities between the structures should be considered. Moreover,candidate anchors selected by the information include one-to-one, one-to-many,many-to-many, and many-to-one anchors on anchor texts and paths, due to the specialfeatures of annotations. Accordingly, to select proper candidate anchors, this workconsiders path matching methods by stages, and candidate anchor merging and link-ing methods. The details are as follows.

Repositioning Rule 3: Let’s suppose that path information of T1, which is annotationinformation, does not coincide with node labels and sibling orders of T2, the annota-tions’ target document. Extract the LCSR between xio,…,xip, annotation anchor textnodes of T1, and all the text nodes of T2, select text nodes of T2 above a thresholdvalue, and accordingly create matchings [xio, yiq],…,[xio, yir],…,[xip, yiu],…,[xip, yiv].

If the original document’s structures and contents of text nodes were modified, re-positioning rule 3 should be applied first, for creating matching relations where theLCSR are above a certain value. For instance, as shown in Figure 4, for node 25,matchings [25, 70] and 1, and [25, 58] and 2 can be created, and matchings 3, 4,5, 6, and 7 can be created according to the same rule.

Fig. 4. Creation of candidate anchors based on the text LCSR.

Matchings created between text nodes can be many anchor nodes to many textnodes, as shown in Figure 4. In that case, the proposed methods create new matchingsrelated in meanings by comparing the LCSR between node labels, rather than select-ing one matching relation merely based on the LCSR between text nodes.

Repositioning Rule 4: If there exist many matchings where the LCSR of the originaldocument on anchor nodes are above a certain value, compare the label LCSR be-

778 W.-S. Sohn et al.

tween paths3 of matched text nodes, and then create matchings [xio, yiq],…,[xio,yir],…,[xip, yiu],…,[xip, yiv] that include label similarity rates above a certain value.

According to the repositioning rule 4, among many matchings based on text simi-larity rates, the matchings in which the label similarity rates are above a certain value,are appointed as new candidate anchors. For instance, as shown in Figure 5, inmatchings in black lines, [25, 58] and 2, [27, 60] and 3, [27, 52] and 4, [27, 51]and 6, and [28, 83] and 7, the label LCSR between paths are above a certain value,therefore they are appointed as new matchings. Likewise, if there exist many candi-date anchors that satisfy both rule 3 and 4, they are regarded as having higher priorityand used as the rule for anchor recommendation interaction later in the repositioninginterface.

Fig. 5. Creation of candidate anchors based on the label LCSR.

If there are matchings that have the same label similarity rate among those createdby the rule 4, the possibility of merging the matchings should be considered by usingsimilarity rate information and the adjacency between labels, for reducing the possi-ble number of matchings. Details are explained in the rule 5.

Repositioning Rule 5: If there exist many nodes of which label similarity rates be-tween paths are above a certain value after the rule 4 is applied to text nodes yiq,…,yir

of the original document, determine whether the nodes can be combined each other.For that purpose, merge sibling text nodes that have a same parent node or text nodesthat have a series of orders, and regard them as candidate anchor boundaries. Among the matchings created by the rule 4, those that satisfy the merging rule 5should be merged as anchor boundaries, and selected as new candidate anchors. Forinstance, as shown in Figure 6, the matchings [27, 51] and [27, 52] are in sibling

3 Path (x) means a sequential set of nodes from a parent node to a root node of node x.

Annotation Repositioning Methods in the XML Documents 779

Fig. 6. Creation of candidate anchors by merging nodes

relations that have a same parent node, and text nodes of matchings [27, 60] and [28,83] have a series of orders in the original document. Therefore, they are merged asanchor boundaries and used later as references.

3 Implementation Result

In this section, the results of implementing the annotation system including the pro-posed methods and interface are examined. The system uses the XML-based eBookstandard[11], and operates in the window XP and CE environment. In this work, theannotation system operated in the window CE.

Figure 7(A) shows that robust anchoring is provided by applying the proposedmethods and interface to target documents. Figure 7(B) shows primitive anchoring onannotations 1, 2, and 5 in Figure 7(A) displayed as annotations in the rectangleform including in particular common anchor texts extracted by LCSR as highlights,which is to provide users with the anchor selection criteria. At the annotation 3 inFigure 7(A), contents of the anchor texts were severely modified, although the anchorstructures remain the same in Figure 7(B).

4 Experimental Evaluation

For the purpose of evaluating the efficiency of the proposed methods, empirical usertests were conducted. This experiment deployed prototypes that applied the proposedsystem, the Robust Location method[6] that uses structure information, and theWebVise[9] method that uses surrounding texts and anchors’ offset information in-stead of structure information.

780 W.-S. Sohn et al.

(A) (B)

Fig. 7. Original document where annotations were inserted(A) and anchoring results(B).

Users created 50 annotations in the XML-based e-Book document, using threeprototypes. And each prototype repositioned 20 annotations of which structures andanchors had been changed, 20 annotations of which only anchor texts had beenchanged, and 10 annotations where nothing had been changed after the originaldocument had been changed. And then users were asked to fill out question sheetsthat used the scales of the lowest accuracy 1 and the highest 10 for evaluating theprototypes’ repositioning accuracy.

A Single-Factor ANOVA (analysis of variance) was used to analyze performancesby subjective accuracy. Figure 8 shows subjective ratings for the accuracy of reposi-tioning of annotations according to application of each method in evaluation. Signifi-cant main effects were seen between the three results of repositioning with regard toapplication of each of the methods (F(2,57) = 8.98, P < 0.05).

1

2

3

4

5

6

7

8

9

10

ProposedSystem

RobustLocation

WebVise

Subj

ectiv

e Ac

cura

cy R

ate

Mean Rate

Fig. 8. Subjective evaluation of the accuracy by applying each method in Experiment 1 (1 =lowest accuracy, 10 = highest accuracy).

The results showed that the method that obtained the highest accuracy rate in thestructured document was the one proposed in this paper. In particular, it was foundout that the proposed candidate anchor creation method by stages influenced the us-ers’ evaluation of the accuracy more effectively than the Robust Locations’ tree walk

Annotation Repositioning Methods in the XML Documents 781

method that used structure information. On the other hand, the method that used onlysurrounding texts disregarding structure information showed lower satisfaction ratescompared to the proposed method. It seems that repositioning results in the structureddocument are affected by the use of meaningful element relations like path informa-tion as well as similarity rates of textual data.

5 Conclusions and Future Works

This paper proposed robust repositioning methods and systems for annotations in theXML document. In the proposed methods, the XML original document and annota-tion information were presented as logical structural trees, and matching relationswere created between the trees. Repositioning rules were presented by stages forselecting an appropriate anchor among many candidates.

The proposed repositioning rules used the similarity rate on anchor text informa-tion to detect the first anchoring point, and the similarity rate on paths to create can-didate anchors and extract meaningful information from the structure information.The boundary of candidate anchors was effectively determined by the merging rules.Finally, the created candidate anchors were either recommended and selected, ororphaned through the user interaction.

In that way, annotations in the structured document were robustly repositionedeven after contexts of the document were deleted, moved, or modified. This work canbe applied in annotation systems that use structured documents such as XML/SGMLas well as in the web (HTML) and can be effectively applied to the online text edit-ing, eBook, Cyber-Class, and so on. More work will be conducted in the future onapplying the methods to the annotation interface that includes semantic informationcreation and robust anchoring operations in the XML-based semantic web environ-ment.

References

1. Bernheim, A.J., Brush & Bargeron, D. (2001). Robustly Anchoring Annotations UsingKeywords. Technical Report, MSR-TR-2001-107, Microsoft Research

2. Brush, A.J., David, B., Anoop, G. & Cadiz, JJ. (2001). Robust Annotation Positioning inDigital Documents. Proceedings of CHI’01. Seattle, March 31, ACM Press, NY, 285-292.

3. Cadiz, J., Gupta, A., & Grudin, J. (2000). Using Web Annotations for AsynchronousCollaboration Around Documents. Proceedings of CSCW ’00. Philadelphia, ACM Press,NY, 309-318.

4. Ovsiannikov, I. A., Arbib, M. A. & Mcheill, T. H. (1999). Annotation Technology. Inter-national Journal of Human-Computer Studies. 50 (4), 329-362.

5. Davis, H. C. (2000). Referential Integrity of Links in Open Hypermedia Systems. Pro-ceedings of ACM Hypertext '98. Pittsburgh, ACM Press, NY, 207-216.

782 W.-S. Sohn et al.

6. Phelps, T. A. & Wilensky, R. (2000a), Robust Intra-document Locations. Proceedings ofthe 9th WWW Conference. Amsterdam.

7. phelps, T. A. & Wilensky, R. (2000b). Multivalent Documents. Communications of theACM. 43 (6), 83-90.

8. Roscheisen, M. & Winograd, T. (1995). Shared Web Annotations as a Platform for third-party Value-added Information ProvidersTechnical Report, CSDTR/DLTR, StanfordUniversity.

9. Grønbæk, K., Sloth, L. & P. Ørbæk (1999). Webvise: Browser and Proxy Support forOpen Hypermedia Structuring Mechanisms on the WWW. Proceedings of the EighthWorld Wide Web Conference. Toronto, Canada.

10. Yee, K.-P. (1997). The CritLink Mediator. http://crit.org/critlink.html.11. Sohn, W. S., et al., (2002). Standardization of eBook documents in the Korean Industry.

Computer Standards & Interfaces. 24(1), 45-60.12. Chang, G. J. S., Patel, G., Relihan, L. & Wang, J. T. J. (1997). A Graphical Environment

for Change Detection in Structured Documents. Proceedings of Twenty-First Annual Int'lComputer Software and Applications Conference (COMPSAC'97). Los Alamitos, CA,536-541.

13. Lee, K. H., Choy, Y. C., & Koh, K. (2001). Change Detection of Structured Documentsusing Path-Matching Algorithms. Journal of KISS(Korean). 28 (4).