the role of automated categorization in e-government information retrieval tanja svarre &...
TRANSCRIPT
The Role of Automated Categorization in E-Government Information Retrieval
Tanja Svarre & Marianne Lykke, Aalborg University, DKISKO conference, 8th of July, 2013.
Agenda
• Background of the study• Theoretical framework• Research methods• Results• Summary and closing remarks
Background to the search test
• Initiated and partially cofinanced by the Danish National IT and Telecom Agency
• Purpose: To investigate how automatic assignment of metadata can contribute to the intention of increased efficiency and effectiveness in (Danish) e-government
Building on indexing/categorization:
• Early Cranfield tests
Categorization is helpful:• when the query is vague, broad, general, or
ambiguous • when result rakings are deficient
(Käki, 2005)• in supporting exploratory searches• in understanding large search sets
(Kules & Shneiderman, 2004; 2005)
Research methods
• Case study in the Danish Tax Authorities• Search test:• Controlled lab test• Comparison test• Professional users• Domain specific search tasks• Pre test questionnaire• Log data• Post search interview
Data: Search test
• System characteristics:B Prototype of the corporate intranetB www.skat.dk content and internal information
• 2 search systems:B Free text indexing (SYSTEM A)B Categorization (SYSTEM B)
• 32 test persons• 3 controlled and 1 natural search task per
session, 2 tasks per system
Search test: General findings
Variables System ASessions N=64Queries N=229
System BSessions N=64Queries N=335
Number of terms in queries (averages)
2.25 2.43
Search filter ‘document type’ applied (percentages)
43.2 31.6
Number of sessions with reformulations (percentages)
65.6 82.8
Number of reformulations in sessions (averages)
2.58 4.23
Query success (percentages)
30.6 21.5
Session success (percentages)
89.1 84.4
Success at task level
Sim1 Sim2 Sim3 NWT Total
SysA SysB SysA SysB SysA SysB SysA SysB SysA SysB
Session succeeded
15 (93.8)
16 (100.0)
15 (93.8)
9 (56.3)
16 (100.0
16 (100.0
11 (68.8)
13 (81.3)
57 (89.1)
54 (84.4)
Query succeeded
18 (58.1)
23 (33.3)
17 (30.4)
11 (9.7)
20 (27.8)
22 (25.6)
15 (21.4)
16 (23.9)
70 (30.6)
72 (21.5)
• At task level the success of the two systems differs
Task level results
Sim1 Sim2 Sim3 NWT Total
System A
1.94 (n=16)
3.50 (n=16)
4.50 (n=16)
4.38 (n=16)
3.58 (n=64)
System B
4.31 (n=16)
7.06 (n=16)
5.38 (n=16)
4.19 (n=16)
5.23 (n=64)
Total 3.13 (n=32)
5.28 (n=32)
4.94 (n=32)
4.28 (n=32)
4.41 (n=128)
Sim1 Sim2 Sim3 NWT Total
System A
2.32 (n=31)
2.39 (n=56)
2.42 (n=72)
1.94 (n=70)
2.25(N=229)
System B
2.54 (n=69)
2.88 (n=113)
1.79 (n=86)
2.39 (n=67)
2.43 (N=335)
Total2.47 (n=100)
2.72 (n=169)
2.08 (n=158)
2.16 (n=137)
2.36 (N=564)
Reformulations Total
SysA SysB
No reformulations 69 (30.1) 62 (18.5)
Category - 114 (34.0)
Query terms 97 (42.4) 47 (14.0)
Document type 28 (12.2) 8 (2.4)
Search operators 8 (3.5) 5 (1.5)
>1 types simultaneously
27 (11.8) 99 (29.6)
Total 229 (100) 335 (100)
System B (cat.) omissions
Number of sessions in system B
Number of successful sessions system B
System B 26 (40.6) 22 (40.7)
Combined system B sessions
38 (59.4) 32 (59.3)
Total 64 (100.0) 54 (100.0)
System B (cat.) omissions
• Highly relevant documents are discovered before a category has been selected
• Relevant documents are located while waiting for B (cat.) to categorize search results
• Categorization is not relevant when few documents are retrieved
Summary• Categorization is useful:• When employees do not posess extensive
knowledge about the task at hand• In offering new perspectives on the
composition of a qury• In understanding facets of queries• When task knowledge is present, categorization
is used to support the assumptions of a correct search
Summary
• Categorization is omitted when:• Search results are limited• When relevant documents are ranked at the
top of the results
National IT & Telecom Agency: Findings• The participants start out with free
text indexing and supplement with the other when necessary
• The indexing methods compared are complementary
• To meet the variety of information needs several indexing me-thods should be representedsimultaneously