australian document computing conference dec 3 2011 information retrieval in large organisations...

23
Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

Upload: branden-little

Post on 16-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

Australian Document Computing Conference Dec 3 2011

Information Retrieval in Large Organisations

Simon Kravis

Information Retrieval in Large Organisations

Simon Kravis

Copyright 2010 Fujitsu Limited

Page 2: Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL

Large Organisations

Can’t rely on personal contacts to obtain information Have difficulty in storing and retrieving information Often use multiple systems for storing information

Paper Files Shared Filesystems Document Management Systems

• Intranets (SharePoint)• Specialised Systems (eg TRIM, Documentum, Alfresco)

Are only interested in Internet style search to meet legal challenges

2 Copyright 2010 Fujitsu Limited

Page 3: Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL

Paper files

Well understood Easy to manage Can be stored over hundreds of years Expensive to store and search Most documents now ‘born digital’

3 Copyright 2011 Fujitsu Limited

Page 4: Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL

Electronic Documents

Cheap to create, exchange and store in the short term Price of powerful applications is poor management

4 Copyright 2011 Fujitsu Limited

Page 5: Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL

Filesystems

• Files are building blocks of– Operating Systems– Applications

• Desktop applications commonly store electronic documents as files

• Hardware costs of storage have become very low• Difficult to model statistically

– many attributes follow power laws (files/folder, file size, subfolders, file types)

5 Copyright 2011 Fujitsu Limited

Page 6: Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL

Why shared filesystems?

Cheap & simple Access to documents from different computers Support collaborative work

6 Copyright 2011 Fujitsu Limited

Page 7: Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL

Shared Filesystem Organisation

Multiple volumes, often based on organisational structure Tree structure of folders and files User and Group areas Permissions based on user ID and group membership Higher levels of folder trees usually controlled by

administrators

7 Copyright 2011 Fujitsu Limited

Page 8: Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL

Are shared filesystems unstructured?

Folder tree represents a high degree of structure created by users

Local but not global consistency Users structure folder trees to facilitate their own work Structures are usually highly efficient information stores

Small survey of users in an IT service company in 2005 showed that only 1 user out of 12 had spent more than 15 mins/day looking for files on share drives over past week

8 Copyright 2011 Fujitsu Limited

Page 9: Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL

Filesystem volume growth & effect of quotas

9 Copyright 2010 Fujitsu Limited

01-Mar-

05

01-May

-05

01-Jul-0

5

01-Sep-05

01-Nov-0

5

01-Jan-06

01-Mar-

06

01-May

-06

01-Jul-0

6

01-Sep-06

01-Nov-0

6

01-Jan-07

01-Mar-

070

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000

Finance sector file server growth over 2 years

Vol (

GByt

es)

3000 users, 90 volumesBasically linear with small accelerationLinear component= 190 Gbytes/Month600 Mbytes/month/userGrowth acceleration =7 Gbytes/month2

11/2003

03/2004

07/2004

11/2004

03/2005

07/2005

11/2005

03/2006

07/2006

11/2006

03/2007

07/2007

11/2007

0

2000

4000

6000

8000

10000

12000

14000

16000

Transport organisation file server growth over 4 years

After QuotasBefore Quotas

Usar

and

Gro

up V

ol (M

Byte

s)

22,000 users, 328 user and group volumesQuadratic fit to cleaned data before quotasLinear component= 160 GBytes/month7 Mbytes/month/userGrowth acceleration =0.07 GBytes/ month2

Page 10: Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL

Volume and count profiles (Financial Services)

10 Copyright 2010 Fujitsu Limited

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

< 2% xls mdb prj doc TXT zip pst nsf pro csv DBF pdf

Volume Profile for 11 TBytes of Data

Vol

Count

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

< 2% doc xls (blank) txt lnk htm pdf prj gif CSV jpg A png

Count Profile for 21 Million Files

Vol

Count

Page 11: Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL

File Size and Count Profile

11 Copyright 2010 Fujitsu Limited

Size range covers 5 orders of magnitude

50% of volume used by 3% of files

0

25

50

75

100

C um % S ize H istogramA ll f iles

%

S ize H istogram

46

.4 - 1

00

.0 K

By

tes

10

0.0

- 21

5.4

KB

yte

s

21

5.4

- 46

4.2

KB

yte

s

46

4.2

- 10

00

.0 K

By

tes

1.0

- 2.1

MB

yte

s

2.1

- 4.5

MB

yte

s

4.5

- 9.8

MB

yte

s

9.8

- 21

.0 M

By

tes

21

.0 - 4

5.3

MB

yte

s

45

.3 - 9

7.7

MB

yte

s

97

.7 - 2

10

.4 M

By

tes

21

0.4

- 45

3.3

MB

yte

s

45

3.3

- 97

6.6

MB

yte

s

1.0

- 2.1

GB

yte

s

2.1

- 4.4

GB

yte

s

Count Vol

Page 12: Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL

Why filesystems are like poorly sorted soil

12 Copyright 2010 Fujitsu Limited

Most of volume taken up by large particles

Page 13: Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL

Duplication by count and volume

13 Copyright 2010 Fujitsu Limited

Volume and count spectra usually different – vol savings seldom > 20% from de-duplication

Page 14: Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL

File Use Profiles – 6500 accesses to 3.5 million files over 21 days by 145 users

14 Copyright 2010 Fujitsu Limited

• 2 accesses per user per day

• About 3 read accesses for every modification

• Files on share drives not frequently shared between users

• Files accessed many times by many users are applications1 2 3 4 5 6 7 8 9

1

10

100

1000

10000

17

13

19

25

31

Users

Files

Accesses

1

2

3

4

5

6

7

8

9

Page 15: Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL

Text Documents in Large Organisations

Mainly created by desktop applications (Office) Usually comprise 15-20% of file count, 10-15% of volume Collections used by different parts of the organisation Small collections often very intensively used

Collateral for service companies

15 Copyright 2011 Fujitsu Limited

Page 16: Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL

Duplication in 12,00 text documents from software development project

16 Copyright 2010 Fujitsu Limited

Exact Near (Document Vector Comparison)

Similar cluster spectra for 40,000 text documents from Govt. Department

Page 17: Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL

Evaluating Measures of Near-Duplication

17 Copyright 2010 Fujitsu Limited

Very large parameter space to test Document vector generation, matching algorithm,

matching level False positives detected by sampling cluster Very difficult to detect false negative clustering

Do documents with similar names have similar content? Trigram matching – very compute-intensive

Most clusters are versions of documents

Page 18: Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL

Example of correct clustering

18 Copyright 2010 Fujitsu Limited

10 versions of the same file, all in same folder

Page 19: Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL

Example of incorrect clustering

19 Copyright 2010 Fujitsu Limited

RfA Diagram2.rtfUI navigation diagrams 010210.RTF

Same 3 words – different pictures

Page 20: Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL

Information Retrieval by Search for Internal Collections

Few or no hyperlinks Composite documents are common Documents frequently have implicit content High level of near duplication Search terms are often commonly occurring words or phrases -> Poor search results when compared to Internet search Users prefer to ask people or browse

20 Copyright 2011 Fujitsu Limited

Page 21: Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL

Is tagging the answer?

Sparse access means that common tags don’t emerge

21 Copyright 2011 Fujitsu Limited

1 2 3 4 5 6 7 8 9

1

10

100

1000

10000

17

13

19

25

31

Users

Files

Accesses

1

2

3

4

5

6

7

8

9

Page 22: Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL

What might help?

Automated tagging Training sets Synonym groups Learning required to adapt to rapidly changing vocabulary

Extraction of document headings & captions “Find a good paragraph on reporting capability”

Clustering of similar documents “Find the most recent version of this document” is a very common

requirement

Using a document management system with version control Presence of a capability doesn’t mean it will be used Cluster spectra of documents in DMS very similar to filesystem for

software development docs

22 Copyright 2011 Fujitsu Limited

Page 23: Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL 23 Copyright 2010 FUJITSU LIMITED