cse3180 semester 1 2005 / nonstructured 1 week 9 a few concepts and approaches in unstructured data

63
CSE3180 Semester 1 2005 / Nonstructured 1 Week 9 A Few Concepts and Approaches in Unstructured Data

Upload: leo-wilcox

Post on 26-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

CSE3180 Semester 1 2005 / Nonstructured 1

Week 9Week 9

A Few Concepts and Approaches in Unstructured Data

CSE3180 Semester 1 2005 / Nonstructured 2

Data ManagementData Management

These notes will address the imbalance between Information sourced from structured data storage systems, mainly databases (as in Relational Databases)

and Information which is necessary for informed management decision making processes based on non structured data (also known as UnStructured data)

CSE3180 Semester 1 2005 / Nonstructured 3

UnStructured InformationUnStructured Information

• Much effort and expense has been directed at the design, capture, processing, storage and retrieval of data in a structured form.

• There are a number of database management systems (DBMS) which can store many trillions of bytes in centralised, distributed and client/server based systems

• There are extensive backup, recovery and restart procedures in place to ensure persistence and continuity of data

CSE3180 Semester 1 2005 / Nonstructured 4

UnStructured InformationUnStructured Information

The ratio of unstructured to structured information in many organisations is approximately 9 to 1

It is easy to conclude that the the most important component which drives much of the decision making in key business processes is badly neglected.

Familiar examples are the World Wide Web,

Corporate intranets

email

on line discussion groups

CSE3180 Semester 1 2005 / Nonstructured 5

UnStructured InformationUnStructured Information

Why is there this bias ?

Perhaps because structured information management has been synonymous with information systems design ?

The technology for unstructured data management is powerful, pervasive and well understood

CSE3180 Semester 1 2005 / Nonstructured 6

UnStructured InformationUnStructured Information

Businesses are becoming more interconnected

Each new connection is made by, and relies on, exchange of Information

One aspect of these ‘connections’ is that of the unpredictability of the nodes and the nodal content

This impacts on the structured information model

CSE3180 Semester 1 2005 / Nonstructured 7

UnStructured InformationUnStructured Information

Structured information systems depend on there being a collection of facts - which make up records

Data storage was ‘expensive’ in terms of access times, processing cycles and the physical media itself

‘Critical data elements’ were (and are) used to minimise data storage requirements, and also to improve processing times

Reduction of the number of bytes allocated to these critical data elements was a bonus, for example a 2 digit year, or a 5 digit salary expression

CSE3180 Semester 1 2005 / Nonstructured 8

UnStructured InformationUnStructured Information

The end product of this ‘distillation’ is structured information

Its basis is stored in a predefined record form

It is dependent on the skills of analysts and management to anticipate precisely which data elements must be stored

This reliance on a predefined record form which includes some data but excludes other data is now seen as a key limitation of structured information sources

CSE3180 Semester 1 2005 / Nonstructured 9

UnStructured InformationUnStructured Information

The term ‘unstructured’ is invariably associated with ‘documents’.

They are a medium which we understand and use

There are other forms of unstructured data

audio

voice

images

graphical objects

These are forms of electronic documents

CSE3180 Semester 1 2005 / Nonstructured 10

UnStructured InformationUnStructured Information

These forms are ‘unstructured’ because their exact content and organisation are unpredictable

Unstructured information is any information type made up of content which does not fit a predefined, descriptive model or arrangement

It would be possible to impose (or superimpose) a structure on a document to make document selection possible, but there would be a cost

CSE3180 Semester 1 2005 / Nonstructured 11

UnStructured InformationUnStructured Information

A document’s content could be ‘distilled’ - which is a process of eliminating or summarising a body of information to its ‘essential’ components.

The danger in this is that the ‘content’ of a document may alter from use to use (processing to audit for instance) and from user to user (credit check authorisation to inventory levels control for instance)

CSE3180 Semester 1 2005 / Nonstructured 12

UnStructured InformationUnStructured Information

A document may need to be tracked using

author name

title

filing date

a short abstract of the content (sound familiar ?)

The effect of this is that the ‘content’ of the document can now be accessed only through these 4 ‘keys’

Question : How could a lengthy research task be categorised by value or content using this technique ?

CSE3180 Semester 1 2005 / Nonstructured 13

UnStructured InformationUnStructured Information

Document management systems support full text search and retrieval of the total document content. (There are also some nice structured document indexing techniques)

Structured data systems will probably always be necessary to track and manage specific facts about key business transactions which associate with or drive other business transactions ( a classic case is the Automated Teller Machine System).

They must have limitations. Structures information systems are based on the premise that it is possible to predict the context in which business information is useful/necessary

CSE3180 Semester 1 2005 / Nonstructured 14

UnStructured InformationUnStructured Information

That is another way of saying

the who

the when

the where

the why

and the how - realistic in today’s environment ?

So what is different with today’s decisions ?

The difference is the nature and kind of decisions made

CSE3180 Semester 1 2005 / Nonstructured 15

UnStructured InformationUnStructured Information

Today’s business is driven by increasing rates of change

We have shifted from an industrial economy to a knowledge driven economy

The need more information to support the decision-making processes, and the dynamic nature of the business environment means that the support from structured information systems is starting to be inadequate

However there is a possibility of ‘information overload’ - one of the accepted criteria of ‘knowledge work’ is a competency to manage the increasing amounts of information

CSE3180 Semester 1 2005 / Nonstructured 16

UnStructured InformationUnStructured Information

The ‘industrial’ economy did have a high degree of predictability

Companies produced and sold a fairly narrow set of products or services

Competition existed. Business operated in ‘static’ markets

Change was relatively slow - change was recognised and there was time to readjust

CSE3180 Semester 1 2005 / Nonstructured 17

UnStructured InformationUnStructured Information

In the current environment, the ‘products’ are now our ideas

Ideas are driven by information

The term ‘Globalisation’ is appropriate

We think and we change or modify our plans - and this has led to the loss of predictability

The the ‘knowledge environment’ business success depends on the ability of knowledge workers to sift through all of the available unstructured resources and to make decisions - and faster than the competitors

The measurement of success is in degrees of innovation

CSE3180 Semester 1 2005 / Nonstructured 18

UnStructured InformationUnStructured Information

So, what are the ‘sources of information’ ?

Corporate document bases

The Internet

The Extranets

Information subscription services

Dialog with Customers, suppliers, competitors

The 2 major problems of decision making are

1. Volume of information

2. The speed at which decisions need to be made

CSE3180 Semester 1 2005 / Nonstructured 19

UnStructured InformationUnStructured Information

Information retrievals have moved away from the filtering of unstructured data into a structured environment.

The emerging models accommodate the capture of resources (and access) which leads to a dynamic and unfiltered information repository which consists of joined but separated sources.

Web sites are sought and searched via the Internet, and possibly a corporate document repository is included in the search.

CSE3180 Semester 1 2005 / Nonstructured 20

UnStructured InformationUnStructured Information

Some of the information found will be transferred to a more structured repository - possibly a competitive analysis database.

A search in this way provides a set of search results but does not change the information sources.

The Importance of Tools:

Users don’t normally have the time nor the skills to ‘get on top of’ a variety and changing set of tools (what is your experience with the various search engines you have used on the Internet 1 ?)

CSE3180 Semester 1 2005 / Nonstructured 21

UnStructured InformationUnStructured Information

What is needed is a retrieval tool (hardware and software) which understands how to work with different repositories

This leads to the ‘repository management system’ being able to recognise the form and requirements of different search tools) or, if you like, software which recognises and communicates exactly and completely with other software

Another aspect is the nature of a search - this has changed from simple words or phrase retrievals to ‘context’ and dynamic analysis and categorisation

CSE3180 Semester 1 2005 / Nonstructured 22

UnStructured InformationUnStructured Information

One of the ‘motivations’ for adopting electronic document management is the volume of information.

This is so for document imaging

There is an offset - how to retrieve - quickly and completely

You are familiar with ‘indexes’ such as Title, Name, Number as being the entry point to a limited set of possibilities

Today’s users want to be able to search entire documents and to match them to other fully searchable documents based on user-defined relevance

CSE3180 Semester 1 2005 / Nonstructured 23

UnStructured InformationUnStructured Information

Fortunately, existing support many different file types I the native application - images, word-processing systems, desktop publishing systems, spreadsheets, CAD files ….

2 methods of access

- user defined indexes in a full functioned database (3rd party relational databases)

- a fully integrated full-text query engine

CSE3180 Semester 1 2005 / Nonstructured 24

What about ???What about ???

• Security

There are highly sophisticated security schemes (also mentioned in Portals)

• Revision Tracking and Control

Essential for accurate and up to date information (what revision number of MS-Word are you using ?)

• Document Check-in/Check-out

Recognition and tagging of all documents - none can be ignored or ‘go missing’

• Usage audit trails

These features are standard features

CSE3180 Semester 1 2005 / Nonstructured 25

UnStructured InformationUnStructured Information

Compound document managers

These are products which provide the facilities mentioned in the previous overheads, but they have an interesting extension

They treat documents as a collection of ‘pointers’ to various and different collections of information - this ensures that ‘new’ data or information will always be part of a search. (you recall the hypertext links - and have you used the links which are attached to my Web page ?)

Other features which ensure document integrity include roll-back, recovery, and audit trails

CSE3180 Semester 1 2005 / Nonstructured 26

UnStructured InformationUnStructured Information

There is a single focal point around which unstructured information management practice and technology have converged

No surprise - it is the Internet

This facility has enabled the widespread distribution of unstructured information from a very large base of resources

The World Wide Web is a hyper-linked, unstructured information repository with millions (?) of documents

CSE3180 Semester 1 2005 / Nonstructured 27

UnStructured InformationUnStructured Information

The Web has made it possible for specialised content to be published, as well as to advertise products

Specialised on-line information subscription services exist which provide industry-specific information (for a fee)

Stock market information services and sites

Monash University has on-line examination results services (and enrolments ?)

CSE3180 Semester 1 2005 / Nonstructured 28

UnStructured InformationUnStructured Information

Internet technology is said to be ubiquitous - that is it can reach anywhere - that is where there are suitable devices of course

All current document management systems support some degree of support for Web technology

Capabilities range from

basic document viewing using dynamic HTML to make the document ‘visible’ regardless of its source format

to

advanced document access and distribution

CSE3180 Semester 1 2005 / Nonstructured 29

UnStructured InformationUnStructured Information

For instance a Computer Aided Design drawing (perhaps the Burnley tunnel?) in a user’s browser without requiring that the remote uses have CAD software installed on their machines

Or perhaps the Scorseby by-pass (or is it now the proposed Mitcham - Frankston freeway ?)

Or the Mullum Mullum creek and its environment

Or perhaps Coastal Area developments

CSE3180 Semester 1 2005 / Nonstructured 30

UnStructured InformationUnStructured Information

Some advanced search engines: (form flexibility)

Verity Inc Information Server

PC Docs/Fulcrum SearchServer

Inktomi Corp Search Engine

Externalisation - knowledge management

Autonomy Inc

Semio Corp

These use a semantic or lexical analysis engine to extract meaning from the information repository

CSE3180 Semester 1 2005 / Nonstructured 31

UnStructured InformationUnStructured Information

A traditional query would look like

‘interest rates AND stock prices’ and you would receive documents which included the 2 search words

A query expressed as ‘I am interested in the effects of interest rate changes on stock prices’ does not have specific key words nor phrases.

Such an expression would be have its content meaning derived from the information base. The result would be a categorisation of topics contained in the information base.

These would then be further analysed by the user over a varying number of criteria

CSE3180 Semester 1 2005 / Nonstructured 32

A few thoughts on ‘Detail’A few thoughts on ‘Detail’

Retailers :

Retailing is a competitive business. Success depends on the Retailer knowing the customers.

There are 3 factors at play

1. The changing likes and dislikes of consumers.

New product preferences are the result of an aging population, changes in family structure, flexible lifestyles

and to no small degree, enterprise bargaining and working conditions.

A successful retailer must be aware and adapt to these factors

CSE3180 Semester 1 2005 / Nonstructured 33

A few thoughts on ‘Detail’A few thoughts on ‘Detail’

2. The uniqueness of each customer.

Individual needs can only be accommodated by knowing individual requirements.

‘Loyalty’ programmes are based on this premise

3. The importance of managing inventory levels, controlling markdowns, maintaining margins.

Static inventory results in greater interest expense. This acts as a barrier to reinvestment of stock which is moving. Unmoving stock will (in many cases) require markdowns to liquidate the stock, which negatively impacts margins

CSE3180 Semester 1 2005 / Nonstructured 34

A few thoughts on ‘Detail’A few thoughts on ‘Detail’

This latter point depends on the level of success with the first 2 factors (changes, uniqueness).

Challenges to retail businesses are:

Knowing who the customer is.

Track and capture purchase history. Identify future needs by storing data at a meaningful level of detail

As an example, a back to school marketing programfor customers who purchased school materials last

year.

CSE3180 Semester 1 2005 / Nonstructured 35

A few thoughts on ‘Detail’A few thoughts on ‘Detail’

Understand how the customer wishes to interact.

How will information be sought about a possible purchase ? Personal visit, phone, Internet, email ..?

How will the sale be finalised - personal, phone, Internet, credit card, cash , account entry ?

What is the customer’s preferred method of interaction ?

Perhaps a ‘personal’ profile should be developed

Be able to track and evaluate the strength of the relationship with each customer

Purchase history plus other contacts - e.g. warranty service, new or updated products

CSE3180 Semester 1 2005 / Nonstructured 36

A few thoughts on ‘Detail’A few thoughts on ‘Detail’

Know what is ‘enough’ detail.

Costs are associated with keeping details. Summary information may however result in incorrect

projections and decisions.

Historical data (such as for air conditioners as opposed to toothpaste)

Analysis of cyclical data for air conditioners should assist in reducing risks of commitments and distribution of such items

The revenue possibilities should outweigh the expense of maintaining air conditioner data

The question is - how to decide what data to store

CSE3180 Semester 1 2005 / Nonstructured 37

A few thoughts on ‘Detail’A few thoughts on ‘Detail’

Incorporate the knowledge gained of customer needs into ‘business intelligence’

This knowledge should be used to

– analyse past performance– get insight into current trends– blend this information into the business plan– develop systems which accurately reflect the customers’

needs– develop systems which interact and support the profit

model

CSE3180 Semester 1 2005 / Nonstructured 38

UnStructured Data ManagementUnStructured Data Management

The following overheads address this situation and suggest some ways in which unstructured text can in fact be managed, and thus become part of the decision making processes

Stage 1:

1. Not surprisingly, the first stage is the creation of ‘intelligence’ - a form of data dictionary.

A more formal term is ‘meta data’ or data about data

This concept is not new as structured databases use metadata - and so do the newer Microsoft Operating Systems

The big trick is to capture representative meta data

CSE3180 Semester 1 2005 / Nonstructured 39

UnStructured Data ManagementUnStructured Data Management

Metadata about content serves a number of functions

especially the essentials of text such as– main topics– author or authors– language– publication– revision dates

Notice the similarity to ‘library’ systems, including Voyager ?

CSE3180 Semester 1 2005 / Nonstructured 40

UnStructured Data ManagementUnStructured Data Management

This metadata improves the precision and quality of full text and keyword search - it allows the users specify additional document attributes

It also supports and extends– classifying and routing content– deleting expired texts– determining any additional processing (e.g. translation

based on ‘knowledge of the source and user.)

CSE3180 Semester 1 2005 / Nonstructured 41

UnStructured Data ManagementUnStructured Data Management

There are two aspects of metadata– extraction from text– storage

Extraction deals with detection of ‘keys’ such as– author’s name– the main thrust of the text (or thrusts)

Storage addresses the support of a number of access and retrieval methods

CSE3180 Semester 1 2005 / Nonstructured 42

UnStructured Data ManagementUnStructured Data Management

Extraction of metadata is rarely manual - there are currently some interesting software tools such as MetaMaker, Insight Software’s Categoriser

They are probably not as effective from the aspect of quality as a person - but they do handle large masses of text quickly

(what are the risks involved, and what is the minimum level acceptable ?)

CSE3180 Semester 1 2005 / Nonstructured 43

UnStructured Data ManagementUnStructured Data Management

As you might expect, storage is much more ‘automated capable’.

1. It can be managed separately from the documents in a relational database - as are the data dictionary tables

This is a fairly normal approach in document warehousing where close integration with other database applications such as data warehousing, is a requirement

CSE3180 Semester 1 2005 / Nonstructured 44

UnStructured Data ManagementUnStructured Data Management

2. Metadata can be stored within the document - XML-based standard Resource Description Format is a good form. It is not tied to a particular metadata standard

Metadata can also manage access control– copyright agreement to manage the distribution (or not) of

content (remember hassles with Napster ? - not exactly ‘documents’ in the traditional sense)

– access-control can exits at many logical points such as documents, journals, magazine, or publisher levels

CSE3180 Semester 1 2005 / Nonstructured 45

UnStructured Data ManagementUnStructured Data Management

Metadata can also be directed at quality control and document ranking

A document in the ‘Financial Review’ for instance might be awarded a higher ranking (authority) than say an article available at a lesser now Web site

and … A final report to a CEO should have more weight than draft memos - but the memos might rank higher is a search which is based only on frequency of word occurrences.

CSE3180 Semester 1 2005 / Nonstructured 46

UnStructured Data ManagementUnStructured Data Management

Metadata can be used to generate automatic summaries and clustering data - but as with derived values in SQL statements execution, it might be more appropriate not to store the summaries or clustered information, but to re-generate these on an as-required basis. (also saving storage and retrieval costs).

CSE3180 Semester 1 2005 / Nonstructured 47

UnStructured Data ManagementUnStructured Data Management

Stage 2 :

This deals with ‘user focus’ also known as ‘profiles’

This is an effective key to improving precision and recall of information retrieval

Profiles are maps of user interests (as is a bit map). They use the same representation schemes as the metadata describing the contents of documents

CSE3180 Semester 1 2005 / Nonstructured 48

UnStructured Data ManagementUnStructured Data Management

Some research work by Tsvi Kuflik and Peretz Shoval identifies the following profiles

1. User-created : They are the easiest to implement but require the user to create, and maintain, them

2. System-generated : These analyse word frequencies in relevant documents to identify patterns which indicate appropriate texts

3. System-Plus User-generated profiles : These are autogenerated profiles which are then modified by a user

4. Neural-net profiles are ‘trained’ using selected or suggested texts provided by a user. They provide ranking in relevance for other texts

CSE3180 Semester 1 2005 / Nonstructured 49

UnStructured Data ManagementUnStructured Data Management

5. Stereotype models : These are interests shared by a large number or groups of users, which provide the basis for building specialised or individualised profiles

6. Rule-based filtering : These profiles processes use explicit if-then rules to classify or categorise content

A word of caution - Each of the techniques has its own advantages and disadvantages (manual update by users when interests alter, or possibly adaptation to these changes - possibly the Euro is an example here).

However, on the plus side, the profile produced will provide a long-term resource for filtering, minimising ambiguity, and document gathering.

CSE3180 Semester 1 2005 / Nonstructured 50

UnStructured Data ManagementUnStructured Data Management

Stage 3: Content Access Control :

The Web offers a ‘free flowing information model’.

Portals, on the other hand, require controlled access as users are more willing and likely to share information when they know it is distributed within the bounds of well-defined security business rules. (Monash University is considering increasing security over material available via Courseware Web pages).

CSE3180 Semester 1 2005 / Nonstructured 51

UnStructured Data ManagementUnStructured Data Management

There are 3 main access control areas

– Open-access information

– Licence restricted information

– Privileged information

CSE3180 Semester 1 2005 / Nonstructured 52

UnStructured Data ManagementUnStructured Data Management

Open-access information is freely available to all portal users

News items, press releases, product catalogues, appointments … are in this category

Licence -restricted information is self explanatory. This information is defined by agreements with content providers (e.g. Dunn and Bradstreet). It could be the digital library of a professional organisation. (would Monash research papers qualify here ?)

Control access generally uses user authentication and/or IP address verification (as in e-commerce perhaps ?)

CSE3180 Semester 1 2005 / Nonstructured 53

UnStructured Data ManagementUnStructured Data Management

Privileged Information is granted on a ‘need to know; basis - but with a different slant than ‘news’.

An example here could be co-operative research (e.g. DNA) or the complicated case where for instance solicitors working on a negotiation with a client will required access to related documents - but others in the same firm or office working on the same type of negotiation should not be able to access the same documents.

CSE3180 Semester 1 2005 / Nonstructured 54

UnStructured Data ManagementUnStructured Data Management

Stage 4: Rich Search Support

Finding correlations between terms, improvements in user query execution, have produced a number of techniques for representing documents.

However, there is still a ‘precision and recall barrier’ which is about the 65 to 70% level because of the dependence of what are essentially statistical techniques rather than linguistic comprehension.

CSE3180 Semester 1 2005 / Nonstructured 55

UnStructured Data ManagementUnStructured Data Management

Is it now generally accepted that a better approach combines– keyword searching– clustering– visualisation

Keyword searches extend the user query, generally using a thesaurus : this uses synonyms, soundex and the extension of singular terms it include plural terms

for example : ‘Stocks’ could become ‘Stocks and Shares’

England could extend to English

Bank could include Banks

CSE3180 Semester 1 2005 / Nonstructured 56

UnStructured Data ManagementUnStructured Data Management

A danger with this, as you have probably found in your use of the Monash Library facilities, is the high number of ‘hits’.

Clustering : Hierarchical clustering builds tree structures where the root of the tree contains– all documents– internal nodes contain groups of similar documents– the size of the groups decreases as the user moves from

the root node to the leaf node (these would contain single documents). You have probably met the term ‘drilling’.

CSE3180 Semester 1 2005 / Nonstructured 57

UnStructured Data ManagementUnStructured Data Management

The Scatter/Gather algorithm produces a result set clustered into a small, fixed number of groups.

Users select the most appropriate group or groups and the referred documents are clustered into the same number of groups.

Users can drill down and the result of drilling into the most appropriate cluster, and the most appropriate elements are grouped into a number of semantically related clusters.

CSE3180 Semester 1 2005 / Nonstructured 58

UnStructured Data ManagementUnStructured Data Management

Coarse and Fine Granularity

Coarse granularity navigation software normally displays trees, where each node (or leaf) represents a document labelled with a text. (visualisation)

A user can then access a ‘group’ of such nodes (hyperlinked documents) and drill in on items of interest. (The alternative would be to click each page and examine its contents)

CSE3180 Semester 1 2005 / Nonstructured 59

UnStructured Data ManagementUnStructured Data Management

Fine Granularity:

When there is a need to locate specific information, such as information about sales of palm held devices and needs to distinguish key marketing terms among a number of plans, then it is necessary to focus quickly on those terms and associate their relationship with other terms

(have you read about the latest mobile phone plans ?)

CSE3180 Semester 1 2005 / Nonstructured 60

UnStructured Data ManagementUnStructured Data Management

This probably all sounds very advanced.

It also sounds if high speed communications are necessary - and they are.

However there is a penalty, and this is the large repository of relevant content necessary - which must by its nature be current or up to date.

This is best handled by automatic and automated tools

CSE3180 Semester 1 2005 / Nonstructured 61

UnStructured Data ManagementUnStructured Data Management

Stage 5: Keeping the Content Current

Harvesters, Trawlers and File Retrieval systems gather documents for content repository inclusion

This software is drive by metadata about sites to search and for the directories or document management systems to scan for relevant information.

Metadata and Indexing details may be all that is required. The documents can be retrieved from source on an as-required basis

CSE3180 Semester 1 2005 / Nonstructured 62

UnStructured Data ManagementUnStructured Data Management

There is the aspect of ‘outdated’ data

This is capably handled by metadata about document types and sources

Predictions about the Australian Federal Budget are obsolete when the Budget is formally released - so, delete the predictions? Do they have any further use ?

CSE3180 Semester 1 2005 / Nonstructured 63

UnStructured Data ManagementUnStructured Data Management

And a final thought :-

The logging of – when documents arrived, – where they came from, – their source (author ?)– other ‘information’ such as ‘superceded by’ or ‘expiry

date’

will be handled by content management processes