cse3180 semester 1 2005 / nonstructured 1 week 9 a few concepts and approaches in unstructured data

CSE3180 Semester 1 2005 / Nonstructured 1

Week 9Week 9

A Few Concepts and Approaches in Unstructured Data


Data ManagementData Management

These notes will address the imbalance between Information sourced from structured data storage systems, mainly databases (as in Relational Databases)

and Information which is necessary for informed management decision making processes based on non structured data (also known as UnStructured data)


UnStructured InformationUnStructured Information

• Much effort and expense has been directed at the design, capture, processing, storage and retrieval of data in a structured form.

• There are a number of database management systems (DBMS) which can store many trillions of bytes in centralised, distributed and client/server based systems

• There are extensive backup, recovery and restart procedures in place to ensure persistence and continuity of data



The ratio of unstructured to structured information in many organisations is approximately 9 to 1

It is easy to conclude that the the most important component which drives much of the decision making in key business processes is badly neglected.

Familiar examples are the World Wide Web,

Corporate intranets

email

on line discussion groups



Why is there this bias ?

Perhaps because structured information management has been synonymous with information systems design ?

The technology for unstructured data management is powerful, pervasive and well understood



Businesses are becoming more interconnected

Each new connection is made by, and relies on, exchange of Information

One aspect of these ‘connections’ is that of the unpredictability of the nodes and the nodal content

This impacts on the structured information model



Structured information systems depend on there being a collection of facts - which make up records

Data storage was ‘expensive’ in terms of access times, processing cycles and the physical media itself

‘Critical data elements’ were (and are) used to minimise data storage requirements, and also to improve processing times

Reduction of the number of bytes allocated to these critical data elements was a bonus, for example a 2 digit year, or a 5 digit salary expression



The end product of this ‘distillation’ is structured information

Its basis is stored in a predefined record form

It is dependent on the skills of analysts and management to anticipate precisely which data elements must be stored

This reliance on a predefined record form which includes some data but excludes other data is now seen as a key limitation of structured information sources



The term ‘unstructured’ is invariably associated with ‘documents’.

They are a medium which we understand and use

There are other forms of unstructured data

audio

voice

images

graphical objects

These are forms of electronic documents



These forms are ‘unstructured’ because their exact content and organisation are unpredictable

Unstructured information is any information type made up of content which does not fit a predefined, descriptive model or arrangement

It would be possible to impose (or superimpose) a structure on a document to make document selection possible, but there would be a cost



A document’s content could be ‘distilled’ - which is a process of eliminating or summarising a body of information to its ‘essential’ components.

The danger in this is that the ‘content’ of a document may alter from use to use (processing to audit for instance) and from user to user (credit check authorisation to inventory levels control for instance)



A document may need to be tracked using

author name

title

filing date

a short abstract of the content (sound familiar ?)

The effect of this is that the ‘content’ of the document can now be accessed only through these 4 ‘keys’

Question : How could a lengthy research task be categorised by value or content using this technique ?



Document management systems support full text search and retrieval of the total document content. (There are also some nice structured document indexing techniques)

Structured data systems will probably always be necessary to track and manage specific facts about key business transactions which associate with or drive other business transactions ( a classic case is the Automated Teller Machine System).

They must have limitations. Structures information systems are based on the premise that it is possible to predict the context in which business information is useful/necessary



That is another way of saying

the who

the when

the where

the why

and the how - realistic in today’s environment ?

So what is different with today’s decisions ?

The difference is the nature and kind of decisions made



Today’s business is driven by increasing rates of change

We have shifted from an industrial economy to a knowledge driven economy

The need more information to support the decision-making processes, and the dynamic nature of the business environment means that the support from structured information systems is starting to be inadequate

However there is a possibility of ‘information overload’ - one of the accepted criteria of ‘knowledge work’ is a competency to manage the increasing amounts of information



The ‘industrial’ economy did have a high degree of predictability

Companies produced and sold a fairly narrow set of products or services

Competition existed. Business operated in ‘static’ markets

Change was relatively slow - change was recognised and there was time to readjust



In the current environment, the ‘products’ are now our ideas

Ideas are driven by information

The term ‘Globalisation’ is appropriate

We think and we change or modify our plans - and this has led to the loss of predictability

The the ‘knowledge environment’ business success depends on the ability of knowledge workers to sift through all of the available unstructured resources and to make decisions - and faster than the competitors

The measurement of success is in degrees of innovation



So, what are the ‘sources of information’ ?

Corporate document bases

The Internet

The Extranets

Information subscription services

Dialog with Customers, suppliers, competitors

The 2 major problems of decision making are

1. Volume of information

2. The speed at which decisions need to be made



Information retrievals have moved away from the filtering of unstructured data into a structured environment.

The emerging models accommodate the capture of resources (and access) which leads to a dynamic and unfiltered information repository which consists of joined but separated sources.

Web sites are sought and searched via the Internet, and possibly a corporate document repository is included in the search.



Some of the information found will be transferred to a more structured repository - possibly a competitive analysis database.

A search in this way provides a set of search results but does not change the information sources.

The Importance of Tools:

Users don’t normally have the time nor the skills to ‘get on top of’ a variety and changing set of tools (what is your experience with the various search engines you have used on the Internet 1 ?)



What is needed is a retrieval tool (hardware and software) which understands how to work with different repositories

This leads to the ‘repository management system’ being able to recognise the form and requirements of different search tools) or, if you like, software which recognises and communicates exactly and completely with other software

Another aspect is the nature of a search - this has changed from simple words or phrase retrievals to ‘context’ and dynamic analysis and categorisation



One of the ‘motivations’ for adopting electronic document management is the volume of information.

This is so for document imaging

There is an offset - how to retrieve - quickly and completely

You are familiar with ‘indexes’ such as Title, Name, Number as being the entry point to a limited set of possibilities

Today’s users want to be able to search entire documents and to match them to other fully searchable documents based on user-defined relevance



Fortunately, existing support many different file types I the native application - images, word-processing systems, desktop publishing systems, spreadsheets, CAD files ….

2 methods of access

- user defined indexes in a full functioned database (3rd party relational databases)

- a fully integrated full-text query engine


What about ???What about ???

• Security

There are highly sophisticated security schemes (also mentioned in Portals)

• Revision Tracking and Control

Essential for accurate and up to date information (what revision number of MS-Word are you using ?)

• Document Check-in/Check-out

Recognition and tagging of all documents - none can be ignored or ‘go missing’

• Usage audit trails

These features are standard features



Compound document managers

These are products which provide the facilities mentioned in the previous overheads, but they have an interesting extension

They treat documents as a collection of ‘pointers’ to various and different collections of information - this ensures that ‘new’ data or information will always be part of a search. (you recall the hypertext links - and have you used the links which are attached to my Web page ?)

Other features which ensure document integrity include roll-back, recovery, and audit trails



There is a single focal point around which unstructured information management practice and technology have converged

No surprise - it is the Internet

This facility has enabled the widespread distribution of unstructured information from a very large base of resources

The World Wide Web is a hyper-linked, unstructured information repository with millions (?) of documents



The Web has made it possible for specialised content to be published, as well as to advertise products

Specialised on-line information subscription services exist which provide industry-specific information (for a fee)

Stock market information services and sites

Monash University has on-line examination results services (and enrolments ?)



Internet technology is said to be ubiquitous - that is it can reach anywhere - that is where there are suitable devices of course

All current document management systems support some degree of support for Web technology

Capabilities range from

basic document viewing using dynamic HTML to make the document ‘visible’ regardless of its source format

to

advanced document access and distribution



For instance a Computer Aided Design drawing (perhaps the Burnley tunnel?) in a user’s browser without requiring that the remote uses have CAD software installed on their machines

Or perhaps the Scorseby by-pass (or is it now the proposed Mitcham - Frankston freeway ?)

Or the Mullum Mullum creek and its environment

Or perhaps Coastal Area developments



Some advanced search engines: (form flexibility)

Verity Inc Information Server

PC Docs/Fulcrum SearchServer

Inktomi Corp Search Engine

Externalisation - knowledge management

Autonomy Inc

Semio Corp

These use a semantic or lexical analysis engine to extract meaning from the information repository



A traditional query would look like

‘interest rates AND stock prices’ and you would receive documents which included the 2 search words

A query expressed as ‘I am interested in the effects of interest rate changes on stock prices’ does not have specific key words nor phrases.

Such an expression would be have its content meaning derived from the information base. The result would be a categorisation of topics contained in the information base.

These would then be further analysed by the user over a varying number of criteria


A few thoughts on ‘Detail’A few thoughts on ‘Detail’

Retailers :

Retailing is a competitive business. Success depends on the Retailer knowing the customers.

There are 3 factors at play

1. The changing likes and dislikes of consumers.

New product preferences are the result of an aging population, changes in family structure, flexible lifestyles

and to no small degree, enterprise bargaining and working conditions.

A successful retailer must be aware and adapt to these factors



2. The uniqueness of each customer.

Individual needs can only be accommodated by knowing individual requirements.

‘Loyalty’ programmes are based on this premise

3. The importance of managing inventory levels, controlling markdowns, maintaining margins.

Static inventory results in greater interest expense. This acts as a barrier to reinvestment of stock which is moving. Unmoving stock will (in many cases) require markdowns to liquidate the stock, which negatively impacts margins



This latter point depends on the level of success with the first 2 factors (changes, uniqueness).

Challenges to retail businesses are:

Knowing who the customer is.

Track and capture purchase history. Identify future needs by storing data at a meaningful level of detail

As an example, a back to school marketing programfor customers who purchased school materials last

year.



Understand how the customer wishes to interact.

How will information be sought about a possible purchase ? Personal visit, phone, Internet, email ..?

How will the sale be finalised - personal, phone, Internet, credit card, cash , account entry ?

What is the customer’s preferred method of interaction ?

Perhaps a ‘personal’ profile should be developed

Be able to track and evaluate the strength of the relationship with each customer

Purchase history plus other contacts - e.g. warranty service, new or updated products



Know what is ‘enough’ detail.

Costs are associated with keeping details. Summary information may however result in incorrect

projections and decisions.

Historical data (such as for air conditioners as opposed to toothpaste)

Analysis of cyclical data for air conditioners should assist in reducing risks of commitments and distribution of such items

The revenue possibilities should outweigh the expense of maintaining air conditioner data

The question is - how to decide what data to store



Incorporate the knowledge gained of customer needs into ‘business intelligence’

This knowledge should be used to

– analyse past performance– get insight into current trends– blend this information into the business plan– develop systems which accurately reflect the customers’

needs– develop systems which interact and support the profit

model


UnStructured Data ManagementUnStructured Data Management

The following overheads address this situation and suggest some ways in which unstructured text can in fact be managed, and thus become part of the decision making processes

Stage 1:

1. Not surprisingly, the first stage is the creation of ‘intelligence’ - a form of data dictionary.

A more formal term is ‘meta data’ or data about data

This concept is not new as structured databases use metadata - and so do the newer Microsoft Operating Systems

The big trick is to capture representative meta data



Metadata about content serves a number of functions

especially the essentials of text such as– main topics– author or authors– language– publication– revision dates

Notice the similarity to ‘library’ systems, including Voyager ?



This metadata improves the precision and quality of full text and keyword search - it allows the users specify additional document attributes

It also supports and extends– classifying and routing content– deleting expired texts– determining any additional processing (e.g. translation

based on ‘knowledge of the source and user.)



There are two aspects of metadata– extraction from text– storage

Extraction deals with detection of ‘keys’ such as– author’s name– the main thrust of the text (or thrusts)

Storage addresses the support of a number of access and retrieval methods



Extraction of metadata is rarely manual - there are currently some interesting software tools such as MetaMaker, Insight Software’s Categoriser

They are probably not as effective from the aspect of quality as a person - but they do handle large masses of text quickly

(what are the risks involved, and what is the minimum level acceptable ?)



As you might expect, storage is much more ‘automated capable’.

1. It can be managed separately from the documents in a relational database - as are the data dictionary tables

This is a fairly normal approach in document warehousing where close integration with other database applications such as data warehousing, is a requirement



2. Metadata can be stored within the document - XML-based standard Resource Description Format is a good form. It is not tied to a particular metadata standard

Metadata can also manage access control– copyright agreement to manage the distribution (or not) of

content (remember hassles with Napster ? - not exactly ‘documents’ in the traditional sense)

– access-control can exits at many logical points such as documents, journals, magazine, or publisher levels



Metadata can also be directed at quality control and document ranking

A document in the ‘Financial Review’ for instance might be awarded a higher ranking (authority) than say an article available at a lesser now Web site

and … A final report to a CEO should have more weight than draft memos - but the memos might rank higher is a search which is based only on frequency of word occurrences.



Metadata can be used to generate automatic summaries and clustering data - but as with derived values in SQL statements execution, it might be more appropriate not to store the summaries or clustered information, but to re-generate these on an as-required basis. (also saving storage and retrieval costs).



Stage 2 :

This deals with ‘user focus’ also known as ‘profiles’

This is an effective key to improving precision and recall of information retrieval

Profiles are maps of user interests (as is a bit map). They use the same representation schemes as the metadata describing the contents of documents



Some research work by Tsvi Kuflik and Peretz Shoval identifies the following profiles

1. User-created : They are the easiest to implement but require the user to create, and maintain, them

2. System-generated : These analyse word frequencies in relevant documents to identify patterns which indicate appropriate texts

3. System-Plus User-generated profiles : These are autogenerated profiles which are then modified by a user

4. Neural-net profiles are ‘trained’ using selected or suggested texts provided by a user. They provide ranking in relevance for other texts



5. Stereotype models : These are interests shared by a large number or groups of users, which provide the basis for building specialised or individualised profiles

6. Rule-based filtering : These profiles processes use explicit if-then rules to classify or categorise content

A word of caution - Each of the techniques has its own advantages and disadvantages (manual update by users when interests alter, or possibly adaptation to these changes - possibly the Euro is an example here).

However, on the plus side, the profile produced will provide a long-term resource for filtering, minimising ambiguity, and document gathering.



Stage 3: Content Access Control :

The Web offers a ‘free flowing information model’.

Portals, on the other hand, require controlled access as users are more willing and likely to share information when they know it is distributed within the bounds of well-defined security business rules. (Monash University is considering increasing security over material available via Courseware Web pages).



There are 3 main access control areas

– Open-access information

– Licence restricted information

– Privileged information



Open-access information is freely available to all portal users

News items, press releases, product catalogues, appointments … are in this category

Licence -restricted information is self explanatory. This information is defined by agreements with content providers (e.g. Dunn and Bradstreet). It could be the digital library of a professional organisation. (would Monash research papers qualify here ?)

Control access generally uses user authentication and/or IP address verification (as in e-commerce perhaps ?)



Privileged Information is granted on a ‘need to know; basis - but with a different slant than ‘news’.

An example here could be co-operative research (e.g. DNA) or the complicated case where for instance solicitors working on a negotiation with a client will required access to related documents - but others in the same firm or office working on the same type of negotiation should not be able to access the same documents.



Stage 4: Rich Search Support

Finding correlations between terms, improvements in user query execution, have produced a number of techniques for representing documents.

However, there is still a ‘precision and recall barrier’ which is about the 65 to 70% level because of the dependence of what are essentially statistical techniques rather than linguistic comprehension.



Is it now generally accepted that a better approach combines– keyword searching– clustering– visualisation

Keyword searches extend the user query, generally using a thesaurus : this uses synonyms, soundex and the extension of singular terms it include plural terms

for example : ‘Stocks’ could become ‘Stocks and Shares’

England could extend to English

Bank could include Banks



A danger with this, as you have probably found in your use of the Monash Library facilities, is the high number of ‘hits’.

Clustering : Hierarchical clustering builds tree structures where the root of the tree contains– all documents– internal nodes contain groups of similar documents– the size of the groups decreases as the user moves from

the root node to the leaf node (these would contain single documents). You have probably met the term ‘drilling’.



The Scatter/Gather algorithm produces a result set clustered into a small, fixed number of groups.

Users select the most appropriate group or groups and the referred documents are clustered into the same number of groups.

Users can drill down and the result of drilling into the most appropriate cluster, and the most appropriate elements are grouped into a number of semantically related clusters.



Coarse and Fine Granularity

Coarse granularity navigation software normally displays trees, where each node (or leaf) represents a document labelled with a text. (visualisation)

A user can then access a ‘group’ of such nodes (hyperlinked documents) and drill in on items of interest. (The alternative would be to click each page and examine its contents)



Fine Granularity:

When there is a need to locate specific information, such as information about sales of palm held devices and needs to distinguish key marketing terms among a number of plans, then it is necessary to focus quickly on those terms and associate their relationship with other terms

(have you read about the latest mobile phone plans ?)



This probably all sounds very advanced.

It also sounds if high speed communications are necessary - and they are.

However there is a penalty, and this is the large repository of relevant content necessary - which must by its nature be current or up to date.

This is best handled by automatic and automated tools



Stage 5: Keeping the Content Current

Harvesters, Trawlers and File Retrieval systems gather documents for content repository inclusion

This software is drive by metadata about sites to search and for the directories or document management systems to scan for relevant information.

Metadata and Indexing details may be all that is required. The documents can be retrieved from source on an as-required basis



There is the aspect of ‘outdated’ data

This is capably handled by metadata about document types and sources

Predictions about the Australian Federal Budget are obsolete when the Budget is formally released - so, delete the predictions? Do they have any further use ?



And a final thought :-

The logging of – when documents arrived, – where they came from, – their source (author ?)– other ‘information’ such as ‘superceded by’ or ‘expiry

date’

will be handled by content management processes

cse3180 semester 1 2005 / nonstructured 1 week 9 a few concepts and approaches in unstructured data

Documents

unstructured data slide

unstructured data management

cse3180 semester

exchange of information

continuity of data slide

retrieval of data

information systems

forms of unstructured