cse3180 semester 1 2005 / nonstructured 1 week 9 a few concepts and approaches in unstructured data
TRANSCRIPT
CSE3180 Semester 1 2005 / Nonstructured 1
Week 9Week 9
A Few Concepts and Approaches in Unstructured Data
CSE3180 Semester 1 2005 / Nonstructured 2
Data ManagementData Management
These notes will address the imbalance between Information sourced from structured data storage systems, mainly databases (as in Relational Databases)
and Information which is necessary for informed management decision making processes based on non structured data (also known as UnStructured data)
CSE3180 Semester 1 2005 / Nonstructured 3
UnStructured InformationUnStructured Information
• Much effort and expense has been directed at the design, capture, processing, storage and retrieval of data in a structured form.
• There are a number of database management systems (DBMS) which can store many trillions of bytes in centralised, distributed and client/server based systems
• There are extensive backup, recovery and restart procedures in place to ensure persistence and continuity of data
CSE3180 Semester 1 2005 / Nonstructured 4
UnStructured InformationUnStructured Information
The ratio of unstructured to structured information in many organisations is approximately 9 to 1
It is easy to conclude that the the most important component which drives much of the decision making in key business processes is badly neglected.
Familiar examples are the World Wide Web,
Corporate intranets
on line discussion groups
CSE3180 Semester 1 2005 / Nonstructured 5
UnStructured InformationUnStructured Information
Why is there this bias ?
Perhaps because structured information management has been synonymous with information systems design ?
The technology for unstructured data management is powerful, pervasive and well understood
CSE3180 Semester 1 2005 / Nonstructured 6
UnStructured InformationUnStructured Information
Businesses are becoming more interconnected
Each new connection is made by, and relies on, exchange of Information
One aspect of these ‘connections’ is that of the unpredictability of the nodes and the nodal content
This impacts on the structured information model
CSE3180 Semester 1 2005 / Nonstructured 7
UnStructured InformationUnStructured Information
Structured information systems depend on there being a collection of facts - which make up records
Data storage was ‘expensive’ in terms of access times, processing cycles and the physical media itself
‘Critical data elements’ were (and are) used to minimise data storage requirements, and also to improve processing times
Reduction of the number of bytes allocated to these critical data elements was a bonus, for example a 2 digit year, or a 5 digit salary expression
CSE3180 Semester 1 2005 / Nonstructured 8
UnStructured InformationUnStructured Information
The end product of this ‘distillation’ is structured information
Its basis is stored in a predefined record form
It is dependent on the skills of analysts and management to anticipate precisely which data elements must be stored
This reliance on a predefined record form which includes some data but excludes other data is now seen as a key limitation of structured information sources
CSE3180 Semester 1 2005 / Nonstructured 9
UnStructured InformationUnStructured Information
The term ‘unstructured’ is invariably associated with ‘documents’.
They are a medium which we understand and use
There are other forms of unstructured data
audio
voice
images
graphical objects
These are forms of electronic documents
CSE3180 Semester 1 2005 / Nonstructured 10
UnStructured InformationUnStructured Information
These forms are ‘unstructured’ because their exact content and organisation are unpredictable
Unstructured information is any information type made up of content which does not fit a predefined, descriptive model or arrangement
It would be possible to impose (or superimpose) a structure on a document to make document selection possible, but there would be a cost
CSE3180 Semester 1 2005 / Nonstructured 11
UnStructured InformationUnStructured Information
A document’s content could be ‘distilled’ - which is a process of eliminating or summarising a body of information to its ‘essential’ components.
The danger in this is that the ‘content’ of a document may alter from use to use (processing to audit for instance) and from user to user (credit check authorisation to inventory levels control for instance)
CSE3180 Semester 1 2005 / Nonstructured 12
UnStructured InformationUnStructured Information
A document may need to be tracked using
author name
title
filing date
a short abstract of the content (sound familiar ?)
The effect of this is that the ‘content’ of the document can now be accessed only through these 4 ‘keys’
Question : How could a lengthy research task be categorised by value or content using this technique ?
CSE3180 Semester 1 2005 / Nonstructured 13
UnStructured InformationUnStructured Information
Document management systems support full text search and retrieval of the total document content. (There are also some nice structured document indexing techniques)
Structured data systems will probably always be necessary to track and manage specific facts about key business transactions which associate with or drive other business transactions ( a classic case is the Automated Teller Machine System).
They must have limitations. Structures information systems are based on the premise that it is possible to predict the context in which business information is useful/necessary
CSE3180 Semester 1 2005 / Nonstructured 14
UnStructured InformationUnStructured Information
That is another way of saying
the who
the when
the where
the why
and the how - realistic in today’s environment ?
So what is different with today’s decisions ?
The difference is the nature and kind of decisions made
CSE3180 Semester 1 2005 / Nonstructured 15
UnStructured InformationUnStructured Information
Today’s business is driven by increasing rates of change
We have shifted from an industrial economy to a knowledge driven economy
The need more information to support the decision-making processes, and the dynamic nature of the business environment means that the support from structured information systems is starting to be inadequate
However there is a possibility of ‘information overload’ - one of the accepted criteria of ‘knowledge work’ is a competency to manage the increasing amounts of information
CSE3180 Semester 1 2005 / Nonstructured 16
UnStructured InformationUnStructured Information
The ‘industrial’ economy did have a high degree of predictability
Companies produced and sold a fairly narrow set of products or services
Competition existed. Business operated in ‘static’ markets
Change was relatively slow - change was recognised and there was time to readjust
CSE3180 Semester 1 2005 / Nonstructured 17
UnStructured InformationUnStructured Information
In the current environment, the ‘products’ are now our ideas
Ideas are driven by information
The term ‘Globalisation’ is appropriate
We think and we change or modify our plans - and this has led to the loss of predictability
The the ‘knowledge environment’ business success depends on the ability of knowledge workers to sift through all of the available unstructured resources and to make decisions - and faster than the competitors
The measurement of success is in degrees of innovation
CSE3180 Semester 1 2005 / Nonstructured 18
UnStructured InformationUnStructured Information
So, what are the ‘sources of information’ ?
Corporate document bases
The Internet
The Extranets
Information subscription services
Dialog with Customers, suppliers, competitors
The 2 major problems of decision making are
1. Volume of information
2. The speed at which decisions need to be made
CSE3180 Semester 1 2005 / Nonstructured 19
UnStructured InformationUnStructured Information
Information retrievals have moved away from the filtering of unstructured data into a structured environment.
The emerging models accommodate the capture of resources (and access) which leads to a dynamic and unfiltered information repository which consists of joined but separated sources.
Web sites are sought and searched via the Internet, and possibly a corporate document repository is included in the search.
CSE3180 Semester 1 2005 / Nonstructured 20
UnStructured InformationUnStructured Information
Some of the information found will be transferred to a more structured repository - possibly a competitive analysis database.
A search in this way provides a set of search results but does not change the information sources.
The Importance of Tools:
Users don’t normally have the time nor the skills to ‘get on top of’ a variety and changing set of tools (what is your experience with the various search engines you have used on the Internet 1 ?)
CSE3180 Semester 1 2005 / Nonstructured 21
UnStructured InformationUnStructured Information
What is needed is a retrieval tool (hardware and software) which understands how to work with different repositories
This leads to the ‘repository management system’ being able to recognise the form and requirements of different search tools) or, if you like, software which recognises and communicates exactly and completely with other software
Another aspect is the nature of a search - this has changed from simple words or phrase retrievals to ‘context’ and dynamic analysis and categorisation
CSE3180 Semester 1 2005 / Nonstructured 22
UnStructured InformationUnStructured Information
One of the ‘motivations’ for adopting electronic document management is the volume of information.
This is so for document imaging
There is an offset - how to retrieve - quickly and completely
You are familiar with ‘indexes’ such as Title, Name, Number as being the entry point to a limited set of possibilities
Today’s users want to be able to search entire documents and to match them to other fully searchable documents based on user-defined relevance
CSE3180 Semester 1 2005 / Nonstructured 23
UnStructured InformationUnStructured Information
Fortunately, existing support many different file types I the native application - images, word-processing systems, desktop publishing systems, spreadsheets, CAD files ….
2 methods of access
- user defined indexes in a full functioned database (3rd party relational databases)
- a fully integrated full-text query engine
CSE3180 Semester 1 2005 / Nonstructured 24
What about ???What about ???
• Security
There are highly sophisticated security schemes (also mentioned in Portals)
• Revision Tracking and Control
Essential for accurate and up to date information (what revision number of MS-Word are you using ?)
• Document Check-in/Check-out
Recognition and tagging of all documents - none can be ignored or ‘go missing’
• Usage audit trails
These features are standard features
CSE3180 Semester 1 2005 / Nonstructured 25
UnStructured InformationUnStructured Information
Compound document managers
These are products which provide the facilities mentioned in the previous overheads, but they have an interesting extension
They treat documents as a collection of ‘pointers’ to various and different collections of information - this ensures that ‘new’ data or information will always be part of a search. (you recall the hypertext links - and have you used the links which are attached to my Web page ?)
Other features which ensure document integrity include roll-back, recovery, and audit trails
CSE3180 Semester 1 2005 / Nonstructured 26
UnStructured InformationUnStructured Information
There is a single focal point around which unstructured information management practice and technology have converged
No surprise - it is the Internet
This facility has enabled the widespread distribution of unstructured information from a very large base of resources
The World Wide Web is a hyper-linked, unstructured information repository with millions (?) of documents
CSE3180 Semester 1 2005 / Nonstructured 27
UnStructured InformationUnStructured Information
The Web has made it possible for specialised content to be published, as well as to advertise products
Specialised on-line information subscription services exist which provide industry-specific information (for a fee)
Stock market information services and sites
Monash University has on-line examination results services (and enrolments ?)
CSE3180 Semester 1 2005 / Nonstructured 28
UnStructured InformationUnStructured Information
Internet technology is said to be ubiquitous - that is it can reach anywhere - that is where there are suitable devices of course
All current document management systems support some degree of support for Web technology
Capabilities range from
basic document viewing using dynamic HTML to make the document ‘visible’ regardless of its source format
to
advanced document access and distribution
CSE3180 Semester 1 2005 / Nonstructured 29
UnStructured InformationUnStructured Information
For instance a Computer Aided Design drawing (perhaps the Burnley tunnel?) in a user’s browser without requiring that the remote uses have CAD software installed on their machines
Or perhaps the Scorseby by-pass (or is it now the proposed Mitcham - Frankston freeway ?)
Or the Mullum Mullum creek and its environment
Or perhaps Coastal Area developments
CSE3180 Semester 1 2005 / Nonstructured 30
UnStructured InformationUnStructured Information
Some advanced search engines: (form flexibility)
Verity Inc Information Server
PC Docs/Fulcrum SearchServer
Inktomi Corp Search Engine
Externalisation - knowledge management
Autonomy Inc
Semio Corp
These use a semantic or lexical analysis engine to extract meaning from the information repository
CSE3180 Semester 1 2005 / Nonstructured 31
UnStructured InformationUnStructured Information
A traditional query would look like
‘interest rates AND stock prices’ and you would receive documents which included the 2 search words
A query expressed as ‘I am interested in the effects of interest rate changes on stock prices’ does not have specific key words nor phrases.
Such an expression would be have its content meaning derived from the information base. The result would be a categorisation of topics contained in the information base.
These would then be further analysed by the user over a varying number of criteria
CSE3180 Semester 1 2005 / Nonstructured 32
A few thoughts on ‘Detail’A few thoughts on ‘Detail’
Retailers :
Retailing is a competitive business. Success depends on the Retailer knowing the customers.
There are 3 factors at play
1. The changing likes and dislikes of consumers.
New product preferences are the result of an aging population, changes in family structure, flexible lifestyles
and to no small degree, enterprise bargaining and working conditions.
A successful retailer must be aware and adapt to these factors
CSE3180 Semester 1 2005 / Nonstructured 33
A few thoughts on ‘Detail’A few thoughts on ‘Detail’
2. The uniqueness of each customer.
Individual needs can only be accommodated by knowing individual requirements.
‘Loyalty’ programmes are based on this premise
3. The importance of managing inventory levels, controlling markdowns, maintaining margins.
Static inventory results in greater interest expense. This acts as a barrier to reinvestment of stock which is moving. Unmoving stock will (in many cases) require markdowns to liquidate the stock, which negatively impacts margins
CSE3180 Semester 1 2005 / Nonstructured 34
A few thoughts on ‘Detail’A few thoughts on ‘Detail’
This latter point depends on the level of success with the first 2 factors (changes, uniqueness).
Challenges to retail businesses are:
Knowing who the customer is.
Track and capture purchase history. Identify future needs by storing data at a meaningful level of detail
As an example, a back to school marketing programfor customers who purchased school materials last
year.
CSE3180 Semester 1 2005 / Nonstructured 35
A few thoughts on ‘Detail’A few thoughts on ‘Detail’
Understand how the customer wishes to interact.
How will information be sought about a possible purchase ? Personal visit, phone, Internet, email ..?
How will the sale be finalised - personal, phone, Internet, credit card, cash , account entry ?
What is the customer’s preferred method of interaction ?
Perhaps a ‘personal’ profile should be developed
Be able to track and evaluate the strength of the relationship with each customer
Purchase history plus other contacts - e.g. warranty service, new or updated products
CSE3180 Semester 1 2005 / Nonstructured 36
A few thoughts on ‘Detail’A few thoughts on ‘Detail’
Know what is ‘enough’ detail.
Costs are associated with keeping details. Summary information may however result in incorrect
projections and decisions.
Historical data (such as for air conditioners as opposed to toothpaste)
Analysis of cyclical data for air conditioners should assist in reducing risks of commitments and distribution of such items
The revenue possibilities should outweigh the expense of maintaining air conditioner data
The question is - how to decide what data to store
CSE3180 Semester 1 2005 / Nonstructured 37
A few thoughts on ‘Detail’A few thoughts on ‘Detail’
Incorporate the knowledge gained of customer needs into ‘business intelligence’
This knowledge should be used to
– analyse past performance– get insight into current trends– blend this information into the business plan– develop systems which accurately reflect the customers’
needs– develop systems which interact and support the profit
model
CSE3180 Semester 1 2005 / Nonstructured 38
UnStructured Data ManagementUnStructured Data Management
The following overheads address this situation and suggest some ways in which unstructured text can in fact be managed, and thus become part of the decision making processes
Stage 1:
1. Not surprisingly, the first stage is the creation of ‘intelligence’ - a form of data dictionary.
A more formal term is ‘meta data’ or data about data
This concept is not new as structured databases use metadata - and so do the newer Microsoft Operating Systems
The big trick is to capture representative meta data
CSE3180 Semester 1 2005 / Nonstructured 39
UnStructured Data ManagementUnStructured Data Management
Metadata about content serves a number of functions
especially the essentials of text such as– main topics– author or authors– language– publication– revision dates
Notice the similarity to ‘library’ systems, including Voyager ?
CSE3180 Semester 1 2005 / Nonstructured 40
UnStructured Data ManagementUnStructured Data Management
This metadata improves the precision and quality of full text and keyword search - it allows the users specify additional document attributes
It also supports and extends– classifying and routing content– deleting expired texts– determining any additional processing (e.g. translation
based on ‘knowledge of the source and user.)
CSE3180 Semester 1 2005 / Nonstructured 41
UnStructured Data ManagementUnStructured Data Management
There are two aspects of metadata– extraction from text– storage
Extraction deals with detection of ‘keys’ such as– author’s name– the main thrust of the text (or thrusts)
Storage addresses the support of a number of access and retrieval methods
CSE3180 Semester 1 2005 / Nonstructured 42
UnStructured Data ManagementUnStructured Data Management
Extraction of metadata is rarely manual - there are currently some interesting software tools such as MetaMaker, Insight Software’s Categoriser
They are probably not as effective from the aspect of quality as a person - but they do handle large masses of text quickly
(what are the risks involved, and what is the minimum level acceptable ?)
CSE3180 Semester 1 2005 / Nonstructured 43
UnStructured Data ManagementUnStructured Data Management
As you might expect, storage is much more ‘automated capable’.
1. It can be managed separately from the documents in a relational database - as are the data dictionary tables
This is a fairly normal approach in document warehousing where close integration with other database applications such as data warehousing, is a requirement
CSE3180 Semester 1 2005 / Nonstructured 44
UnStructured Data ManagementUnStructured Data Management
2. Metadata can be stored within the document - XML-based standard Resource Description Format is a good form. It is not tied to a particular metadata standard
Metadata can also manage access control– copyright agreement to manage the distribution (or not) of
content (remember hassles with Napster ? - not exactly ‘documents’ in the traditional sense)
– access-control can exits at many logical points such as documents, journals, magazine, or publisher levels
CSE3180 Semester 1 2005 / Nonstructured 45
UnStructured Data ManagementUnStructured Data Management
Metadata can also be directed at quality control and document ranking
A document in the ‘Financial Review’ for instance might be awarded a higher ranking (authority) than say an article available at a lesser now Web site
and … A final report to a CEO should have more weight than draft memos - but the memos might rank higher is a search which is based only on frequency of word occurrences.
CSE3180 Semester 1 2005 / Nonstructured 46
UnStructured Data ManagementUnStructured Data Management
Metadata can be used to generate automatic summaries and clustering data - but as with derived values in SQL statements execution, it might be more appropriate not to store the summaries or clustered information, but to re-generate these on an as-required basis. (also saving storage and retrieval costs).
CSE3180 Semester 1 2005 / Nonstructured 47
UnStructured Data ManagementUnStructured Data Management
Stage 2 :
This deals with ‘user focus’ also known as ‘profiles’
This is an effective key to improving precision and recall of information retrieval
Profiles are maps of user interests (as is a bit map). They use the same representation schemes as the metadata describing the contents of documents
CSE3180 Semester 1 2005 / Nonstructured 48
UnStructured Data ManagementUnStructured Data Management
Some research work by Tsvi Kuflik and Peretz Shoval identifies the following profiles
1. User-created : They are the easiest to implement but require the user to create, and maintain, them
2. System-generated : These analyse word frequencies in relevant documents to identify patterns which indicate appropriate texts
3. System-Plus User-generated profiles : These are autogenerated profiles which are then modified by a user
4. Neural-net profiles are ‘trained’ using selected or suggested texts provided by a user. They provide ranking in relevance for other texts
CSE3180 Semester 1 2005 / Nonstructured 49
UnStructured Data ManagementUnStructured Data Management
5. Stereotype models : These are interests shared by a large number or groups of users, which provide the basis for building specialised or individualised profiles
6. Rule-based filtering : These profiles processes use explicit if-then rules to classify or categorise content
A word of caution - Each of the techniques has its own advantages and disadvantages (manual update by users when interests alter, or possibly adaptation to these changes - possibly the Euro is an example here).
However, on the plus side, the profile produced will provide a long-term resource for filtering, minimising ambiguity, and document gathering.
CSE3180 Semester 1 2005 / Nonstructured 50
UnStructured Data ManagementUnStructured Data Management
Stage 3: Content Access Control :
The Web offers a ‘free flowing information model’.
Portals, on the other hand, require controlled access as users are more willing and likely to share information when they know it is distributed within the bounds of well-defined security business rules. (Monash University is considering increasing security over material available via Courseware Web pages).
CSE3180 Semester 1 2005 / Nonstructured 51
UnStructured Data ManagementUnStructured Data Management
There are 3 main access control areas
– Open-access information
– Licence restricted information
– Privileged information
CSE3180 Semester 1 2005 / Nonstructured 52
UnStructured Data ManagementUnStructured Data Management
Open-access information is freely available to all portal users
News items, press releases, product catalogues, appointments … are in this category
Licence -restricted information is self explanatory. This information is defined by agreements with content providers (e.g. Dunn and Bradstreet). It could be the digital library of a professional organisation. (would Monash research papers qualify here ?)
Control access generally uses user authentication and/or IP address verification (as in e-commerce perhaps ?)
CSE3180 Semester 1 2005 / Nonstructured 53
UnStructured Data ManagementUnStructured Data Management
Privileged Information is granted on a ‘need to know; basis - but with a different slant than ‘news’.
An example here could be co-operative research (e.g. DNA) or the complicated case where for instance solicitors working on a negotiation with a client will required access to related documents - but others in the same firm or office working on the same type of negotiation should not be able to access the same documents.
CSE3180 Semester 1 2005 / Nonstructured 54
UnStructured Data ManagementUnStructured Data Management
Stage 4: Rich Search Support
Finding correlations between terms, improvements in user query execution, have produced a number of techniques for representing documents.
However, there is still a ‘precision and recall barrier’ which is about the 65 to 70% level because of the dependence of what are essentially statistical techniques rather than linguistic comprehension.
CSE3180 Semester 1 2005 / Nonstructured 55
UnStructured Data ManagementUnStructured Data Management
Is it now generally accepted that a better approach combines– keyword searching– clustering– visualisation
Keyword searches extend the user query, generally using a thesaurus : this uses synonyms, soundex and the extension of singular terms it include plural terms
for example : ‘Stocks’ could become ‘Stocks and Shares’
England could extend to English
Bank could include Banks
CSE3180 Semester 1 2005 / Nonstructured 56
UnStructured Data ManagementUnStructured Data Management
A danger with this, as you have probably found in your use of the Monash Library facilities, is the high number of ‘hits’.
Clustering : Hierarchical clustering builds tree structures where the root of the tree contains– all documents– internal nodes contain groups of similar documents– the size of the groups decreases as the user moves from
the root node to the leaf node (these would contain single documents). You have probably met the term ‘drilling’.
CSE3180 Semester 1 2005 / Nonstructured 57
UnStructured Data ManagementUnStructured Data Management
The Scatter/Gather algorithm produces a result set clustered into a small, fixed number of groups.
Users select the most appropriate group or groups and the referred documents are clustered into the same number of groups.
Users can drill down and the result of drilling into the most appropriate cluster, and the most appropriate elements are grouped into a number of semantically related clusters.
CSE3180 Semester 1 2005 / Nonstructured 58
UnStructured Data ManagementUnStructured Data Management
Coarse and Fine Granularity
Coarse granularity navigation software normally displays trees, where each node (or leaf) represents a document labelled with a text. (visualisation)
A user can then access a ‘group’ of such nodes (hyperlinked documents) and drill in on items of interest. (The alternative would be to click each page and examine its contents)
CSE3180 Semester 1 2005 / Nonstructured 59
UnStructured Data ManagementUnStructured Data Management
Fine Granularity:
When there is a need to locate specific information, such as information about sales of palm held devices and needs to distinguish key marketing terms among a number of plans, then it is necessary to focus quickly on those terms and associate their relationship with other terms
(have you read about the latest mobile phone plans ?)
CSE3180 Semester 1 2005 / Nonstructured 60
UnStructured Data ManagementUnStructured Data Management
This probably all sounds very advanced.
It also sounds if high speed communications are necessary - and they are.
However there is a penalty, and this is the large repository of relevant content necessary - which must by its nature be current or up to date.
This is best handled by automatic and automated tools
CSE3180 Semester 1 2005 / Nonstructured 61
UnStructured Data ManagementUnStructured Data Management
Stage 5: Keeping the Content Current
Harvesters, Trawlers and File Retrieval systems gather documents for content repository inclusion
This software is drive by metadata about sites to search and for the directories or document management systems to scan for relevant information.
Metadata and Indexing details may be all that is required. The documents can be retrieved from source on an as-required basis
CSE3180 Semester 1 2005 / Nonstructured 62
UnStructured Data ManagementUnStructured Data Management
There is the aspect of ‘outdated’ data
This is capably handled by metadata about document types and sources
Predictions about the Australian Federal Budget are obsolete when the Budget is formally released - so, delete the predictions? Do they have any further use ?
CSE3180 Semester 1 2005 / Nonstructured 63
UnStructured Data ManagementUnStructured Data Management
And a final thought :-
The logging of – when documents arrived, – where they came from, – their source (author ?)– other ‘information’ such as ‘superceded by’ or ‘expiry
date’
will be handled by content management processes