knowledge discovery in great textual data bases
DESCRIPTION
Text Processing & Knowledge Discovery System for Digest Preparing and Decision MakingTRANSCRIPT
1
Plekhanov Russian Academy of EconomicsPlekhanov Russian Academy of EconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva
Knowledge discovery in large text data bases using the MST
algorithm Doctor V. Romanov, student E. PantileevaDoctor V. Romanov, student E. Pantileeva
PlekhanovPlekhanov Russian Academy of Russian Academy of EconomicsEconomics
Data Mining 2005
25 – 27 May 2005
Skiathos
DISCOVERING THE HIDDEN PROBLEM SITUATIONSTUCTURE FROM DOCUMENTS SET
Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva
Manager (problem situation)
Document collection
Maximum Spanning
Tree
Attributes Names and Values
table
relevant data collection word dictionary with the word frequencies pairs of words from dictionary and pairs frequencies maximum spanning tree forming and interpretation
Information system, supporting decision making, with embedded adaptation
function
Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva
User (problem situation)
KNOWLEDGEBASE
Documents text processing &
loading
OUTPUT DATA FOR QUERIES PROCESSED
METADATA:
THESAURUS
THEMATIC CLASSIFIER
SUBJECT CLASSIFIERS
NAVIGATOR
TEXTDATABASE
Queriesprocessing
DATA BASE RECORDS
STATSTICAL DATA
ONE TERM FREQUENCIES COUNTING &
NORMALISATION
ATTRIBUTES NAMES & VALUES
DETECTION
TWO TERMS FREQUENCIES COUNTING & COVARIANCE
MATRIXFORMATION
MAXIMUM SPANNING TREE DEVELOPMENT & EXPLICATION AS
NAVIGATOR
FORMAL CONTEXT
TABLE FORMING @
FILLING
FORMAL CONCEPTS&
RULESDISCOVERING
TWO TERMS FREQUENCIES COUNTING & COVARIANCE MATRIX
FORMATION
Three Steps of MST Construction
Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva
Maximum Spanning
Tree
ONE TERM FREQUENCIES COUNTING & NORMALISATION
MAXIMUM SPANNING TREE DEVELOPMENT
The Construction of Maximum Spanning Tree
The Term Connectedness Graphs The Pairs of Terms Frequencies Matrix
Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva
Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva
Maximum Spanning Tree Maximum Spanning Tree Matrix
The Construction of Maximum Spanning Tree
Dynamic picture is formed as maximum spanning tree for graph, is representing covariance matrix for word/lemmas or concepts pairs.MST serves as dynamically changing thesaurus or semantic net for
query navigation.
Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva
Above the each word there is a sign one of two kinds: leaf or branch. The word designated by the leaf sign has not any connections down the tree. The words
designed by branch sign permit further navigation along the route.
Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva
The user begins retrieval session with browsing dictionary with word frequencies and choosing the word to be included in the query.
An example of MST for thematic class
“Production reconstruction”
metallurgical plants
q u a r t e r s ( o f ye a r ) (года)
F e r r o u s m e t a l l u r g y d e p a r t m e n t
b l a s t fu r n a ce s
I
Lipet sk ci ty
w or k s
d i s t r i b u t i on
M a i n p l a n n i n g d e p a r t me n t
p u t t i n g i n t o op e r a t i on
Ch e mi ca l ma c h i n e r y c on s t r u ct i on d e p a r t me n t
p r od u ct i o n ,
De vi ce s i n d u s t r y d e p a r t me n t
B u i l d i n g ( p r o ce s s )
El e c t r i ca l e n gi n e e r i n g d e p a r t me n t
d e vi ce s
b a s i c fu n d s o f i n d u s t r y
M a i n s u p p l yi n g
d e p a r t me n t
e q u i p me n t ( s u b j e c t )
M a gn i t o g or s k c i t y
c ol d r o l l i n g s h op
gr ou p of me t a l l u r g y e n t e r p r i s e s
ensur ing
ob j e c t s
p r op e r t i e s apportionment
s t ee l
M o u n t i n g @ b u i l d i n g d e p a r t m e n t
r ec on s t r u ct i on
fu l f i l me n t
elaboration
He a v y e n gi n e e r i n g w or k s d e p a r t me n t
Li p e t s k s t e e l
h a l f - ye a r
u t i l i za t i o n ,
s c r a p me t a l
Main management board department
e n t e r p r i s e s
Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva
THE SEMANTIC OF PROBLEM SITUATION-1
DATA BASE DOMAINS: RG- REGIONS, SA – STATE AGENCIES, EN – ENTERPRISES, KM – KIND OF MAKE, SP - SUPPLIER, OB - OBJECT, EN - ENTERPRISE, DT – DATE, KR – KIND OF RESOURCE, EX - EXECUTOR, PR - PURPOSE, RN - RECIPIENT, DT – DATE…
THE SEMANTIC OF PROBLEM SITUATION-2
Relations:“EQUIPMENT_SUPPLY” ES("KIND_OF_MAKE"/KM, "SUPPLIER"/ SP, "OBJECT"/OB,"ENTERPRISE"/EN,"DATE"/DT),
"RESOURCE_APPORTION" RA("KIND-OF_RESOURCE"/KR, "EXECUTOR"/EX, "PURPOSE"/PR,"RECIPIENT"/RN," DATE"/DT),
"RECONCTRUCTION”RC("OBJECT"/OB,"ENTERPRISE"/EN,"REGION”/RG,"STATE_AGENCY"/SA,"DATE"/DT,"PURPOSE"/PR),
THE SEMANTIC OF PROBLEM SITUATION-3
QUESTIONS: WHO IS SUPPLYING EQUIPMENT FOR (OBJECT,
REGION,KIND_OF_EQUIPMENT)? etc.
RULES: "STATEMENT_#111_EXECUTION"
SE(“DATE_1”/DT,”DATE_2”/DT,”PRODUCTION_VOLUME_1”/VL, PRODUCTION_VOLUME_2/VL, "OBJECT"/OB, "ENERPRISE"/EN):-
RA("KIND-OF_RESOURCE"/KR, "EXECUTOR"/EX, "PURPOSE"/PR,"RECIPIENT"/RN," DATE"/DT),
MN( "KIND_OF_EQUIPMENT"/KE, "OBJECT”/OB,
"ENTERPRISE"/EN, "DATE"/DT,"PURPOSE"/PR),...
CONTEXT FORMING FOR THE PROBLEM SITUATION
Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva
Documents
table
Context
• 1. Attribute names and values recognizing in documents.
2. Table “documents-attributes” filling
Let the result of lexical and categorical analysis be set di -terms, extracted via mapping: value – domain – term.
For each text qk Q we can compose a matrix M named context, whose elements mki say whether term di enters into document qk.
STAGES OF DB SCHEMA EXTRACTIONSTAGES OF DB SCHEMA EXTRACTIONFROM SET OF TEXTSFROM SET OF TEXTS
Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva
Context
FormalConcept
Analysis
ConceptsThe problem
situationdata base schema
The problem situation description in data base
Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva
Document
Tokenization -Morphological
anlaysis
Semanticalanalysis
Syntacticalanalysis
User interface forms for data
entering
Data base records
Data base loading
C1:=(main management board)
Concepts of the situation “Reconstruction”f1:=main management board, f2:=building, f3:=cold rolling shop, f4:=reconstruction, f5:=blast furnace O5
O4
O3
O2
O1
f5f4f3f2f1
Сontext
O5
O4
O3
O2
O1
C2f5f4f
3f2f1
C2:= (main management board, reconstruction, blast furnace)
O5
O4
O3
O2
O1
C
5
f5f4f3f2f1
C5:=(main management board, cold rolling shop, reconstruction, blast
furnace)
O5
O4
O3
O2
O1
C3f5f4f3f2f1
C3:=(main management board, cold rolling shop)
O5
O4
O3
O2
O1
C6f5f4f3f2f1
C6:=(main management board, building, reconstruction, blast furnace)
O5
O4
O3
O2
O1
C1f5f4f3f2f1
O5
O4
O3
O2
O1
C4f5f4f3f2f1
C4:=(main management board, building, cold rolling shop)
O5
××O4
O3
××O2
××O1
C7f5f4f3f2f1
C7:=(main management board, building)
Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva
The Hasse diagram of concept lattice
Plekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva
C1
C4
C3 C2
C1:=(main management board)
C2:= (main management board, reconstruction, blast
furnace)C3:=(main management board, cold rolling shop)
C4:=(main management board,
building, cold rolling shop)
C5:=(main management board, cold rolling shop, reconstruction,
blast furnace)
C7:=(main management board,
building)
C6:=(main management board, building, reconstruction, blast
furnace)
C6
C5
C7
User (problem situation)
Document collection
Official Official structurestructure
Object Object namename ActionsActions RegionRegion
Action’s Action’s effecteffect
TimeTime
Main Main ManagementManagement
boardboard
Blast Blast furnacefurnace
reconstructionreconstruction Lipetsk cityLipetsk city Scrap metal Scrap metal utilization utilization
meliorationmelioration
IV quarters IV quarters
of xxx-yearof xxx-year
Main Main
ManagementManagement
boardboard
Cold Cold rolling rolling shopshop
buildingbuilding Magnitogorsk Magnitogorsk citycity
Steel Steel production production increment increment
I half I half
xxx-yearxxx-year
MountingMounting
&building &building departmentdepartment
Blast Blast furnacefurnace
Putting into Putting into operationoperation
Lipetsk cityLipetsk city Equipment Equipment mountedmounted
II quarters II quarters of xxx-yearof xxx-year
Digest of situationPlekhanov Russian Academy of Plekhanov Russian Academy of EconomicsEconomicsDDoctor V. Romanov, student E. Pantileevaoctor V. Romanov, student E. Pantileeva