textual data analysis
TRANSCRIPT
Textual Data Analysislex icometric approach, explorat ive methods
and appl icat ions
Pasquale Pavone
PREDICT Insights on ICT R&D Workshop
Techno-Economic Segments (TES) Analysis:
Methodological Aspects & Policy Perspectives
CAPP – Centro di
Analisi delle Politiche Pubbliche
Automatic processing for text analysis, in a perspective of qualitative andquantitative analysis of their contents, properties and characteristics,enables us to not read a text but different representations of theinformation contained in it.
These tecniques are the natural context for the application of Text Mining tools
Text Mining is a multidisciplinary research field that combines with equalimportance, instruments from Computational Linguistic, InformationRetrieval and Statistics, with the purpose of extracting information ofinterest from a collection of documents (Corpus)
Usually a Text Mining (TM) strategy includes the following steps:
Pre-processing, which consists in the procurement of digital sources, tokenization of essential analysis units and building
the document warehouse.
Lexical processing, which consists in the attribution of linguistic and probabilistic meta-data to the indentification of
the lexical units of analysis (can include a process of ETL).
Information Extraction:
-Information retrieval-Automatic Document Categorization
Taltac2 Software - consists of a series of instruments which allow for the study of any kind
of linguistic data - collected in a Corpus - by employing the techniques of "textual statistics".
The automatic processing, based on a lexicometric approach, allows us to find certain
constants in the text, a sort of DNA of the Corpus
The Role of Meta Data in the Automatic Analysis
meta-data generated during the analysis, consists in annotation of lexical units and
categorization of textual units
Lexical Automatic Processing: object of study is the Vocabulary
Annotations are made on Vocabulary DB (types table)
elementary lexical unit is the lexeme (graphical forms) (type = atom of meaning)
Textual Automatic Processing: object of study is the corpusLike a collection of texts to categorise on Documents DB
textual unit is the document (or fragment of text)
Vocabulary DBAnnotation Meta Data- Linguistic:
- Grammatical Tagging- Semantic Tagging (by dictionaries -glossary)
- Numerical Statistic:- Peculiar (dev. from a model)- Specific (dev. from eq.categ.)- Relevant (TFIDF on Vocabulary)
Recognition of Multiwords- Analysis of repeted segments- Lexical-Textual Algoritm
Matrix Words x Categories
Textual DBCategorization – Meta Data- Regular Expression- TFIDF (textual query - dictionaries)
Selection of Documents:- Concordances Analysis- Co-Occurence Analysis
Matrix Documents x Words
In Taltac2 the unstructured textual information is structured in two main DB
Multidimensional Analysis, is an explorative method and consists in a reproduction of dimensions
(factors) through which we can simplify, synthesise and represent the Studied Phenomenon.
It is an overall analysis that intends a Corpus as a system and puts in evidence the relations exisisting in
the whole system, based on the euclidean logic.
The Correspondece Analysis is the factorial technique of reference for studying the matrix of textual data oflarge dimensions.This Factor Analysis produces the best simultaneous representation of row profiles vs. column profiles in eachfactorial plan
This particular tecnique allows us to represent the information in terms of similarities among the
elements of each group of rows and columns
By Cluster Analysis, on a factor analysis (on a matrix «Documents x Words»), we are able to group
documents based on their similarity in terms of words. The dictionaries characteristic of each
cluster represent the topic (or topics) of each one.
• Which are the criteria (among all of those we possess) to identify and select the lexical units of analysis? (word – multiword – entities)
• Which is the definition of textual units of analysis?
It Depends each time on the Corpus under analysisFrom its main characteristics and the objectives we want to reach.
• A Corpus could be a Collection of: web page texts – newspaper articles –political speeches - scientific papers (title, Abstract, Full paper) – Post or Comments in Social Networks – open ended questions…
• Common language vs specialistic-technical language
• Sectorial Homogeneity vs Dishomogeneity of the language of the variousdocuments
• Dimensional Homogeneity vs Dishomogeneity of the documents• if homogeneous, they can be either very short or very long
What Strategy to use?
Applications - Technical Specialistics Corpora
- Corpus Italian Gastronomy
- Corpus Italian Aerospace Market
- Corpus History of Philosophy
289 doc 641.797 occ.
3.285 doc 682.971 occ.
7.172 doc 54.636 occ.
Corpus Italian GastronomyThe corpus is composed of 3.285 reviews of restaurants present in 2 Italian guides
Reviews are short description of the restaurants in terms of food, characteristics, ospitality.
Vocabulary consists of 32.467 Types for a total of 682.971 occurrences
Objectives:
- Identification of terminology
- Identification of the restaurants typology
Starting points:
- No initial Hypothesis
- No previous knowledge of the gastronomic sector
We focused on identifying nominal idioms, collocations and complexe lexeme to obtain a dictionary
of non ambiuguos semantical terms
A Collocation is a sequence of two or more words characterized by a strong reciprocal link (Sinclair, 1991)
Complex Lexeme, particularly existent in a technical specialistic Corpus, represent an important part of the
terminology of the corresponding sector. Although they are not nominal idioms, they represent a technical
specialistic expression (De Mauro 1999-2003).
1) Terminology Identification
<N+A> <N+LEMMA(di)+Toponimi>pasta fresca olive di Gaeta
<A+N> <N+LEMMA(di)+N+LEMMA(di)+N>ampia sala insalata di frutti di mare
<N+N> <<N+A>+LEMMA(di)+N>pesce spada azienda agricola di famiglia
<N+LEMMA(di)+N> <N+LEMMA(di)+<N+A>>crema di patate dolci di fattura casalinga
<N+LEMMA(a)+N> <N+LEMMA(a)+N+LEMMA(di)+N>risotto allo zafferano spaghetti al nero di seppia
Definition of the syntactical structures of most common collocations
The added value of the structure <N of N> is given by the preposition “of”, which introduces the second nouns as
a property of the first (Rouget, 2000). Each structure is considered as a Regular Expression
The meta-query composed of all Regular Expressions identifies in the Corpus all thesequences. By Lexicalization of the entities with more than 5 occurrences we obtain2.892 new types in the Vocabulary, for a total of 37.979 occurrences.
carta dei vini 1262 frittura di paranza 54 […] […]selezione di formaggi 156 parmigiana di melanzane 52 filetto di triglia 10verdure di stagione 141 tagliata di manzo 51 gallinella di mare 10fiori di zucca 137 […] […] marmellata di cipolle 10frutti di mare 135 tortino di cioccolato 37 vellutata di patate 10frutti di bosco 135 tagliata di tonno 31 crema di fave 9piatti della tradizione 127 crosta di pane 31 carpaccio di baccalà 8gnocchi di patate 105 verdure dell'orto 30 torte di verdure 7filetto di maiale 93 carpaccio di tonno 24 asparagi di fiume 7
[…] […] tortino di melanzane 18 millefoglie di patate 7mozzarella di bufala 75 crostata di ricotta 18 crosta di erbe 6dolci della casa 72 tonno di coniglio 17 millefoglie di melanzane 5filetto di manzo 71 crema di ricotta 15 asparagi di mare 5petto d’anatra 65 sella di coniglio 15 caviale di melanzane 5prodotti del territorio 57 tortino di alici 13 marmellata di pomodori 5crema di patate 55 fragoline di bosco 13 spaghetti di patate 5
Obtaining a semantic disambiguation of ambiguos words:
crema (261 variants) ingles / di patate / catalana / di limone / di broccoli / di ortiche;
tortino (146 variants) di cioccolato / di alici / di castagne / di arancia / di cipolle / di papavero / dicaffè / di ananas / di cicoria;
pasta (141 variants) fresca / ripiena / all’uovo / di salame / di pane / fritta / di mandorle / kataifi /matta / molle / rustica / al fumo / brisée / allo scoglio;
mousse (104 variants) di cioccolato / di baccalà / di fichi / al limone / di trota / di gorgonzola / difegato / di cavolo / di albicocche / di torrone / di fegato / al tabacco.
ragù (191 variants) bianco / di carne / di pesce / di lepre / napoletano / toscano / di verdure / ditonno / di lumache / di gamberi / di cavallo / di broccoli / di seppie / di suino / di triglie
carpaccio (88 variants) di tonno / di manzo / di cervo / di bufala / di orata / di gamberi / dicarciofi / di noci / di melone
Distribution of the Lexicon on the combination first 5 factors
Factor Plan f1f2 - Distribution of Lexicon
Factor Plan f1f2 - Distribution of Documents
Factor Analysis - selection of all nouns (words and multiwords)and adjectives with minimum 5 occurrences. 8.911 selectedwords, in wich 2.892 are multiwords
Matrix Documents × Selected Terms (3285×8911)
2) Restaurants Typology
2) Cluster Analysis – Dendrograms
Cluster Analysis based on 10 factorsCluster Analysis based on 2 factors
• Cluster 1 (1.220 documents) – Gourmet Kitchen – bonus, menù degustazione, chef, piccola pasticceria, appetizer, foie gras, elegante, equilibrio, professionale, tortino caldo, spuma, perfetto, astice, zenzero, creatività, tempura, crema di patate, tartare di tonno, verdurine, tonno scottato, cotto a bassa temperatura, maialino, wasabi, tè verde;
• Cluster 2 (444 documents) – Mediterranean Flavours – mare, cozze, calamari, pesce spada, frutti di mare, gamberetti, frittura di paranza, alici marinate, crostacei, insalata di polpo, guazzetto, insalata di mare, sorbetto al limone, melanzane, orziadas, seadas, broccoli arriminati, lumachine di mare, caponata di melanzane, pistacchi, foglia di limone, tortino di melenzane, tagliata di tonno;
• Cluster 3 (646 documents) – Traditions and Territory of Central Southern Italy – fagioli, salsiccia, maiale, ravioli di ricotta, orecchiette, baccalà, pecorino, coniglio alla cacciatora, funghi porcini, abbacchio, coda alla vaccinara, melanzane,gelatina di maiale, fettuccine, trippa alla romana, mozzarelle, abbacchio a scottadito, caciocavallo podolico, peperoncino, porchetta, malloreddus, peperoni secchi, cime di rapa, torta di ricotta;
• Cluster 4 (975 documents) – Traditions and Territory of Norder Italy – polenta, selvaggina, burro, speck, canederli, agnolotti, plin, brasato, strudel di mele, bagna caoda, amarone, vitello tonnato, tortelli, fonduta, sclopìt, baccalà alla vicentina, stracotto d’asino, culatello di zibello, pesce d’acqua dolce, agnello sambucano, mostarde, trota affumicata, cotechino, casoncelli, gorgonzola, bagòss, blecs, silene.
■■
2) Cl. A. 2 factors – Thematic Dictionaries
Cl 1 – Gourmet Kitchen – bonus, menù degustazione, chef, piccola pasticceria, foie gras, elegante, equilibrio, professionale, spuma;
Cl 2 – Japanese Food – sushi, sashimi, sakè, maki, tempura, tè verde, pesce crudo;
Cl 3 – Seafood – pesce, frittura di paranza, gamberi, insalata di mare, insalata di polpo, zuppa di pesce;
Cl 4 – Sicilian Cuisine – pesce spada, cannoli, ricotta fresca, cassata siciliana, tonno a cipollata, cuscus di pesce, polpette di sarde;
Cl 5 – Roman Cuisine – coda alla vaccinara, amatriciana, trippa alla romana, carciofi alla romana, spaghetti alla carbonara;
Cl 6 – Sardinian Cuisine – fregola, casizolu, malloreddus, maialetto, pane carasau, gattuccio, culurgiones;
Cl 7 – Earth Kitchen – agnello, cacioricotta, peperoni cruschi, salsiccia, maiale, fagioli, ceci, capretto, caciocavallo, funghi porcini;
Cl 8 – Cuisine of Trentino – canederli, maso, speck, schlutzer, gulash, gasthof, strudel di mele, cervo, krapfen, asparagi di fiume;
Cl 9 – Piedmont Cuisine – plin, bonet, tajarin, vitello tonnato, bagna caoda, Barolo, fonduta, torta di nocciole, tonno di coniglio;
Cl 10 – Tuscan Cuisine – ribollita, Vin Santo, pappa al pomodoro, cinta senese, cantuccini, cavolo nero, tordelli, bistecca alla fiorentina;
Cl 11 – Cuisine of Veneto – polenta, soppressa, frico, baccalà alla vicentina, morlacco, sclopit, trota, orzo, gulash, radicchio di Treviso;
Cl 12 – Cuisine of Emilia-Romagna – tortellini, tortelli di zucca, tagliatelle al ragù, Lambrusco, aceto balsamico, prosciutto di Parma;
2) Cl. A. 10 factors – Thematic Dictionaries
Factorial Plan f1-f2 – Distribution Documents – Convex Hulls - 4 Clusters
Factorial Plan f1-f2 – Distribution Documents – 10 factors -12 Clusters
Corpus Aerospace MarketThe corpus is composed of 289 web pages of Italian firms of the sector
Vocabulary is composed of 33.791 Types for a total of 641.797 Token
Objectives
- Market Exploration
- Positioning of the Firms on the market
Starting points:
- No initial Hypothesis
- No previous knowledge of the aerospace sector
After Lexical analysis we select 9.224 units of analysis (nouns, adjectives) in which 3.443 are multiwords
Factor Analysis + Cluster Analysis on Matrix Documents × Selected Terms (289×9.224)
Cluster 1- Information Tecnology – Engineering Servicy
and Consultance
management, services, airport, data, security, web,
monitoring, clients, information, ict, cloud, logistics,
government, images, remote sensing, solutions, maps,
mapping, supporting, international, earth observation,
satellite data, telecommunications
Cluster 2 – System Solution - Military Defence
helicopter, combat, system, solution, communications,
esm, network, radar, air defence, radio, sensors,, tactical,
airborne, sonar, weapon, warfare, aircraft, surveillance,
Cluster 3 – electronic apparatus for telecomunications
system, design, technology, applications, development,
test, antenna, boards, single board, power, slots, rugged,
embedded, shelter, analog, pointing, electronic, tester.
Cluster 4 - Design, modelling, analysis of manufacturing
process
design, analysis, test, components, engineering, phase,
process, techniques, fluid, inspection, geological, defects,
civil engineering
Cluster 5 – Production of Mechanical and metallic
Components
production, components, process, parts, manufacturing,
aeronautical, mechanical, automatic, steels, aluminium,
machinery, drilling, engine, flexibility, special processes,
composite materials.
FP f1-f2 – Distribution Documents – Convex hull – 6 Clusters
Corpus – History of PhilosophyIl corpus is composed of 9.465 Titles of papers of the Journal Philosophical Review, publishing
from 1892 to 2015 (125 years)
Dimension of title is very short. Minimum Dimension is 1 word – Maximum Dimension 33 words
(871 titles – composed of 1 or 2 words; 4.800 titles composed of 3-5 words)
Vocabulary: 7.172 types for a total of 54.636 occurrences
Objective: To Trace the history of the Discipline
Starting points: No initial Hypothesis – No previous knowledge of the Philosophical Discipline
After Lexical analysis we select 1.184 units of analysis (nouns, adjectives) in which 114are multiwords
Factor Analysis + Cluster Analysis on Matrix Fragments × Selected Terms (125×1.184)
Each Fragments consists in all the titles published over one year
FP f1f2 – Convex Hulls of the 4 temporal clusters
Cluster 4 (1892-1914)
Cluster 3 (1915-1946)
Cluster 2 (1947-1982)
Cluster 1 (1983-2015)
Do not need- initial Hypothesis- previous knowledge of the sector in analysis
Need- Homogeneity
- Sector Language- Textual Units Dimension
Purpose of the analysis:
- It is NOT to model the Data based on a previous Hypothesis
- It is To Explore the Data to represent the Information contained in it
In Conclusion Explorative Analysis: