textual data analysis

Textual Data Analysislex icometric approach, explorat ive methods

and appl icat ions

Pasquale Pavone

PREDICT Insights on ICT R&D Workshop

Techno-Economic Segments (TES) Analysis:

Methodological Aspects & Policy Perspectives

CAPP – Centro di

Analisi delle Politiche Pubbliche

Automatic processing for text analysis, in a perspective of qualitative andquantitative analysis of their contents, properties and characteristics,enables us to not read a text but different representations of theinformation contained in it.

These tecniques are the natural context for the application of Text Mining tools

Text Mining is a multidisciplinary research field that combines with equalimportance, instruments from Computational Linguistic, InformationRetrieval and Statistics, with the purpose of extracting information ofinterest from a collection of documents (Corpus)

Usually a Text Mining (TM) strategy includes the following steps:

Pre-processing, which consists in the procurement of digital sources, tokenization of essential analysis units and building

the document warehouse.

Lexical processing, which consists in the attribution of linguistic and probabilistic meta-data to the indentification of

the lexical units of analysis (can include a process of ETL).

Information Extraction:

-Information retrieval-Automatic Document Categorization

Taltac2 Software - consists of a series of instruments which allow for the study of any kind

of linguistic data - collected in a Corpus - by employing the techniques of "textual statistics".

The automatic processing, based on a lexicometric approach, allows us to find certain

constants in the text, a sort of DNA of the Corpus

The Role of Meta Data in the Automatic Analysis

meta-data generated during the analysis, consists in annotation of lexical units and

categorization of textual units

Lexical Automatic Processing: object of study is the Vocabulary

Annotations are made on Vocabulary DB (types table)

elementary lexical unit is the lexeme (graphical forms) (type = atom of meaning)

Textual Automatic Processing: object of study is the corpusLike a collection of texts to categorise on Documents DB

textual unit is the document (or fragment of text)

Vocabulary DBAnnotation Meta Data- Linguistic:

- Grammatical Tagging- Semantic Tagging (by dictionaries -glossary)

- Numerical Statistic:- Peculiar (dev. from a model)- Specific (dev. from eq.categ.)- Relevant (TFIDF on Vocabulary)

Recognition of Multiwords- Analysis of repeted segments- Lexical-Textual Algoritm

Matrix Words x Categories

Textual DBCategorization – Meta Data- Regular Expression- TFIDF (textual query - dictionaries)

Selection of Documents:- Concordances Analysis- Co-Occurence Analysis

Matrix Documents x Words

In Taltac2 the unstructured textual information is structured in two main DB

Multidimensional Analysis, is an explorative method and consists in a reproduction of dimensions

(factors) through which we can simplify, synthesise and represent the Studied Phenomenon.

It is an overall analysis that intends a Corpus as a system and puts in evidence the relations exisisting in

the whole system, based on the euclidean logic.

The Correspondece Analysis is the factorial technique of reference for studying the matrix of textual data oflarge dimensions.This Factor Analysis produces the best simultaneous representation of row profiles vs. column profiles in eachfactorial plan

This particular tecnique allows us to represent the information in terms of similarities among the

elements of each group of rows and columns

By Cluster Analysis, on a factor analysis (on a matrix «Documents x Words»), we are able to group

documents based on their similarity in terms of words. The dictionaries characteristic of each

cluster represent the topic (or topics) of each one.

• Which are the criteria (among all of those we possess) to identify and select the lexical units of analysis? (word – multiword – entities)

• Which is the definition of textual units of analysis?

It Depends each time on the Corpus under analysisFrom its main characteristics and the objectives we want to reach.

• A Corpus could be a Collection of: web page texts – newspaper articles –political speeches - scientific papers (title, Abstract, Full paper) – Post or Comments in Social Networks – open ended questions…

• Common language vs specialistic-technical language

• Sectorial Homogeneity vs Dishomogeneity of the language of the variousdocuments

• Dimensional Homogeneity vs Dishomogeneity of the documents• if homogeneous, they can be either very short or very long

What Strategy to use?

Applications - Technical Specialistics Corpora

- Corpus Italian Gastronomy

- Corpus Italian Aerospace Market

- Corpus History of Philosophy

289 doc 641.797 occ.

3.285 doc 682.971 occ.

7.172 doc 54.636 occ.

Corpus Italian GastronomyThe corpus is composed of 3.285 reviews of restaurants present in 2 Italian guides

Reviews are short description of the restaurants in terms of food, characteristics, ospitality.

Vocabulary consists of 32.467 Types for a total of 682.971 occurrences

Objectives:

- Identification of terminology

- Identification of the restaurants typology

Starting points:

- No initial Hypothesis

- No previous knowledge of the gastronomic sector

We focused on identifying nominal idioms, collocations and complexe lexeme to obtain a dictionary

of non ambiuguos semantical terms

A Collocation is a sequence of two or more words characterized by a strong reciprocal link (Sinclair, 1991)

Complex Lexeme, particularly existent in a technical specialistic Corpus, represent an important part of the

terminology of the corresponding sector. Although they are not nominal idioms, they represent a technical

specialistic expression (De Mauro 1999-2003).

1) Terminology Identification

<N+A> <N+LEMMA(di)+Toponimi>pasta fresca olive di Gaeta

<A+N> <N+LEMMA(di)+N+LEMMA(di)+N>ampia sala insalata di frutti di mare

<N+N> <<N+A>+LEMMA(di)+N>pesce spada azienda agricola di famiglia

<N+LEMMA(di)+N> <N+LEMMA(di)+<N+A>>crema di patate dolci di fattura casalinga

<N+LEMMA(a)+N> <N+LEMMA(a)+N+LEMMA(di)+N>risotto allo zafferano spaghetti al nero di seppia

Definition of the syntactical structures of most common collocations

The added value of the structure <N of N> is given by the preposition “of”, which introduces the second nouns as

a property of the first (Rouget, 2000). Each structure is considered as a Regular Expression

The meta-query composed of all Regular Expressions identifies in the Corpus all thesequences. By Lexicalization of the entities with more than 5 occurrences we obtain2.892 new types in the Vocabulary, for a total of 37.979 occurrences.

carta dei vini 1262 frittura di paranza 54 […] […]selezione di formaggi 156 parmigiana di melanzane 52 filetto di triglia 10verdure di stagione 141 tagliata di manzo 51 gallinella di mare 10fiori di zucca 137 […] […] marmellata di cipolle 10frutti di mare 135 tortino di cioccolato 37 vellutata di patate 10frutti di bosco 135 tagliata di tonno 31 crema di fave 9piatti della tradizione 127 crosta di pane 31 carpaccio di baccalà 8gnocchi di patate 105 verdure dell'orto 30 torte di verdure 7filetto di maiale 93 carpaccio di tonno 24 asparagi di fiume 7

[…] […] tortino di melanzane 18 millefoglie di patate 7mozzarella di bufala 75 crostata di ricotta 18 crosta di erbe 6dolci della casa 72 tonno di coniglio 17 millefoglie di melanzane 5filetto di manzo 71 crema di ricotta 15 asparagi di mare 5petto d’anatra 65 sella di coniglio 15 caviale di melanzane 5prodotti del territorio 57 tortino di alici 13 marmellata di pomodori 5crema di patate 55 fragoline di bosco 13 spaghetti di patate 5

Obtaining a semantic disambiguation of ambiguos words:

crema (261 variants) ingles / di patate / catalana / di limone / di broccoli / di ortiche;

tortino (146 variants) di cioccolato / di alici / di castagne / di arancia / di cipolle / di papavero / dicaffè / di ananas / di cicoria;

pasta (141 variants) fresca / ripiena / all’uovo / di salame / di pane / fritta / di mandorle / kataifi /matta / molle / rustica / al fumo / brisée / allo scoglio;

mousse (104 variants) di cioccolato / di baccalà / di fichi / al limone / di trota / di gorgonzola / difegato / di cavolo / di albicocche / di torrone / di fegato / al tabacco.

ragù (191 variants) bianco / di carne / di pesce / di lepre / napoletano / toscano / di verdure / ditonno / di lumache / di gamberi / di cavallo / di broccoli / di seppie / di suino / di triglie

carpaccio (88 variants) di tonno / di manzo / di cervo / di bufala / di orata / di gamberi / dicarciofi / di noci / di melone

Distribution of the Lexicon on the combination first 5 factors

Factor Plan f1f2 - Distribution of Lexicon

Factor Plan f1f2 - Distribution of Documents

Factor Analysis - selection of all nouns (words and multiwords)and adjectives with minimum 5 occurrences. 8.911 selectedwords, in wich 2.892 are multiwords

Matrix Documents × Selected Terms (3285×8911)

2) Restaurants Typology

2) Cluster Analysis – Dendrograms

Cluster Analysis based on 10 factorsCluster Analysis based on 2 factors

• Cluster 1 (1.220 documents) – Gourmet Kitchen – bonus, menù degustazione, chef, piccola pasticceria, appetizer, foie gras, elegante, equilibrio, professionale, tortino caldo, spuma, perfetto, astice, zenzero, creatività, tempura, crema di patate, tartare di tonno, verdurine, tonno scottato, cotto a bassa temperatura, maialino, wasabi, tè verde;

• Cluster 2 (444 documents) – Mediterranean Flavours – mare, cozze, calamari, pesce spada, frutti di mare, gamberetti, frittura di paranza, alici marinate, crostacei, insalata di polpo, guazzetto, insalata di mare, sorbetto al limone, melanzane, orziadas, seadas, broccoli arriminati, lumachine di mare, caponata di melanzane, pistacchi, foglia di limone, tortino di melenzane, tagliata di tonno;

• Cluster 3 (646 documents) – Traditions and Territory of Central Southern Italy – fagioli, salsiccia, maiale, ravioli di ricotta, orecchiette, baccalà, pecorino, coniglio alla cacciatora, funghi porcini, abbacchio, coda alla vaccinara, melanzane,gelatina di maiale, fettuccine, trippa alla romana, mozzarelle, abbacchio a scottadito, caciocavallo podolico, peperoncino, porchetta, malloreddus, peperoni secchi, cime di rapa, torta di ricotta;

• Cluster 4 (975 documents) – Traditions and Territory of Norder Italy – polenta, selvaggina, burro, speck, canederli, agnolotti, plin, brasato, strudel di mele, bagna caoda, amarone, vitello tonnato, tortelli, fonduta, sclopìt, baccalà alla vicentina, stracotto d’asino, culatello di zibello, pesce d’acqua dolce, agnello sambucano, mostarde, trota affumicata, cotechino, casoncelli, gorgonzola, bagòss, blecs, silene.

■■

2) Cl. A. 2 factors – Thematic Dictionaries

Cl 1 – Gourmet Kitchen – bonus, menù degustazione, chef, piccola pasticceria, foie gras, elegante, equilibrio, professionale, spuma;

Cl 2 – Japanese Food – sushi, sashimi, sakè, maki, tempura, tè verde, pesce crudo;

Cl 3 – Seafood – pesce, frittura di paranza, gamberi, insalata di mare, insalata di polpo, zuppa di pesce;

Cl 4 – Sicilian Cuisine – pesce spada, cannoli, ricotta fresca, cassata siciliana, tonno a cipollata, cuscus di pesce, polpette di sarde;

Cl 5 – Roman Cuisine – coda alla vaccinara, amatriciana, trippa alla romana, carciofi alla romana, spaghetti alla carbonara;

Cl 6 – Sardinian Cuisine – fregola, casizolu, malloreddus, maialetto, pane carasau, gattuccio, culurgiones;

Cl 7 – Earth Kitchen – agnello, cacioricotta, peperoni cruschi, salsiccia, maiale, fagioli, ceci, capretto, caciocavallo, funghi porcini;

Cl 8 – Cuisine of Trentino – canederli, maso, speck, schlutzer, gulash, gasthof, strudel di mele, cervo, krapfen, asparagi di fiume;

Cl 9 – Piedmont Cuisine – plin, bonet, tajarin, vitello tonnato, bagna caoda, Barolo, fonduta, torta di nocciole, tonno di coniglio;

Cl 10 – Tuscan Cuisine – ribollita, Vin Santo, pappa al pomodoro, cinta senese, cantuccini, cavolo nero, tordelli, bistecca alla fiorentina;

Cl 11 – Cuisine of Veneto – polenta, soppressa, frico, baccalà alla vicentina, morlacco, sclopit, trota, orzo, gulash, radicchio di Treviso;

Cl 12 – Cuisine of Emilia-Romagna – tortellini, tortelli di zucca, tagliatelle al ragù, Lambrusco, aceto balsamico, prosciutto di Parma;

2) Cl. A. 10 factors – Thematic Dictionaries

Factorial Plan f1-f2 – Distribution Documents – Convex Hulls - 4 Clusters

Factorial Plan f1-f2 – Distribution Documents – 10 factors -12 Clusters

Corpus Aerospace MarketThe corpus is composed of 289 web pages of Italian firms of the sector

Vocabulary is composed of 33.791 Types for a total of 641.797 Token

Objectives

- Market Exploration

- Positioning of the Firms on the market

Starting points:

- No initial Hypothesis

- No previous knowledge of the aerospace sector

After Lexical analysis we select 9.224 units of analysis (nouns, adjectives) in which 3.443 are multiwords

Factor Analysis + Cluster Analysis on Matrix Documents × Selected Terms (289×9.224)

Cluster 1- Information Tecnology – Engineering Servicy

and Consultance

management, services, airport, data, security, web,

monitoring, clients, information, ict, cloud, logistics,

government, images, remote sensing, solutions, maps,

mapping, supporting, international, earth observation,

satellite data, telecommunications

Cluster 2 – System Solution - Military Defence

helicopter, combat, system, solution, communications,

esm, network, radar, air defence, radio, sensors,, tactical,

airborne, sonar, weapon, warfare, aircraft, surveillance,

Cluster 3 – electronic apparatus for telecomunications

system, design, technology, applications, development,

test, antenna, boards, single board, power, slots, rugged,

embedded, shelter, analog, pointing, electronic, tester.

Cluster 4 - Design, modelling, analysis of manufacturing

process

design, analysis, test, components, engineering, phase,

process, techniques, fluid, inspection, geological, defects,

civil engineering

Cluster 5 – Production of Mechanical and metallic

Components

production, components, process, parts, manufacturing,

aeronautical, mechanical, automatic, steels, aluminium,

machinery, drilling, engine, flexibility, special processes,

composite materials.

FP f1-f2 – Distribution Documents – Convex hull – 6 Clusters

Corpus – History of PhilosophyIl corpus is composed of 9.465 Titles of papers of the Journal Philosophical Review, publishing

from 1892 to 2015 (125 years)

Dimension of title is very short. Minimum Dimension is 1 word – Maximum Dimension 33 words

(871 titles – composed of 1 or 2 words; 4.800 titles composed of 3-5 words)

Vocabulary: 7.172 types for a total of 54.636 occurrences

Objective: To Trace the history of the Discipline

Starting points: No initial Hypothesis – No previous knowledge of the Philosophical Discipline

After Lexical analysis we select 1.184 units of analysis (nouns, adjectives) in which 114are multiwords

Factor Analysis + Cluster Analysis on Matrix Fragments × Selected Terms (125×1.184)

Each Fragments consists in all the titles published over one year

FP f1f2 – Convex Hulls of the 4 temporal clusters

Cluster 4 (1892-1914)

Cluster 3 (1915-1946)

Cluster 2 (1947-1982)

Cluster 1 (1983-2015)

Do not need- initial Hypothesis- previous knowledge of the sector in analysis

Need- Homogeneity

- Sector Language- Textual Units Dimension

Purpose of the analysis:

- It is NOT to model the Data based on a previous Hypothesis

- It is To Explore the Data to represent the Information contained in it

In Conclusion Explorative Analysis: