effects on curve clustering of different …by adopting a functional data approach and a...

4
E FFECTS ON CURVE CLUSTERING OF DIFFERENT TRANSFORMATIONS OF CHRONOLOGICAL TEXTUAL DATA Matilde Trevisani 1 and Arjuna Tuzzi 2 1 DEAMS, Universit` a di Trieste, (e-mail: [email protected]) 2 Dipartimento di FISPPA, Universit` a di Padova, (e-mail: [email protected]) ABSTRACT: Chronological corpora are collections of texts ordered in time. In bag- of-words approaches, data are typically the frequencies of individual words in the set of texts being grouped into equal-distant time points. In our work the temporal course of a word occurrence is viewed as a proxy of a word life-cycle: recognition of temporal shapes and clustering of words having sim- ilar life-cycles are the basic objective. However, the strong asymmetry of frequency spectrum typical of textual data has to be taken into account when defining the specific purpose of clustering and, hence, any type of further processing of data. By adopting a functional data approach and a distance-based curve clustering, the effect of selected data transformations on the generation of word groups is examined. KEYWORDS: chronological corpora, data transformation, curve clustering, spline smoothing, textual data. 1 Introduction In the framework of chronological textual data, corpora are sets of texts or- dered in time. In bag-of-words approaches, data are typically the frequencies of each word in the set of texts being grouped into equal-distant time points. We interpret the course of a word occurrence over time as a proxy of a word diffusion, i.e. of a word history. In our approach, the occurrences through time are viewed as a realization of an underlying continuous function representing the temporal development of a word. We are interested in shaping each word history and clustering words which share a similar life cycle. In studying these functions data normalization is essential, especially in light of the clustering objectives. Indeed, a strong asymmetry characterizes textual data. A corpus, no matter how large, falls in the so called large number rare events zone, a direct consequence of which is sparsity. Moreover, the size of time-point subcorpora may vary greatly over time.

Upload: others

Post on 17-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EFFECTS ON CURVE CLUSTERING OF DIFFERENT …By adopting a functional data approach and a distance-based curve clustering, the effect of selected data transformations on the generation

EFFECTS ON CURVE CLUSTERING OF DIFFERENTTRANSFORMATIONS OF CHRONOLOGICAL TEXTUAL

DATAMatilde Trevisani1 and Arjuna Tuzzi2

1 DEAMS, Universita di Trieste, (e-mail: [email protected])2 Dipartimento di FISPPA, Universita di Padova, (e-mail: [email protected])

ABSTRACT: Chronological corpora are collections of texts ordered in time. In bag-of-words approaches, data are typically the frequencies of individual words in the setof texts being grouped into equal-distant time points.

In our work the temporal course of a word occurrence is viewed as a proxy of aword life-cycle: recognition of temporal shapes and clustering of words having sim-ilar life-cycles are the basic objective. However, the strong asymmetry of frequencyspectrum typical of textual data has to be taken into account when defining the specificpurpose of clustering and, hence, any type of further processing of data.

By adopting a functional data approach and a distance-based curve clustering, theeffect of selected data transformations on the generation of word groups is examined.

KEYWORDS: chronological corpora, data transformation, curve clustering, splinesmoothing, textual data.

1 Introduction

In the framework of chronological textual data, corpora are sets of texts or-dered in time. In bag-of-words approaches, data are typically the frequenciesof each word in the set of texts being grouped into equal-distant time points.

We interpret the course of a word occurrence over time as a proxy of a worddiffusion, i.e. of a word history. In our approach, the occurrences through timeare viewed as a realization of an underlying continuous function representingthe temporal development of a word. We are interested in shaping each wordhistory and clustering words which share a similar life cycle.

In studying these functions data normalization is essential, especially inlight of the clustering objectives. Indeed, a strong asymmetry characterizestextual data. A corpus, no matter how large, falls in the so called large numberrare events zone, a direct consequence of which is sparsity. Moreover, the sizeof time-point subcorpora may vary greatly over time.

Page 2: EFFECTS ON CURVE CLUSTERING OF DIFFERENT …By adopting a functional data approach and a distance-based curve clustering, the effect of selected data transformations on the generation

Objective of this work is examining how different normalizations affectsthe generation of word groups. If we consider corpus data organized by aword×time-point contingency table, a sort of column-normalization shouldbe regarded as preliminary in order to adjust the uneven document dimensionacross time. Yet, in order to effectively gather the synchrony of word histories,we need somehow to normalize data also by row. By applying spline smooth-ing and distance-based curve clustering, the effect of selected data transforma-tions on the generation of word groups is examined.

2 Corpus data transformation

In our research plan, we envision several column normalizations coupled withrow normalizations of a corpus data word×time table (see Table 1).

Table 1. Normalization plan

normalizedby row

normalizedby column

(corpus logic) (”table” logic)

Dim1(# titles)

Dim2(# tokens)

maximumcolumnfrequency

columnsum(under

√)

Strongasymmetry

row sum d1 (χ2) r1z-score by row d2 r2maximum row frequency d3 r3nonlinear transformation (p1) d4 r4nonlinear transformation (p2) d5 r5

Excess of0’s

dynamic density (d) d6 r6rate growth (g) d7 r7

c1 c2 c3 c4

The table gives a complete picture. Here we will show an example, forthe purpose of highlighting through a comparison some of the main pointsoutlined above. In particular, we compare the c2 column normalization whichentails the computation of relative frequencies from dividing the number ofoccurrences by Dim2, total number of tokens in documents referred to sametime-point, to the d1 double normalization obtained from dividing by both totalword frequency (r1) and total column frequency (eventually under square rootbeing sympathetic to CA rationale, whence a χ2-like approach; c4).

3 An nutshell example

Example data are taken from the corpus of titles of papers published by theAmerican Statistical Association’s journals (volumes from 1888 to 2012) and

Page 3: EFFECTS ON CURVE CLUSTERING OF DIFFERENT …By adopting a functional data approach and a distance-based curve clustering, the effect of selected data transformations on the generation

consist in the frequency data of finally selected 900 keywords through 107years (volumes from 1888 to 1921 were biennial) (Trevisani & Tuzzi, 2014).

For representing functional data as smooth functions we adopted the ba-sis function approach and used B-splines as basis system. By the rough-ness penalty approach for estimation, the smoothing selection was carried outby varying roughness penalty form and, over opportune value ranges, splineorder m and smoothing parameter λ. According to the generalized cross-validation criterion, it led to m = 5 with a PEN2(x) =

∫[D2x(s)]2ds rough-

ness penalty and λ = 103 (df = 7.7) for c2 normalized data, to m = 3 witha PEN1(x) =

∫[Dx(s)]2ds roughness penalty and λ = 101.75 (df = 7.4) for d1

(χ2) normalized data. Curves are then partitioned by means of the K-meansalgorithm on the basis of the euclidean distance between trajectories. By pool-ing several clustering quality criteria (49, see clusterCrit and kml R packages),and after discarding the top-rated solutions (2/3-cluster partition), partitionswith 5/9/22/6 groups for c2 normalized data and with 6/4/19 for d1 nor-malized data were ex-aequo. Some effects of the chosen data normalizationon clustering results stand out even considering a clustering with low num-ber of groups. When data are solely column normalized, word “popularity“play a dominant role: clusters are primarily determined by high-level curves(high frequency words) leaving the majority of low frequency words in oneor more fuzzy groups. In the example of five-group clustering on c2 normal-ized data (Figure 1), three clusters–which account for only about 10% of thetotal words–look interesting, the rest being a singleton (statist, the most fre-quent word making group E) and an indistinct storage (the massive group A).Conversely, a double normalization allows for a more balanced partitioningwhere the shape and level of curves play a role on a par. In the example of six-group clustering on d1 normalized data (Figure 2), some patterns which havealready appeared with column normalized data are now confirmed and betterstructured by more plentiful groups: the long time span is clearly divided byclusters of words with similar life cycles (e.g., compare C, D and B groups inthe two compared clusterings).

References

TREVISANI, M., & TUZZI, A. 2014. A portrait of JASA: the History ofStatistics through analysis of keyword counts in an early scientific journal.Quality and Quantity, 1–20.

Page 4: EFFECTS ON CURVE CLUSTERING OF DIFFERENT …By adopting a functional data approach and a distance-based curve clustering, the effect of selected data transformations on the generation

01

02

03

04

05

06

07

0

t(x["

tra

j"])

A B C D E803(89.2%)

75(8%)

14(1.6%)

7(1%)

1(0.1%)

88

/89

94

/95

00

/01

06

/07

12

/13

18

/19

19

24

19

27

19

30

19

33

19

36

19

39

19

42

19

45

19

48

19

51

19

54

19

57

19

60

19

63

19

66

19

69

19

72

19

75

19

78

19

81

19

84

19

87

19

90

19

93

19

96

19

99

20

02

20

05

20

08

20

11

05

10

15

keyw

ord

no

rma

lize

d f

req

ue

ncy

88

/89

94

/95

00

/01

06

/07

12

/13

18

/19

19

24

19

27

19

30

19

33

19

36

19

39

19

42

19

45

19

48

19

51

19

54

19

57

19

60

19

63

19

66

19

69

19

72

19

75

19

78

19

81

19

84

19

87

19

90

19

93

19

96

19

99

20

02

20

05

20

08

20

11

populproblemmeasurstudicensu

notecorrelproducteconomprice

indexreportindustrivital statist

14 wordsCluster C

05

10

15

20

keyw

ord

no

rma

lize

d f

req

ue

ncy

88

/89

94

/95

00

/01

06

/07

12

/13

18

/19

19

24

19

27

19

30

19

33

19

36

19

39

19

42

19

45

19

48

19

51

19

54

19

57

19

60

19

63

19

66

19

69

19

72

19

75

19

78

19

81

19

84

19

87

19

90

19

93

19

96

19

99

20

02

20

05

20

08

20

11

modeltestdatadistribut

analysisamplmethod

7 wordsCluster D

02

46

8

keyw

ord

no

rma

lize

d f

req

ue

ncy

88

/89

94

/95

00

/01

06

/07

12

/13

18

/19

19

24

19

27

19

30

19

33

19

36

19

39

19

42

19

45

19

48

19

51

19

54

19

57

19

60

19

63

19

66

19

69

19

72

19

75

19

78

19

81

19

84

19

87

19

90

19

93

19

96

19

99

20

02

20

05

20

08

20

11

regressfunctiongenerbaseprobablbayesianmultivarieffectapproachmeanerrorlineardesignvariancvariablrandomnonparametrmultiplpredict

approximprocedurtheoritime seriparametsurveiselectrobustcomparisonratecomputconditregress modelnormalprocessinformforecaststructurrespons

covarirelatpowerranklikelihoodlinear modelobservclasstimeclassifsystemcurvsequentigroupsemiparametrfactortablevaludepend

weightspatialcontrolindependconting tablconfid intervtypebivarirationumbercluster.....................

75 wordsCluster B

Figure 1. Clustering on column-normalized data: all five groups and some clusters.

0.0

0.2

0.4

0.6

0.8

1.0

t(x["

tra

j"])

A B C D E F260(28.9%)

197(22%)

177(19.7%)

124(14%)

101(11.2%)

41(5%)

88

/89

94

/95

00

/01

06

/07

12

/13

18

/19

19

24

19

27

19

30

19

33

19

36

19

39

19

42

19

45

19

48

19

51

19

54

19

57

19

60

19

63

19

66

19

69

19

72

19

75

19

78

19

81

19

84

19

87

19

90

19

93

19

96

19

99

20

02

20

05

20

08

20

11

0.0

0.2

0.4

0.6

0.8

1.0

keyw

ord

no

rma

lize

d f

req

ue

ncy

88

/89

94

/95

00

/01

06

/07

12

/13

18

/19

19

24

19

27

19

30

19

33

19

36

19

39

19

42

19

45

19

48

19

51

19

54

19

57

19

60

19

63

19

66

19

69

19

72

19

75

19

78

19

81

19

84

19

87

19

90

19

93

19

96

19

99

20

02

20

05

20

08

20

11

censureportvital statistamericanmassachusettyearcitibureaustatist work

deathregistrschoolchildrenmunicipmarriagenglandcollegwomen

infant mortalwealthdeath ratenew yorkbostonrailwaibirth ratedivorccrimin

new york citinegrolife insurmortal statistannual meetbudgetgermaniruralsex

sectionlabor statistcaus of deathsociolograce

41 wordsCluster F

0.0

0.2

0.4

0.6

keyw

ord

no

rma

lize

d f

req

ue

ncy

88

/89

94

/95

00

/01

06

/07

12

/13

18

/19

19

24

19

27

19

30

19

33

19

36

19

39

19

42

19

45

19

48

19

51

19

54

19

57

19

60

19

63

19

66

19

69

19

72

19

75

19

78

19

81

19

84

19

87

19

90

19

93

19

96

19

99

20

02

20

05

20

08

20

11

statistpopulproblemmeasurstudinotecorrelproducteconompriceindexrelatindustrifamilitrendvalumortal

incomprogramstateserisocialdeterminnationfedercostareapracticstandardbusiresearchstatist methodpublicindex number

tradehealthservicwagemarketdemandunemployseasonprogressinterpretcertainconsumrecordagricultursignificgrowthwar

planlaborlawmanufacturcalculmathematstatisticianunitinfluenccorrel coefficibirthlifeconceptfinancformulaworkmigrat

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

177 wordsCluster C

0.0

0.1

0.2

0.3

0.4

keyw

ord

no

rma

lize

d f

req

ue

ncy

88

/89

94

/95

00

/01

06

/07

12

/13

18

/19

19

24

19

27

19

30

19

33

19

36

19

39

19

42

19

45

19

48

19

51

19

54

19

57

19

60

19

63

19

66

19

69

19

72

19

75

19

78

19

81

19

84

19

87

19

90

19

93

19

96

19

99

20

02

20

05

20

08

20

11

samplvarianctablratiocasebinomisizeexpecttruncatsampl sizeboundchi squarleast squarnormal distributsmall samploptimumorder statist

rangfishercompletsumsquarexperimentstudentexponenti distributnormal populunbiaswilcoxonunbias estimrandom variablmodifisampl surveiconfid limitfinit sampl

fhouseholdmaximumbinomi distributseason adjustlife testsymmetrlarg samplfactorihypothesilognormchi squar testmultipl regressinterv estimkolmogorovnegsign

disturbstratifi samplinterviewsystemattheoremsampl plantaulinear combinrespons errorsign testrandom respons modelprobabl sampllinear estimpoisson distributbetaestim of the parametrel effici

124 wordsCluster D

0.0

0.1

0.2

0.3

keyw

ord

no

rma

lize

d f

req

ue

ncy

88

/89

94

/95

00

/01

06

/07

12

/13

18

/19

19

24

19

27

19

30

19

33

19

36

19

39

19

42

19

45

19

48

19

51

19

54

19

57

19

60

19

63

19

66

19

69

19

72

19

75

19

78

19

81

19

84

19

87

19

90

19

93

19

96

19

99

20

02

20

05

20

08

20

11

modeldataanalysiregressfunctionbasemultivariapproachrandommultiplpredicttime serirobustregress modelprocessstructurrespons

covarilinear modeltimedependweightclusterlocaltransformtreatmentsmoothbootstrapcombinnonlinearpartialsurvivpriormixtur

stochastmargincensordensitialgorithmdensiti estimappliassessautoregressquantilbayesian analysidiagnostmont carlomixsimulgener linear modeldiseas

patternparametroutlierkernelmatchbinaricensor datacontinuscienctrialgaussianmisslatentsplineregress estimrobust estimconsist

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

197 wordsCluster B

0.0

0.1

0.2

0.3

0.4

0.5

0.6

keyw

ord

no

rma

lize

d f

req

ue

ncy

88

/89

94

/95

00

/01

06

/07

12

/13

18

/19

19

24

19

27

19

30

19

33

19

36

19

39

19

42

19

45

19

48

19

51

19

54

19

57

19

60

19

63

19

66

19

69

19

72

19

75

19

78

19

81

19

84

19

87

19

90

19

93

19

96

19

99

20

02

20

05

20

08

20

11

bayesiannonparametrlikelihoodsemiparametrspatialriskdynamhierarchmeasur errorlongitudineventlongitudin datavariabl selectjointcanceridentifmiss data

hazardcalibrnetworkmodel selectdata analysistatist modelgenenonparametr regressheterogensurviv databayesian inferhigh dimensionwaveletmixtur modelobserv studicausaldose

sparsbayesian modeldimens reductscreenmix effect modelsignaltreeeffici estimpoint processspectralmicroarraiepidemiologmarkov chain mont carloguidprincipdimensframework

temporthresholdsmall areaaddit modelgenetstochast processtime varihierarch modelpenaldnatime dependlarg scalefailur timesemiparametr estimprofilfalsenvironment

101 wordsCluster E

Figure 2. Clustering on doubly normalized data: all six groups and some clusters.