graphical and smoothing techniques for sequences

Graphical and smoothing techniques for sequences

Raffaella PiccarretaRaffaella PiccarretaDept. of Decision SciencesDept. of Decision Sciences

andandDondena Centre for Social Dondena Centre for Social

DynamicsDynamicsBocconi University, Milan - ItalyBocconi University, Milan - Italy

Sequence analysisSequence analysis

Aim: describe/explore life courses using suitable graphical tools

Fundamental to get an initial impression of the most relevant tendencies in data, while not disregarding possible rare trajectories

Can support the analyst in the choice of the set of tracked activities, of the calendar.

Can help to individuate groups of cases with typical patterns and to explore whether these patterns are related to classification variable/s

Graphical toolsGraphical tools

•Our proposal depends upon a dissimilarity criterion (thus, we obtain informative graphs)•In selecting the dissimilarity criterion, different choices can be done by the analyst•We consider the availability of different dissimilarity criteria as an advantageous opportunity to ‘choose how to represent data’ (in a sense which will be clarified later) depending on the aim of the analysis and on the features of the data

In our approach, the evaluation of dissimilarities between sequences is needed, to identify sequences which are ‘similar’ , i.e. which share similar structural/salient features.

Graphical tools: Sequence plotsGraphical tools: Sequence plotsSequence/index plots (Scherer 2001). Cases on the horizontal axis, time on the vertical one. To each case a set of stacked bars is associated, with colours and lengths depending upon states and their durations.

To avoid the confusion consequent to a random order of the cases on the horizontal axis, a suitable arrangement of sequences is needed. A possibility is to order sequences according to duration of particular activities or to age at entry in a given activity (univariate criterion). We propose instead to order sequences according to the first Multidimensional Scaling (MDS) factor (multivariate / data driven)

cases

Multidimensional ScalingMultidimensional Scaling

121

2

F1

F20 12 …. 1n

21 0 …. 2n

…. …. …. ….

n1 n2 …. 0

Conditionally to a dissimilarity criterionFactors are estimated which can be considered as responsible of the observed dissimilarities.MDS represents dissimilarities as distances between points of a low-dimensional space.


The MDS solution is usually put in so-called principal axis orientation. I.e., the MDS factors have a decreasing order of importance. Conditionally to the extraction method, the 1° factor is that explaining at best dissimilarities, the 2° factor is the next more important factor and so on.

The 1st MDS factor is the one explaining at best the chosen dissimilarities: it is usually ‘based’ upon a combination of durations/ages at entry and it provides the best dissimilarity-based ordering criterion (multivariate / data driven).

The dataThe data

• PSIN Data - Panel study of Social Integration in the Netherlands• 6 waves (1987, 1989, 1991, 1995, 1999/2000 and 2005/2006)• Reference: Lifbroer AC and Kalmijn M. (1997) PSIN Codebook• Focus: females’ work and school / union and family formation

careers. We consider only women observed on the age span 15-34 (complete data, N = 326 / years of birth ’61 and ‘65).

The dataThe dataFor each woman we build, on a monthly (229 months) time scale, one sequence type representation of each career. The following activities are tracked over the considered period:

School/Work: S1/S2/S3 Lower/Medium/Higher secondary educationWP Part time workingWF Working full timeU None of the previous states (unemployment)

FamilyN0/N1/N2/N3 No Cohabitation/ChildrenU0/U1/U2/U3 Cohabitation with partner/ChildrenM0/M1/M2/M3 Marriage/Children

The choice of the dissimilarity criterionThe choice of the dissimilarity criterion

A number of dissimilarity criteria are available.

There are no results proving that one criterion is better than the others. Each one has appealing characteristics, from a theoretical point of view.

We are not interested to compare the alternative measures or to determine which is the ‘best’. We are only concerned with the selection of a criterion assuring a reasonable ordering, the concept of ‘reasonability’ depending on the aim of the analysis.


Many criteria have been introduced in the literature to properly quantify the dissimilarity between sequences

OMA: quantification of the effort needed to transform one sequence into another using three basic operations: insertion, deletion, substitution. Costs have to be assigned to each operation. Debate on this...

Substitution costs: usually inversely proportional to the transition frequencies (otherwise set on the basis of a-priori knowledge)

Indel costs: Different proposals: At least half the max substitution cost, so that substitution is preferred to two indels. More recently: Lower values ( 0.1 times the max substitution cost).

The choice of the dissimilarity criterionThe choice of the dissimilarity criterionMany proposals to properly quantify the dissimilarity between sequences

Lesnard: dynamic Hamming distance. It is based only upon substitution costs, which are related to the frequencies of the transitions from one state to another between two consecutive periods (thus varying across time).

Halpin: in OMA the cost does not depend upon the length of the modified spell: costs of the operations should be weighted accordingly. The deletion of an element from a long episode produces lower costs than that of the element itself from a short episode.

The choice of the dissimilarity criterionThe choice of the dissimilarity criterionElzinga. Focus is on the states sequence, i.e. the collection of visited statesThe evaluation of similarity is based upon the number and/or the frequency of substructures common to two states sequences:

•length of the longest common prefix (the first pattern of states, including the first visited state)•length of the longest common sub-sequences (collection of states appearing in each sequence and in the same order)•number of common sub-sequences•number of matched sub-sequences (counting how often each subsequence embedded in a states sequence can be matched with the same subsequence embedded in the other)

These measures can all be extended to account for durations. One possibility is to refer to the minimal shared time, i.e. the units of time spent in the common sub-sequences (or prefixes).

DissimilaritiesDissimilarities

We obtained different dissimilarity matrices, using:

•OMA: substitution costs based on transition frequencies and different indel costs. Similar results: only OMA05, with indel=(max substitution cost*0.5) will be presented

•Lesnard Dynamic Hamming distance. Results similar to OMA, not shown.

•Halpin’s OMAH

•Elzinga’s criteria

Length of the longest common prefix, LCPNumber of matching sequences, NMS


MDS was applied to each dissimilarity matrix

•Classic/Metric MDS•Non-Metric MDS•Bayesian MDS

Using standard criteria (Stress, based upon the normalized squared distances between the observed and the reproduced dissimilarities), the Bayesian MDS solution was taken into account.

Work trajectories – MDS sequence plotsWork trajectories – MDS sequence plots

OMA05 (A) and OMAH (B) provide similar (if not identical) ordering: full time workers are opposed to the unemployed.

Blocks of trajectories dominated by part time work are scattered along the horizontal axes

Work trajectories – MDS sequence plotsWork trajectories – MDS sequence plots

NMS (C) more focused on school, a bit more confused than OMA05 and OMAH. LCP (D) does not describe properly work careers after school.

Analytical properties of criteria combined with the features of these careers. It is not a general consideration: for other sets of sequences focusing on the initial or the combined experienced states can provide suitable ordering

Work trajectories – MDS sequence plotsWork trajectories – MDS sequence plotsFor all the criteria, the appearance of the plots is influenced by short non employment or part-time work spells characterizing some trajectories.

The presence of noisy sequences can make the visualisation complicated especially when the sample size increases and over-plotting becomes a more serious problem.

Graphical tools: Sequence plotsGraphical tools: Sequence plots

Sequence plots MDS sequence plots

Even when sequences are reasonably ordered, a possible problem is the over-plotting consequent to the limited available visual field. As the sample size increases the thickness of the bars may become not sufficient, with a consequent difficulty to visualize individual trajectories.

•In some situations individual variability (or complexity) due for example to short and non relevant spells in a state (e.g. short non employment spells between two jobs in a work history) can mask the most salient features of the trajectories.•The sequences deviating from the others can be hard to visualise and/or to individuate (also when cluster analysis is used….)

Graphical tools: Graphical tools: Smoothed MDS Sequence plotsSmoothed MDS Sequence plots

We introduce a criterion to smooth sequences, reducing individual noise and permitting to unveil the ‘structural’ features of life courses.

In our smoothed MDS sequence plots the smoothed sequences are plotted, ordered according to the first MDS factor.

We propose criteria to measure the quality of the smoothing for each specific sequence, and use this information to individuate outliers sequences, possibly under-represented in the plots.

Graphical tools: Graphical tools: Smoothed MDS Sequence plotsSmoothed MDS Sequence plots

•For each sequence, si, we focus on its neighbourhood, Ni, i.e. the set of sequences closest to si.

•The original sequence si is substituted by a summary of cases in Ni, the smoothed sequence, i.

•The distinction between what has to be considered as ‘structure’ and what as individual and negligible noise, depends upon the chosen dissimilarity criterion, (.,.), which plays consequently a crucial role in the definition of the Ni’s and of the smoothed sequences i’s

The smoothed sequencesThe smoothed sequences

For given Ni and (.,.), we suggest to smooth a sequence si using the medoid of cases in Ni, that is the sequence having the minimum (total) distance from all the others:

iNh

ihi s ),(minarg

Being the most centrally located case, the medoid is a good local representative of the cases in Ni, and it can be obtained also when only dissimilarities are available

The neighborhoodsThe neighborhoods

For given (.,.), possible proposals to choose Ni are:

•The set of the k nearest neighbours of si.

•For a fixed radius, r, the set of sequences that are closer than r to si.

•A combination of the two criteria: k is chosen, the maximal distance between si and its k neighbours is determined, ri, and the set of sequences closer that ri to si are selected in Ni

A relevant issue concerns the selection of k and/or r.


A leave-one-out cross-validation procedure is used to choose k or r.

To choose k, the medoid i –i ( k) is obtained without considering si.

The leave-one-out cross-validation error is the sum of the dissimilarities between each original sequence and the corresponding medoid

i

iiiskCV ),()(

The ‘best’ value of k according to this criterion is that minimizing CV.

A similar reasoning can be applied to select r.

The described approaches select the same k (resp. r) for all the cases. If the criteria are combined, the radius is different from case to case.


A more flexible procedure combines the nearest neighbours and the radius approaches, and allows both k and r to vary across cases

•For a given si, the leave-one-out cross-validation procedure is first applied to select the number of nearest neighbours, k*

i.

•The maximal distance between si and its k*i nearest neighbours, is

determined, r*i

•N*i is selected as the set of cases closer than r*

i to si.

Therefore, both the number of neighbours and/or the radius are ideally peculiar to each case.

The quality of the smoothingThe quality of the smoothingThe performance of the alternative smoothing methods can be evaluated using the prediction error, i.e., the sum of the dissimilarities between the original and the smoothed sequences.

To reason in relative rather than in absolute terms, we refer to the prediction error corresponding to the general medoid associated to the whole sample:

i

is ),(minarg

The resulting quality criterion is:

measuring the relative decrease in the prediction error when passing from the general medoid to the specific ones.

i i

i ii

ss

R),(),(

12 i iis ),(

The quality of the smoothingThe quality of the smoothing

Adopting another approach, note that the original dissimilarity between the i-th and the h-th sequence, (si , sh), is approximated by the dissimilarity between the two smoothed sequences, (i , h).

The sum of the squared differences [(si , sh) – (i , h)]2 can also be used to evaluate the goodness of fit.

This is a generalisation of the stress, solely used to evaluate the quality of an MDS solution. Adopting a procedure which is rather common in MDS, also in this case we consider a measure normalized using the sum of the squared original distances:

i ih hi

i ih hihi

ssss

S),(

)],(),([2

22


In our smoothed MDS sequence plots the smoothed sequences are plotted, ordered according to their score on the first MDS factor.

The dissimilarity measure (.,.) plays consequently a double role.

•First, it determines the ordering criterion.

•Second, it is used to determine both the neighbours and the medoids in the smoothing procedure.

Smoothing PSIN dataSmoothing PSIN data

Turning back to data, we will focus on the OMA05 and OMAH criteria (MDS factors extracted using the bayesian approach).

In the smoothing procedure, different definitions of neighbourhoods were considered (cross-validation procedure always used to select parameters)

•Nearest Neighbours (k)•Radius (r)•Combination of k and r – with r varying across cases •Combination of k and r – both varying across cases

Using the R2 and S2 criteria introduced before, the last criterion was selected

Work trajectories Work trajectories Smoothed MDS sequence plotsSmoothed MDS sequence plots

Smoothed MDS sequence plots: A) OMA05 B) OMAH

1) For each sequence its neighbourhood is determined 2) The original sequences are replaced by the neighbourhoods’ medoids 3) Medoids are ordered according to their score on the MDS factor.

The ordering of cases can differ from that in the original MDS plots. Here similar medoids are plotted close one to another, improving sequences’ representation.


Due to the double role (ordering / smoothing) played by dissimilarity, one can also combine criteria depending to the specific aim of the visualisation. (A) OMA05; (B) OMAH (C) Combination of OMA05 (ordering) and OMAH (smoothing). Visualisation improved, individual noise reduced, main patterns more evident. Note: the definition of ‘noise’ depends on the chosen dissimilarity criterion.


We now focus on the quality of the approximation

R 2(OMAH)=0.723 [R 2(OMA05)=0.847 ; R 2(NMS)=0.697] The approximation provided by the smoothed sequences is rather convenient as compared to the general medoid. Note that a low R2 can also be observed when the general medoid provides a satisfactory smoothing for the sequences

Stress-based statistic:S 2(OMAH)=0.289 [S 2(OMA05)=0.117, S 2(NMS)=0.116]Some authors suggest interpreting the MDS stress informally and indicate 0.1 as the maximum value which can be considered as acceptable. In MDS the dissimilarities are reproduced based on factors which are free to vary, whilst here we focus on medoids. Hence this approximation can be considered again as satisfactory.

In the following we refer to OMAH to illustrate how outliers can be identified

Misrepresented Work trajectories Misrepresented Work trajectories Smoothed MDS sequence plotsSmoothed MDS sequence plots

An interesting characteristic of our tools is that for each sequence the dissimilarity between the original and the smoothed sequence, (i , h), can be used to determine which sequences are badly approximated.

To do this, we obtain the percentiles of the (i , h)’s and flag as critical cases with a prediction error higher than the 80-th percentile.

The critical sequences are compared with the smoothed ones to verify whether some ‘structural’ characteristics were masked for some trajectories.

Family trajectories Family trajectories Smoothed MDS sequence plotsSmoothed MDS sequence plots

Now consider the family formation patterns. Smoothed sequences plots are presented, obtained using OMAH (A) and NMS (B). In (C) a combination of NMS (ordering) and OMAH (smoothing) is reported.Here the use of these plots is not strictly necessary: the original MDS sequence plots already provide a satisfactory representation of sequences,and over-plotting is not a serious issues.

Nonetheless this example highlights the usefulness of these plots in extracting outliers

Misrepresented family trajectories Misrepresented family trajectories Smoothed MDS sequence plotsSmoothed MDS sequence plots

It is also possible to analyze for each case the number of neighbours or the radius of the neighbourhood, to distinguish between sequences having dense neighbourhoods with similar cases from those which are instead more isolated

In the plots aside the poorly smoothed sequences are reported: those of women who had children alone or during a cohabitation, or experienced a relatively short period of cohabitation before marrying.The possibility to analyze critical careers is particularly important beyond the usefulness of graphical tools per se.

Smoothed MDS sequence plotsSmoothed MDS sequence plots•Visualisation of sequences focused on the most salient features

•Permit to individuate poorly-represented sequences, i.e. trajectories with very peculiar characteristics.

It is possible to analyze in a detailed manner the entire ‘tail’ of the ‘extreme’ or deviating sequences. Separate plots can be considered for sequences characterized by more and more severe approximation errors. Thus, ideally one might consider a plot for prediction errors between the 60-th and the 70-th percentile, the 70-th and 80-th percentile, and so on.

•Can be used to smooth and to simplify the visual representation of subgroups of cases determined, for example, using cluster analysis or classification trees.

In this case, it is also possible to inspect in details the level of cohesion of the groups, to monitor the variation in the quality of the smoothing when passing from the entire dataset to the subgroups for each case.

Using smoothed MDS sequence plotsUsing smoothed MDS sequence plotsCluster analysis applied to work trajectories (OMAH dissimilarity). Ward’s algorithm was used and 6 clusters selected

Using smoothed MDS sequence plotsUsing smoothed MDS sequence plotsClusters can be analyzed using the same approach described before: sequences ‘extraneous’ to the others in a cluster can be identified.

Note that when the whole dataset is considered, for each sequence its neighbours and the smoothed sequence can be determined unconditionally.Instead, when focusing only on cases placed in the same cluster, a constrained smoothed sequence is determined. Therefore the level of overlapping between clusters can be evaluated wrt relatively unexplained trajectories

Using smoothed MDS sequence plotsUsing smoothed MDS sequence plotsNote that the R2 for clusters will generally be lower than that for the whole sample (here, 0.72). Actually, the smoothing procedure is convenient when the general medoid does not effectively describe cases in the cluster. Thus, relatively homogeneous clusters can be expected to be characterized by lower R2

values. For hierarchically nested sub-samples it is therefore possible to evaluate the quality of achieved internal homogeneity focusing on the decrease of the R2

Using smoothed MDS sequence plotsUsing smoothed MDS sequence plotsThe same analysis can be conducted to compare groups of cases induced by the levels of one (categorical) covariate. Aside, the year of birth (1961 or 1965) is taken into account.Can be implemented in the context of classification trees for sequences

Thank youThank you

Using smoothed MDS sequence plotsUsing smoothed MDS sequence plots

We refer to data in Mc Vicar, Anyadike-Danes (2002, JRSS series A) collected on N = 712 young people from Northern Ireland. Monthly activity information is available for a period of 6 years (T = 72 months), following the completion of compulsory education.

S SchoolFE Further education HE Higher educationT TrainingE EmploymentJL Joblessness

The dissimilarity matrix was built using OMA, with substitution costs inversely related to the transition frequencies and indel cost equal to 1.

Using smoothed MDS sequence plotsUsing smoothed MDS sequence plots

Aside are the MDS (bottom panel) and the smoothed MDS sequence plots (upper panel).Clusters of sequences were obtained using Ward’s algorithm.6 clusters were selected

ClustersClusters

Clusters can be analyzed using the same approach described before.For example, sequences ‘extraneous’ to the others in a cluster can be identified.

ClustersClusters

Also note that when the whole dataset is considered, for each sequence its neighbours and the consequent smoothed sequence can be determined unconditionally.Instead, when focusing only on cases placed in the same cluster, a constrained smoothed sequence is determined. Therefore the level of overlapping between clusters can be evaluated wrt relatively unexplained trajectories

ClustersClustersAlso, note that the R2 characterizing clusters is generally lower than that characterizing the whole sample. This is reasonable, since the smoothing procedure is convenient when the general medoid does not effectively describe cases in the cluster. Thus, relatively homogeneous clusters can be expected to be characterized by lower values of the R2. For hierarchically nested sub-samples it is therefore possible to evaluate the quality of achieved internal homogeneity focusing on the decrease of the R2.

ClustersClustersNote that the R2 for clusters are generally lower than that for the whole sample. Actually, the smoothing procedure is convenient when the general medoid does not effectively describe cases in the cluster. Thus, relatively homogeneous clusters can be expected to be characterized by lower R2

values. For hierarchically nested sub-samples it is therefore possible to evaluate the quality of achieved internal homogeneity focusing on the decrease of the R2.

TreeTree

DissimilaritiesDissimilarities

We obtained different dissimilarity matrices, using:

•OMA: substitution costs based on transition frequencies and different indel costs. Similar results: only indel=(max substitution cost*0.5) will be presented

•Lesnard Dynamic Hamming distance. Results similar to OMA, not shown.

•Halpin’s OMAH

•Elzinga’s criteria

Length of the longest common prefix, LCPNumber of matching sequences, NMS

graphical and smoothing techniques for sequences

Documents

different typical life

data sequence analysis

groups of similar life

describeexplore life

main features of sequences

dissimilarity criterionmds

typical patterns

groups of cases