conversational voice patterns in adult english speakers

1
Conversational voice patterns in adult English speakers with ASD Introduction Individuals with an Autism Spectrum Disorder (ASD) often display distinctive quality of speech, described as monotone or singsongy [1]. These patterns are robust indicators of social communication deficit (Paul et al., 2005) and contribute to reaching a diagnosis of ASD. In recent work [1-4], we have shown the possibility to train machine learning algorithms to accurately (80-86%) discriminate autistic from non-autistic speakers in monological speech recordings in British and American English as well as Danish. Here we extend those findings by characterizing acoustic properties of ASD speech in a naturalistic dialogical setting and comparing them to the previous results. The current results refer to British English only. This work was supported by an Interacting Minds Center seed funding. The Autism Research Group Fusaroli, R., Lambrechts, A., Yarrow, K., Maras, K., Gaigg, S.B. Methods The framework of a previous naturalistic experiment [5] provided us with provided audio recordings of 17 ASD and 17 matched Typically Developing (TD) adults. Participants took part in a live event scenario in which they performed first aid manipulations on a manikin following a script. Later on, participants were interviewed on what they recalled of the event. The first part of the interview was a monological report of the events by the interviewee, followed by a dialogical Q&A session. Figure 1: the panel on the left represents the first aid situation; the panel on the right the free recall setting. Participants: Autism diagnostic observation tool, ADOS [6]: Total: Range: 5-17; Mean: 9.6 (SD: 3.2) Com: Range: 0-6; Mean: 2.8 (SD: 1.7) RSI: Range: 3-12; Mean: 6.8 (SD: 2.5) Autism Quotient: ASD: Range 21-45; Mean 32.9 (SD: 6.8); TD: Range: 4-28; Mean: 16.5 (SD: 6.3) Materials: We automatically separated vocal productions from the interviewer and the interviewee using a custom-made speaker diarization system. We discarded all utterances shorter than 3 seconds, which yielded 1547 total utterances: s 34 monological utterances (17 ASD and 17 TD). 416 produced by interviewees with ASD and 334 by their interviewer 438 produced by control interviewees and 277 by their interviewer From each utterance we extracted regularly sampled (every 10 ms.) time-series of: 1) voicing/pause behavior; 2) pitch; 3) intensity; 4) vowel onsets; 5) several measures of voice quality (formants, mel cepstral coefficients, harmonics to noise ratio, creakiness, clarity, breathiness, etc.) For all measures we calculated descriptive statistics (mean, standard deviation), and non-linear measures of change over time, in particular recurrence quantification analysis (RQA analysis, [9]) and Teager Keiser Energy Operator (TKEO, [10]). 5-fold subject-level cross-validated ElasticNet was used for feature selection [11]. Diagnosis was predicted using a 5-fold subject-level cross-validated linear discriminant function and the accuracy was balanced using Variational Bayesian mixed-effects inference [12]. AQ, ADOS total scores and individual factors scores were predicted using a 5-fold cross-validated multiple linear regression. Both analyses were iterated 100 times to test for stability of results. Analysis Results: Diagnosis 1. Predicting diagnosis from monological speech (analogous to [4]) Mean Intensity and Shimmer Variability (Standard Deviation in TKEO of Intensity) yield a balanced accuracy of 87.9% (CIs: 77%-96.2%), sensitivity of 89% (CIs: 87.8% - 90.2%), and specificity of 86.8% (CIs: 85.6% - 88%). Other highly informative features are Mel Cepstral Coefficients. 2. Predicting diagnosis from conversational speech (interviewee) Median and coefficient of variation of intensity, average syllable duration, variability of breathiness (TKEO of Parabolic Spectral Parameter), 3 rd formant variability (TKEO) and 4 th formant skewness yield a balanced accuracy of 74.6% (Cis 65% - 82.6%), sensitivity of 78.9% (CI: 68.5% - 87.3%), and specificity of 74.5% (CI: 70% - 79.4%) 3. Predicting diagnosis from conversational speech (interviewer) The percentage of voiced speech to silence, variations in clarity and pitch range yield a balanced accuracy of 63% (CIs: 57.1% - 68.5%), sensitivity of 58% (CIs: 57.8% - 58.1%), specificity of 69.4 (69.3% - 69.5%), The distinctiveness of autistic speech is quantifiable in ways that make it possible to largely reconstruct key aspects of the diagnosis from the voice only. Monological speech seems to provide better information, either due to its greater length or to the absence of compensatory conversational dynamics. Crucially, key aspects of the diagnosis can also be reconstructed, albeit to a lesser degree, from an interlocutor’s conversational behavior. Future work will explore the role and diagnostic power of conversational dynamics (e.g. backchanelling, turn-taking, etc. Conclusions [1] Fusaroli, R. et al. (under review). Is voice a biomarker for ASD? [2] Fusaroli, R., et. al (2013). Non-Linear Analyses of Speech and Prosody in Asperger's Syndrome. IMFAR 2013. [3] Fusaroli, R., et. al. (2015). The Temporal Structure of the Autistic Voice. IMFAR 2015. [4] Fusaroli, R., et. al. (2015). Voice patterns in adult English speakers with ASD. IMFAR 2015. [5] Maras, K. L., Memon, A., Lambrechts, A., & Bowler, D. M. (2013). Recall of a live and personally experienced eyewitness event by adults with autism spectrum disorder. JADD, 43(8), [6] Lord, C., et. al. (1989). Autism diagnostic observation schedule: A standardized observation of communicative and social behavior. JADS, 19(2), 185-212. [9] Marwan, N. et. al. (2007). Recurrence plots for the analysis of complex systems. Physics Reports, 438, 237-329. [10] Tsanas, A., et al. (2011). Nonlinear speech analysis algorithms mapped to a standard metric achieve clinically useful quantification of average Parkinson’s. J R Soc Interface, 8(59), 842-855. [11] Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320. [12] Brodersen, KH et. al. (2013). Variational Bayesian mixed-effects inference for classification studies. Neuroimage, 76C, 345-361 References 1. Predicting clinical features from monological speech (analogous to [4]) Mel Cepstral Coefficients range, Pitch Jitter and Speech rate regularity over time were able to predict 34.6% of the variance in AQ. ADOS scores could only be predicted in absence of cross-validation at the regression level. 2. Predicting clinical features from conversational speech (interviewee) Syllable duration (B=13.5), ratio of speech to silence and temporal dynamics of the fourth formant (TKEO and range) were able to predict 30.3% of the variance in AQ. ADOS scores could only be predicted in absence of cross-validation at the regression level. 3. Predicting clinical features from conversational speech (interviewer) Pitch range , syllable duration, number of pauses, mean and temporal dynamics of intensity and more subtle voice qualities (range of 4 th and 1 st formants, breathiness) were able to predict 29.8% of the variance in AQ. ADOS scores could only be predicted in absence of cross-validation at the regression level. Figure 2 – Example of a spectrogram in an ASD participant: (a) Raw signal (b) Spectrum (frequency signal as a function of time, blue lines show maximum intensity at which f0 is measured) (c) Speech-pause pattern (speech in blue) Results: Clinical Features

Upload: others

Post on 16-Jan-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Conversational voice patterns in adult English speakers with ASD

Introduction Individuals with an Autism Spectrum Disorder (ASD) often display distinctive quality of speech, described as monotone or singsongy [1]. These patterns are robust indicators of social communication deficit (Paul et al., 2005) and contribute to reaching a diagnosis of ASD. In recent work [1-4], we have shown the possibility to train machine learning algorithms to accurately (80-86%) discriminate autistic from non-autistic speakers in monological speech recordings in British and American English as well as Danish. Here we extend those findings by characterizing acoustic properties of ASD speech in a naturalistic dialogical setting and comparing them to the previous results. The current results refer to British English only.

This work was supported by an Interacting Minds Center seed funding.

The Autism Research Group

Fusaroli, R., Lambrechts, A., Yarrow, K., Maras, K., Gaigg, S.B.

Methods The framework of a previous naturalistic experiment [5] provided us with provided audio recordings of 17 ASD and 17 matched Typically Developing (TD) adults. Participants took part in a live event scenario in which they performed first aid manipulations on a manikin following a script. Later on, participants were interviewed on what they recalled of the event. The first part of the interview was a monological report of the events by the interviewee, followed by a dialogical Q&A session.

Figure 1: the panel on the left represents the first aid situation; the panel on the

right the free recall setting. Participants: Autism diagnostic observation tool, ADOS [6]:

•  Total: Range: 5-17; Mean: 9.6 (SD: 3.2) •  Com: Range: 0-6; Mean: 2.8 (SD: 1.7) •  RSI: Range: 3-12; Mean: 6.8 (SD: 2.5) Autism Quotient: •  ASD: Range 21-45; Mean 32.9 (SD: 6.8); •  TD: Range: 4-28; Mean: 16.5 (SD: 6.3)

Materials: We automatically separated vocal productions from the interviewer and the interviewee using a custom-made speaker diarization system. We discarded all utterances shorter than 3 seconds, which yielded 1547 total utterances: s •  34 monological utterances (17 ASD and 17 TD). •  416 produced by interviewees with ASD and 334 by their interviewer •  438 produced by control interviewees and 277 by their interviewer

From each utterance we extracted regularly sampled (every 10 ms.) time-series of: 1) voicing/pause behavior; 2) pitch; 3) intensity; 4) vowel onsets; 5) several measures of voice quality (formants, mel cepstral coefficients, harmonics to noise ratio, creakiness, clarity, breathiness, etc.)

For all measures we calculated descriptive statistics (mean, standard deviation), and non-linear measures of change over time, in particular recurrence quantification analysis (RQA analysis, [9]) and Teager Keiser Energy Operator (TKEO, [10]). 5-fold subject-level cross-validated ElasticNet was used for feature selection [11]. Diagnosis was predicted using a 5-fold subject-level cross-validated linear discriminant function and the accuracy was balanced using Variational Bayesian mixed-effects inference [12]. AQ, ADOS total scores and individual factors scores were predicted using a 5-fold cross-validated multiple linear regression. Both analyses were iterated 100 times to test for stability of results.

Analysis

Results: Diagnosis 1. Predicting diagnosis from monological speech (analogous to [4]) Mean Intensity and Shimmer Variability (Standard Deviation in TKEO of Intensity) yield a balanced accuracy of 87.9% (CIs: 77%-96.2%), sensitivity of 89% (CIs: 87.8% - 90.2%), and specificity of 86.8% (CIs: 85.6% - 88%). Other highly informative features are Mel Cepstral Coefficients. 2. Predicting diagnosis from conversational speech (interviewee) Median and coefficient of variation of intensity, average syllable duration, variability of breathiness (TKEO of Parabolic Spectral Parameter), 3rd formant variability (TKEO) and 4th formant skewness yield a balanced accuracy of 74.6% (Cis 65% - 82.6%), sensitivity of 78.9% (CI: 68.5% - 87.3%), and specificity of 74.5% (CI: 70% - 79.4%) 3. Predicting diagnosis from conversational speech (interviewer) The percentage of voiced speech to silence, variations in clarity and pitch range yield a balanced accuracy of 63% (CIs: 57.1% - 68.5%), sensitivity of 58% (CIs: 57.8% - 58.1%), specificity of 69.4 (69.3% - 69.5%),

The distinctiveness of autistic speech is quantifiable in ways that make it possible to largely reconstruct key aspects of the diagnosis from the voice only. Monological speech seems to provide better information, either due to its greater length or to the absence of compensatory conversational dynamics. Crucially, key aspects of the diagnosis can also be reconstructed, albeit to a lesser degree, from an interlocutor’s conversational behavior. Future work will explore the role and diagnostic power of conversational dynamics (e.g. backchanelling, turn-taking, etc.

Conclusions

[1] Fusaroli, R. et al. (under review). Is voice a biomarker for ASD? [2] Fusaroli, R., et. al (2013). Non-Linear Analyses of Speech and Prosody in Asperger's Syndrome. IMFAR 2013. [3] Fusaroli, R., et. al. (2015). The Temporal Structure of the Autistic Voice. IMFAR 2015. [4] Fusaroli, R., et. al. (2015). Voice patterns in adult English speakers with ASD. IMFAR 2015. [5] Maras, K. L., Memon, A., Lambrechts, A., & Bowler, D. M. (2013). Recall of a live and personally experienced eyewitness event by adults with autism spectrum disorder. JADD, 43(8), [6] Lord, C., et. al. (1989). Autism diagnostic observation schedule: A standardized observation of communicative and social behavior. JADS, 19(2), 185-212. [9] Marwan, N. et. al. (2007). Recurrence plots for the analysis of complex systems. Physics Reports, 438, 237-329. [10] Tsanas, A., et al. (2011). Nonlinear speech analysis algorithms mapped to a standard metric achieve clinically useful quantification of average Parkinson’s. J R Soc Interface, 8(59), 842-855. [11] Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320. [12] Brodersen, KH et. al. (2013). Variational Bayesian mixed-effects inference for classification studies. Neuroimage, 76C, 345-361

References

1. Predicting clinical features from monological speech (analogous to [4]) Mel Cepstral Coefficients range, Pitch Jitter and Speech rate regularity over time were able to predict 34.6% of the variance in AQ. ADOS scores could only be predicted in absence of cross-validation at the regression level. 2. Predicting clinical features from conversational speech (interviewee) Syllable duration (B=13.5), ratio of speech to silence and temporal dynamics of the fourth formant (TKEO and range) were able to predict 30.3% of the variance in AQ. ADOS scores could only be predicted in absence of cross-validation at the regression level. 3. Predicting clinical features from conversational speech (interviewer) Pitch range , syllable duration, number of pauses, mean and temporal dynamics of intensity and more subtle voice qualities (range of 4th and 1st formants, breathiness) were able to predict 29.8% of the variance in AQ. ADOS scores could only be predicted in absence of cross-validation at the regression level.

Figure 2 – Example of a spectrogram in an ASD participant: (a) Raw signal (b) Spectrum (frequency signal as a function of time, blue lines show maximum intensity at which f0 is measured) (c) Speech-pause pattern (speech in blue)

Results: Clinical Features