AUDITORY COGNITIVE SCIENCE LAB

 

 

   LAB & LOCATION    PEOPLE  CURRENT FUNDING    RESEARCH     THE SOCIETY    PUBLICATIONS    SOUND GALLERY    SLHS DEPARTMENT     

 

 

CURRENT FUNDING

 

National Institutes of Health R01: Formation and tuning of complex auditory categories

 

PI: Dr. Andrew J. Lotto

Co-PI: Dr. Lori L. Holt

$1,597,659

12/01/05-11/30/10

 

To respond adaptively in a variety of situations, organisms form and utilize perceptual categories or functional equivalence classes. Categories can be induced from distributions of sensory input experienced across varied but similar contexts. However, the usefulness of perceptual categories based on a large corpus of experiences may be limited when the distributional characteristics of a particular setting differ drastically from the norm. In these cases, reliance on stable long-term categories may be inefficient or maladaptive. In order to perform optimally or adaptively, organisms must dynamically “tune” categories to the regularities of the current environment. For example, an animal may distinguish friend and foe by the acoustic patterns of calls. The category distinction may be established across exposure to the sounds in many listening conditions. However, if the animal finds itself in an acoustic environment that deviates from the norm (e.g., a lot of low-frequency noise) then the categorization decision should be shifted (e.g., increase the low-frequency energy needed to elicit a response or greater “weighting” of high-frequency differences).

 

Another example of the need for category tuning comes from speech perception. Phonetic identification of a speech sound based on its acoustic properties can be considered an example of perceptual categorization. Infants and second-language (L2) learners form phonetic categories from distributions of experienced speech sounds (Jusczyk, 1997; Kuhl, 1993; Lotto, 2000). However, the acoustics of speech are notoriously variable across speakers. Some of this variability is the result of anatomical and physiological differences in the instrument of speech production, such as the larger (and differently-proportioned) vocal tracts of male vs. female speakers or perturbed articulatory patterns resulting from stroke or dysarthria. Other variability is the result of linguistic experience such as foreign accent and dialect, or idiosyncratic patterns of speech. The result of all this variability is that phonetic categories and decision bounds founded on experience across a variety of talkers may produce mis-categorization in application to any particular talker. Categories must be tuned dynamically to the speech of the current talker either by changing the representation of the individual sounds or influencing the relevant phonetic category space. Within the field of speech perception, the accommodation of talker-specific characteristics is referred to as “talker normalization” (Johnson & Mullennix, 1997).

 

The problem of talker-specific acoustics has been a focus of speech perception research since the beginning of the field (Potter & Steinberg, 1950). Much of the investigation of talker normalization has been concentrated on compensating for anatomical differences among speakers, such as gender differences. Although differences in vocal tracts present a substantial challenge to pattern-recognition approaches to speech perception, the variability arising from these differences is relatively constrained. As a result, some success has been attained by extracting less-variable ratios of frequency components (Fujisaki & Kawashima, 1968; Syrdal & Gopal, 1986; Traunmuller, 1981) or re-scaling speech based on vocal-tract length (Nordstrom & Lindblom, 1975). A more unconstrained source of variability is that arising from dialectic (speech patterns of a community) and idiolectic (speech patterns of an individual) differences between talkers. For example, when producing the same phonetic segments, a non-native English speaker with an accent may use a different range of values across an acoustic dimension than a native speaker. Speakers may even systematically violate correlations among dimensions ordinarily present in native speech. Despite vast dialectical differences, it is a commonplace anecdotal experience that non-native speakers become more intelligible the more experience one has with their speech. How much tolerance do listeners have for perturbations from established auditory categories? What mechanisms are responsible for the adaptive tuning of category responses?

 

In a previous NIH-funded project, the PIs investigated the formation of auditory categories defined by distributions of novel complex sounds. The goals of that project were to design methods for studying auditory categorization and to provide insights into categorization of ecologically-valid stimuli such as speech sounds. One of the conclusions arising from the studies was that brief exposure to distributions of sounds can radically shift categorization of subsequent stimuli. Listeners identify stimuli not just on the basis of the long-term regularities of the training input but also on the basis of short-term or “local” regularities. We refer to this as “tuning” the category. Such tuning indicates that listeners adapt categories dynamically.

 

The proposed project uses the stimulus sets and methods developed in the previous project to examine how auditory categories become tuned to local distributional information. The results of this investigation will have clear implications for models of talker normalization. We have designed a multi-level empirical approach, which relates a clear link between the basic science questions (i.e., How are auditory categories adaptively tuned?) and applications of the results (i.e., How is intelligibility of foreign-accent or articulatory idiosyncrasies altered by experience with the speaker?). The stimulus sets range from more ecologically-valid but less controlled (natural and altered speech) to more controlled but ecologically less-valid (shaped bursts of noise) with some stimulus sets in the middle of the range (hybrid non-speech/speech). The non-speech and hybrid stimulus sets provide the control necessary to probe basic perceptual/cognitive processes of category formation and tuning, whereas the speech stimuli allow a direct test of the relevance of these basic findings to a “real-world” perceptual problem. We have used each of these approaches with success in our past research. The present project is designed to accomplish three specific aims; 1) to test the extent of phonetic category tuning for talker differences arising from dialectic and idiolectic differences, 2) to test the limits of category response tuning as a function of the spectral-temporal make-up of the local acoustic context, 3) to test the limits of category response tuning based on the distributional statistics of the local context.

 


 

National Science Foundation: Collaborative Research: Learning complex auditory categories

 

PI: Andrew J. Lotto

$149,982

04/01/08-12/31/10

 

Speech sounds are complex signals that vary across a large number of temporal and spectral (energy by frequency) characteristics.  Some of this acoustic variance is directly related to the intended message of the speaker.  Other variance is extra-linguistic resulting from factors such as the particular structure of the speaker’s vocal tract.  A language learner must parse the input variance to discriminate those contrasts that carry information and to generalize across variation within a contrast that is due to speaker characteristics, coarticulation, articulatory undershoot, etc. Complicating this task is the fact that the language learner must do this in a language-appropriate manner. Languages utilize some subset of over 800 phonemes, and this subset can range from 11 to 141 phonemes (Maddieson, 1984). As a result of this diversity, variance that is extra-linguistic in one language community may be pivotal for discovering the intended message of a speaker in another language environment.

 

Whereas infants show some evidence of native-language-appropriate parsing of the speech variance before their first birthday (Kuhl, 1983; Werker & Tees, 1984; Kuhl, Williams, Lacerda, Stevens, & Lindblom, 1992), it is well-known that adult second-language (L2) learners have significant trouble producing and perceiving some non-native contrasts.  For example, Japanese speakers have well-known problems with the English /l/-/r/ contrast (Goto, 1971; Miyawaki et al., 1975; Strange & Jenkins, 1978).  For English listeners, the primary cue for this distinction in syllable-initial position is the onset frequency of the third formant (F3, O'Connor, Gerstman, Liberman, Delattre, & Cooper, 1957).  /l/ is associated with a higher F3 onset frequency.  There is now good evidence that Japanese speakers fail to use the informative F3 feature in producing or perceiving this distinction (Yamada & Tohkura, 1992; Gordon, Keyes, & Yung, 2001; Iverson et al., 2003).  Instead, they tend to rely on F2 variance, which is informative for a similar Japanese contrast but is not reliably related to the English contrast.  Thus, Japanese speakers weight the acoustic features in a non-optimal manner for identification of the English contrast.  Extensive perceptual and/or production training has led to limited improvements in the perceptual identification of difficult non-native contrasts.  In particular, it has been challenging to demonstrate learning that robustly generalizes past the particular stimuli used in training (Logan, Lively, & Pisoni, 1991; McCandliss, Fiez, Protopapas, Conway, & McClelland, 2002).

 

The familiar /l/-/r/ example illuminates some of the complexities of phonemic acquisition that are often overlooked.  Because of the tradition of distinctive feature phonology in speech perception research (Jakobson & Halle, 1971), there is sometimes a tendency to think of speech sound categorization as a process of grouping the appropriate acoustic attributes into features that differentiate one phoneme from others.  However, there are an immense number of possible acoustic attributes that could be informative.  The first task for categorization is defining the attributes that vary in a structured way.  Imagine a listener presented with a 300-ms wide-band noise burst that belongs to a novel sound category.  The amplitude of energy in any frequency x time region could be a defining feature (e.g., high amplitude from 2000-2200 Hz in the 200-250 ms time slice), as could any change in energy across two time slices, or overall duration or intensity or any combination of these attributes.  As the listener receives more exemplars from the category, the perceptual system must determine what aspects of the noise burst covary with the category and what variance is irrelevant.  This is obviously an extreme case but the number of possible cues to a speech category (or any other complex auditory category) is daunting.  For example, the contrast between voiced and voiceless stop (e.g., /b/ vs. /p/) is signaled in part by duration of aspiration noise, duration of cutback of energy in the first formant (F1) region, fundamental frequency of the following vowel, F1-onset frequency, etc. (Summerfield & Haggard, 1977; Lisker, 1986).

 

To add to the complexity, there are few acoustic features that are necessary or sufficient for defining speech categories (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967; Liberman, 1996).  Once informative features are extracted from the signal, they must be “weighted” to come to a perceptual decision, especially if there is a conflict amongst the cues (Jusczyk, 1993; Fox, Flege, & Munro, 1995; Lotto, Kluender, & Holt, 1997; Nearey, 1997; Nittrouer, 2002).  In the case of the English /l/-/r/ contrast, the starting frequency of F3 should carry most of the weight but Japanese listeners appear to heavily weight F2 starting frequency (Yamada & Tohkura, 1992).  This weighting strategy is clearly non-optimal as F2 is not varied contrastively for syllable-initial liquids by English speakers.  Non-optimal weighting strategies have also been demonstrated for native-language perception by children (Parnell & Amerman, 1978; Morrongiello, Robson, Best, & Clifton, 1984; Nittrouer, Crowther, & Miller, 1998; Nittrouer, Miller, Crowther, & Manhart, 2000).

 

The goal of the proposed project is to investigate the processes and variables involved in the dual tasks of discovering and weighting informative acoustic features for categorization of complex stimuli such as speech sounds.  In the past several years, the PIs have developed a multi-prong research program to investigate the categorization of complex sounds.  The empirical methodologies include: 1) training non-human animals to identify speech sounds (e.g., Kluender, Lotto, Holt, & Bloedel, 1998; Holt, Lotto, & Kluender, 2001); 2) measuring the productions and identification abilities of human adults learning a second language (e.g., Kim, Kluender, Lotto, & Reed, 1994; Kim & Lotto, 2002); 3) computational modeling of animal and human categorization data (Kluender et al., 1998); and 4) training adults on well-controlled non-speech categories (e.g., Lotto, 2000a; Lotto & Holt, 2001; Mirman, Holt, & McClelland, 2002; Holt, Lotto, & Diehl, in press).  (The development of this latter work has been funded by a previous NSF grant and is described in more detail below.)  The use of multiple methods provides more than mere converging evidence.  The approach provides a spectrum of stimulus control and ecological validity.  L2 studies provide an ecologically-valid and practical learning situation, but there is a lack of control over listeners’ experience with the speech sounds. Animal models permit us complete control over the experience of our participants and allow us to examine the effects of general auditory and learning processes on representations of the sounds.  Non-speech training studies provide ultimate control over the distribution of experienced input with human listeners, but the stimuli must necessarily be somewhat different from natural speech sounds.  These last two approaches provide known input distributions and fine-grained sampling of perceptual behavior, which creates an excellent testing ground for computational models of categorization.  We have found that these strictly-controlled methodologies produce novel predictions that can be tested in the more ecologically-valid paradigms. 

 

Most of our previous efforts have been concentrated on the establishment of boundaries and decision criteria across maximally informative features of the stimuli (e.g., the first and second formants in a vowel identification task or center frequency of a band of noise).  Results of these experiments have provided insights into how operating characteristics of the auditory system interact with information coming from experienced stimuli to form perceptual categories.  In many cases, we found that our listeners were not categorizing stimuli in an optimal manner even for simple tasks.  It has become clear that the tasks of detecting informative features in complex acoustic signals and weighting them correctly are not easily accomplished and can serve as a bottleneck for categorization performance.  

 


 

Mayo Clinic Arizona Research Committee CR5: High density electroencephalographic (EEG) event-related bandpower as a biomarker for disordered speech perception

 

PI: John Caviness

Co-PI: Julie M. Liss

Collaborator: Andrew J. Lotto

$94,481

02/08-02/10

 

This collaborative project between UA (Dr. Lotto’s Auditory Cognitive Science Lab), ASU (Dr. Julie Liss’ Motor Speech Disorders Lab), and the Mayo Clinic in Scottsdale (Dr. John Caviness) seeks to merge three strong independent lines of research with emerging state-of-the-art functional brain mapping technology.  Our ultimate goal is to define the temporal-spatial cortical activation associated with the perception of intelligible speech.  The topic of the neural correlates of speech processing is of broad interest, with both contemporary clinical and basic science implications. The recent surge of papers on the topic includes results of functional neuroimaging studies that define the anatomical substrates of speech processing.  However, the temporal sequence of activation of these structures, an aspect critical to processing degraded speech like dysarthria, remains to be established.  We are in a unique position to tackle this question because of our constellation of expertise in perceptual processing of normal and disordered speech and state-of-the art electroencephalography (EEG) mapping.  This project will incorporate behavioral data on speech perception and production with neurophysiologic data to develop and test a model of how listeners map from auditory to semantic representations. The novel EEG modeling techniques along with work on speech perception at the phonemic and sentence level will be the basis for a unique research program with theoretical and clinical impact. 

 

The term “dysarthria” refers to a class of motor speech disorders resulting from damage to the central and or peripheral nervous systems. Findings from Dr. Liss’ lab have shown that different forms of dysarthria present specific perceptual processing challenges to listeners. That is, the nature of the speech deficit (i.e., whether the speech is produced too fast or slow, with imprecise articulation, without normal pitch and loudness contours, etc) affects the effectiveness of the perceptual strategies listeners use to decipher what is being said. This finding has substantial clinical implications, particularly for the development of efficacious treatments.  But beyond this, our behavioral data yield theoretical predictions that bear directly on emerging models of the neural correlates of speech perception and speech intelligibility. Current models have delineated the cortical structures involved in speech processing.  However testing predictions about the perception of disordered speech requires a paradigm that allows for the specification of the sequence and time-course over which critical structures are activated. In the present proposal, we will use the paradigm of event-related spectral desynchronization-synchronization (ERD/ERS) as measured with EEG to map the location and time-course of neural activation associated with the auditory processing of dysarthric speech. The EEG ERD/ERS is an established correlate of increased brain activation or idling, depending on the direction of the spectral change.  Application of the ERD/ERS measure to speech intelligibility work is novel.  The paradigm of ERD/ERS can be applied to our intelligibility experiments with relatively inexpensive additions to the equipment already operational in the clinical movement disorders neurophysiology laboratory at Mayo Clinic-Arizona (MCA).  Results will form the pilot data for an extensive investigation of the relationship between speech disorder characteristics and the associated neurocognitive processing.   

 

Our long-term goal is to establish speech pattern (dysarthria) specific biomarkers of intelligibility.  Our plans for an extramural application to support this work require a convincing demonstration of the technical feasibility of our paradigm, as well as a compelling case for the interpretability of the data relative to different forms and severities of dysarthria.  Goal 1, Technical.  Demonstrate the ability of high resolution ERD to map the time course of cortical activation in the perception of different types and severities of dysarthric speech.  Goal 2, Interpretive.   Determine the task(s) which yield the most systematic and interpretable pattern of results relative to dysarthria type and severity (intelligibility).  Tasks include a) passive listening for meaning, b) lexical decision, and 3) phrase repetition. 

 

Current models of the neural correlates of speech processing provide a framework for an array of expected findings. In broad terms, we expect that the sequence and time-course of cortical activation will differ between the perception of dysarthric speech and intact speech with an earlier and stronger recruitment of the dorsal activation pathway (inferior parietal and frontal systems), reflecting increased cognitive load.  We expect dysarthria-specific patterns will emerge, which reflect the nature of the acoustic components of the degraded signal.  For example, phrase-length speech associated with reductions in pitch variation (especially Parkinsonian dysarthria) will elicit greater activation of right hemisphere cortical structures than those types of dysarthria with pitch variation intact.  Finally, we expect task-specific differences in the time-course and sequence of cortical activation, with differences among tasks to distinguish among dysarthria patterns and severities. 

 

This work is novel and significant.  Although the project aims at one class of disordered speech (the dysarthrias), the principles discovered herein would apply more generally to all forms of speech disorders.  As such, the study holds promise for explaining the perceptual basis of intelligibility deficits across communication disorders, and for pointing toward more efficient treatment paradigms.  The work also serves as an ecologically valid test case for the perception of degraded speech, in general, which will be of interest to a broad range of scientists in neuropsychology, cognitive psychology, and neurolinguistics. 

 


 

National Institutes for Health-NIDCD R01: Optimizing amplification for infants and young children

 

PI: Patricia G. Stelmachowicz

Co-Investigator: Andrew J. Lotto

12/01/04-11/30/09

 

Recent studies have shown that children with hearing loss who are identified through universal newborn hearing screening programs are not as delayed in speech and language development as children who are identified at later ages. It appears, however, that even with early identification and intervention (including amplification), these children are still delayed relative to children with normal hearing. In the current proposal, it is hypothesized that these persistent delays are the result of reduced auditory access and limited auditory experiences. Specifically, one consequence of congenital hearing loss is limited auditory access to speech. Reduced auditory experience in infancy may compromise auditory perceptual foundations upon which later language stages are constructed. It is critical to determine the constellation of auditory factors that support early learning and the experiences that facilitate continued language development throughout childhood. The overall goal of this project is to explore ways in which to enhance auditory access and auditory experiences in young children with hearing loss. Current hearing instruments and other assistive listening devices appear to be incapable of fully compensating for the perceptual degradation of hearing loss. In addition the negative influence of factors such as distance, noise, and reverberation are magnified for children with hearing loss, thus reducing the number and quality of auditory experiences. Two areas associated with reduced auditory access for children with hearing loss will be investigated in the studies described in this proposal. First, the influence of selected forms of advanced signal processing on speech perception, speech production, novel-word learning, and ease of listening will be explored. Second, experiments will be conducted to determine whether the quality and quantity of auditory experiences can be enhanced for the purpose of accelerating auditory skill development and adaptation to new signal-processing algorithms. In combination, these studies potentially could result in the development of alternative intervention strategies leading to more successful speech and language outcomes for children with hearing loss.

 

 

The University of Arizona

Tucson