RoseSmall

AUDITORY COGNITIVE NEUROSCIENCE EXPERIENCE LAB

 

 

   LAB & LOCATION    PEOPLE  CURRENT FUNDING    RESEARCH     THE SOCIETY    PUBLICATIONS    SOUND GALLERY    SLHS DEPARTMENT     

 

 

CURRENT FUNDING

NIH_Logo_165

 

National Institutes of Health R01: Formation and tuning of complex auditory categories

 

Award No. 5 R01 DC004674

PI: Dr. Andrew J. Lotto

Co-PI: Dr. Lori L. Holt

$1,597,659

12/01/05-11/30/10

 

To respond adaptively in a variety of situations, organisms form and utilize perceptual categories or functional equivalence classes. Categories can be induced from distributions of sensory input experienced across varied but similar contexts. However, the usefulness of perceptual categories based on a large corpus of experiences may be limited when the distributional characteristics of a particular setting differ drastically from the norm. In these cases, reliance on stable long-term categories may be inefficient or maladaptive. In order to perform optimally or adaptively, organisms must dynamically “tune” categories to the regularities of the current environment. For example, an animal may distinguish friend and foe by the acoustic patterns of calls. The category distinction may be established across exposure to the sounds in many listening conditions. However, if the animal finds itself in an acoustic environment that deviates from the norm (e.g., a lot of low-frequency noise) then the categorization decision should be shifted (e.g., increase the low-frequency energy needed to elicit a response or greater “weighting” of high-frequency differences).

 

Another example of the need for category tuning comes from speech perception. Phonetic identification of a speech sound based on its acoustic properties can be considered an example of perceptual categorization. Infants and second-language (L2) learners form phonetic categories from distributions of experienced speech sounds (Jusczyk, 1997; Kuhl, 1993; Lotto, 2000). However, the acoustics of speech are notoriously variable across speakers. Some of this variability is the result of anatomical and physiological differences in the instrument of speech production, such as the larger (and differently-proportioned) vocal tracts of male vs. female speakers or perturbed articulatory patterns resulting from stroke or dysarthria. Other variability is the result of linguistic experience such as foreign accent and dialect, or idiosyncratic patterns of speech. The result of all this variability is that phonetic categories and decision bounds founded on experience across a variety of talkers may produce mis-categorization in application to any particular talker. Categories must be tuned dynamically to the speech of the current talker either by changing the representation of the individual sounds or influencing the relevant phonetic category space. Within the field of speech perception, the accommodation of talker-specific characteristics is referred to as “talker normalization” (Johnson & Mullennix, 1997).

 

The problem of talker-specific acoustics has been a focus of speech perception research since the beginning of the field (Potter & Steinberg, 1950). Much of the investigation of talker normalization has been concentrated on compensating for anatomical differences among speakers, such as gender differences. Although differences in vocal tracts present a substantial challenge to pattern-recognition approaches to speech perception, the variability arising from these differences is relatively constrained. As a result, some success has been attained by extracting less-variable ratios of frequency components (Fujisaki & Kawashima, 1968; Syrdal & Gopal, 1986; Traunmuller, 1981) or re-scaling speech based on vocal-tract length (Nordstrom & Lindblom, 1975). A more unconstrained source of variability is that arising from dialectic (speech patterns of a community) and idiolectic (speech patterns of an individual) differences between talkers. For example, when producing the same phonetic segments, a non-native English speaker with an accent may use a different range of values across an acoustic dimension than a native speaker. Speakers may even systematically violate correlations among dimensions ordinarily present in native speech. Despite vast dialectical differences, it is a commonplace anecdotal experience that non-native speakers become more intelligible the more experience one has with their speech. How much tolerance do listeners have for perturbations from established auditory categories? What mechanisms are responsible for the adaptive tuning of category responses?

 

In a previous NIH-funded project, the PIs investigated the formation of auditory categories defined by distributions of novel complex sounds. The goals of that project were to design methods for studying auditory categorization and to provide insights into categorization of ecologically-valid stimuli such as speech sounds. One of the conclusions arising from the studies was that brief exposure to distributions of sounds can radically shift categorization of subsequent stimuli. Listeners identify stimuli not just on the basis of the long-term regularities of the training input but also on the basis of short-term or “local” regularities. We refer to this as “tuning” the category. Such tuning indicates that listeners adapt categories dynamically.

 

The proposed project uses the stimulus sets and methods developed in the previous project to examine how auditory categories become tuned to local distributional information. The results of this investigation will have clear implications for models of talker normalization. We have designed a multi-level empirical approach, which relates a clear link between the basic science questions (i.e., How are auditory categories adaptively tuned?) and applications of the results (i.e., How is intelligibility of foreign-accent or articulatory idiosyncrasies altered by experience with the speaker?). The stimulus sets range from more ecologically-valid but less controlled (natural and altered speech) to more controlled but ecologically less-valid (shaped bursts of noise) with some stimulus sets in the middle of the range (hybrid non-speech/speech). The non-speech and hybrid stimulus sets provide the control necessary to probe basic perceptual/cognitive processes of category formation and tuning, whereas the speech stimuli allow a direct test of the relevance of these basic findings to a “real-world” perceptual problem. We have used each of these approaches with success in our past research. The present project is designed to accomplish three specific aims; 1) to test the extent of phonetic category tuning for talker differences arising from dialectic and idiolectic differences, 2) to test the limits of category response tuning as a function of the spectral-temporal make-up of the local acoustic context, 3) to test the limits of category response tuning based on the distributional statistics of the local context.

 


 

nsf4c

National Science Foundation: Collaborative Research: Learning complex auditory categories

 

Award No. 0746019

PI: Andrew J. Lotto

$149,980

04/01/08-09/30/11

 

Speech sounds are complex signals that vary across a large number of temporal and spectral (energy by frequency) characteristics.  Some of this acoustic variance is directly related to the intended message of the speaker.  Other variance is extra-linguistic resulting from factors such as the particular structure of the speaker’s vocal tract.  A language learner must parse the input variance to discriminate those contrasts that carry information and to generalize across variation within a contrast that is due to speaker characteristics, coarticulation, articulatory undershoot, etc. Complicating this task is the fact that the language learner must do this in a language-appropriate manner. Languages utilize some subset of over 800 phonemes, and this subset can range from 11 to 141 phonemes (Maddieson, 1984). As a result of this diversity, variance that is extra-linguistic in one language community may be pivotal for discovering the intended message of a speaker in another language environment.

 

Whereas infants show some evidence of native-language-appropriate parsing of the speech variance before their first birthday (Kuhl, 1983; Werker & Tees, 1984; Kuhl, Williams, Lacerda, Stevens, & Lindblom, 1992), it is well-known that adult second-language (L2) learners have significant trouble producing and perceiving some non-native contrasts.  For example, Japanese speakers have well-known problems with the English /l/-/r/ contrast (Goto, 1971; Miyawaki et al., 1975; Strange & Jenkins, 1978).  For English listeners, the primary cue for this distinction in syllable-initial position is the onset frequency of the third formant (F3, O'Connor, Gerstman, Liberman, Delattre, & Cooper, 1957).  /l/ is associated with a higher F3 onset frequency.  There is now good evidence that Japanese speakers fail to use the informative F3 feature in producing or perceiving this distinction (Yamada & Tohkura, 1992; Gordon, Keyes, & Yung, 2001; Iverson et al., 2003).  Instead, they tend to rely on F2 variance, which is informative for a similar Japanese contrast but is not reliably related to the English contrast.  Thus, Japanese speakers weight the acoustic features in a non-optimal manner for identification of the English contrast.  Extensive perceptual and/or production training has led to limited improvements in the perceptual identification of difficult non-native contrasts.  In particular, it has been challenging to demonstrate learning that robustly generalizes past the particular stimuli used in training (Logan, Lively, & Pisoni, 1991; McCandliss, Fiez, Protopapas, Conway, & McClelland, 2002).

 

The familiar /l/-/r/ example illuminates some of the complexities of phonemic acquisition that are often overlooked.  Because of the tradition of distinctive feature phonology in speech perception research (Jakobson & Halle, 1971), there is sometimes a tendency to think of speech sound categorization as a process of grouping the appropriate acoustic attributes into features that differentiate one phoneme from others.  However, there are an immense number of possible acoustic attributes that could be informative.  The first task for categorization is defining the attributes that vary in a structured way.  Imagine a listener presented with a 300-ms wide-band noise burst that belongs to a novel sound category.  The amplitude of energy in any frequency x time region could be a defining feature (e.g., high amplitude from 2000-2200 Hz in the 200-250 ms time slice), as could any change in energy across two time slices, or overall duration or intensity or any combination of these attributes.  As the listener receives more exemplars from the category, the perceptual system must determine what aspects of the noise burst covary with the category and what variance is irrelevant.  This is obviously an extreme case but the number of possible cues to a speech category (or any other complex auditory category) is daunting.  For example, the contrast between voiced and voiceless stop (e.g., /b/ vs. /p/) is signaled in part by duration of aspiration noise, duration of cutback of energy in the first formant (F1) region, fundamental frequency of the following vowel, F1-onset frequency, etc. (Summerfield & Haggard, 1977; Lisker, 1986).

 

To add to the complexity, there are few acoustic features that are necessary or sufficient for defining speech categories (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967; Liberman, 1996).  Once informative features are extracted from the signal, they must be “weighted” to come to a perceptual decision, especially if there is a conflict amongst the cues (Jusczyk, 1993; Fox, Flege, & Munro, 1995; Lotto, Kluender, & Holt, 1997; Nearey, 1997; Nittrouer, 2002).  In the case of the English /l/-/r/ contrast, the starting frequency of F3 should carry most of the weight but Japanese listeners appear to heavily weight F2 starting frequency (Yamada & Tohkura, 1992).  This weighting strategy is clearly non-optimal as F2 is not varied contrastively for syllable-initial liquids by English speakers.  Non-optimal weighting strategies have also been demonstrated for native-language perception by children (Parnell & Amerman, 1978; Morrongiello, Robson, Best, & Clifton, 1984; Nittrouer, Crowther, & Miller, 1998; Nittrouer, Miller, Crowther, & Manhart, 2000).

 

The goal of the proposed project is to investigate the processes and variables involved in the dual tasks of discovering and weighting informative acoustic features for categorization of complex stimuli such as speech sounds.  In the past several years, the PIs have developed a multi-prong research program to investigate the categorization of complex sounds.  The empirical methodologies include: 1) training non-human animals to identify speech sounds (e.g., Kluender, Lotto, Holt, & Bloedel, 1998; Holt, Lotto, & Kluender, 2001); 2) measuring the productions and identification abilities of human adults learning a second language (e.g., Kim, Kluender, Lotto, & Reed, 1994; Kim & Lotto, 2002); 3) computational modeling of animal and human categorization data (Kluender et al., 1998); and 4) training adults on well-controlled non-speech categories (e.g., Lotto, 2000a; Lotto & Holt, 2001; Mirman, Holt, & McClelland, 2002; Holt, Lotto, & Diehl, in press).  (The development of this latter work has been funded by a previous NSF grant and is described in more detail below.)  The use of multiple methods provides more than mere converging evidence.  The approach provides a spectrum of stimulus control and ecological validity.  L2 studies provide an ecologically-valid and practical learning situation, but there is a lack of control over listeners’ experience with the speech sounds. Animal models permit us complete control over the experience of our participants and allow us to examine the effects of general auditory and learning processes on representations of the sounds.  Non-speech training studies provide ultimate control over the distribution of experienced input with human listeners, but the stimuli must necessarily be somewhat different from natural speech sounds.  These last two approaches provide known input distributions and fine-grained sampling of perceptual behavior, which creates an excellent testing ground for computational models of categorization.  We have found that these strictly-controlled methodologies produce novel predictions that can be tested in the more ecologically-valid paradigms. 

 

Most of our previous efforts have been concentrated on the establishment of boundaries and decision criteria across maximally informative features of the stimuli (e.g., the first and second formants in a vowel identification task or center frequency of a band of noise).  Results of these experiments have provided insights into how operating characteristics of the auditory system interact with information coming from experienced stimuli to form perceptual categories.  In many cases, we found that our listeners were not categorizing stimuli in an optimal manner even for simple tasks.  It has become clear that the tasks of detecting informative features in complex acoustic signals and weighting them correctly are not easily accomplished and can serve as a bottleneck for categorization performance.  

 


 

NIH_Logo_165

National Institutes of Health R01: Perception of Dysarthric Speech

 

PI: Andrew J. Lotto

Co-PI: Julie M. Liss

$1,857,511

07/01/10-06/31/15

 

Our research program has been designed to develop a model of intelligibility deficits associated with the dysarthrias, with application to degraded speech in general. In particular, we have investigated those perturbations of the speech signal that are most deleterious to the listeners’ accurate comprehension of the intended message. Our work has produced converging evidence that listeners rely heavily on prosodic information to aid in the comprehension of degraded speech. Key to this finding is that listeners commonly apply the cognitive-perceptual strategy of relying on prosodic information to identify word boundaries when encountering a speech signal that contains impoverished segmental information. That is, when there is sufficient uncertainty about the identity of phonemes in connected speech, listeners shift their attention to prosodic cues to make predictions about where words begin and end (lexical segmentation). Phonemic ambiguities are then resolved within these word-delimited frames.

 

When listeners are faced with degraded speech that has reduced or abnormal prosodic variation—as in the case of the dysarthrias—their ability to use this information to facilitate lexical segmentation is challenged. Failure to properly segment the signal is difficult to overcome and typically leads to radical changes in the perceived message. Perhaps the most crucial observation from our work is that the different forms of impaired prosody (for example, across dysarthria subtypes) result in different patterns of perceptual outcomes (lexical boundary errors). In this way, not all intelligibility deficits are created equal. The differences in perceptual error patterns resulting from speech produced by two equally unintelligible speakers is predictable and provides information both about the underlying motor deficit and the perceptual representations and strategies of the listener.

 

In order to investigate these relationships of signal disruptions and intelligibility, we have performed extensive acoustic analyses on speech produced by a variety of speakers with dysarthria, and have examined the perceptual error patterns obtained from normal-hearing listeners presented dysarthric speech and non-dysarthric speech that has been digitally altered to match some of the acoustic characteristics that we have measured in dysarthric speech. This multi-pronged approach is necessary to achieve our ultimate goal: a comprehensive model of intelligibility deficits that ties specific acoustic measures to particular motor deficits of the speaker (or degradations of the signal in the environment) and to particular representations and challenges for the listener.

 

As part of this effort we have developed measures of speech rhythm that can be automatically computed and appear to predict both dysarthric subtypes of the speaker and percent correct accuracy and word segmentation errors of the listener. The present proposal develops the most promising results of this work in a structured set of experiments with theoretical import and the potential for immediate clinical impact. Specifically, we will expand on preliminary findings that metrics of speech rhythmicity are remarkably predictive of listener performance. Indeed, we have found this to hold for segmental duration metrics in the temporal domain and for long-term modulations of amplitude envelopes within frequency bands in the spectral domain. Because of the ways in which dependent variables in these two domains map onto clinically and theoretically meaningful percepts, there is strong indication for their development as outcome measures.

 

 


 

NIH_Logo_165

National Institutes of Health R01: Auditory and cognitive factors in speech perception and category learning

 

Award No. 5 R01 DC000427

PI: Randy Diehl

Collaborator: Andrew J. Lotto

$77,666 (subcontract)

09/01/09-06/30/11

 

The work to be performed at the University of Arizona under the subcontract to the University of Texas at Austin will be concerned mainly with the first two specific aims listed in the research plan for the grant entitled “Auditory and cognitive factors in speech perception and category learning” (R. Diehl, PI). Perceptual learning experiments will be conducted in which participants will be asked to label non-speech sounds (sampled from overlapping distributions) as members of arbitrary categories A or B. The accuracy of performance on these categorization tasks will be compared with ideal observer models and theoretical predictions as described in the original Research Plan. Stimulus creation, programming of experimental presentation software, collection of data, and analysis of accuracy and reaction time will be conducted at the University of Arizona. Data will be collected in the research laboratory of Dr. Andrew Lotto in the Department of Speech & Hearing Sciences under the management of Ms. Sarah Sullivan, M.A. General design of the experiments and comparisons of (de-identified) averaged data with specific model predictions will occur in collaboration with the PI at the University of Texas.

 

Below is the abstract submitted to the University of Arizona IRB for the portions of the project that will be conducted at Arizona under the subcontract:

 

A number of auditory tasks, including speech perception, require listeners to categorize stimuli on the basis of one or more features of the input. In many cases, especially speech, there is no one-to-one mapping between values along continuous features and discrete categories (e.g., phonemes). How then do perceptual systems categorize stimuli under uncertainty? One possible solution is to use probabilistic information from experienced stimulus distributions to optimize accuracy. We propose that perceivers incorporate distributional knowledge about the acoustic environment with the information provided by the signal in order to make optimal (i.e., maximal accuracy) categorical decisions. Statistical approaches such as this are widely used in vision research but are rarely applied to auditory or speech perception. The goal of this study is to develop a framework that will provide testable hypotheses about the nature of statistical (distributional) learning in auditory perception, in general, and speech perception, specifically. For this study, speech stimuli were intentionally avoided in order to simplify experimental designs and increase experimenter control. Numerous studies indicate that non-speech sounds can be perceived in a speech-like manner, and, in fact, Saffran and colleagues (1999) have demonstrated that listeners can learn the statistical properties of non-linguistic stimuli. The project is designed to investigate perceivers’ sensitivity to probabilistic distributional information. Recent research by several investigators indicates that infant and adult humans are sensitive to auditory statistical information (e.g., Saffran, Aslin, & Newport, 1996). However, this research has lacked a strong theoretical framework. The present study will independently manipulate statistical information such as distribution characteristics (e.g., shape, mean, and variance), stimuli features (e.g., center frequency), feedback (e.g., whether or not feedback is provided), and the number of dimensions included in the experiment to examine their individual effects on categorization. This experimentation will consist of noise bursts with varying acoustical characteristics. Participants will be asked to listen to the stimuli and identify what category the complex non-speech stimuli belong to based on their sound characteristics. These responses will be collected via a computer game in which participants navigate through three-dimensional space and respond to animated characters correlating to the different sound category distributions. Results from these studies will potentially answer numerous theoretical questions such as whether or not humans are sensitive to auditory statistical information, if participants’ behavior changes with training, and how closely participants’ responses match those of an ideal observer defined by the optimal decision strategy.

 

 


 

NIH_Logo_165

National Institutes of Health F31: High-frequency energy in speech and voice

 

PI: Brian Monson

Co-Sponsor: Andrew J. Lotto

$77,828

01/01/10-12/31/11

 

For years the major focus in speech acoustics has been on the frequency range below approximately 5 kHz. Human speech and the human voice generate acoustical energy up to 20 kHz. Evidence is accruing that high-frequency energy (energy above 5 kHz) in speech and voice contributes to percepts of quality, localization, and intelligibility.

The proposed research is intended to be an initial step in the long- range goal of characterizing high-frequency energy in speech, with particular regard for its perceptual role, its potential for modification during speech production, and its generation mechanism.

 

In this study, a database of high-fidelity recordings of singers and talkers will be used for both a broad acoustical analysis and general characterization of high-frequency energy, as well as specific characterization of phoneme category, speech intensity level, and mode of production by their high-frequency energy content. Directionality of radiation of high-frequency energy from the mouth will also be examined. The recordings will be used for perceptual experiments wherein listeners will be asked to discriminate between speech and voice samples that differ only in high-frequency energy content. Listeners will also be subjected to intelligibility-in-noise tasks with samples that have been modified only in high-frequency content. The combination of these experiments will reveal (1) the ability of human listeners to detect high-frequency energy modification, and (2) the phonetic value of high-frequency energy in speech.

 

The relevance of this project to public health lies in its efforts to elucidate the effect on human communicative behavior when high- frequency energy in speech is lost or altered, which may be incurred by factors such as hearing loss, noisy environmental conditions, telephony, audio data compression (such as mp3 compression), electronic sound reinforcement, or sound recording and playback. Previous research has already shown that high-frequency energy affects speech intelligibility, word-learning in normal-hearing and hearing- impaired children, speech localization, and qualitative percepts of speech and voice (e.g. 'naturalness'). Thus, this project will provide particularly valuable insight regarding the need for representation of the high-frequency range in augmentative hearing devices, including hearing aids, cochlear implants, and auditory brainstem implants; the results of this project may also impact the evaluation and management of speech, voice, and language disorders, as well as the development of training techniques for the enhancement of speech and voice.

 

 

 

The University of Arizona

Tucson