|
CURRENT FUNDING
|

|
National Institutes of
Health R01: Formation and tuning of complex auditory categories
PI: Dr. Andrew J. Lotto
Co-PI: Dr. Lori L. Holt
$1,597,659
12/01/05-11/30/10
To respond adaptively in a variety of
situations, organisms form and utilize perceptual categories or
functional equivalence classes. Categories can be induced from
distributions of sensory input experienced across varied but similar
contexts. However, the usefulness of perceptual categories based on a
large corpus of experiences may be limited when the distributional
characteristics of a particular setting differ drastically from the
norm. In these cases, reliance on stable long-term categories may be
inefficient or maladaptive. In order to perform optimally or
adaptively, organisms must dynamically “tune” categories to the
regularities of the current environment. For example, an animal may
distinguish friend and foe by the acoustic patterns of calls. The
category distinction may be established across exposure to the sounds
in many listening conditions. However, if the animal finds itself in an
acoustic environment that deviates from the norm (e.g., a lot of
low-frequency noise) then the categorization decision should be shifted
(e.g., increase the low-frequency energy needed to elicit a response or
greater “weighting” of high-frequency differences).
Another example of the need for category tuning comes from
speech perception. Phonetic identification of a speech sound based on
its acoustic properties can be considered an example of perceptual
categorization. Infants and second-language (L2) learners form phonetic
categories from distributions of experienced speech sounds (Jusczyk, 1997; Kuhl, 1993; Lotto, 2000). However, the acoustics of speech are
notoriously variable across speakers. Some of this variability is the
result of anatomical and physiological differences in the instrument of
speech production, such as the larger (and differently-proportioned)
vocal tracts of male vs. female speakers or perturbed articulatory
patterns resulting from stroke or dysarthria. Other variability is the
result of linguistic experience such as foreign accent and dialect, or
idiosyncratic patterns of speech. The result of all this variability is
that phonetic categories and decision bounds founded on experience
across a variety of talkers may produce mis-categorization in
application to any particular talker. Categories must be tuned
dynamically to the speech of the current talker either by changing the
representation of the individual sounds or influencing the relevant
phonetic category space. Within the field of speech perception, the
accommodation of talker-specific characteristics is referred to as
“talker normalization” (Johnson & Mullennix, 1997).
The problem of talker-specific acoustics has been a focus of
speech perception research since the beginning of the field (Potter & Steinberg, 1950). Much of the investigation of talker
normalization has been concentrated on compensating for anatomical
differences among speakers, such as gender differences. Although
differences in vocal tracts present a substantial challenge to
pattern-recognition approaches to speech perception, the variability
arising from these differences is relatively constrained. As a result,
some success has been attained by extracting less-variable ratios of frequency
components (Fujisaki & Kawashima, 1968; Syrdal
& Gopal, 1986; Traunmuller, 1981) or re-scaling speech based on vocal-tract
length (Nordstrom & Lindblom, 1975). A more unconstrained source of
variability is that arising from dialectic (speech patterns of a
community) and idiolectic (speech patterns of an individual)
differences between talkers. For example, when producing the same
phonetic segments, a non-native English speaker with an accent may use
a different range of values across an acoustic dimension than a native
speaker. Speakers may even systematically violate correlations among
dimensions ordinarily present in native speech. Despite vast
dialectical differences, it is a commonplace anecdotal experience that
non-native speakers become more intelligible the more experience one
has with their speech. How much tolerance do listeners have for
perturbations from established auditory categories? What mechanisms are
responsible for the adaptive tuning of category responses?
In a previous NIH-funded project, the PIs investigated the
formation of auditory categories defined by distributions of novel
complex sounds. The goals of that project were to design methods for
studying auditory categorization and to provide insights into categorization
of ecologically-valid stimuli such as speech sounds. One of the
conclusions arising from the studies was that brief exposure to
distributions of sounds can radically shift categorization of
subsequent stimuli. Listeners identify stimuli not just on the basis of
the long-term regularities of the training input but also on the basis
of short-term or “local” regularities. We refer to this as “tuning” the
category. Such tuning indicates that listeners adapt categories
dynamically.
The proposed
project uses the stimulus sets and methods developed in the previous
project to examine how auditory categories become tuned to local
distributional information. The results of this investigation will have
clear implications for models of talker normalization. We have designed
a multi-level empirical approach, which relates a clear link between
the basic science questions (i.e., How are auditory categories
adaptively tuned?) and applications of the results (i.e., How is
intelligibility of foreign-accent or articulatory idiosyncrasies
altered by experience with the speaker?). The stimulus sets range from
more ecologically-valid but less controlled (natural and altered
speech) to more controlled but ecologically less-valid (shaped bursts
of noise) with some stimulus sets in the middle of the range (hybrid
non-speech/speech). The non-speech and hybrid stimulus sets provide the
control necessary to probe basic perceptual/cognitive processes of
category formation and tuning, whereas the speech stimuli allow a
direct test of the relevance of these basic findings to a “real-world”
perceptual problem. We have used each of these approaches with success
in our past research. The present project is designed to accomplish
three specific aims; 1) to test the extent of phonetic category tuning
for talker differences arising from dialectic and idiolectic
differences, 2) to test the limits of category response tuning as a
function of the spectral-temporal make-up of the local acoustic
context, 3) to test the limits of category response tuning based on the
distributional statistics of the local context.
|
|
|
|
|
|

|
National Science Foundation: Collaborative Research:
Learning complex auditory categories
PI: Andrew J. Lotto
$149,982
04/01/08-12/31/10
Speech sounds are complex signals that vary across a
large number of temporal and spectral (energy by frequency)
characteristics. Some of this
acoustic variance is directly related to the intended message of the speaker. Other variance is extra-linguistic
resulting from factors such as the particular structure of the speaker’s
vocal tract. A language learner must
parse the input variance to discriminate those contrasts that carry
information and to generalize across variation within a contrast that is
due to speaker characteristics, coarticulation, articulatory undershoot,
etc. Complicating this task is the fact that the language learner must do
this in a language-appropriate manner. Languages utilize some subset of
over 800 phonemes, and this subset can range from 11 to 141 phonemes (Maddieson,
1984). As
a result of this diversity, variance that is extra-linguistic in one
language community may be pivotal for discovering the intended message of a
speaker in another language environment.
Whereas infants show some evidence of
native-language-appropriate parsing of the speech variance before their
first birthday (Kuhl,
1983; Werker & Tees, 1984; Kuhl, Williams, Lacerda, Stevens, &
Lindblom, 1992), it is well-known that
adult second-language (L2) learners have significant trouble producing and
perceiving some non-native contrasts.
For example, Japanese speakers have well-known problems with the
English /l/-/r/ contrast (Goto,
1971; Miyawaki et al., 1975; Strange & Jenkins, 1978). For English listeners, the primary cue
for this distinction in syllable-initial position is the onset frequency of
the third formant (F3,
O'Connor, Gerstman, Liberman, Delattre, & Cooper, 1957). /l/ is associated with a higher F3 onset
frequency. There is now good
evidence that Japanese speakers fail to use the informative F3 feature in
producing or perceiving this distinction (Yamada
& Tohkura, 1992; Gordon, Keyes, & Yung, 2001; Iverson et al., 2003). Instead, they tend to rely on F2
variance, which is informative for a similar Japanese contrast but is not
reliably related to the English contrast.
Thus, Japanese speakers weight the acoustic features in a
non-optimal manner for identification of the English contrast. Extensive perceptual and/or production
training has led to limited improvements in the perceptual identification
of difficult non-native contrasts.
In particular, it has been challenging to demonstrate learning that
robustly generalizes past the particular stimuli used in training (Logan,
Lively, & Pisoni, 1991; McCandliss, Fiez, Protopapas, Conway, &
McClelland, 2002).
The familiar /l/-/r/ example illuminates some of the
complexities of phonemic acquisition that are often overlooked. Because of the tradition of distinctive
feature phonology in speech perception research (Jakobson
& Halle, 1971),
there is sometimes a tendency to think of speech sound categorization as a
process of grouping the appropriate acoustic attributes into features that
differentiate one phoneme from others.
However, there are an immense number of possible acoustic attributes
that could be informative. The first
task for categorization is defining the attributes that vary in a
structured way. Imagine a listener
presented with a 300-ms wide-band noise burst that belongs to a novel sound
category. The amplitude of energy in
any frequency x time region could be a defining feature (e.g., high
amplitude from 2000-2200 Hz in the 200-250 ms time slice), as could any
change in energy across two time slices, or overall duration or intensity
or any combination of these attributes.
As the listener receives more exemplars from the category, the
perceptual system must determine what aspects of the noise burst covary
with the category and what variance is irrelevant. This is obviously an extreme case but the
number of possible cues to a speech category (or any other complex auditory
category) is daunting. For example,
the contrast between voiced and voiceless stop (e.g., /b/ vs. /p/) is
signaled in part by duration of aspiration noise, duration of cutback of
energy in the first formant (F1) region, fundamental frequency of the
following vowel, F1-onset frequency, etc. (Summerfield
& Haggard, 1977; Lisker, 1986).
To add to the complexity, there are few acoustic
features that are necessary or sufficient for defining speech categories (Liberman,
Cooper, Shankweiler, & Studdert-Kennedy, 1967; Liberman, 1996). Once informative features are extracted
from the signal, they must be “weighted” to come to a perceptual decision,
especially if there is a conflict amongst the cues (Jusczyk,
1993; Fox, Flege, & Munro, 1995; Lotto, Kluender, & Holt, 1997;
Nearey, 1997; Nittrouer, 2002). In the case of the English /l/-/r/
contrast, the starting frequency of F3 should carry most of the weight but
Japanese listeners appear to heavily weight F2 starting frequency (Yamada
& Tohkura, 1992). This weighting strategy is clearly
non-optimal as F2 is not varied contrastively for syllable-initial liquids
by English speakers. Non-optimal
weighting strategies have also been demonstrated for native-language
perception by children (Parnell
& Amerman, 1978; Morrongiello, Robson, Best, & Clifton, 1984;
Nittrouer, Crowther, & Miller, 1998; Nittrouer, Miller, Crowther, &
Manhart, 2000).
The goal of the proposed project is to investigate the
processes and variables involved in the dual tasks of discovering and
weighting informative acoustic features for categorization of complex
stimuli such as speech sounds. In
the past several years, the PIs have developed a multi-prong research
program to investigate the categorization of complex sounds. The empirical methodologies include: 1)
training non-human animals to identify speech sounds (e.g.,
Kluender, Lotto, Holt, & Bloedel, 1998; Holt, Lotto, & Kluender,
2001); 2)
measuring the productions and identification abilities of human adults
learning a second language (e.g.,
Kim, Kluender, Lotto, & Reed, 1994; Kim & Lotto, 2002); 3)
computational modeling of animal and human categorization data (Kluender
et al., 1998); and 4) training adults on
well-controlled non-speech categories (e.g.,
Lotto, 2000a; Lotto & Holt, 2001; Mirman, Holt, & McClelland, 2002;
Holt, Lotto, & Diehl, in press). (The development of this latter work has
been funded by a previous NSF grant and is described in more detail
below.) The use of multiple methods
provides more than mere converging evidence. The approach provides a spectrum of
stimulus control and ecological validity.
L2 studies provide an ecologically-valid and practical learning situation,
but there is a lack of control over listeners’ experience with the speech
sounds. Animal models permit us complete control over the experience of our
participants and allow us to examine the effects of general auditory and
learning processes on representations of the sounds. Non-speech training studies provide
ultimate control over the distribution of experienced input with human
listeners, but the stimuli must necessarily be somewhat different from
natural speech sounds. These last
two approaches provide known input distributions and fine-grained sampling
of perceptual behavior, which creates an excellent testing ground for
computational models of categorization.
We have found that these strictly-controlled methodologies produce
novel predictions that can be tested in the more ecologically-valid
paradigms.
Most of our previous efforts have been concentrated on
the establishment of boundaries and decision criteria across maximally
informative features of the stimuli (e.g., the first and second formants in
a vowel identification task or center frequency of a band of noise). Results of these experiments have
provided insights into how operating characteristics of the auditory system
interact with information coming from experienced stimuli to form
perceptual categories. In many
cases, we found that our listeners were not categorizing stimuli in an
optimal manner even for simple tasks.
It has become clear that the tasks of detecting informative features
in complex acoustic signals and weighting them correctly are not easily
accomplished and can serve as a bottleneck for categorization performance.
|
|
|
|

|
Mayo Clinic Arizona
Research Committee CR5: High density electroencephalographic (EEG)
event-related bandpower as a biomarker for disordered speech perception
PI: John Caviness
Co-PI: Julie M. Liss
Collaborator: Andrew J. Lotto
$94,481
02/08-02/10
This collaborative project
between UA (Dr. Lotto’s Auditory Cognitive Science Lab), ASU (Dr. Julie Liss’
Motor Speech Disorders Lab), and the Mayo Clinic in Scottsdale (Dr. John
Caviness) seeks to merge three strong independent lines of research with
emerging state-of-the-art functional brain mapping technology. Our ultimate goal is to define the temporal-spatial
cortical activation associated with the perception of intelligible
speech. The topic of the neural
correlates of speech processing is of broad interest, with both
contemporary clinical and basic science implications. The recent surge of
papers on the topic includes results of functional neuroimaging studies
that define the anatomical substrates of speech processing. However, the temporal sequence of
activation of these structures, an aspect critical to processing degraded
speech like dysarthria, remains to be established. We are in a unique position to tackle
this question because of our constellation of expertise in perceptual
processing of normal and disordered speech and state-of-the art
electroencephalography (EEG) mapping.
This project will incorporate behavioral data on speech perception
and production with neurophysiologic data to develop and test a model of
how listeners map from auditory to semantic representations. The novel EEG
modeling techniques along with work on speech perception at the phonemic
and sentence level will be the basis for a unique research program with
theoretical and clinical impact.
The term “dysarthria” refers
to a class of motor speech disorders resulting from damage to the central
and or peripheral nervous systems. Findings from Dr. Liss’ lab have shown
that different forms of dysarthria present specific perceptual processing
challenges to listeners. That is, the nature of the speech deficit (i.e.,
whether the speech is produced too fast or slow, with imprecise
articulation, without normal pitch and loudness contours, etc) affects the
effectiveness of the perceptual strategies listeners use to decipher what
is being said. This finding has substantial clinical implications,
particularly for the development of efficacious treatments. But beyond this, our behavioral data
yield theoretical predictions that bear directly on emerging models of the
neural correlates of speech perception and speech intelligibility. Current
models have delineated the cortical structures involved in speech
processing. However testing
predictions about the perception of disordered speech requires a paradigm
that allows for the specification of the sequence and time-course over
which critical structures are activated. In the present proposal, we will
use the paradigm of event-related spectral
desynchronization-synchronization (ERD/ERS) as measured with EEG to map the
location and time-course of neural activation associated with the auditory
processing of dysarthric speech. The EEG ERD/ERS is an established
correlate of increased brain activation or idling, depending on the
direction of the spectral change.
Application of the ERD/ERS measure to speech intelligibility work is
novel. The paradigm of ERD/ERS can
be applied to our intelligibility experiments with relatively inexpensive
additions to the equipment already operational in the clinical movement
disorders neurophysiology laboratory at Mayo Clinic-Arizona (MCA). Results will form the pilot data for an
extensive investigation of the relationship between speech disorder
characteristics and the associated neurocognitive processing.
Our long-term goal is to establish speech pattern
(dysarthria) specific biomarkers of intelligibility. Our plans for an extramural application
to support this work require a convincing demonstration of the technical
feasibility of our paradigm, as well as a compelling case for the
interpretability of the data relative to different forms and severities of
dysarthria. Goal 1, Technical. Demonstrate the ability of high
resolution ERD to map the time course of cortical activation in the
perception of different types and severities of dysarthric speech. Goal 2, Interpretive. Determine the task(s) which yield the
most systematic and interpretable pattern of results relative to dysarthria
type and severity (intelligibility).
Tasks include a) passive listening for meaning, b) lexical decision,
and 3) phrase repetition.
Current models of the neural correlates of speech
processing provide a framework for an array of expected findings. In broad
terms, we expect that the sequence and time-course of cortical activation
will differ between the perception of dysarthric speech and intact speech
with an earlier and stronger recruitment of the dorsal activation pathway
(inferior parietal and frontal systems), reflecting increased cognitive
load. We expect dysarthria-specific
patterns will emerge, which reflect the nature of the acoustic components
of the degraded signal. For example,
phrase-length speech associated with reductions in pitch variation
(especially Parkinsonian dysarthria) will elicit greater activation of
right hemisphere cortical structures than those types of dysarthria with
pitch variation intact. Finally, we expect
task-specific differences in the time-course and sequence of cortical
activation, with differences among tasks to distinguish among dysarthria
patterns and severities.
This work is novel and significant. Although the project aims at one class of
disordered speech (the dysarthrias), the principles discovered herein would
apply more generally to all forms of speech disorders. As such, the study holds promise for
explaining the perceptual basis of intelligibility deficits across
communication disorders, and for pointing toward more efficient treatment
paradigms. The work also serves as
an ecologically valid test case for the perception of degraded speech, in
general, which will be of interest to a broad range of scientists in
neuropsychology, cognitive psychology, and neurolinguistics.
|
|
|
|

|
National Institutes for Health-NIDCD R01: Optimizing
amplification for infants and young children
PI: Patricia G. Stelmachowicz
Co-Investigator: Andrew J. Lotto
12/01/04-11/30/09
Recent studies have shown
that children with hearing loss who are identified through universal
newborn hearing screening programs are not as delayed in speech and
language development as children who are identified at later ages. It
appears, however, that even with early identification and intervention
(including amplification), these children are still delayed relative to
children with normal hearing. In the current proposal, it is hypothesized
that these persistent delays are the result of reduced auditory access and
limited auditory experiences. Specifically, one consequence of congenital
hearing loss is limited auditory access to speech. Reduced auditory
experience in infancy may compromise auditory perceptual foundations upon
which later language stages are constructed. It is critical to determine
the constellation of auditory factors that support early learning and the
experiences that facilitate continued language development throughout
childhood. The overall goal of this project is to explore ways in which to
enhance auditory access and auditory experiences in young children with
hearing loss. Current hearing instruments and other assistive listening
devices appear to be incapable of fully compensating for the perceptual
degradation of hearing loss. In addition the negative influence of factors
such as distance, noise, and reverberation are magnified for children with
hearing loss, thus reducing the number and quality of auditory experiences.
Two areas associated with reduced auditory access for children with hearing
loss will be investigated in the studies described in this proposal. First,
the influence of selected forms of advanced signal processing on speech
perception, speech production, novel-word learning, and ease of listening
will be explored. Second, experiments will be conducted to determine
whether the quality and quantity of auditory experiences can be enhanced
for the purpose of accelerating auditory skill development and adaptation
to new signal-processing algorithms. In combination, these studies
potentially could result in the development of alternative intervention
strategies leading to more successful speech and language outcomes for
children with hearing loss.
|
|