|
CURRENT
FUNDING
|

|
National
Institutes of Health R01: Formation and tuning of complex auditory
categories
Award No.
5 R01 DC004674
PI: Dr.
Andrew J. Lotto
Co-PI:
Dr. Lori L. Holt
$1,597,659
12/01/05-11/30/10
To respond
adaptively in a variety of situations, organisms form and utilize
perceptual categories or functional equivalence classes. Categories can
be induced from distributions of sensory input experienced across
varied but similar contexts. However, the usefulness of perceptual
categories based on a large corpus of experiences may be limited when
the distributional characteristics of a particular setting differ
drastically from the norm. In these cases, reliance on stable long-term
categories may be inefficient or maladaptive. In order to perform
optimally or adaptively, organisms must dynamically “tune” categories
to the regularities of the current environment. For example, an animal
may distinguish friend and foe by the acoustic patterns of calls. The
category distinction may be established across exposure to the sounds
in many listening conditions. However, if the animal finds itself in an
acoustic environment that deviates from the norm (e.g., a lot of
low-frequency noise) then the categorization decision should be shifted
(e.g., increase the low-frequency energy needed to elicit a response or
greater “weighting” of high-frequency differences).
Another example of the need for category
tuning comes from speech perception. Phonetic identification of a
speech sound based on its acoustic properties can be considered an
example of perceptual categorization. Infants and second-language (L2)
learners form phonetic categories from distributions of experienced
speech sounds (Jusczyk, 1997; Kuhl, 1993; Lotto, 2000). However, the acoustics of speech are
notoriously variable across speakers. Some of this variability is the
result of anatomical and physiological differences in the instrument of
speech production, such as the larger (and differently-proportioned)
vocal tracts of male vs. female speakers or perturbed articulatory patterns resulting from stroke or dysarthria. Other variability is the result of
linguistic experience such as foreign accent and dialect, or
idiosyncratic patterns of speech. The result of all this variability is
that phonetic categories and decision bounds founded on experience
across a variety of talkers may produce mis-categorization
in application to any particular talker. Categories must be tuned
dynamically to the speech of the current talker either by changing the
representation of the individual sounds or influencing the relevant
phonetic category space. Within the field of speech perception, the
accommodation of talker-specific characteristics is referred to as
“talker normalization” (Johnson & Mullennix,
1997).
The problem of talker-specific acoustics
has been a focus of speech perception research since the beginning of
the field (Potter & Steinberg, 1950). Much of the investigation of talker
normalization has been concentrated on compensating for anatomical
differences among speakers, such as gender differences. Although
differences in vocal tracts present a substantial challenge to
pattern-recognition approaches to speech perception, the variability
arising from these differences is relatively constrained. As a result,
some success has been attained by extracting less-variable ratios of
frequency components (Fujisaki & Kawashima, 1968; Syrdal & Gopal, 1986;
Traunmuller, 1981) or re-scaling speech based on vocal-tract
length (Nordstrom & Lindblom,
1975). A more unconstrained source of
variability is that arising from dialectic (speech patterns of a
community) and idiolectic (speech patterns of
an individual) differences between talkers. For example, when producing
the same phonetic segments, a non-native English speaker with an accent
may use a different range of values across an acoustic dimension than a
native speaker. Speakers may even systematically violate correlations
among dimensions ordinarily present in native speech. Despite vast
dialectical differences, it is a commonplace anecdotal experience that
non-native speakers become more intelligible the more experience one
has with their speech. How much tolerance do listeners have for
perturbations from established auditory categories? What mechanisms are
responsible for the adaptive tuning of category responses?
In a previous NIH-funded project, the PIs
investigated the formation of auditory categories defined by
distributions of novel complex sounds. The goals of that project were
to design methods for studying auditory categorization and to provide
insights into categorization of ecologically-valid stimuli such as
speech sounds. One of the conclusions arising from the studies was that
brief exposure to distributions of sounds can radically shift
categorization of subsequent stimuli. Listeners identify stimuli not
just on the basis of the long-term regularities of the training input
but also on the basis of short-term or “local” regularities. We refer
to this as “tuning” the category. Such tuning indicates that listeners
adapt categories dynamically.
The
proposed project uses the stimulus sets and methods developed in the
previous project to examine how auditory categories become tuned to
local distributional information. The results of this investigation
will have clear implications for models of talker normalization. We
have designed a multi-level empirical approach, which relates a clear link
between the basic science questions (i.e., How are auditory categories
adaptively tuned?) and applications of the results (i.e., How is
intelligibility of foreign-accent or articulatory
idiosyncrasies altered by experience with the speaker?). The stimulus
sets range from more ecologically-valid but less controlled (natural
and altered speech) to more controlled but ecologically less-valid
(shaped bursts of noise) with some stimulus sets in the middle of the
range (hybrid non-speech/speech). The non-speech and hybrid stimulus
sets provide the control necessary to probe basic perceptual/cognitive
processes of category formation and tuning, whereas the speech stimuli
allow a direct test of the relevance of these basic findings to a
“real-world” perceptual problem. We have used each of these approaches
with success in our past research. The present project is designed to
accomplish three specific aims; 1) to test the extent of phonetic
category tuning for talker differences arising from dialectic and idiolectic differences, 2) to test the limits of
category response tuning as a function of the spectral-temporal make-up
of the local acoustic context, 3) to test the limits of category
response tuning based on the distributional statistics of the local
context.
|
|
|
|
|
|

|
National Science Foundation:
Collaborative Research: Learning complex auditory categories
Award No. 0746019
PI: Andrew J. Lotto
$149,980
04/01/08-09/30/11
Speech sounds are complex signals that
vary across a large number of temporal and spectral (energy by frequency)
characteristics. Some of this
acoustic variance is directly related to the intended message of the
speaker. Other variance is
extra-linguistic resulting from factors such as the particular structure of
the speaker’s vocal tract. A
language learner must parse the input variance to discriminate those
contrasts that carry information and to generalize across variation within
a contrast that is due to speaker characteristics, coarticulation,
articulatory undershoot, etc. Complicating this
task is the fact that the language learner must do this in a
language-appropriate manner. Languages utilize some subset of over 800
phonemes, and this subset can range from 11 to 141 phonemes (Maddieson, 1984).
As a result of this diversity, variance that is extra-linguistic in one
language community may be pivotal for discovering the intended message of a
speaker in another language environment.
Whereas infants show some evidence of
native-language-appropriate parsing of the speech variance before their
first birthday (Kuhl, 1983; Werker &
Tees, 1984; Kuhl, Williams, Lacerda,
Stevens, & Lindblom, 1992),
it is well-known that adult second-language (L2) learners have significant
trouble producing and perceiving some non-native contrasts. For example, Japanese speakers have
well-known problems with the English /l/-/r/ contrast (Goto, 1971; Miyawaki et al.,
1975; Strange & Jenkins, 1978). For English listeners, the primary cue
for this distinction in syllable-initial position is the onset frequency of
the third formant (F3,
O'Connor, Gerstman, Liberman,
Delattre, & Cooper, 1957). /l/ is associated with a higher F3 onset
frequency. There is now good
evidence that Japanese speakers fail to use the informative F3 feature in
producing or perceiving this distinction (Yamada
& Tohkura, 1992; Gordon, Keyes, & Yung,
2001; Iverson et al., 2003). Instead, they tend to rely on F2
variance, which is informative for a similar Japanese contrast but is not
reliably related to the English contrast.
Thus, Japanese speakers weight the acoustic features in a
non-optimal manner for identification of the English contrast. Extensive perceptual and/or production
training has led to limited improvements in the perceptual identification
of difficult non-native contrasts.
In particular, it has been challenging to demonstrate learning that
robustly generalizes past the particular stimuli used in training (Logan,
Lively, & Pisoni, 1991; McCandliss,
Fiez, Protopapas,
Conway, & McClelland, 2002).
The familiar /l/-/r/ example
illuminates some of the complexities of phonemic acquisition that are often
overlooked. Because of the tradition
of distinctive feature phonology in speech perception research (Jakobson & Halle, 1971),
there is sometimes a tendency to think of speech sound categorization as a
process of grouping the appropriate acoustic attributes into features that
differentiate one phoneme from others.
However, there are an immense number of possible acoustic attributes
that could be informative. The first
task for categorization is defining the attributes that vary in a
structured way. Imagine a listener
presented with a 300-ms wide-band noise burst that belongs to a novel sound
category. The amplitude of energy in
any frequency x time region could be a defining feature (e.g., high
amplitude from 2000-2200 Hz in the 200-250 ms time slice), as could any
change in energy across two time slices, or overall duration or intensity
or any combination of these attributes.
As the listener receives more exemplars from the category, the
perceptual system must determine what aspects of the noise burst covary with the category and what variance is
irrelevant. This is obviously an
extreme case but the number of possible cues to a speech category (or any other
complex auditory category) is daunting.
For example, the contrast between voiced and voiceless stop (e.g.,
/b/ vs. /p/) is signaled in part by duration of aspiration noise, duration
of cutback of energy in the first formant (F1) region, fundamental frequency
of the following vowel, F1-onset frequency, etc. (Summerfield
& Haggard, 1977; Lisker, 1986).
To add to the complexity, there are few
acoustic features that are necessary or sufficient for defining speech
categories (Liberman, Cooper, Shankweiler,
& Studdert-Kennedy, 1967; Liberman,
1996). Once informative features are extracted
from the signal, they must be “weighted” to come to a perceptual decision,
especially if there is a conflict amongst the cues (Jusczyk, 1993; Fox, Flege,
& Munro, 1995; Lotto, Kluender, & Holt,
1997; Nearey, 1997; Nittrouer,
2002). In the case of the English /l/-/r/
contrast, the starting frequency of F3 should carry most of the weight but
Japanese listeners appear to heavily weight F2 starting frequency (Yamada
& Tohkura, 1992). This weighting strategy is clearly
non-optimal as F2 is not varied contrastively for syllable-initial liquids
by English speakers. Non-optimal
weighting strategies have also been demonstrated for native-language
perception by children (Parnell
& Amerman, 1978; Morrongiello,
Robson, Best, & Clifton, 1984; Nittrouer, Crowther, & Miller, 1998; Nittrouer,
Miller, Crowther, & Manhart,
2000).
The goal of the proposed project is to
investigate the processes and variables involved in the dual tasks of
discovering and weighting informative acoustic features for categorization
of complex stimuli such as speech sounds.
In the past several years, the PIs have developed a multi-prong
research program to investigate the categorization of complex sounds. The empirical methodologies include: 1)
training non-human animals to identify speech sounds (e.g.,
Kluender, Lotto, Holt, & Bloedel,
1998; Holt, Lotto, & Kluender, 2001);
2) measuring the productions and identification abilities of human adults
learning a second language (e.g.,
Kim, Kluender, Lotto, & Reed, 1994; Kim &
Lotto, 2002);
3) computational modeling of animal and human categorization data (Kluender et al., 1998);
and 4) training adults on well-controlled non-speech categories (e.g.,
Lotto, 2000a; Lotto & Holt, 2001; Mirman,
Holt, & McClelland, 2002; Holt, Lotto, & Diehl, in press). (The development of this latter work has
been funded by a previous NSF grant and is described in more detail
below.) The use of multiple methods
provides more than mere converging evidence. The approach provides a spectrum of
stimulus control and ecological validity.
L2 studies provide an ecologically-valid and practical learning
situation, but there is a lack of control over listeners’ experience with
the speech sounds. Animal models permit us complete control over the
experience of our participants and allow us to examine the effects of
general auditory and learning processes on representations of the
sounds. Non-speech training studies
provide ultimate control over the distribution of experienced input with human
listeners, but the stimuli must necessarily be somewhat different from
natural speech sounds. These last
two approaches provide known input distributions and fine-grained sampling
of perceptual behavior, which creates an excellent testing ground for computational
models of categorization. We have
found that these strictly-controlled methodologies produce novel
predictions that can be tested in the more ecologically-valid paradigms.
Most of our previous efforts have been
concentrated on the establishment of boundaries and decision criteria
across maximally informative features of the stimuli (e.g., the first and
second formants in a vowel identification task or center frequency of a
band of noise). Results of these
experiments have provided insights into how operating characteristics of
the auditory system interact with information coming from experienced
stimuli to form perceptual categories.
In many cases, we found that our listeners were not categorizing stimuli
in an optimal manner even for simple tasks.
It has become clear that the tasks of detecting informative features
in complex acoustic signals and weighting them correctly are not easily
accomplished and can serve as a bottleneck for categorization performance.
|
|
|
|

|
National
Institutes of Health R01: Perception of Dysarthric
Speech
PI:
Andrew J. Lotto
Co-PI:
Julie M. Liss
$1,857,511
07/01/10-06/31/15
Our research program has been designed to
develop a model of intelligibility deficits associated with the dysarthrias, with application to degraded speech in
general. In particular, we have investigated those perturbations of the
speech signal that are most deleterious to the listeners’ accurate
comprehension of the intended message. Our work has produced converging
evidence that listeners rely heavily on prosodic information to aid in the
comprehension of degraded speech. Key to this finding is that listeners
commonly apply the cognitive-perceptual strategy of relying on prosodic
information to identify word boundaries when encountering a speech signal
that contains impoverished segmental information. That is, when there is
sufficient uncertainty about the identity of phonemes in connected speech,
listeners shift their attention to prosodic cues to make predictions about
where words begin and end (lexical segmentation). Phonemic ambiguities are
then resolved within these word-delimited frames.
When listeners are faced with degraded
speech that has reduced or abnormal prosodic variation—as in the case of
the dysarthrias—their ability to use this
information to facilitate lexical segmentation is challenged. Failure to
properly segment the signal is difficult to overcome and typically leads to
radical changes in the perceived message. Perhaps the most crucial
observation from our work is that the different forms of impaired prosody
(for example, across dysarthria subtypes) result
in different patterns of perceptual outcomes (lexical boundary errors). In
this way, not all intelligibility deficits are created equal. The
differences in perceptual error patterns resulting from speech produced by
two equally unintelligible speakers is predictable and provides information
both about the underlying motor deficit and the perceptual representations
and strategies of the listener.
In order to investigate these
relationships of signal disruptions and intelligibility, we have performed
extensive acoustic analyses on speech produced by a variety of speakers
with dysarthria, and have examined the perceptual
error patterns obtained from normal-hearing listeners presented dysarthric speech and non-dysarthric
speech that has been digitally altered to match some of the acoustic
characteristics that we have measured in dysarthric
speech. This multi-pronged approach is necessary to achieve our ultimate
goal: a comprehensive model of intelligibility deficits that ties specific
acoustic measures to particular motor deficits of the speaker (or
degradations of the signal in the environment) and to particular
representations and challenges for the listener.
As part of this effort we have
developed measures of speech rhythm that can be automatically computed and
appear to predict both dysarthric subtypes of the
speaker and percent correct accuracy and word segmentation errors of the
listener. The present proposal develops the most promising results of this
work in a structured set of experiments with theoretical import and the
potential for immediate clinical impact. Specifically, we will expand on
preliminary findings that metrics of speech rhythmicity
are remarkably predictive of listener performance. Indeed, we have found
this to hold for segmental duration metrics in the temporal domain and for
long-term modulations of amplitude envelopes within frequency bands in the
spectral domain. Because of the ways in which dependent variables in these
two domains map onto clinically and theoretically meaningful percepts,
there is strong indication for their development as outcome measures.
|
|
|
|

|
National Institutes of Health R01: Auditory and cognitive factors in
speech perception and category learning
Award
No. 5 R01 DC000427
PI:
Randy Diehl
Collaborator:
Andrew J. Lotto
$77,666
(subcontract)
09/01/09-06/30/11
The work to be performed at the
University of Arizona under the subcontract to the University of Texas at Austin
will be concerned mainly with the first two specific aims listed in the
research plan for the grant entitled “Auditory and cognitive factors in
speech perception and category learning” (R. Diehl, PI). Perceptual
learning experiments will be conducted in which participants will be asked
to label non-speech sounds (sampled from overlapping distributions) as
members of arbitrary categories A or B. The accuracy of performance on
these categorization tasks will be compared with ideal observer models and
theoretical predictions as described in the original Research Plan.
Stimulus creation, programming of experimental presentation software,
collection of data, and analysis of accuracy and reaction time will be
conducted at the University of Arizona. Data will be collected in the
research laboratory of Dr. Andrew Lotto in the Department of Speech &
Hearing Sciences under the management of Ms. Sarah Sullivan, M.A. General
design of the experiments and comparisons of (de-identified) averaged data
with specific model predictions will occur in collaboration with the PI at
the University of Texas.
Below is the abstract submitted to the
University of Arizona IRB for the portions of the project that will be
conducted at Arizona under the subcontract:
A number of auditory tasks,
including speech perception, require listeners to categorize stimuli on the
basis of one or more features of the input. In many cases, especially
speech, there is no one-to-one mapping between values along continuous
features and discrete categories (e.g., phonemes). How then do perceptual
systems categorize stimuli under uncertainty? One possible solution is to
use probabilistic information from experienced stimulus distributions to
optimize accuracy. We propose that perceivers incorporate distributional
knowledge about the acoustic environment with the information provided by
the signal in order to make optimal (i.e., maximal accuracy) categorical
decisions. Statistical approaches such as this are widely used in vision
research but are rarely applied to auditory or speech perception. The goal
of this study is to develop a framework that will provide testable
hypotheses about the nature of statistical (distributional) learning in
auditory perception, in general, and speech perception, specifically. For
this study, speech stimuli were intentionally avoided in order to simplify
experimental designs and increase experimenter control. Numerous studies
indicate that non-speech sounds can be perceived in a speech-like manner,
and, in fact, Saffran and colleagues (1999) have
demonstrated that listeners can learn the statistical properties of
non-linguistic stimuli. The project is designed to investigate perceivers’
sensitivity to probabilistic distributional information. Recent research by
several investigators indicates that infant and adult humans are sensitive
to auditory statistical information (e.g., Saffran,
Aslin, & Newport, 1996). However, this
research has lacked a strong theoretical framework. The present study will
independently manipulate statistical information such as distribution
characteristics (e.g., shape, mean, and variance), stimuli features (e.g.,
center frequency), feedback (e.g., whether or not feedback is provided),
and the number of dimensions included in the experiment to examine their
individual effects on categorization. This experimentation will consist of
noise bursts with varying acoustical characteristics. Participants will be
asked to listen to the stimuli and identify what category the complex
non-speech stimuli belong to based on their sound characteristics. These
responses will be collected via a computer game in which participants
navigate through three-dimensional space and respond to animated characters
correlating to the different sound category distributions. Results from
these studies will potentially answer numerous theoretical questions such
as whether or not humans are sensitive to auditory statistical information,
if participants’ behavior changes with training, and how closely
participants’ responses match those of an ideal observer defined by the
optimal decision strategy.
|
|
|
|

|
National
Institutes of Health F31: High-frequency energy in speech and voice
PI:
Brian Monson
Co-Sponsor:
Andrew J. Lotto
$77,828
01/01/10-12/31/11
For years the major focus in speech
acoustics has been on the frequency range below approximately 5 kHz. Human
speech and the human voice generate acoustical energy up to 20 kHz.
Evidence is accruing that high-frequency energy (energy above 5 kHz) in
speech and voice contributes to percepts of quality, localization, and
intelligibility.
The proposed research is intended to be
an initial step in the long- range goal of characterizing high-frequency
energy in speech, with particular regard for its perceptual role, its
potential for modification during speech production, and its generation
mechanism.
In this study, a database of
high-fidelity recordings of singers and talkers will be used for both a broad acoustical analysis and general
characterization of high-frequency energy, as well as specific
characterization of phoneme category, speech intensity level, and mode of
production by their high-frequency energy content. Directionality of
radiation of high-frequency energy from the mouth will also be examined.
The recordings will be used for perceptual experiments wherein listeners
will be asked to discriminate between speech and voice samples that differ
only in high-frequency energy content. Listeners will also be subjected to
intelligibility-in-noise tasks with samples that have been modified only in
high-frequency content. The combination of these experiments will reveal
(1) the ability of human listeners to detect high-frequency energy
modification, and (2) the phonetic value of high-frequency energy in
speech.
The relevance of this project to public
health lies in its efforts to elucidate the effect on human communicative
behavior when high- frequency energy in speech is lost or altered, which
may be incurred by factors such as hearing loss, noisy environmental
conditions, telephony, audio data compression (such as mp3 compression),
electronic sound reinforcement, or sound recording and playback. Previous
research has already shown that high-frequency energy affects speech
intelligibility, word-learning in normal-hearing and hearing- impaired
children, speech localization, and qualitative percepts of speech and voice
(e.g. 'naturalness'). Thus, this project will provide particularly valuable
insight regarding the need for representation of the high-frequency range
in augmentative hearing devices, including hearing aids, cochlear implants,
and auditory brainstem implants; the results of this project may also
impact the evaluation and management of speech, voice, and language
disorders, as well as the development of training techniques for the
enhancement of speech and voice.
|
|