Psychophysics of Spectral Contrast

There are a number of context effects in speech perception in which the identification of one phoneme is changed due to the identity of neighboring phonemes. It has been proposed that these identity shifts compensate for the spectral deformations caused by coarticulation. For example, listen to the following files.

To hear a synthesized version of /al/ followed by a consonant-vowel (CV) syllable, click here: 
(Was the second syllable a /da/ or a /ga/?)

To hear a synthesized version of /ar/ followed by a consonant-vowel (CV) syllable, click here: 
(Was the second syllable a /da/ or a /ga/?)

In both of these previous cases the second syllable was identical, yet most people hear the CV as /ga/ when preceded by /al/ and as /da/ when preceded by /ar/. The syllable sounds like this in isolation: 

This perceptual effect may counteract the constraints of coarticulation, in that, a syllable is acoustically more /da/-like when produced following /al/ and more /ga/-like when produced following /ar/. Thus, this context effect is likely important for veridical speech communication.

Whereas, these effects have traditionally been offered as evidence for the existence of specialized modules for speech perception, recent data suggest that the effects are the result of more general auditory processes. Similar shifts in syllable "labeling" have been demonstrated in pre-lingual infants (Fowler et al., 1990) and in birds (Japanese quail, Coturnix japonica) trained to label syllables by pecking a key (Lotto et al., 1997).

 

Figure 1.

Japanese quail (Coturnix japonica) in an operant chamber. Quail trained to peck to /da/ or /ga/ syllables showed a shift in peck rates dependent on the preceding syllable. More "/ga/ responses" were obtained when CVs were preceded by /al/ and more "/da/ responses" were obtained when CVs were preceded by /ar/. This shift in response is similar to what is witnessed in humans responses to these syllables.

 

Experiment 1: Non-Speech Analogue
One of the first questions raised by this context effect is whether or not it is specific to speech stimuli?

In order to provide an answer to this question, we presented listeners with CVs (/da/-/ga/) preceded by non-speech stimuli that contained some of the purported important spectral properties of the speech contexts (/al/-/ar/) but that didn’t sound like speech. The non-speech sounds were a sum of two sine-wave tones matched in frequency with the offset of  the second and third formant (F2 and F3) in the speech stimuli (for /al/ this is: 956 and 2700 Hz; for /ar/: 1517 and 1600 Hz). These stimuli preceded the speech CVs with a 50 msec interstimulus gap. Listeners identified the CVs.

To listen to the tones modeled on /al/ followed by a CV, click here: 

To listen to the tones modeled on /ar/ followed by a CV click here: 

The CV is identical in both of these examples.

Results
Figure 2 relates the mean percentage of /ga/ responses for CVs following the speech /al/ and /ar/ contexts and for the CVs following the non-speech /al/ and /ar/ sine-pair analogues. One obtains an identification shift for the non-speech contexts that is statistically indistinguishable from the shift obtained for the speech contexts. This suggests that the spectral properties of the context determine the context effect and not the identity of the context. Taken together with the data from avian subjects, these results point to a general auditory mechanism underlying this important speech context effect.

 

Figure 2.

Mean Percent of /ga/ responses to CVs preceded by synthesized speech (/al/ or /ar/) or preceded by non-speech analogues (sine waves placed at frequencies of F2 and F3 offset of /al/ or /ar/). The shift in CV identifications is not statistically different. 

 

Proposed General Explanations
The perceptual context effect may be described in terms of spectral contrast. That is, following a syllable with high-frequency F3 offset (/al/), an ambiguous syllable is labeled as if the syllable had a low-frequency F3 onset (/ga/). Following a syllable with a low-frequency F3 offset (/ar/), an ambiguous syllable is labeled as if the syllable had a high-frequency onset (/da/). Two mechanisms that have been proposed as underlying this spectral contrast are: adaptation of auditory nerve fibers and auditory enhancement (Holt & Kluender, 2000; see also Delgutte, 1996).

These possible explanations both suggest that the context effect occurs at a rather peripheral level in the auditory system (it appears that auditory enhancement is due in part to interactions in the cochlear nucleus).  In order to evaluate purported mechanisms for this important context effect, two experiments were run which were designed to 1) describe the time course of the context effect (roughly); and 2) to determine if the effects are strictly monaural.

Experiment 2: Temporal Contiguity
A 10-step series of consonant-vowel (CV) syllables was synthesized varying in F3-onset frequency (1800-2700 Hz). These syllables varied perceptually from a good /da/ to a good /ga/. These CVs were preceded by synthesized versions of /al/ (F3 offset=2700 Hz) or /ar/ (F3 offset=1600 Hz). The duration of the silent gap between these syllables was varied from 25 to 400 msec. Participants were asked to identify the second syllable as /da/ or /ga/ by pressing a button on a response box. Pseudo-spectrograms of the stimuli are displayed here.

Results
Figure 3 displays identification boundaries (from probit analysis) for each context (/al/ vs. /ar/) at each silent gap duration (25 – 400 msec). There is a monotonic decrease in the size of the context effect as gap duration increases (higher boundary = more /ga/ responses). The boundary shift is significant (p < .05) for each duration up to, and including, 275 msec.

The fact that the context effect is maintained for gaps up to 275 msec long has implications for determining the mechanism underlying the effect. This duration appears to be too long for adaptation at the level of ANFs to play an appreciable role. Viemeister & Bacon (1981) found no auditory enhancement for their masking study beyond about 100 msec of silent gap. (However, the time course of auditory enhancement does vary with particulars of the stimuli and tasks.) The context effect studied here is still quite strong with a 100-msec gap between syllables. These data suggest that the mechanisms responsible for this effect may not be peripheral.
 

 

Figure 3.

Identification boundary (Probit) values for CVs preceded by /al/ or /ar/ with varying durations of silent gap (with s.e. bars). T-tests are significant for all comparisons up to and including 275 msec gap.

 

Experiment 3: Dichotic Presentation
ANF adaptation and auditory enhancement are monaural effects. For example, Summerfield & Assmann (1989) failed to find effects of a precursor stimulus when it was presented to the contralateral ear. One way to examine the plausibility of auditory enhancement as a mechanism for coarticulation compensation is to present the context for a syllable to the contralateral ear as the target syllable.

Here, we use a different speech context than used in Experiment 1.

The target CV varies in the frequency of the onset of the second formant (F2). Perceptually, it varies from /ba/ to /da/. This is preceded by examples of the vowel /i/ or /u/. We have shown previously that the context of /i/ (high F2) and /u/ (low F2) results in a shift in identification from /ba/ (low F2 onset) to /da/ (high F2 onset), respectively. In experiment 3, one group is presented the context and target CV binaurally and a second group receives the target CV monaurally with the preceding context being presented to the contralateral ear. The ear receiving the context varied randomly between trials.

 

Figure 4.

Diagram of presentation conditions for Experiment 3.

 

Results
The data demonstrate that the context effect maintains with dichotic presentation. Figure 5 displays the identification functions for each presentation condition (binaural vs. dichotic) and each context (/i/ vs. /u/). The size of the context effect does not change when the context arrives at a different ear. As with the temporal range evidenced in Experiment 2, these data suggest that the mechanism underlying this context effect is not peripheral. The maintenance of the effect in the dichotic condition also makes it less plausible that auditory enhancement is responsible for the effect, as auditory enhancement is usually a monaural effect.

 

Figure 5.

Identification functions for CVs preceded by /i/ or /u/ presented in the same ear or contralaterally. The size of the shift (context effect) does not differ between dichotic and binaural presentation.

 

Conclusions
There is a class of context effects that have been referred to as "perceptual compensation for coarticulation". They may be important for maintaining invariant phonemic perception despite varying acoustic input. The three experiments described here lead to the following conclusions concerning the mechanisms responsible for these context effects:

1) Effects of context can occur with, at least, a 275-msec gap between syllables. This suggests that the effects are not (completely) due to adaptation in the auditory nerve.

2) Shifts in identification occur even when the context is presented to the contralateral ear. This is evidence against auditory enhancement (a monaural effect) as a plausible mechanism for the context effects.

3) A similar identification shift can be induced by non-speech analogues with some spectral similarity to the speech contexts. These data suggest that the context effect is general in nature and does not require that the context is perceived as speech. The results are also coherent with a general spectral contrast account of the effects.


Bibliography

Delgutte, B. (1996). Auditory neural processing of speech. In W. J. Hardcastle & J. Laver (Eds.), The Handbook of Phonetic Sciences, pp. 507-538. Oxford: Blackwell.

Fowler, C.A., Best, C.T., & McRoberts, G.W. (1990). Young infants' perception of liquid coarticulatory influences on following stop consonants. Perception & Psychophysics, 48, 559-570.

Holt, L. L. & Kluender, K. R. (2000). General auditory processes contribute to perceptual accommodation of
coarticulation. Phonetica, 57, 170-180.

Lotto, A.J., Kluender, K.R., & Holt, L.L. (1997). Perceptual compensation for coarticulation by Japanese quail (Coturnix coturnix japonica). Journal of the Acoustical Society of America, 102, 1134-1140.

Summerfield, Q., & Assmann, P.F. (1989). Auditory enhancement and the perception of concurrent vowels. Perception & Psychophysics, 45, 529-536.

Viemeister, N.F.., & Bacon, S.P. (1982). Forward masking by enhanced components in harmonic complexes. Journal of the Acoustical Society of America, 71, 1502-1507.