A model of concurrent vowel identification without segregation predicts perceptual errors
When positioned in a complex auditory environment, individuals with normal hearing are able to identify and concentrate on specific components within that environment, famously termed the cocktail-party phenomenon. There are a multitude of cues which can be used to facilitate auditory stream segregation (e.g., pitch differences, dynamics, onset/offset asynchronies, differences in speech spectral characteristics). Particular attention has been paid to the positive effect that differences in fundamental frequency (F0) between two vowels (steady-state harmonic complexes), presented concurrently, has on their identification (review: Micheyl and Oxenham, 2010).
Computer models exist that predict with some success the improvement in concurrent-vowel identification observed with increasing F0 differences (Meddis & Hewitt, 1992). However, these existing models are poor at predicting listener confusions (Chintanpalli and Heinz, 2013).
Presented is our model of concurrent-vowel identification, which incorporates a naïve Bayesian classifier. This model directly predicts the probabilities of different combinations of two vowels giving rise to an integrated representation of the concurrent-vowel pair presented. This contrasts with previous models, which were deterministic and assumed that a segregation process separated out individual vowel representations based on F0 differences, followed by a comparison with templates of individual vowels. Our model can also incorporate a pitch estimation process, but this is used to restrain the concurrent-vowel pair categories used in classification, and has marginal benefits. Our new ‘synthesis’ based model was tested for both temporal (autocorrelation-based) and spectral (rate-based) internal representations.
The new model was able to successfully predict confusions with a high degree of accuracy (confusions: R>0.85). In the case where there was no difference in F0 between the vowels, performance was slightly better for the spectral model than the temporal model. However, only when temporal processing was implemented, our model qualitatively replicated the positive effect that differences in F0 have on human concurrent-vowel identification. Overall, our model is much closer to predicting human performance than previous models, and hints at a process that seeks to optimally predict which concurrent-vowel pair led to a corresponding internal representation, rather than to segregate the representation and recognise individual vowels separately.