LIALIA A-EURECOM CO RT RT`09 T 09 S SPEAKER A DIARIZATION A ZAT O SYSTEM: S ST SYSTEM ENHANCEMENTS A C TS IN SPEAKER MODELLING AND CLUSTER PURIFICATION 1, Nicholas 1 and Si B Ni h l W. W D E dC Corinne i Fredouille F d ill 2 Simon Bozonnet D. Evans 1EURECOM – 2229 Routes des Crêtes Valbonne - Sophia Antipolis, Antipolis France 2LIA - 339 chemin des Meinajaries - Agroparc BP 1228 - 84911 Avignon Cedex 9, 9 France
il {{b t, }@ f
[email protected] i f d ill @ i ig f email: {bozonnet,evans}@eurecom.fr
CLUSTER PURIFICATION
INTRODUCTION • top-down top down approach to speaker diarization: • competitive performance • computationally efficient •w weaknesses k i model in d l initialisation i i i li i andd cluster l i p ii impurities • improvement i p i speaker in p k modelling d lli g • new n work rk in p postt p purification rifi ti n • improved stability across five standard NIST RT datasets. datasets
E HMM DIARIZATION E-HMM N SYSTEM W use th We the E E-HMM HMM system yt p presented t d iin [[1] [1],], [2] [ ]
11. Initial data and models
Noise reduction (Wiener)
22. Keep p 55% of data which hich best fit the fits th model d l
Delay y and Sum Beamforming g
composed of 3 steps + preprocessing preprocessing. We added a cluster purification step based on [3] in step 22. 1 Speech 1. p activity cti ity detection ((SAD): (SAD)) p performed rf rm d via i alignment to an HMM with speech/non-speech speech/non speech
Speech Activity Detection (SAD)
ENHANCEMENTS IN SPEAKER MODELLING
models. 2 Segmentation 2. g & Clustering: g the E-HMM E HMM is
33. Retune the model d l
• models d l trained i db byy EM instead i d off MAP • 16 16-component component GMM speaker models (cf (cf. 128 before)
initialised with a root model trained with all the
S Segmentation t ti & Clustering Cl t i
44. Realign the data
speech h data. d 16-component 16 6 component GMM G speaker k models d l are iteratively it ti ly added dd d to t the th E-HMM E HMM with ith EM Cl t purification Cluster ifi ti
training
using
the
longest
available
segments Clusters are purified with a method segments.
• 6 seconds minimum of data to tune a model ((cf (cf. 3 seconds before))
similar i il to that h in i [[3]] andd [[4]] by by discarding di di g 45% % off ReSegmentation
Stage 1: adding speaker L0
the worst fittingg data. data A resegmentation g is then
P Process initialization i i i li i
applied but now using MAP adaptation of a world L0
model d l trained i d on externall data. d
N Normalization li ti
L0 t
3 Normalization & Resegmentation: 3. g Segmental g S Stage 2: 2 adding ddi speaker k L1 The best subset is used to learn L1 model, a new HMM is built
L0
L1
attenuate
channel h l
effects ff
b f before
a
fi l final
resegmentation g t ti is i p performed. f d
L1 L0
L0
L0
New ReSegmentation
Accordingg to the subset selected,, this indexing g is obtained
Process : steps 1 & 2
normalisation is applied to each segment to
t
t
P Process : step t 33, M Models d l Adaptation Ad t ti L1 L0
L1 L0
No gain observed, the adaptation of the L1 model is stopped
L1 L0
t
t
Adaptation + Viterbi
Model purity for the NIST RT RT`09 09 dataset (MDM conditions) before and after purification
t Adaptation + Viterbi
Adaptation + Vit Viterbi bi
CONCLUSIONS • Improvements I p i stability in bili y andd overallll p performance: f
Process : step 44, Stop criterion
A gain is observed, a new speaker k will ill bbe added dd d
L1 L0
t
Best 2 speakers indexing
Best one speaker indexing
L0
L1
L0
The best subset is used to learn L2 model,, a new HMM is built
L1
L1 L0
L2
• validation lid ti set: t RT`07 RT 07
L2 L1 L0 t
Process : step 3, Models Adaptation L2 L1 L0 t
Adaptation + Viterbi
Adaptation + Viterbi
No gain observed, the h adaptation d i off the h L2 model d l is i stopped t d
t
A gain i iis not observed, b d we return t the th best b t 2 speakers k indexing i d i t
Best 3 speakers indexing
• evaluation set: RT`09 RT 09
MDM condition, condition results illustrated with/without scoring overlapping speech.
• 3 systems
Cluster purification: 14% relative improvement in DER for MDM, MDM 19% SDM
•
Complementary gains: 44% MDM, 28% SDM
• RT RT`09 09 based on [2]
L1 L0 t
Best 2 speakers indexing
• RT RT`09 09 and p purification
•
Minimal Mi i l increase i i computational in p i l load l d from f purification • RT`09 system y achieved hi d a speed p d factor f off 1.5 15
• RT`07 b based d on [1] [ ]
Process : step 4, Stop criterion i i
L2 L1 L0
• dev. dev set: RT RT`04 04, RT RT`05 05, RT RT`06 06
According di to the h subset b selected, l t d this thi iindexing d i i obtained is bt i d
t
L2 L1 L0
•
• 3 datasets d
Stage 3: adding speaker L2 P Process : steps 1 & 2
Enhancements Enh n m nt t to speaker p k r modelling m d lling consistent i i p improvement i p in performance: f 35% relative improvement in DER for MDM, 28% SDM
RESULTS
L1 L0
t
•
• Purification P ifi i increases i speed p d factor f to 1.6 16 S Same except p for f the h SDM condition. di i
• Still among m g the th most m t computationally mp p t ti lly efficient ffi i t systems y submitted b i d to RT`09 evaluation l i
REFERENCES [ ] C. [1] C Fredouille F d ill and d N. N Evans,“The E “Th LIA RT`07 RT 07 speaker diarization system, system ” in Lecture notes in Computer Science - Multimodal Technologies for Perception of Humans, Humans Fiscus Stiefelhagen, Stiefelhagen Bowers, Ed. 2008, vol. 4625/2008, / pp. 520 520– 532, Springer. 532, Sp i g [2] C. C Fredouille, Fr d ill S. S Bozonnet B z nn t and nd N. N Evans “The The LIA-EURECOM LIA EURECOM RT RT`09 09 Speaker Diarization System, System ” in RT RT’09 09, NIST Rich Transcript. Workshop, Workshop May 28 28-29 29, 2009, 2009 M lb Melbourne, , Florida, Fl id , USA, USA, 2009. 2009 [ ] Nguyen [3] Ng y ett al., l , “The “Th IIR-NTU IIR NTU Speaker Spp k Diarization Systems for RT RT`2009 2009,” in RT RT`09 09, NIST Rich Transcription Workshop, Workshop May 2828 29 2009, 29, 2009 Melbourne, Melbourne Florida, Florida USA, USA 2009. 2009 [4] H. Sun, T. L. Nwe, B. Ma and H. Li “Sp k diarization “Speaker di iz i ffor meeting i g room audio,” di ,” in i P Proc. I t p h September Interspeech, S pt b 2009. 2009