cluster purification cluster purification cluster purification

LIA EURECOM RT 09 SPEAKER. RT 09 SPEAKER DIARIZATION. DIARIZATION SYSTEM: ENHANCEMENTS. SYSTEM: ENHANCEMENTS. IN SPEAKER.
1MB taille 1 téléchargements 330 vues
LIALIA A-EURECOM CO RT RT`09 T 09 S SPEAKER A DIARIZATION A ZAT O SYSTEM: S ST SYSTEM ENHANCEMENTS A C TS IN SPEAKER MODELLING AND CLUSTER PURIFICATION 1, Nicholas 1 and Si B Ni h l W. W D E dC Corinne i Fredouille F d ill 2 Simon Bozonnet D. Evans 1EURECOM – 2229 Routes des Crêtes Valbonne - Sophia Antipolis, Antipolis France 2LIA - 339 chemin des Meinajaries - Agroparc BP 1228 - 84911 Avignon Cedex 9, 9 France

il {{b t, }@ f [email protected] i f d ill @ i ig f email: {bozonnet,evans}@eurecom.fr

CLUSTER PURIFICATION

INTRODUCTION • top-down top down approach to speaker diarization: • competitive performance • computationally efficient •w weaknesses k i model in d l initialisation i i i li i andd cluster l i p ii impurities • improvement i p i speaker in p k modelling d lli g • new n work rk in p postt p purification rifi ti n • improved stability across five standard NIST RT datasets. datasets

E HMM DIARIZATION E-HMM N SYSTEM W use th We the E E-HMM HMM system yt p presented t d iin [[1] [1],], [2] [ ]

11. Initial data and models

Noise reduction (Wiener)

22. Keep p 55% of data which hich best fit the fits th model d l

Delay y and Sum Beamforming g

composed of 3 steps + preprocessing preprocessing. We added a cluster purification step based on [3] in step 22. 1 Speech 1. p activity cti ity detection ((SAD): (SAD)) p performed rf rm d via i alignment to an HMM with speech/non-speech speech/non speech

Speech Activity Detection (SAD)

ENHANCEMENTS IN SPEAKER MODELLING

models. 2 Segmentation 2. g & Clustering: g the E-HMM E HMM is

33. Retune the model d l

• models d l trained i db byy EM instead i d off MAP • 16 16-component component GMM speaker models (cf (cf. 128 before)

initialised with a root model trained with all the

S Segmentation t ti & Clustering Cl t i

44. Realign the data

speech h data. d 16-component 16 6 component GMM G speaker k models d l are iteratively it ti ly added dd d to t the th E-HMM E HMM with ith EM Cl t purification Cluster ifi ti

training

using

the

longest

available

segments Clusters are purified with a method segments.

• 6 seconds minimum of data to tune a model ((cf (cf. 3 seconds before))

similar i il to that h in i [[3]] andd [[4]] by by discarding di di g 45% % off ReSegmentation

Stage 1: adding speaker L0

the worst fittingg data. data A resegmentation g is then

P Process initialization i i i li i

applied but now using MAP adaptation of a world L0

model d l trained i d on externall data. d

N Normalization li ti

L0 t

3 Normalization & Resegmentation: 3. g Segmental g S Stage 2: 2 adding ddi speaker k L1 The best subset is used to learn L1 model, a new HMM is built

L0

L1

attenuate

channel h l

effects ff

b f before

a

fi l final

resegmentation g t ti is i p performed. f d

L1 L0

L0

L0

New ReSegmentation

Accordingg to the subset selected,, this indexing g is obtained

Process : steps 1 & 2

normalisation is applied to each segment to

t

t

P Process : step t 33, M Models d l Adaptation Ad t ti L1 L0

L1 L0

No gain observed, the adaptation of the L1 model is stopped

L1 L0

t

t

Adaptation + Viterbi

Model purity for the NIST RT RT`09 09 dataset (MDM conditions) before and after purification

t Adaptation + Viterbi

Adaptation + Vit Viterbi bi

CONCLUSIONS • Improvements I p i stability in bili y andd overallll p performance: f

Process : step 44, Stop criterion

A gain is observed, a new speaker k will ill bbe added dd d

L1 L0

t

Best 2 speakers indexing

Best one speaker indexing

L0

L1

L0

The best subset is used to learn L2 model,, a new HMM is built

L1

L1 L0

L2

• validation lid ti set: t RT`07 RT 07

L2 L1 L0 t

Process : step 3, Models Adaptation L2 L1 L0 t

Adaptation + Viterbi

Adaptation + Viterbi

No gain observed, the h adaptation d i off the h L2 model d l is i stopped t d

t

A gain i iis not observed, b d we return t the th best b t 2 speakers k indexing i d i t

Best 3 speakers indexing

• evaluation set: RT`09 RT 09

MDM condition, condition results illustrated with/without scoring overlapping speech.

• 3 systems

Cluster purification: 14% relative improvement in DER for MDM, MDM 19% SDM



Complementary gains: 44% MDM, 28% SDM

• RT RT`09 09 based on [2]

L1 L0 t

Best 2 speakers indexing

• RT RT`09 09 and p purification



Minimal Mi i l increase i i computational in p i l load l d from f purification • RT`09 system y achieved hi d a speed p d factor f off 1.5 15

• RT`07 b based d on [1] [ ]

Process : step 4, Stop criterion i i

L2 L1 L0

• dev. dev set: RT RT`04 04, RT RT`05 05, RT RT`06 06

According di to the h subset b selected, l t d this thi iindexing d i i obtained is bt i d

t

L2 L1 L0



• 3 datasets d

Stage 3: adding speaker L2 P Process : steps 1 & 2

Enhancements Enh n m nt t to speaker p k r modelling m d lling consistent i i p improvement i p in performance: f 35% relative improvement in DER for MDM, 28% SDM

RESULTS

L1 L0

t



• Purification P ifi i increases i speed p d factor f to 1.6 16 S Same except p for f the h SDM condition. di i

• Still among m g the th most m t computationally mp p t ti lly efficient ffi i t systems y submitted b i d to RT`09 evaluation l i

REFERENCES [ ] C. [1] C Fredouille F d ill and d N. N Evans,“The E “Th LIA RT`07 RT 07 speaker diarization system, system ” in Lecture notes in Computer Science - Multimodal Technologies for Perception of Humans, Humans Fiscus Stiefelhagen, Stiefelhagen Bowers, Ed. 2008, vol. 4625/2008, / pp. 520 520– 532, Springer. 532, Sp i g [2] C. C Fredouille, Fr d ill S. S Bozonnet B z nn t and nd N. N Evans “The The LIA-EURECOM LIA EURECOM RT RT`09 09 Speaker Diarization System, System ” in RT RT’09 09, NIST Rich Transcript. Workshop, Workshop May 28 28-29 29, 2009, 2009 M lb Melbourne, , Florida, Fl id , USA, USA, 2009. 2009 [ ] Nguyen [3] Ng y ett al., l , “The “Th IIR-NTU IIR NTU Speaker Spp k Diarization Systems for RT RT`2009 2009,” in RT RT`09 09, NIST Rich Transcription Workshop, Workshop May 2828 29 2009, 29, 2009 Melbourne, Melbourne Florida, Florida USA, USA 2009. 2009 [4] H. Sun, T. L. Nwe, B. Ma and H. Li “Sp k diarization “Speaker di iz i ffor meeting i g room audio,” di ,” in i P Proc. I t p h September Interspeech, S pt b 2009. 2009