Research Poster 36 x 48 - Simon Bozonnet's Webpage

Simon Bozonnet1, Nicholas Evans1, Xavier Anguera2, Oriol Vinyals3,4, Gerald Friedland4 and Corinne Fredouille5. Fig. 1 average number of speakers and ...
1MB taille 1 téléchargements 268 vues
Simon

System Output Combination For Improved Speaker Diarization

1 Bozonnet ,

Nicholas

1 Evans ,

Xavier

2 Anguera ,

Oriol

3,4 Vinyals ,

Gerald

1EURECOM,

Sophia Antipolis, France 2Telefonica Research, Barcelona, Spain 3EECS Department, University of California at Berkeley, USA 4International Computer Science Institute, Berkeley, USA 5LIA, University of Avignon, France

System combination or fusion is a popular, successful and sometimes straightforward means of improving performance in many fields of statistical pattern classification, including speech and speaker recognition. In the literature there is only little work which aims to combine the outputs of multiple speaker diarization systems.

Speaker Diarization = “Who spoke and when?” • Identify the speakers within an audio stream • Unsupervised process • Two main approaches: bottom-up, top-down.

Bottom-up clustering

This paper reports our first attempt to combine the outputs of two state-of-theart systems: • ICSI’s bottom-up system • LIA-EURECOM’s top-down system We show that a cluster matching procedure reliably identifies corresponding speaker clusters in the two system outputs and that, when they are used in a new realignment and resegmentation stage, the combination leads to relative improvements of 13% and 7% DER on independent development and evaluation datasets.

REFERENCES [1] G. Friedland, O. Vinyals, Y. Huang, and C. Muller, “Prosodic and other long-term features for speaker diarization,” IEEE TASLP, vol. 17, no. 5, pp. 985–993, July 2009.

• Different way to combine systems: feature, score, or decision levels. We propose: Output combination of: ICSI’s System [1]

Poster Design & Printing by Genigraphics® - 800.790.4001

SPEAKER turn

SPEAKER 1

States = GMM = one speaker Transitions = speaker turns

SPEAKER 2

Combined system LIAEURECOM’s System [2]

Bottom-up (BU) or Agglomerative Hierarchical Clustering [1] (AHC)

Top-down (TD) or Divisive Hierarchical Clustering [2] (DHC)

Audio stream over A single general model is segmented, before clusters tuned, before it is iteratively . are iteratively merged split

Number of detected segments more accurate [fig. 2]

Number of detected speakers more accurate [fig. 1]

• ICSI’s bottom-up system • LIA-EURECOM topdown system

While one system provides more accurate estimates of the number of speakers, the other gives a more accurate estimation of the size of the segments

1

Optimal Combination Sys. A Sys. B

1 2 1

3 21

Main challenges: • The number of speakers detected may differ from one system to another • The outputs are not standardized in terms of labeling: no natural correspondence between system output labels • Different segmentation outputs are generally not time-synchronized: different Speech Activity Detection (SAD) algorithms are used.

3

2 A

A B A DB C A 1

B 1

A 2

D B 1 1

C 3

B CB A B B B C 2 1 3 3

B 3

A 2

A 1

4 clusters:

D C

B

7 clusters:

B1

A1

FUSION

C3

D1 A2

B3

B2

An artificial experiment where virtual clusters are merged according to the ground truth: • Segments boundaries are kept • Lower bound on likely performance Practical Combination Cluster Matching: Each cluster Ti contained in the TD output is associated with a cluster Bi in the BU output if: TBi i • Ti,Bi share a sufficient proportion of frames • Bi is the closest to Ti according to the Information Change Rate (ICR) [3] among all other clusters in the BU output Top-down Bottom-up D 1 B C

Exploit merits of both systems in a combined approach Ti

Bi TiBi

- the matched cluster pair is accepted - frames with unmatched Ti Bi are discarded

Unmatched Clusters: • Unmatched clusters from the TD output are introduced using only α% of the best data (likelihood) • α =20% determined by cross validation

Fig. 3. Speaker diarization performance in DER including overlapping speech for RT`07 dataset (SDM) Fig. 1 average number of speakers and average error, for RT‘07 and RT‘09 datasets. Last column: with/without the inclusion of the NIST_20080307-0955 show.

Fig. 2. Average number of segments and average segment length in seconds

3

2 1 3 clusters:

2

[2] S. Bozonnet, N. W. D. Evans, and C. Fredouille, “The LIA-EURECOM RT’09 speaker diarization system: enhancements in speaker modelling and cluster purification,” in Proc. ICASSP, March 2010. [3] K. J. Han, S. Kim, and S. S. Narayanan, “Strategies to improve the robustness of agglomerative hierarchical clustering under data source variation for speaker diarization,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 16, no. 8, pp. 1590–1601, November 2008.

Both approaches are HMM-based:

EXPERIMENTAL WORK

Top-down clustering

Optimum # clusters

and Corinne

5 Fredouille

{bozonnet,evans}@eurecom.fr, [email protected], [email protected], [email protected], [email protected]

DIARIZATION SYSTEMS AND CHARACTERISTICS

INTRODUCTION ABSTRACT

4 Friedland

Final Resegmentation: • A 128-component GMM is trained for each cluster by MAP adaptation • Several iterations of realignment and adaptation are performed • Speakers with less than 8 seconds of assigned speech are removed

CONCLUSIONS

Fig. 4. Speaker diarization performance in DER including overlapping speech for RT`09 dataset (SDM)

• First attempt to combine two state-of-the-art speaker diarization system at the output level • 13% and 7% DER relative improvement for the standard NIST RT`07 and RT`09 databases • Future work: Improve cluster matching in a probabilistic manner