FP6 IST-034624 2008/04/30 · of the project in order to meet the requirements for the first...

DDeelliivveerraabbllee 33..11

MMuullttii--cchhaannnneell AAccoouussttiicc EEcchhoo CCaanncceellllaattiioonn,, AAccoouussttiicc SSoouurrccee LLooccaalliizzaattiioonn,, aanndd BBeeaammffoorrmmiinngg

AAllggoorriitthhmmss ffoorr DDiissttaanntt--TTaallkkiinngg AASSRR aanndd SSuurrvveeiillllaannccee

AAuutthhoorrss:: LLuuttzz MMaarrqquuaarrddtt,, EEddwwiinn MMaabbaannddee,, AAlleessssiioo BBrruuttttii,,

WWaalltteerr KKeelllleerrmmaannnn

AAffffiilliiaattiioonnss:: FFAAUU,, FFBBKK--iirrsstt

DDaattee:: 3300--AApprr 22000088

DDooccuummeenntt TTyyppee:: RR

SSttaattuuss//VVeerrssiioonn:: 11..00

DDiisssseemmiinnaattiioonn LLeevveell:: PPUU

FP6 IST-034624 http://dicit.itc.it

D 3.1 – MC-AEC, Acoustic Source Localization, and Beamforming

DICIT_D3.1_20080430 ii

Project Reference FP6 IST-034624

Project Acronym DICIT

Project Full Title Distant-talking Interfaces for Control of Interactive TV Dissemination Level PU

Contractual Date of

Delivery March 2008

Actual Date of Delivery Preliminary version: 11-January-2008

Final version: 30-April-2008

Document Number DICIT_D3.1_20080430_PU

Type Deliverable

Status & Version 1.0

Number of Pages 4+33

WP Contributing to the

Deliverable WP3 (WP responsible: Walter Kellermann – FAU)

WP Task responsible Lutz Marquardt (FAU)

Authors (Affiliation)

Lutz Marquardt, Edwin Mabande

and Walter Kellermann (FAU),

Alessio Brutti (FBK-irst)

Other Contributors

Reviewer

EC Project Officers

Anne Bajart (till January 31st 2007), Erwin Valentini (from

February 1st till October 31st 2007), Pierre Paul Sondag

(from November 1st 2007)

Keywords: multi-channel acoustic echo cancellation, acoustic source localization,

beamforming, multi-microphone devices, distant-talking speech recognition devices, voice-

operated devices, Interactive TV, anti-intrusion, surveillance.

Abstract: The purpose of this document is to describe the acoustic pre-processing algorithms to be integrated into the first DICIT prototype. These algorithms will be used for the acquisition, extraction and enhancement of the desired speech signals which will be fed into the speech recognizer.

© DICIT Consortium


DICIT_D3.1_20080430 iii

Contents Contents ..................................................................................................................................... iii List of Figures ........................................................................................................................... iv 1. Introduction ........................................................................................................................ 1 2. Beamforming ...................................................................................................................... 2

2.1 Array Geometry ........................................................................................................... 5 2.2 Data-independent Beamforming Designs for DICIT .................................................. 7

2.2.1 DSB with Dolph-Chebyshev Window Weighting ............................................... 7 2.2.2 FSB based on a Dolph-Chebyshev Design ........................................................... 8 2.2.3 Least-Squares Frequency-Invariant Beamformer ................................................ 9

2.3 Beamforming Module Structure for DICIT ............................................................... 11 3. Multi-channel Acoustic Echo Cancellation (MC-AEC) .................................................. 11

3.1 Generalized Frequency Domain Adaptive Filtering (GFDAF) ................................. 13 3.2 Channel Decorrelation for MC-AEC ......................................................................... 14

4. Source Localization (SLoc) .............................................................................................. 18 4.1 SLoc in DICIT ........................................................................................................... 19

4.1.1 Array Design ...................................................................................................... 19 4.1.2 Application Scenario .......................................................................................... 20

4.2 Adopted SLoc Approach ........................................................................................... 21 4.2.1 Global Coherence Field ...................................................................................... 22 4.2.2 Sub-optimal Least Squares ................................................................................. 23 4.2.3 Tracking .............................................................................................................. 23 4.2.4 Experimental Results .......................................................................................... 24 4.2.5 Multiple Sources ................................................................................................. 26 4.2.6 Loudspeakers as Additional Sources .................................................................. 28 4.2.7 Real-time Implementation .................................................................................. 28

5. Multi-channel Acoustic Processing Subsystem ............................................................... 29 5.1 FBK Hardware Setup ................................................................................................. 30 5.2 FAU Hardware Setup ................................................................................................ 31

Bibliography ............................................................................................................................. 32

© DICIT Consortium


DICIT_D3.1_20080430 iv

List of Figures Figure 1: A linear uniformly-spaced microphone array ............................................................. 2 Figure 2: Frequency-dependent and frequency-independent beampatterns ............................... 4 Figure 3: Harmonically Nested Array ........................................................................................ 5 Figure 4: Nested sub-array structure .......................................................................................... 6 Figure 5: Beampattern and WNG for DSB design ..................................................................... 8 Figure 6: Beampattern and WNG for FSB-DC .......................................................................... 9 Figure 7: Beampattern and WNG for LS-FIB .......................................................................... 10 Figure 8: Signal flow block diagram ........................................................................................ 11 Figure 9: MC-AEC in Human-Machine-Interface System ...................................................... 12 Figure 10: MC-AEC misalignment convergence comparison for NLMS and FDAF [8] ........ 13 Figure 11: Phase modulation amplitude as a function of frequency subband [9] .................... 15 Figure 12: Stereo-decorrelation employing frequency-dependent phase modulation [9] ........ 15 Figure 13: Subjective audio quality for pre-processing methods [9] ....................................... 17 Figure 14: Convergence comparison of pre-processing methods for stereo AEC [9] ............. 17 Figure 15: Loci of points that satisfy a given TDOA at two microphones .............................. 18 Figure 16: TDOA given two microphones and a source in far field position .......................... 18 Figure 17: Effect of noisy time delay estimations in a double microphone pair set-up ........... 20 Figure 18: SLoc module block diagram. .................................................................................. 21 Figure 19: Localization performance in terms of angular RMSE for different thresholds ...... 25 Figure 20: GCF acoustic map in presence of two sources ....................................................... 26 Figure 21: Multiple speaker localization .................................................................................. 27 Figure 22: GCF map of Figure 20 after the de-emphasis process. ........................................... 27 Figure 23: Configuration of the first DICIT prototype. ........................................................... 28 Figure 24: Block structure of Multi-channel Acoustic Processing .......................................... 29 Figure 25: PC 1 audio acquisition chain .................................................................................. 30

© DICIT Consortium


DICIT_D3.1_20080430 1

1. Introduction The general objective of WP3 is to find the most effective multi-channel pre-processing for the DICIT system in order to acquire, extract and enhance the signals uttered by the desired speakers. This pre-processing should help to maximize speech recognition performance in the noisy and reverberant DICIT scenarios, which require an abandonment of close-talk microphones. The four main research challenges for reaching this goal are reflected by tasks T3.1 “Multi-channel Acoustic Echo Cancellation (MC-AEC)”, T3.2 “Adaptive Beamforming for Dereverberation and Noise Reduction”, T3.3 “Source Localization Algorithms for supporting Beamforming” and T3.4 ”Blind Source Separation for Noisy and Reverberant Environments”. While Blind Source Separation (BSS) as the fourth field of research will be investigated for the second prototype only, the first three research topics were addressed during the first year of the project in order to meet the requirements for the first prototype. This document describes the respective work conducted in connection with tasks for T3.1 to T3.3. For task T3.2, fixed beamformers with steering capabilities are described as these are relevant for the first prototype. First, Section 2 reports on beamforming, which was investigated with respect to conceiving adequate solutions that allow for a straightforward implementation together with MC-AEC. Regarding Acoustic Echo Cancellation (AEC), Section 3 describes the work on MC-AEC and in particular on two-channel AEC as to its employment and adaptation to the DICIT scenarios. The usage of a new pre-processing scheme for channel decorrelation as an important feature for increasing the performance of an existing algorithm is described in Section 3.2. Source Localization (Sloc) which provides the beamformer with steering information and also allows to track movements of the active speakers is described in Section 4. It is indispensable as long as no BSS algorithms are foreseen for desired signal acquisition. Besides the preparation of these DICIT-tailored mechanisms for the localization of a single active source, also a novel approach to handle the localization of the active source in the presence of the stereo loudspeaker outputs is described. Finally the integration of the respective modules within the “Multi-channel Acoustic Processing Subsystem” is reported on in Section 5.

© DICIT Consortium


2. Beamforming Array signal processing makes use of an array, which consists of a group of transducers, to extract information from an environment. Microphone arrays have been successfully applied to spatial filtering problems where a desired acoustic source, usually obstructed by noise or interferers, needs to be extracted from an observed wavefield. The paradigm used for this task is called beamforming. By definition, a beamformer is a processor that is used in conjunction with an array of microphones in order to produce a versatile form of spatial filtering [1]. The beamformer exploits the spatial distribution of desired sources and interferers in order to attenuate the latter. The multiple microphone signals are jointly processed to form a beam, that is a region of high acoustic sensitivity, which is steered towards the desired source. For narrowband signals, the classical Delay-and-Sum Beamformer (DSB) [1] may be used. In the case of broadband signal acquisition, the goal of beamforming is to obtain a frequency-independent beam. This is necessary in order to avoid also extracting low-pass versions of the noise or interferers from the observed wavefield. The width of the main beam is directly related to the length of the array and the wavelength of the signal [2]. This goal may be accomplished by either implementing a Filter-and-Sum Beamformer (FSB) or utilizing nested arrays or a combination of the two [3].

Microphones Source

Beamformig Filters

Spacing

Output

Figure 1: A linear uniformly-spaced microphone array

Figure 1 depicts a linear, uniformly-spaced microphone array Filter-and-Sum Beamformer (FSB). In the following we consider an array that consists of 12 += MN microphones with a uniform spacing . The source signal is captured by the microphones and digitized by d

DICIT_D3.1_20080430 2 © DICIT Consortium


analog-to-digital converters. The digitized signals are then fed into the beamforming filters before being combined to produce the output. The FSB response for a frequency ω and angleν relative to the array axis is given by [4]

∑−=

=M

Mn

jn

neWB )()(),( νωτωνω

where and ∑ −

=−=

1

0)()( L

kjk

n ekwW ωω cfnd sn /)cos()( νντ = . Note that are the filter

coefficients of FIR filters of length , is the sampling frequency and c is the speed of sound in air. The magnitude square of the beamformer response is known as the beampattern of the beamformer. The beampattern describes the beamformer’s ability to capture acoustic energy as a function of the angle of arrival of the plane wave. It is defined as

nw

L sf

[4]

)),((log20),( 10 νωνω BP = The directivity for a linear array, which is the ratio of the beampattern in the desired direction with the average over all directions, is given by [2]

∫= π

νννω

νωω

0

2

2

)sin(),(2/1

),()(

dB

BD o

is the steering direction. where oν Figure 2 depicts the beampatterns obtained from a 5-element linear uniformly spaced array by utilizing narrowband beamforming design (DSB) [2] and a broadband beamforming design [5] respectively. It can be clearly seen that the beamwidth of the main beam of the DSB design varies with frequency. This leads to marked reduction in directivity as the frequency decreases. On the other hand, the main beam of the broadband beamforming design is approximately frequency-independent and therefore the variation in directivity with frequency is limited.



ϑ in degrees

Fre

quen

cy in

Hz

−90 −60 −30 0 30 60 90

200

1000

2000

3000

4000

5000

[dB]

−40

−30

−20

−10

0

ϑ in degrees

Fre

quen

cy in

Hz

−90 −60 −30 0 30 60 90

200

1000

2000

3000

4000

5000

[dB]

−40

−30

−20

−10

0

Figure 2: Frequency-dependent and frequency-independent beampatterns

The White Noise Gain (WNG) quantifies a beamformers ability to suppress spatially white noise with respect to frequency. It is given by

FHF

TF

wwdw

WNG2

)( =ω

where and [ ]Tjj oMoM eed )()( ,, νωτνωτ K−= [ ]TMMF WWw )(,),( ωω K−= denote the so-called steering vector and the frequency responses of the beamforming filters respectively. Note that a small WNG at a particular frequency corresponds to a low ability to suppress spatially white noise, resulting in an amplification of the noise at that frequency. Important errors, such as amplitude and phase errors in microphone channels, and microphone position errors, are nearly uncorrelated from sensor to sensor and affect the beamformer in a manner similar to spatially white noise [6]. Hence the WNG is a good measure of the beamformers robustness to errors.



2.1 Array Geometry The array can take on a variety of different geometries depending on the application of interest. As mentioned previously a simple method of approximating a frequency-independent beampattern and thus cover broadband signals, is to implement the array as a series of sub-arrays which are themselves linear arrays with uniform spacing. The nested array structure chosen for the DICIT scenario is depicted in Figure 3. It consists of four sub-arrays, three of which consist of five microphones each and one which consists of seven microphones. The microphone spacings for the four sub-arrays are 0.04 m, 0.08 m, 0.16 m and 0.32 m, respectively. The array also consists of two additional microphones which are mounted directly 32 cm above the left- and right-most microphones in the nested array. These will not be utilized for beamforming here since only linear arrays are considered.

Figure 3: Harmonically Nested Array

Each sub-array is operating in a different frequency range by applying appropriate bandpass filtering to the sub-array outputs. The overall array output is obtained by combining the outputs of the bandlimited sub-arrays as depicted in Figure 4. For a general sub-array broadband beamformer, the beamforming filters are applied to the microphone signals before applying the bandpass filters. For the DICIT prototype bandpass filters were chosen that cover

, , , Hz frequency bands. The sampling frequency is 48 kHz. The bandpass filters are FIR filters of length L= 256 and were designed according to the frequency sampling-based finite impulse response filter design

900100K 1800901L 36001801L 80003601L

[7].



Figure 4: Nested sub-array structure



DICIT_D3.1_20080430 7

2.2 Data-independent Beamforming Designs for DICIT The design choice will be made between two fixed non-superdirective beamforming designs and one fixed superdirective beamformer design. Simulations results (e.g. beampatterns, WNG e.t.c) for each of these designs configurations, as applied to the DICIT scenario, will be shown in the following subsections.

2.2.1 DSB with Dolph-Chebyshev Window Weighting For the DSB with Dolph-Chebyshev Window Weighting (DSB-DC) [2] design, the microphone signals are first weighted by applying a Dolph-Chebyshev window before being processed by a DSB. The DSB design is based on the idea that the desired output contribution of each of the array microphones will be the same, except that each one will be delayed by a different amount. Therefore, if the output of each of the sensors is delayed and weighted appropriately, the signal originating from a desired spatial region will be reinforced, while noise and interfering signals from other spatial regions will generally be attenuated. This is the most robust beamforming design considered for the DICIT project. The major disadvantage of DSB-DC is that it produces a frequency-dependent beampattern since it is a narrowband beamforming design. Figure 5 depicts the beampattern and WNG for a DSB-DC design utilizing the whole nested array. It is clear that by using the nested array a relatively frequency invariant beampattern is obtained and this leads to improved spatial selectivity. Note that the sidelobes appearing at about 7 kHz are due to spatial aliasing. The WNG figure shows that this design is very robust to errors. The WNG at lower frequencies is higher than that at the higher frequencies. This is due to the fact that the sub-array covering the lower frequencies consists of seven microphones while the other three sub-arrays consist of five microphones each. The higher the number of microphones used in the sub-array the higher the WNG for the DSB-DC design.

© DICIT Consortium


Figure 5: Beampattern and WNG for DSB design

2.2.2 FSB based on a Dolph-Chebyshev Design For a FSB based on a Dolph-Chebyshev Design (FSB-DC), the FIR filters are obtained by applying Dolph-Chebyshev windows with a predefined frequency-independent peak-to-zero distance of the beampattern [4] to a set of discrete frequencies. These frequency-dependent Dolph-Chebyshev windows are then fed into the Fourier approximation filter design to determine the FIR filters. A FSB-DC is designed, where the first null is frequency-independent for frequencies greater than a given lower limit. This lower limit is determined by the microphone spacing [4]. For frequencies below this limit, a simple DSB is designed. This beamforming design is less robust than the DSB-DC but guaranties a frequency-independent beampattern above the given lower limit. It suffers from the same problems as



the DSB-DC below this limit. Figure 6 depicts the beampattern and WNG obtained for a FSB-DC design using the nested array with each sub-array consisting of 5 microphones. In comparison to the beampattern in Figure 5, the main beam is narrower and shows an improved frequency-invariance but it also shows that the attenuation of off-axis signals is lower. The WNG is very similar to that of the DSB-DC design and this design is therefore also robust.

Figure 6: Beampattern and WNG for FSB-DC

2.2.3 Least-Squares Frequency-Invariant Beamformer As a novel and very general beamformer design method, the Least-Squares Frequency-Invariant Beamformer (LS-FIB) design [5] uses a linear basis which optimally approximates desired spatio-spectral array characteristics in the least-squares sense and inherently leads to superdirective beamformers for low frequencies, if the aperture is small relative to the wavelengths [5].



Figure 7 depicts the beampattern and WNG obtained for a LS-FIB design using the nested array with each sub-array consisting of 5 microphones. In comparison to the beampatterns of the previous designs, the main lobe is narrowest and compares favorably with the FSB-DC in terms of frequency-invariance. The major advantage of this design is that there is good spatial selectivity at the very low frequencies. The WNG is very small at very low frequencies. This means that this design is very sensitive to errors.

Figure 7: Beampattern and WNG for LS-FIB

Due to its superdirective nature, the LS-FIB design gives the best spatial selectivity when the number of sensors is small. It has no restrictions on sensor positioning but is very sensitive to random errors and thus small random errors lead to a significant loss in spatial selectivity. This becomes more significant as the number of microphones increases. The sensitivity of the design may be reduced by adjusting some design parameters but this leads to a loss in spatial



selectivity. The use of matched microphones with a low self-noise and well calibrated arrays is strongly recommended when using this design.

2.3 Beamforming Module Structure for DICIT In the first prototype for the DICIT project, the adaptive beamforming module will be made up of two units, namely, the Steering Unit (SU) and the Fixed Beamforming (FBF) unit as depicted in Figure 8. The SU consists of a set of fractional delay filters which facilitate the steering of the beam to the desired look direction in order to track movements of the source. The desired look direction will be supplied by the source localization module. In the FBF unit the FSB-DC will be utilized due to its relatively frequency-invariant main beam and its robustness to errors.

Figure 8: Signal flow block diagram

3. Multi-channel Acoustic Echo Cancellation (MC-AEC)

Acoustic echoes appear due to the coupling between loudspeakers and microphones, i.e. due to the lack of acoustical barriers: Apart from the speech uttered by the near-end speaker v(k) and a noise signal, the microphones in the receiving room also acquire the far-end signal that is played back via the loudspeakers. The term “Acoustic Echo Cancellation” is employed for each signal processing technique that aims at reducing the reverberated loudspeaker signals within a microphone signal y(k). In the DICIT scenario the loudspeaker signals constitute the DICIT system audio output, which consists of the TV audio signals and the Text-to-Speech output driven by the system’s dialogue manager. Instead of disturbing a communication between humans, in this case acoustic echoes impair machine-based speech recognition.



Thus AEC is a crucial means to improve the recognition rate of an Automatic Speech Recognizer (ASR), providing the ASR with the echo-compensated signal e(k) that should mainly contain the utterance of the desired speaker v(k). Figure 9 depicts the employment of AEC in a Human-Machine-Interface System as is implemented in the DICIT project.

Figure 9: MC-AEC in Human-Machine-Interface System

The relation between the original loudspeaker signals and their contribution to y(k) is established by the time-variant impulse responses of the Loudspeaker-Enclosure-Microphone (LEM) system ĥ1(k)...ĥp(k), with P being the number of channels – time-variance is due to continuous changes of the acoustic environment, e.g. caused by temperature changes, door openings or user movements. An Acoustic Echo Canceller (AEC) as depicted in Figure 9 models these impulse responses by means of digital filters ĥ1(k)…ĥp(k). The echo replicas ŷ1(k)...ŷp(k) computed via convolution of the AEC filter responses with the known loudspeaker signals x1(k)...xp(k) are then subtracted from the microphone signal y(k), leading to the desired echo reduction. As to the design of ĥ1(k)..ĥp(k), adaptive filters are an adequate means to track the temporal variations of the LEM system. Among the different filter structures that have been studied, finite impulse response (FIR) filters are usually chosen for simplicity reasons on the one hand but they also guarantee stability during adaptation, which an infinite impulse response (IIR) structure does not. However, the employment of FIR filters necessarily implies a certain error due to the approximation of infinite LEM impulse responses by finite models. The related system mismatch (tail effect), is considered as part of the noise contribution. In the following, the Generalized Frequency Domain Adaptive Filtering (GFDAF) concept is outlined in Section 3.1 as an adequate algorithm for realizing MC-AEC in DICIT, and Stereo-AEC in particular for the first prototype [8]. Section 3.2 describes a new channel decorrelation approach according to [9].



3.1 Generalized Frequency Domain Adaptive Filtering (GFDAF) As described in [8], the low-complexity algorithms used in conventional single-channel AEC, such as the Normalized Least Mean Square (NLMS) algorithm, do not achieve sufficient convergence results when used for MC-AEC. This is due to the fact that these algorithms do not take the cross-correlations between the different channels into account. Consequently, not only the convergence rate is slowed down, but moreover, the solution for the adaptive filters may diverge.

Figure 10: MC-AEC misalignment convergence comparison for NLMS and FDAF [8]

The effect of taking the cross-correlations into account is depicted in Figure 10, which shows the misalignment convergence for FDAF and the basic NLMS in the multi-channel cases P=2 (respective lowest curve) to P=5 (respective uppermost curve) [8]. In the case of time-invariant environments and stationary highly correlated signals, the optimal choice for the AEC adaptive algorithm in the time-domain is the Recursive Least Squares (RLS) algorithm. However, its high computational complexity and its sensitivity to nonstationary signal statistics discourage its use in real-time applications. In the case of MC-AEC, the correlation matrix is worse conditioned compared to single-channel AEC scenarios (i.e. not only are the input channels highly auto-correlated but also cross-correlated), which implies that the inversion of the autocorrelation matrix of the input channel becomes numerically highly sensitive and the recursive computation of the inverse autocorrelation matrix for the RLS using the matrix inversion lemma leads almost surely to numerical instabilities. In conclusion, with respect to MC-AEC the use of both simple NLMS-like algorithms as well as RLS-based algorithms is discarded due to poor efficiency or high computational complexity and numerical sensitivity, respectively. As an alternative solution, algorithms for frequency-domain adaptive filtering offer fast convergence combined with acceptable computational complexity. These types of algorithms



DICIT_D3.1_20080430 14

are based on a realization of all filtering operations as fast convolutions in the DFT domain and the resulting applicability of the Fast Fourier Transform (FFT). The algorithm employed for the DICIT scenario is based on the Generalized Frequency Domain Adaptive Filtering (GFDAF) paradigm, presented in [8]. Employing the computational efficiency of the FFT it factors in the cross-correlations among the different channels of the input signal and thereby enables a faster convergence of the filters. This faster filter adaptation manifests itself in faster echo suppression. In DICIT this is especially important because user movements have to be expected in the Interactive TV scenario, which in turn imply rapid changes of the impulse responses of the LEM-system that has to be identified by the adaptive filters. The chosen algorithm thus constitutes an appropriate solution to the addressed problem. A preceding channel decorrelation that will be described in the following Section 3.2 allows a further speeding-up of the filter convergence. For the first prototype a Stereo-AEC algorithm has been implemented, which is intended to be extended to a 5.1-version for the second prototype.

3.2 Channel Decorrelation for MC-AEC As already mentioned above, the channels of the reference input signal are usually very similar and therefore not only highly auto-correlated but also often strongly cross-correlated. Without decorrelating the different channels prior to playback and echo path estimation, these strong cross-correlations lead to an ambiguity with respect to the solution that minimizes the underlying error criterion. Therefore the algorithm might converge to a solution, which minimizes the block error signal and therefore leads to an echo reduction, but without modelling the “correct” impulse responses of the LEM-system. Consequently, a change of the acoustical environment might result in a total system mismatch and thus in a breakdown of the AEC performance until the filters have converged to a new solution. Nonetheless, this is not the only requirement that has to be met by the employed channel decorrelation technique. In addition, subjective audio quality of the TV output must not be impaired by the decorrelation, i.e. the introduced signal manipulations must not cause audible artifacts – this is especially important in a multimedia application like the DICIT interactive TV scenario. Furthermore, with respect to the real-time application and costs simplicity is a crucial issue, as the channel pre-processing should not require an excessive amount of computational resources, in order to minimize the total computational expenses, and consequently equipments cost. Summarizing, the decorrelation of the loudspeaker signals is thus decisive for robust AEC and a fast convergence of the filters. Compared to the nonlinear pre-processing method that has been applied so far, the phase modulation-based approach according to [9] that will be described in the following enhances convergence behavior of the adaptive filters without impairing the subjective audio quality, i.e. without destroying the spatial (stereo) image of the reproduced sound. Please note that for DICIT the first running real-time implementation of this scheme has been developed.

© DICIT Consortium


The decorrelation strategy to be employed in DICIT is based on psychoacoustics and the realization that applying a time-varying phase modulation to a monophonic audio signal is a relatively simple method which does not damage the perceptual sound quality of the signal. However, applying a phase-modulation to a stereo pair might degrade the perceived sound image. Therefore, in order to obtain a maximum decorrelation of the signals, but without altering the stereo image, the interaural phase differences introduced by the time-varing phase modulation must not surpass the threshold of perception – however, to achieve the highest possible degree of channel decorrelation, these introduced signal manipulations should be chosen as close as possible to the limit given by the threshold of perception. Moreover, the phase differences are not equally perceived by the human hearing in different frequency ranges [9] as depicted in Figure 11, the sensitivity decreases gradually with increasing frequency until it vanishes for frequencies above 4 kHz. Therefore, a frequency-selective approach based on psychoacoustics appears appropriate to deliver best possible channel decorrelation for the DICIT scenario.

Figure 11: Phase modulation amplitude as a function of frequency subband [9]

In practice, it is implemented by employing pairs of analysis and synthesis filterbanks, and a phase modulation to the transform-domain coefficients after analysis. The complete phase modulation block diagram, for a stereophonic application, is depicted in Figure 12.

Figure 12: Stereo-decorrelation employing frequency-dependent phase modulation [9]

As for the filterbank design, a “Modulated Complex Lapped Transform” (MCLT) is employed, which was first introduced in [10]. A window length chosen as L=128 results in 64



MCLT domain subbands. This complex-valued modulation allows for perfect reconstruction of the signal after overlapping the output blocks by 50%. Using a complex-valued filter bank allows the easy manipulation of the phase in the individual frequency subbands. The phase modulation is performed by a multiplication or division of the MCLT-domain coefficients with , where ),( steϕ ),( stϕ is the time-varying phase shift. This phase shift is composed of a modulation function multiplied by a subband-dependent scaling factor. As explained in [9], the chosen modulation function must be smooth and the modulation frequency must be low, to be sure to not introduce a perceptible frequency shift (a frequency modulation with a frequency shift proportional to the derivative of the phase modulation function is introduced as a consequence of the phase modulation). Nevertheless, the modulation frequency has to be carefully chosen, due to the fact that the decorrelation introduced by an extremely low modulation frequency will not lead to a sufficient enhancement of the echo cancellation performance. For reasons of simplicity, a sine wave with a relatively low modulation frequency was chosen. The time-varying phase shift is given by

)2sin()(),( , tfsast stereomπϕ = , with 75.0, =stereomf The subband-dependent scaling factors a(s) were designed and optimized by a listening test procedure by the Fraunhofer Institute for Integrated Circuits (IIS, Erlangen, Germany), Audio Group [9]. The modulation scale factors for the first 12 subbands are chosen according to the curve depicted in Figure 11. Note that the amplitude of the phase modulation introduced in the first coefficients is very low but increases with increasing subband number and becomes maximal within the seventh subband – this corresponds to a frequency of 2.5 kHz (given M=64). Besides its simplicity the chosen decorrelation scheme is also attractive as it can easily be extended to applications with more than two playback channels, such as the foreseen second DICIT prototype enabling 5.1-playback. In this case the channels are grouped into channel pairs with each pair being treated like a separate stereo signal. Every pair will be pre-processed by employing the described modulation scheme, but with different modulation frequencies for the phase modulation. To provide an orthogonal modulation of all the modulators, the modulation frequencies chosen must be non-commensurate. It was mentioned before that the design of the algorithmic parameters were justified by listening tests. The following Figure 13 depicts the results of a listening test according to the MUSHRA standard (“Multi Stimulus test with Hidden Reference and Anchor”) that has been conducted by the Fraunhofer Institute for Integrated Circuits (IIS, Erlangen, Germany) [9]. The outcome is based on the assessment of ten subjects, nine of them experienced listeners. It is evident that the phase modulation scheme (indicated in green) clearly outperforms all other investigated decorrelation methods, including the nonlinear pre-processing which is represented by the blue bars. Being the only method whose corresponding signal quality was



constantly rated “excellent” except for the signal “glock” (nevertheless delivering the best result “good” ) the phase modulation is the only approach that practically preserves subjective audio quality concerning state-of-the-art.

Figure 13: Subjective audio quality for pre-processing methods [9]

Note that the result already reflects the quality of the pre-processing scheme employed in a 5.1-scenario. Since the processing of more than two channels is based on a pair-wise pre-processing as noted before the results illustrated in Figure 13 are not valid for 5.1 only, but also for the stereo case.

Figure 14: Convergence comparison of pre-processing methods for stereo AEC [9]

Concluding, Figure 14, taken out of [9], illustrates the convergence acceleration of the coefficient error norm due to the various pre-processing methods. Considering the fact that the phase modulation approach practically does not impair the subjective quality of the signals, a significant improvement of the convergence behavior is observable, thus suggesting this approach to be an adequate processing scheme for the DICIT prototype.



4. Source Localization (SLoc) From a general point of view, the source localization (SLoc) problem consists in deriving the position of one or more emitting sources given the acoustic measurements provided by a set of sensors. This research field has been widely investigated over the years and a wide range of different approaches has been proposed in the literature [14]. The most common technique relies on the estimation of the time difference that characterizes the different propagation paths between a source and two sensors. This difference is referred to as Time Difference of Arrival (TDOA). For a single microphone pair, the loci of points which pertains to a given TDOA is one of the sheets of a hyperboloid of two sheets as depicted in Figure 15.

Figure 15: Loci of points that satisfy a given TDOA at two microphones m1 and m2

When a point-like source is in a far field position the wave fronts can be assumed to be planar and the hyperboloid can be approximated by a cone. If we restrict the analysis to a plane, each TDOA corresponds to a Direction of Arrival (DOA) as explained in Figure 16.

Figure 16: TDOA given two microphones and a source in far field position. ν identifies the direction of arrival.

When a set of TDOA estimations is available from a set of microphone pairs, one can derive the position of the source as the point in space that best fits all the measurements. From this point of view, distributed microphone networks, similar to those exploited in the CHIL project, guarantee a more uniform coverage of the space than compact arrays and permit to design very accurate localization algorithms. The most critical issue for a SLoc algorithm is reverberation which is generated by the reflections that characterize the acoustic wave propagation in an enclosure. Reverberation is



DICIT_D3.1_20080430 19

critical because a virtual sound source is generated wherever a reflection occurs [15]. Although reflections considerably weaken the signal and hence the real source is always predominant on virtual ones, in some cases constructive interferences between reflected patterns may compromise the TDOA estimation. Environmental noise and presence of coherent noise sources concur to render the SLoc task more difficult.

4.1 SLoc in DICIT Within the DICIT project, the goal of the SLoc module is to provide the beamformer with accurate information about the position of the active speaker. The localization information is made available also to other modules, as for instance the Smart Speech Filtering (SSF) module, which may exploit the desired speaker’s position as an extra feature in an attempt to improve their own performance. Besides the typical issues related to the SLoc problem, goals and constraints of the DICIT project, as for instance the application scenario and the sensor setup, introduce particular issues which influence the algorithm design. Next sections briefly describe such problems. Finally, the DICIT scenario calls for the localization of several sources if several users are taken into account. For this situation there is an established method based on Blind Source Separation [11], [12], for which a real-time demonstrator is already available and which will be evaluated for the DICIT scenario in Task 3.4 and which will be described in Deliverable D3.2. In Task 3.3, covered by this deliverable, a new coherence-based localization algorithm for multiple sources will be described that provides a beamformer with accurate information about the position of the desired speaker.

4.1.1 Array Design In recent years, the SLoc community has shown a growing interest for distributed microphone networks and circular arrays. Unfortunately, the characteristics of the DICIT application do not allow this kind of sensor deployment and hence a linear nested array is adopted, as depicted in Figure 4. The nested array allows to exploit different sub-arrays in order to meet the requirements of each different technology in terms of inter-microphone distance. For SLoc purposes we exploit the 7 microphones at 32 cm distance plus the two vertical ones which permit to estimate the vertical position of the speaker. When using TDOA estimation for localization, the distance between sensors is a crucial aspect: a large inter-microphone distance guarantees a higher resolution but at the cost of a reduced performance in terms of TDOA estimation as the coherence of the desired signal between the microphones decreases. An inter-microphone distance of 32 cm is a reasonable trade-off for the first prototype. It is worth underlining that the adoption of a compact array, where microphones are deployed along a single horizontal line, renders the estimation of distance difficult [16], and makes the estimation of elevation impossible. Figure 17 illustrates how microphone arrangement influences the localization accuracy. Let us assume that two microphone pairs are deployed on the same wall and let us consider a set of positions distributed on a grid 7x7 in a room. From

© DICIT Consortium


each of the TDOAs corresponding to those positions a set of 1000 noisy TDOAs is derived by summing white Gaussian noise. The source positions are then computed from the noisy TDOAs through simple triangulation, i.e. crossing DOAs, resulting in 1000 points distributed around the original position. Obviously the shapes of the estimation distributions depend on the original position and the noisy TDOA measurements introduce a higher uncertainty along the radial direction relative to the microphone pairs’ center .

Figure 17: Effect of noisy time delay estimations in a double microphone pair set-up. Microphones-pairs are identified by the

two pairs of black points on the left.

As final remark, a compact array clearly requires that users speak towards, i.e. frontally, the sensors, otherwise localization can not be carried out due to lack of direct propagation paths. However, inviting the user to look at the television he/she is trying to control is a reasonable constraint.

4.1.2 Application Scenario Aside from the above-mentioned array design issues, the SLoc problem in DICIT is further complicated by the application scenario. The behavior of naïve users was monitored in a series of WOZ experiments. In particular it was observed that users tend to pronounce very short utterances that correspond to single commands for the system. As a consequence, silence is predominant and the overall length of speech segments is only about 15-20% of the interaction with the system. It was also observed that users change their positions while being silent between two consecutive commands. Finally, it is worth mentioning that the final prototype is expected to work in the presence of a home-theatre audio system where some loudspeakers will be located in a frontal position with respect to the microphone array. As a consequence loudspeaker signals must be correctly dealt with in order to ensure the correct system operation.



4.2 Adopted SLoc Approach

Given the project goals and specific issues, in this section we present the solution adopted in DICIT. Figure 18 presents the block diagram of the SLoc module.

Figure 18: SLoc module block diagram.

As already mentioned, the most common approach to the SLoc problem is based on TDOA estimations at a set of microphone pairs. A very efficient method to estimate the TDOA is provided by the CrossPower-Spectrum Phase analysis (CSP) [17], also known as Generalized Cross Correlation – Phase Transform (GCC-PHAT) [18], which is in general used for a single source, but shall now also form the basis for localizing several simultaneously active sources. Note that multiple simultaneously active sound sources will also be considered in Task 3.4 of the DICIT project and TDOA estimation methods based on blind source separation [11], [12], [13] will be investigated in this context. As for the CSP method for a single source, let us consider a microphone pair p and denote as xp1(k) and xp2(k) the signals acquired by the two microphones. CSP is defined as:

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧ ⋅

=∗

−

21

211)(pp

ppp XX

XXDFTC τ

Where Xp1 and Xp2 are the DFT transforms of the acquired signals and τ denotes the time lag. CSP provides a measure of the similarity between the two signals aligned according to the given time lag. It has been demonstrated that in noiseless free-field conditions, CSP presents a prominent peak in correspondence with the actual TDOA. Conversely, in highly reverberant environments reflections may give rise to spurious peaks. A large inter-microphone distance guarantees a better resolution but decreases the robustness of the approach. On the other hand, a higher sampling rate may be exploited to increase the resolution at the cost of a heavier computational load. As indicated before, in a multi-microphone scenario, as the one envisioned in DICIT, a set of TDOA measurements, computed for a set of microphone pairs, can be combined in order to obtain a single accurate estimation of the source position. In general the computed result is the



point in space that best fits the set of measurements. Depending on the characteristics of the particular problem, several different approaches can be found in literature. In our implementation we chose a steered beamformer-like approach, based on CSP, which performs a full search on the space of possible source positions. Let us assume that a set of P microphones is available and let us define a grid Σ of points s in order to uniformly cover the spatial area of interest. The adopted approach computes a function, defined over the above introduced grid that represents a measure of the plausibility that an active source is present in s. The resulting function, which can be evaluated in either three or two dimensions, is also referred to as “acoustic map” as it gives a representation of the acoustic activity in an enclosure. The speaker position estimation is obtained by maximizing such a map. The plausibility can be evaluated in several ways; in the next sections we will present the two methods we have so far implemented in DICIT. This kind of approach is easy to implement and it represents a straightforward way to exploit the redundancy provided by a microphone array. It guarantees a high level of flexibility that allows us to quickly modify the implementation according to both the experimental results and the behaviour of the first prototype. It can be easily downscaled in order to meet potential computational power limitation. Moreover this approach does not make any assumption on the characteristics of the scenario and so it is suitable to be employed as prototype in a real environment. Finally, as we will see later, this method is proper to apply in a multispeaker context and allows to efficiently handle the TV surrounding.

4.2.1 Global Coherence Field The first approach that we have implemented is based on the Global Coherence Field (GCF) theory [19]. For each point s on the grid Σ , the GCF function is defined as follows:

∑−

=

=1

0))((1)(

P

ppp sC

PsGCF δ

Where δp(s) is the geometrically determined TDOA at microphone pair p if the source is located in s. As mentioned above, the source position can be estimated as the point that maximizes the GCF. In our implementation δp(s) is rounded to the closest integer delay (in samples). In a distributed microphone network scenario, GCF is suitable to be extended to the so-called Oriented Global Coherence Field (OGCF) [20] which is capable of estimating also the orientation of the source. However, the adopted linear array provides a limited angular coverage for a directional (‘oriented’) source and as a consequence the potential of OGCF can be only partially utilized.



4.2.2 Sub-optimal Least Squares As an alternative approach, a sub-optimal Least Squares (LS) method has been implemented. The solution is sub-optimal because the search for the point that minimizes the LS criterion is restricted to a sampled version of the space of source coordinates, i.e. the grid Σ. The resulting acoustic map is computed in the following way [21]:

( )( )∑−

=

−−=1

0

21)(P

ppp s

PsLS δτ

where τp is the time lag that maximizes Cp(τ). Again the source position estimation is the point that maximizes the objective function. This kind of approach is less critical from a computational load point of view since it does not require to always access the CSP vectors. However it is intrinsically weaker than the GCF because the decision on τp is taken before combining single contributions. Nevertheless, in a linear array scenario, where the speaker is always supposed to be facing the microphones, this approach delivers satisfactory performance that is in line with the one obtained with a GCF method. Moreover, this solution allows refining the TDOA estimations through interpolation, which is not feasible in practice on the whole CSP functions as required in a GCF implementation.

4.2.3 Tracking In order to guarantee smooth localization outputs, single-frame localization estimates are processed by a threshold-and-hold filter. As a matter of fact, in a real implementation some localization estimates are less reliable due to the spectral content of the corresponding speech frames as well as due to the level of noise. Such localizations are characterized by a low peak in the acoustic map and are randomly spread over the search area. The idea is to filter out those localizations whose corresponding acoustic map peaks are below a given threshold and those isolated outliers that are too distant from the current tracking area. When a frame is skipped the filter keeps the previous localization estimate. The post-processing works in the following 3 steps:

1. If the acoustic map peak is below the threshold, skip the frame, otherwise go to next step;

2. If the current localization is “close” to previous localizations, take it as a good localization otherwise go to next step;

3. If a sufficient number of localization estimates concentrates in a particular area, take the current localization as a good localization.

This simple and light computation is capable of guaranteeing sufficient accuracy as reported in the next section. As it was observed that users prefer to move while not speaking this implementation is ready to quickly react when the user moves to a different position.



DICIT_D3.1_20080430 24

It is worth mentioning that an activity investigating the implementation of a particle filtering approach in the DICIT scenario has been recently started.

4.2.4 Experimental Results A series of localization experiments were run on the acoustic WOZ data collection (please refer to D6.2 for further details on the data collection and for a description of the annotation and labelling process). It is worth remarking that the addressed scenario is very challenging from a localization point of view because of its characteristics such as very short spoken sentences and very long pauses. The evaluation metrics are derived from those adopted in previous evaluations [22]. The basic metric to evaluate SLoc methods is the euclidean distance between the coordinates delivered by the localization system and the reference coordinates. An error is classified as fine if it is lower than 50 cm; otherwise it is classified as gross. A fine error is due to a correct but noisy source position estimation, on the other hand a gross error occurs when a faulty localization is delivered by the tracking algorithm. Distinction between fine and gross errors guarantees a better understanding of the real performance of a localization system. As a matter of fact few large errors, due to particular environmental conditions, may considerably affect the overall performance. Given the above mentioned metrics, the evaluation of SLoc algorithms is carried out in terms of:

• Localization rate: percentage of fine errors with respect to all the localization outputs; • RMSE: overall root mean square error (mm); • fine RMSE: root mean square error computed only on fine errors (mm); • Angular RMSE: root mean square angular error with respect to the center of the array; • Bias: single coordinate average error (mm); • Deletion rate: percentage of frames for which a localization is not available (due to

post-processing) and the previous value is kept. The localization algorithm executes a 2D search for a single source and it does not perform any speech activity detection. Given the 2D position estimate, the speaker height is derived by exploiting the vertical TDOA, i.e. computed at vertical microphone pair, with the highest CSP value. As a matter of fact, the automatic extraction of speaker coordinates from video recording and the manual transcriptions are not completely reliable. For this reason we first compare our algorithms under more controlled conditions where spatial and temporal labelings are easier to extract and control. The evaluation is hence restricted to the very beginning of each WOZ session when users are asked to read some phonetically rich sentences while sitting in front of the array. Figure 19 shows localization performance in terms of angular RMSE when different thresholds are applied, resulting in different deletion rates.

© DICIT Consortium


0.5

1

1.5

2

2.5

3

3.5

4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

RM

S (

deg)

deletion rate

GCFLS

Figure 19: Localization performance in terms of angular RMSE when different thresholds, corresponding to different

deletion rates, are applied

As expected, there is not significant difference in the performance of the two methods. It is hence not possible to definitely assess the effectiveness of the two approaches under investigation on the basis of these evaluation results. It is worth underling that LS seems to perform slightly better than GCF as soon as few localizations are discarded. The explanation lies in the fact that LS does not require rounding of time delays. Table 1 describes the evaluation results on the whole WOZ data collection in terms of the above introduced measures. The reported results are obtained setting the post-processing threshold to the value which delivers the best overall performance and are measured on speech frames only by applying a manual segmentation, based on transcriptions, as speech activity detector.

Method Loc Rate

Fine RMS (mm)

RMS (mm)

Angular RMS (deg)

Bias (mm) Deletion rate

GCF 92% 238 347 7.8 (-15,-180,-7) 13.5% LS 93% 223 315 7.4 (-7, -166,-5) 10%

Table 1: Evaluation results on the WOZ data collection.

As mentioned above, references and transcriptions are prone to errors and hence these results must be considered as tendencies more than localization performance. As general results, it is worth mentioning that due to the limited vertical coverage the speaker elevation is less accurate which is acceptable as it is not critical for the overall system performance. Evaluation results in terms of bias confirm also that the uncertainty is higher along the y-axis that is orthogonal to the array. Concluding, although this analysis allowed us to understand the criticalities of the problem and gave us an idea of the potential of our algorithms, we think that the behaviour of the overall DICIT prototype is the best metric to evaluate each front-end component. Frontal/Non-Frontal Speaker Detection As a side effect of the localization process, one can obtain clues that help to understand whether the talker is facing the system, and hence whether he is talking to it or to somebody



else. The peak of the acoustic map, or CSP, turned out to be a feature related to the speaker’s head orientation. However, some experimental results showed that this information alone is not enough and alternative features should be found out and applied in combination. For instance the SSF module could integrate the above mentioned feature with any other information available at that level. Further research activities on this topic will be conducted during the next months.

4.2.5 Multiple Sources The presented approaches have been widely adopted to tackle the SLoc problem when limited to a single source. When two or more sources are simultaneously active, it is reasonable to assume that an acoustic map presents two or more peaks in correspondence with the sources. However, searching for two local maxima may fail in the given context. In the presence of two speakers, depending on the spectral contents of the involved signals, the main peak jumps from one source to the other while the second peak may be considerably lower than the main one and may be overtaken by spurious peaks. Figure 20 shows an example of the GCF function when two sources are simultaneously active. In this case the two sources, denoted by circles, are on the left and on the right of the nested microphone array that is placed in the lower part of the picture. It can be observed that most of the coherence concentrates around the speaker on the left, while the peak on the right is quite smooth.

Figure 20: GCF acoustic map in presence of two sources. Dark colors mean small GCF valueswhile bright colors identify

large values.

In order to deal with this problem we devised an approach that attempts to de-emphasize the main peak in order to make the detection of the second one easier. Our proposed method works as follows [23]:

1. search for the main peak; 2. modify each single CSP by lowering those values at time lags that generated the main

peak; 3. compute a new map and search for the maximum.

If we define as s1 the position of the main peak, the de-emphasized version of each single CSP contribution is obtained as follows:



( ) ( ) ( )( )1, sCC ppp δτφττ =′

A suitable definition of the de-emphasis function φ is the following:

( )⎥⎥⎦

⎤

⎢⎢⎣

⎡−=

−−

bebb

μτ

αμτφ 11

21,

The parameter b determines the spatial selectivity of the de-emphasis function.

Figure 21: Multiple speaker localization. The de-emphasis process is highlighted in the dotted box.

Figure 21 graphically shows how the de-emphasis process is applied in order to identify the peak associated to the second speaker. Figure 22 shows the same GCF map as in Figure 20 after the CSP de-emphasis process has been applied.

Figure 22: GCF map of Figure 20 after the de-emphasis process.

Unfortunately the de-emphasis process not only highlights the second source but also increases the relative level of background amplitude in the map due to reverberation. However, in general the method is very effective in enhancing the peak related to the second sound source position. This approach can be combined with a spatio-temporal clustering that monitors the spatial and time persistency of source positions.



It is worth underlining that the same method can be applied to both GCF and LS localization approaches presented in this document.

4.2.6 Loudspeakers as Additional Sources In the DICIT project the SLoc module is expected to work even in the presence of surrounding TV outputs. As far as the first prototype is concerned, the TV output is diffused by two loudspeakers located next to the TV and the array as depicted in Figure 23.

Figure 23: Configuration of the first DICIT prototype.

This configuration may turn out to be more or less critical for SLoc depending on the reverberation level of the room. In particular, in a highly reverberant room, the small coherence contribution of loudspeakers tend to disappear in the overall reverberation and does not affect the localization performance. Conversely, in a less reverberant room, as the one where the first prototype will be running, an even small coherence contribution is evident. As a consequence TV outputs must be taken into account in order to guarantee an accurate estimation of the speaker position. Each loudspeaker can be handled as a further speaker whose position is known and can be deleted adopting the same approach that is exploited to tackle the multiple speaker scenario. Notice that also in a reverberant environment, a human talker facing the microphones is always prevailing on loudspeakers and it is not weakened by the deletion process, even in the presence of high TV output levels. It is worth remarking that all the above statements are valid in the given configuration and there is no claim of generalization. In particular when a 5.1 TV output system will be adopted, as envisioned in the final prototype, reverberation will not help anymore as some loudspeakers will be frontal to the array.

4.2.7 Real-time Implementation A real time implementation of the source localization module is available based on a multi-thread architecture. The SLoc module is run in a thread and reads input data from a ring buffer. The implementations allows for on-line switching between the two map computation



approaches (GCF and LS) and for on-line parameter tuning. Current implementation correctly tackles the presence of two loudspeakers by applying the multiple source localization approach. Localization of multiple users is not yet implemented in real time as it is not foreseen in the first prototype.

5. Multi-channel Acoustic Processing Subsystem This section deals with the implementation issues concerning the Multi-channel Acoustic Processing Subsystem (MAPS). The audio processing of the MAPS is obtained combining different software modules: Beamforming (BF), Source Localization (SLoc), Preprocessing (PreProc), Two-channel Acoustic Echo Cancellation (2C-AEC), and Smart Speech Filtering (SSF) including Speech Activity Detection (SAD). Within the MAPS a main program will be executed that will take care of the audio input/output and data processing. The various processing modules will be organized as libraries that will be exchanged between the partners. After an initial phase of setup and configuration, the main loop of the program is composed by the acquisition section and three modules that run sequentially. However, parallel processing will be investigated where possible, using threads and a multi-core CPU. Figure 24 shows the software structure of the MAPS. The acquired input data frame is made available in a ring buffer that will be available to the module. In order to ensure that the system’s response time will not be affected, BF module will not wait for SLoc output but will use the previous results, in case the localization module runs slower than real-time.

Figure 24: Block structure of Multi-channel Acoustic Processing



5.1 FBK Hardware Setup PC 1 hosts the Multi-channel Acoustic Processing subsystem and the audio interface. The PC must be a powerful machine equipped with a multi-core CPU, suitable to parallelize the execution of different software modules. The chosen hardware configuration is the following:

• Intel Quad Core Xeon E5320

• 4GB RAM

• 2 x 320GB SATA HD

An important section of the PC is the audio data recording hardware. The digital board is the RME HDSP 9652, connected via ADAT to three RME Octamic D acquisition boards. An external MOTU 896HD output board connected via ADAT is required to play the de-correlated TV signals. Figure 25 shows the hardware configuration of PC 1. The number of channels to be recorded is 15 (from the nested array) + 2 (TV left and right channels) + 1 (synthesis). Later in the project, TV audio channels could be 5 + 1.

Figure 25: PC 1 audio acquisition chain

The RME HDSP 9652 is a digital acquisition board that offers 3 ADAT optical I/O, ADAT-sync In, SPDIF I/O, word clock I/O. It is a PCI board that supports the ALSA drivers. The RME Octamic D is an acquisition board with 8 balanced XLR mic/line inputs. Each channel contains switches for 48V phantom power, a low cut filter and phase reversal. Amplification gain can be set between 10 and 60 dB. The ADC module adds 8 channels pristine 192 kHz at the precision of 24 bits, available as double ADAT output (S/MUX, up to



DICIT_D3.1_20080430 31

96 kHz), and simultaneously via DB-25 connectors as 4 AES outputs (up to 192 kHz). The ADC can be clocked internally (master), and externally via word clock and AES sync. The MOTU 896HD is an acquisition board with 8 microphone preamplifiers, pristine 192 kHz analog I/O, 8 channels of ADAT digital I/O, stereo AES/EBU. The precision is 24 bits. As for the external hardware, it includes the following items:

‐ The microphone array is composed by 15 low-cost electret microphones in a nested configuration and is connected to the three Octamic boards.

‐ The LCD television Sharp Aquos 46”. ‐ A modified version of a commercial STB made available by Fracarro. It is connected

to a satellite dish to receive DVB-S free to air digital programs. The STB is based on the ST5105 platform.

‐ The infrared remote control “Planet Alias 1”, a programmable universal model. It is programmed by Fracarro as a digital satellite receiver remote, compatible with the STB system.

‐ 5.1 Genelec 8030A plus the Genelec 7050B subwoofer audio surround system. For the first prototype only two stereo channels are used.

5.2 FAU Hardware Setup The chosen hardware configuration for PC1 at FAU is the following:

• Intel Pentium D (Dual Core) 3.2GHz

• 3GB RAM

• Seagate Barracuda 7200.9 250GB 8MB SATA II HD

The audio data acquisition hardware is composed of a RME Hammerfall Digi 9652 multi-channel soundcard, whose two ADAT-in and one ADAT-out ports are connected via three TOSLINK cables to a Creamware A-16 TDAT Analog audio-interface. This AD/DA-Converter is optically synchronized by the soundcard. It is connected to the DICIT microphone array via two Tascam MA-8 preamplifiers and a FAU-developed “LNT audio”-board. On the other hand the Creamware audio-interface is connected to a Sony amplifier that drives the stereo-loudspeakers. In the current setup the stereo-sound is delivered by a combination of a Grundig DVD-Player and a Harman/Kardon AVR 3000 serving as decoder.

Very similar to the FBK setup the external hardware used at FAU includes the following items in addition to the hardware mentioned above:

‐ The microphone array is composed by 15 low-cost electret microphones (Panasonic WM60-AT) in a nested configuration

‐ Teufel 5.1 audio surround system, consisting of M400 (front/center), M500 (surround), M5000 (subwoofer). For the first prototype only two stereo channels are used.

© DICIT Consortium


DICIT_D3.1_20080430 32

Bibliography

[1] B.D. Van Veen and K.M. Buckley, “Beamforming: A versatile approach to spatial filtering,” in Speech and Signal Processing Magazine, 1998 [2] H.L. Van Trees, Optimum Array Processing: Part IV of Detection, Estimation, and Modulation Theory”, Wiley-Interscience, 2002 [3] W. Kellermann, “A self-steering digital microphone array,” in IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP), Toronto, Canada, pp.3581-3584, May 1991. [4] W. Herbordt, Sound Capture for Human/Machine Interfaces, Springer-Verlag, Berlin Heidelberg, 2005 [5] L.C. Parra, “Steerable frequency-invariant beamforming for arbitrary arrays,” in Journal of the Acoustic Society of America, pp. 3839-3847, June 2006 [6] H. Cox, R.M. Zeskind, and T. Kooij. “Practical supergain,” in IEEE Transactions on Acoustics, Speech, and Signal Processing (ASSP-34(3)), pp. 393-398, June 1986 [7] L.B. Jackson, Digital Filters and Signal Processing, Third Edition, Kluwer Academic Publishers, Boston, 1996, pp. 301-307 [8] H. Buchner, J. Benesty, and W. Kellermann, “Generalized multichannel frequency-domain adaptive filtering: efficient realization and application to hands-free speech communication,“ in Elsevier Computer Science, Signal Processing, vol. 85, pp. 549-570, 2005.

[9] J. Herre, H. Buchner, and W. Kellermann, “Acoustic echo cancellation for surround sound using perceptually motivated convergence enhancement,“ in IEEE Transactions on Speech and Audio Processing, 2007. [10] H.S. Malvar, Signal Processing with Lapped Transforms, Artech House, Norwood, MA, USA, 1992. [11] H. Buchner, R. Aichner, J. Stenglein, H. Teutsch, and W. Kellermann, “Simultaneous localization of multiple sound sources using blind adaptive MIMO filtering,” in IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP), Philadelphia, PA, USA, Mar. 2005. [12] A. Lombard, H. Buchner, and W. Kellermann, “Multidimensional localization of multiple sound sources using blind adaptive MIMO system identification,” in Proc. IEEE Int. Conf. on Multisensor Fusion and Integration for Intelligent Systems (MFI), Heidelberg, Germany, Sep. 2006.

© DICIT Consortium


DICIT_D3.1_20080430 33

[13] H. Buchner, R. Aichner, and W. Kellermann, “TRINICON-based blind system identification with application to multiple-source localization and separation,” in Blind Speech Separation, S. Makino T.-W. Lee and S. Sawada, Eds. Springer-Verlag, Berlin, 2007.

[14] M. Brandstein, D. Ward, “Microphone arrays”, Springer, 2001. [15] H. Kuttruff, “Room Acoustics”, Elsevier Applied Science, 1991. [16] M. Brandstein, A Framework for speech Source Localization using sensor arrays. PhD Thesis, Brown University, May 1995.

[17] M. Omologo and P. Svaizer, “Use of the Crosspower-Spectrum Phase in Acoustic Event Localization”, IEEE Trans. On SAP, vol.5, n.3, pp. 288-292, May 1997. [18] C. Knapp, C. Carter, “The Generalized Correlation Method for Estimation of Time Delay”, IEEE Trans. On ASSP, vol.24, pp 320-327,1976

[19] R. DeMori, “Spoken Dialogue with Computers”, Chapter 2, Academic Press, 1998.

[20] A. Brutti, M. Omologo, P. Svaizer, “Oriented Global Coherence Field for the estimation of the estimation of the head orientation in smart rooms equipped with distributed microphone arrays”, in Proc. Interspeech 2005.

[21] D. Rabikin et al, “A DSP implementation of Source Location Using Microphone Arrays'”, in Proceedings of the 131st Meeting of the Acoustical Society of America, 1996, pp. 88-99

[22] M. Omologo, A. Brutti, P. Svaizer, and L. Cristoforetti, “Speaker Localization in CHIL lectures: Evaluation Criteria and Results”, “MLMI 2005: Revised selected papers”, edited by Steve Renals and Samy Bengio, Springer Berlin/Heidelberg, pp. 476-487, 2006. [23] A. Brutti, M. Omologo, and P. Svaizer, “Localization of multiple speakers based on a two step acoustic map analysis, ICASSP 2008.

© DICIT Consortium

FP6 IST-034624 2008/04/30 · of the project in order to meet the requirements for the first...

Documents

Transcript of FP6 IST-034624 2008/04/30 · of the project in order to meet the requirements for the first...