回声感知

无论是实际环境还是语音通话中，回声总是存在的。但是需要满足如下两个条件，我们才能感觉到回声的存在：

回波通路延时大于 50ms
回波信号能量足够强（能听到👂）

Tips: 回波通路延时小于 30ms 时不易察觉，小于50ms 才会有感知。

回声分类

从通讯回音产生的原因看，可以分为声学回音（Acoustic Echo）和线路回音（Line Echo），相应的回声消除技术就叫声学回声消除（Acoustic Echo Cancellation，AEC）和线路回声消除（Line Echo Cancellation, LEC）。

声学回音是由于在免提或者会议应用中，扬声器的声音多次反馈到麦克风引起的。
线路回音是由于物理电子线路的二四线匹配耦合所引起（由于电路存在不匹配的问题，会有一部分的信号被反馈回来，形成了回音）。

尽管回声消除是非常复杂的技术，但我们可以简单的描述这种处理方法：

房间A的音频会议系统接收到房间B中的声音
声音被采样，这一采样被称为回声消除参考
随后声音被送到房间A的音箱和声学回声消除器中
房间B的声音和房间A的声音一起被房间A的话筒拾取
声音被送到声学回声消除器中，与原始的采样进行比较，移除房间B的声音

WebRTC AEC 算法是属于分段快频域自适应滤波算法，Partioned block frequeney domain adaPtive filter (PBFDAF)。具体可以参考 Paez Borrallo J M and Otero M G 使用该AEC算法要注意两点：

延时要小，因为算法默认滤波器长度是分为12块，每块64点，按照8000采样率，也就是12*8ms=96ms的数据，而且超过这个长度是处理不了的。
延时抖动要小，因为算法是默认10块也计算一次参考数据的位置（即滤波器能量最大的那一块），所以如果抖动很大的话找参考数据时不准确的，这样回声就消除不掉了。

声学回声分类

声学回声信号根据传输途径的差别可以分别直接回声信号和间接回声信号。

1）直接回声

近端扬声器B将语音信号播放出来后，近端麦克风B直接将其采集后得到的回声。直接回声不受环境的影响，与扬声器到麦克风的距离及位置有很大的关系，因此直接回声是一种线性信号。 直接回声在音频会议中容易形成啸叫，这类回声的处理方法分为两大类：前向抑制、反馈抵消。

2）间接回声

近端扬声器B将语音信号播放出来后，语音信号经过复杂多变的墙面反射后由近端麦克风B将其拾取。间接回声的大小与房间环境、物品摆放以及墙面吸引系数等等因素有关，因此间接回声是一种非线性信号。

核心模块（组成）

WebRTC 回声抵消 (aec、aecm) 算法主要包括以下几个重要模块：

回声时延估计
NLMS(归一化最小均方自适应算法)
NLP（非线性滤波）
CNG(舒适噪声产生）

回声时延估计

![[Pasted image 20240309175252.png]]

这张图很多东西可以无视，我们重点看T0，T1，T2三项。

T0 代表着声音从扬声器传到麦克风的时间，这个时间可以忽略，因为一般来说话筒和扬声器之间距离不会太远，考虑到声音340米每秒的速度，这个时间都不会超过1毫秒。
T1 代表远处传到你这来的声音，这个声音被传递到回声消除远端接口（WebRtcAec_BufferFarend）的到播放出来的时间。一般来说接收到的音频数据传入这个接口的时候也就是上层传入扬声器的时刻，所以可以理解成该声音防到播放队列中开始计时，到播放出来的时间。
T2代表一段声音被扬声器采集到，然后到被送到近端处理函数（WebRtcAec_Process）的时刻，由于声音被采集到马上会做回声消除处理，所以这个时间可以理解成麦克风采集到声音开始计时，然后到你的代码拿到音频 PCM 数据所用的时间。
delay=T0+T1+T2，其实也就是 T1+T2。

一般来说，如果一个设备能找到合适的 delay，那么这个设备再做回声消除处理就和降噪增益一样几乎没什么难度了。如 iPhone 的固定 delay 是 60ms。不过具体还要看代码所在位置，如在芯片内部，时间还是比较少的并且容易固定。假如是系统应用层软件，整个时间就不确定了。

NLMS（归一化最小均方自适应算法）

LMS/NLMS/AP/RLS等都是经典的自适应滤波算法，此处只对webrtc中使用的NLMS算法做简略介绍。
设远端信号为x(n),近段信号为d(n),W(n),则误差信号e(n)=d(n)-w’(n)x(n) (此处‘表示转秩），NLMS对滤波器的系数更新使用变步长方法，即步长u=u0/(gamma+x’(n) * x(n))。其中u0为更新步长因子，gamma是稳定因子，则滤波器系数更新方程为 W(n+1)=W(n)+u*e(n)*x(n); NLMS比传统LMS算法复杂度略高，但收敛速度明显加快。LMS/NLMS性能差于AP和RLS算法。
webrtc使用了分段块频域自适应滤波(PBFDAF)算法，这也是自适应滤波器的常用算法。
自适应滤波的更多资料可以参考simon haykin 的《自适应滤波器原理》。

NLP（非线性滤波）

webrtc采用了维纳滤波器。此处只给出传递函数的表达式，设估计的语音信号的功率谱为Ps(w)，噪声信号的功率谱为Pn(w)，则滤波器的传递函数为H(w)=Ps(w)/(Ps(w)+Pn(w))。

CNG(舒适噪声产生）

webrtc采用的舒适噪声生成器比较简单，首先生成在[0 ,1 ]上均匀分布的随机噪声矩阵，再用噪声的功率谱开方后去调制噪声的幅度。

Android AEC

下面是 Android 系统下 WebRTC 的封装调用，注意其中对 apm_->set_stream_delay_ms(delay_ms_); 的使用。

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263


#include "webrtc_apm.h"
//#include "webrtc/common_types.h"
#include "webrtc/modules/audio_processing/include/audio_processing.h"
#include "webrtc/modules/include/module_common_types.h"
#include "webrtc/api/audio/audio_frame.h"
#include "YuvConvert.h"
 
using namespace webrtc;
//using namespace cbase;
 
WebrtcAPM::WebrtcAPM(int process_smp, int reverse_smp)
: apm_(nullptr)
, far_frame_(new AudioFrame)
, near_frame_(new AudioFrame)
, ref_cnt_(0)
, sample_rate_(8000)
, samples_per_channel_(8000/100)
, channels_(1)
, frame_size_10ms_(8000/100*sizeof(int16_t))
, delay_ms_(60)
, process_sample_rate_(process_smp)//44100
, reverse_sample_rate_(reverse_smp)//48000
{
    audio_send_ = new unsigned char[kMaxDataSizeSamples_];
    audio_reverse_ = new unsigned char[kMaxDataSizeSamples_];
 
#if defined(__APPLE__)
    delay_ms_ = 60;
#endif
 
#if defined(VAD_TEST)
        //create webrtc vad
        ty_vad_create(8000, 1);
        ty_set_vad_level(vad_level::LOW);
        ty_vad_set_recordfile("/sdcard/vadfile.pcm");
#endif
}
 
WebrtcAPM::~WebrtcAPM()
{
    if(far_frame_)
    {
        delete far_frame_;
        far_frame_ = NULL ;
    }
 
    if (near_frame_) {
        delete near_frame_ ;
        near_frame_ = NULL ;
    }
 
    if (audio_send_) {
        delete[] audio_send_;
    }
 
    if (audio_reverse_) {
        delete[] audio_reverse_;
    }
 
#if defined(VAD_TEST)
        //destory webrtc vad
        ty_vad_destory();
#endif
}
 
void WebrtcAPM::set_sample_rate(int sample_rate)
{
    sample_rate_ = sample_rate;
    samples_per_channel_ = sample_rate_ / 100;
}
 
int WebrtcAPM::frame_size()
{
    return frame_size_10ms_;
}
 
void WebrtcAPM::set_ace_delay(int delay)
{
    LOGI("set aec delay to %d ms \n", delay);
    delay_ms_ = delay;
}
 
void WebrtcAPM::set_reverse_stream(int reverse_sample_rate)
{
    std::lock_guard<std::mutex> guard(mutex_);
 
    reverse_sample_rate_ = reverse_sample_rate;
    if (resampleReverse) {
        delete resampleReverse;
        resampleReverse = NULL ;
    }
 
    resampleReverse = new webrtc::Resampler(reverse_sample_rate_,sample_rate_,channels_);
    int result = resampleReverse->Reset(reverse_sample_rate_,sample_rate_,channels_);
    if (result != 0) {
        LOGE("reset resampleReverse fail,%d!\n", result);
    }
}
 
int WebrtcAPM::init()
{
    std::lock_guard<std::mutex> guard(mutex_);
 
    resampleReverse = new webrtc::Resampler(reverse_sample_rate_,sample_rate_,channels_);
    int result = resampleReverse->Reset(reverse_sample_rate_,sample_rate_,channels_);
    if (result != 0) {
        LOGE("reset resampleReverse fail,%d!\n", result);
    }
 
    resampleIn = new webrtc::Resampler(sample_rate_,process_sample_rate_,channels_);
    result = resampleIn->Reset(sample_rate_,process_sample_rate_,channels_);
    if (result != 0) {
        LOGE("reset resampleIn fail,%d!\n", result);
    }
 
    initAPM();
 
    LOGI("initAPM success!");
 
    return 0;//crash without this
}
 
void WebrtcAPM::uninit()
{
    std::lock_guard<std::mutex> guard(mutex_);
 
    deInitAPM();
    safe_delete(resampleIn);
    safe_delete(resampleReverse);
}
 
void WebrtcAPM::reset_apm(){
    std::lock_guard<std::mutex> guard(mutex_);
    deInitAPM();
    initAPM();
}
 
int WebrtcAPM::initAPM(){
    apm_ = AudioProcessingBuilder().Create();
    if (apm_ == nullptr) {
        LOGE("AudioProcessing create failed");
        return -1;
    }
 
    AudioProcessing::Config config;
    config.echo_canceller.enabled = true;
    config.echo_canceller.mobile_mode = true;
 
    config.gain_controller1.enabled = true;
    config.gain_controller1.mode = AudioProcessing::Config::GainController1::kAdaptiveDigital;
    config.gain_controller1.analog_level_minimum = 0;
    config.gain_controller1.analog_level_maximum = 255;
 
    config.noise_suppression.enabled = true;
    config.noise_suppression.level = AudioProcessing::Config::NoiseSuppression::Level::kModerate;
 
    config.gain_controller2.enabled = true;
    config.high_pass_filter.enabled = true;
    config.voice_detection.enabled = true;
 
    apm_->ApplyConfig(config);
    LOGI("AudioProcessing initialize success \n");
 
    far_frame_->sample_rate_hz_ = sample_rate_;
    far_frame_->samples_per_channel_ = samples_per_channel_;
    far_frame_->num_channels_ = channels_;
 
    near_frame_->sample_rate_hz_ = sample_rate_;
    near_frame_->samples_per_channel_ = samples_per_channel_;
    near_frame_->num_channels_ = channels_;
 
    frame_size_10ms_ = samples_per_channel_ * channels_ * sizeof(int16_t);
 
    //LOGI("AudioProcessing initialize success end\n");
    return 0;
 
}
 
void WebrtcAPM::deInitAPM(){
    //std::lock_guard<std::mutex> guard(mutex_);
    //if (--ref_cnt_ == 0) {
        LOGI("destroy WebrtcAPM \n");
        safe_delete(apm_);
    //}
}
 
//8000->44100
void WebrtcAPM::process_stream(uint8_t *buffer,int bufferLength, uint8_t *bufferOut, int* pOutLen, bool bUseAEC)
{
    if (bUseAEC) {
        std::lock_guard<std::mutex> guard(mutex_);
        if (apm_) {
            int frame_count = bufferLength / frame_size_10ms_;
        
            for (int i = 0; i < frame_count; i++) {
                 apm_->set_stream_delay_ms(delay_ms_);
                 // webrtc apm process 10ms datas every time
                 memcpy((void*)near_frame_->data(), buffer + i*frame_size_10ms_, frame_size_10ms_);
                 int res = apm_->ProcessStream(near_frame_->data(),
                                        StreamConfig(near_frame_->sample_rate_hz_, near_frame_->num_channels_),
                                        StreamConfig(near_frame_->sample_rate_hz_, near_frame_->num_channels_),
                                        (int16_t * const)near_frame_->data());
                 if (res != 0) {
                     LOGE("ProcessStream failed, ret %d \n",res);
                 }
 
#if  defined(VAD_TEST)                        
            bool ret = ty_vad_process(buffer + i*frame_size_10ms_, frame_size_10ms_);
#endif
 
                 memcpy(buffer + i*frame_size_10ms_, near_frame_->data(), frame_size_10ms_);
            }
        }
    }
 
    if (resampleIn && apm_) {
        //resample
        size_t outlen = 0 ;
        int result = resampleIn->Push((int16_t*)buffer, bufferLength/sizeof(int16_t), (int16_t*)bufferOut, kMaxDataSizeSamples_/sizeof(int16_t), outlen);
        if (result != 0) {
            LOGE("resampleIn error, result = %d, outlen = %d\n", result, outlen);
        }
        *pOutLen = outlen;
    }
}
 
//48000->8000
void WebrtcAPM::process_reverse_10ms_stream(uint8_t *bufferIn, int bufferLength, uint8_t *bufferOut, int* pOutLen, bool bUseAEC)
{
    size_t outlen = 0 ;
    if (resampleReverse && apm_) {
        //resample
        int result = resampleReverse->Push((int16_t*)bufferIn, bufferLength/sizeof(int16_t), (int16_t*)audio_reverse_, kMaxDataSizeSamples_/sizeof(int16_t), outlen);
        if (result != 0) {
            LOGE("resampleReverse error, result = %d, outlen = %d\n", result, outlen);
        }
    }
    else {
		memcpy(audio_reverse_, bufferIn, bufferLength);
		outlen = bufferLength;
	}
 
    *pOutLen = outlen;
 
    if (!bUseAEC){
        //copy data and return
        memcpy(bufferOut, audio_reverse_, frame_size_10ms_);
        return;
    }
 
    std::lock_guard<std::mutex> guard(mutex_);
    if (apm_) {
        memcpy((void*)far_frame_->data(), audio_reverse_, frame_size_10ms_);
        int res = apm_->ProcessReverseStream(far_frame_->data(),
                                        StreamConfig(far_frame_->sample_rate_hz_, far_frame_->num_channels_),
                                        StreamConfig(far_frame_->sample_rate_hz_, far_frame_->num_channels_),
                                        (int16_t * const)far_frame_->data());
        if (res != 0) {
            LOGE("ProcessReverseStream failed, ret %d \n",res);
        }
        memcpy(bufferOut, audio_reverse_, frame_size_10ms_);//far_frame_->data()
    }
}

Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89


#include <webrtc/modules/audio_processing/aec/echo_cancellation.h>

using namespace webrtc;

#define NN 160

int webrtcAecTest1()
{
#define  NN 160
    char far_frame_c[NN * 2];
    char near_frame_c[NN * 2];

    short far_frame_s[NN];
    short near_frame_s[NN];
    short out_frame_s[NN];

    void *aecmInst = NULL;
    FILE *fp_far = fopen("speaker.pcm", "rb");
    FILE *fp_near = fopen("micin.pcm", "rb");
    FILE *fp_out = fopen("out1.pcm", "wb");

    float far_frame_f[NN];
    float near_frame_f[NN];
    float out_frame_f[NN];

    float;

    do
    {
        if (!fp_far || !fp_near || !fp_out)
        {
            printf("WebRtcAecTest open file err \n");
            break;
        }

        aecmInst = WebRtcAec_Create();
        WebRtcAec_Init(aecmInst, 8000, 8000);

        AecConfig config;
        config.nlpMode = kAecNlpConservative;
        WebRtcAec_set_config(aecmInst, config);

        while (1)
        {
            if (NN == fread(far_frame_c, sizeof(char) * 2, NN, fp_far))
            {
                //1
                for (int i = 0; i < NN; ++i)
                {
                    far_frame_s[i] = (far_frame_c[i * 2 + 1] << 8) | (far_frame_c[i * 2] & 0xFF);//两个char型拼成一个short
                    far_frame_f[i] = far_frame_s[i];//转float型接口需要
                }
                WebRtcAec_BufferFarend(aecmInst, far_frame_f, NN);//对参考声音(回声)的处理

                //2
                fread(near_frame_c, sizeof(char) * 2, NN, fp_near);
                for (int i = 0; i < NN; ++i)
                {
                    near_frame_s[i] = (near_frame_c[i * 2 + 1] << 8) | (near_frame_c[i * 2] & 0xFF);
                    near_frame_f[i] = near_frame_s[i];
                }

                float* const p = near_frame_f;
                const float* const* nearend = &p;

                float* const q = out_frame_f;
                float* const* out = &q;

                //3
                WebRtcAec_Process(aecmInst, nearend, 1, out, NN, 40, 0);//回声消除
                for (int i = 0; i < NN; ++i)
                {
                    out_frame_s[i] = out_frame_f[i];
                }
                fwrite(out_frame_s, sizeof(short), NN, fp_out);
            }
            else
            {
                break;
            }
        }
    } while (0);

    fclose(fp_far);
    fclose(fp_near);
    fclose(fp_out);
    WebRtcAec_Free(aecmInst);
    return 0;
}

测试文件

麦克风输入音频，包括回音：宜居小镇/Document/WebRTC/AEC/spacker.pcm
回音参考音频：宜居小镇/Document/WebRTC/AEC/micin.pcm

应用场景

以直播应用场景为例，有两种可能需要回声消除的情况：

场景A：主播端具有麦克风输入，并且音箱外放声音，需要过滤掉音响的回音；
场景B：主播端同某些客户端在同一房间，并且客户端在音响外放；

场景A

可以利用 WebRTC 的 AEC 模块进行回声消除，在 Windows 端，需要计算出音频输出到音响，麦克风采集到音频的时间间隔。 实际应用中，一般主播端无外放功能，回声消除的作用不是特别广泛。

场景B

很难计算每个客户端到采集端的 delay 时间，需要一些 ntp 之类的时间同步过程，复杂且技术难较高，效果也不明显。实际应用过程中，主播端应该是在一个安静的环境中，应用范围低。

综上两种场景，直播应用中加入回声消除，适用面窄，收益和技术难度成反比。

番外

Android AEC 控制

目前网上有两种控制 AEC 是否启用的方式，具体要视 WebRTC 版本而定。

1

WebRtcAudioUtils.setWebRtcBasedAcousticEchoCanceler(true);

1
2
3
4


MediaConstraints mediaConstraints = new MediaConstraints();
mediaConstraints.mandatory.add(new MediaConstraints.KeyValuePair("echoCancellation", "true"));
mediaConstraints.mandatory.add(new MediaConstraints.KeyValuePair("googEchoCancellation", "true"));
mediaConstraints.mandatory.add(new MediaConstraints.KeyValuePair("googEchoCancellation2", "true"));