D²Net: A Denoising and Dereverberation Network Based on Two-branch Encoder and Dual-path Transformer

Liusong Wang, Wenbing Wei, Yadong Chen, Ying Hu

Abstract: The simultaneous denoising and dereverberation for single-channel mixture speech under the complicated acoustic environment is considered to be a challengeable task. In this paper, we propose a denoising and dereverberation network named as D²Net in which a two-branch encoder (TBE) is designed to extract and selectively fuse features with different granularity. In addition, we design a global-local dual-path transformer (GLDPT) which introduces the local dense synthesizer attention (LDSA) in the dual-path transformer to improve the perception of local information. We evaluated our proposed D²Net and conducted ablation studies on the VoiceBank+DEMAND and WHAMR! datasets. Meanwhile, we chose three types of data in the WHAMR! dataset to verify the ability of the D²Net on the tasks of denoising-only, dereverberation-only, and simultaneous denoising and dereverberation, respectively. Experimental results show that our proposed model outperforms the comparative models, and all achieve better performance on the tasks of simultaneous denoising and dereverberation, dereverberation-only, and denoising-only, while keeping a small number of network parameters.

D²Net Architecture: