<span id="fpn9h"><noframes id="fpn9h"><span id="fpn9h"></span>
<span id="fpn9h"><noframes id="fpn9h">
<th id="fpn9h"></th>
<strike id="fpn9h"><noframes id="fpn9h"><strike id="fpn9h"></strike>
<th id="fpn9h"><noframes id="fpn9h">
<span id="fpn9h"><video id="fpn9h"></video></span>
<ruby id="fpn9h"></ruby>
<strike id="fpn9h"><noframes id="fpn9h"><span id="fpn9h"></span>
  • 《工程索引》(EI)刊源期刊
  • 中文核心期刊
  • 中國科技論文統計源期刊
  • 中國科學引文數據庫來源期刊

留言板

尊敬的讀者、作者、審稿人, 關于本刊的投稿、審稿、編輯和出版的任何問題, 您可以本頁添加留言。我們將盡快給您答復。謝謝您的支持!

姓名
郵箱
手機號碼
標題
留言內容
驗證碼

基于SE-DR-Res2Block的聲紋識別方法

李平 高清源 夏宇 張小勇 曹毅

李平, 高清源, 夏宇, 張小勇, 曹毅. 基于SE-DR-Res2Block的聲紋識別方法[J]. 工程科學學報, 2023, 45(11): 1962-1969. doi: 10.13374/j.issn2095-9389.2022.09.19.001
引用本文: 李平, 高清源, 夏宇, 張小勇, 曹毅. 基于SE-DR-Res2Block的聲紋識別方法[J]. 工程科學學報, 2023, 45(11): 1962-1969. doi: 10.13374/j.issn2095-9389.2022.09.19.001
LI Ping, GAO Qingyuan, XIA Yu, ZHANG Xiaoyong, CAO Yi. Voiceprint recognition method based on SE-DR-Res2Block[J]. Chinese Journal of Engineering, 2023, 45(11): 1962-1969. doi: 10.13374/j.issn2095-9389.2022.09.19.001
Citation: LI Ping, GAO Qingyuan, XIA Yu, ZHANG Xiaoyong, CAO Yi. Voiceprint recognition method based on SE-DR-Res2Block[J]. Chinese Journal of Engineering, 2023, 45(11): 1962-1969. doi: 10.13374/j.issn2095-9389.2022.09.19.001

基于SE-DR-Res2Block的聲紋識別方法

doi: 10.13374/j.issn2095-9389.2022.09.19.001
基金項目: 國家自然科學基金資助項目(51375209);江蘇省“六大人才高峰”計劃項目(ZBZZ-012);江蘇省研究生創新計劃項目(KYCX18_0630, KYCX18_1846);高等學校學科創新引智計劃資助項目(B18027)
詳細信息
    通訊作者:

    E-mail: caoyi@jiangnan.edu.cn

  • 中圖分類號: TN912.34

Voiceprint recognition method based on SE-DR-Res2Block

More Information
  • 摘要: 針對聲紋識別領域中基于傳統Res2Net模型特征表達能力不足、泛化能力不強的問題,提出了一種結合稠密連接與殘差連接的特征提取模塊SE-DR-Res2Block(Sequeeze and excitation with dense and residual connected Res2Block). 首先,介紹了應用傳統Res2Block的ECAPA-TDNN(Emphasized channel attention, propagation and aggregation in time delay neural network)網絡結構和稠密連接及其工作原理;然后,為實現更高效的特征提取,采用稠密連接進一步實現特征的充分挖掘,基于SE-Block(Squeeze and excitation block)將殘差連接和稠密連接相結合,提出了一種更高效的特征提取模塊SE-DR-Res2Net. 該模塊以一種更細粒化的方式獲得不同生長速率和多種感受野的組合,從而獲取多尺度的特征表達組合并最大限度上實現特征重用,以實現對不同層特征的信息進行有效提取,獲取更多尺度的特征信息;最后,為驗證該模塊的有效性,基于不同網絡模型采用SE-Res2Block(Sequeeze and excitation Res2Block)、FULL-SE-Res2Block(Fully connected sequeeze and excitation Res2Block)、SE-DR-Res2Block、FULL-SE-DR-Res2Block(Fully connected sequeeze and excitation with dense and residual connected Res2Block),分別在Voxceleb1和SITW(Speakers in the wild)數據集開展了聲紋識別的研究. 實驗結果表明,采用SE-DR-Res2Block的ECAPA-TDNN網絡模型,最佳等錯誤率分別達到2.24%和3.65%,其驗證了該模塊的特征表達能力,并且在不同測試集上的結果也驗證了其具有良好的泛化能力.

     

  • 圖  1  SE-Res2Block結構示意圖(示例取T為5,s為4)

    Figure  1.  Schematic diagram of the structure of SE-RES2Block (T is 5, and s is 4 in the example)

    圖  2  稠密連接結構示意圖

    Figure  2.  Schematic diagram of the dense connection structure

    圖  3  SE-DR-Res2Block結構示意圖(示例取T為5,s為4)

    Figure  3.  Schematic diagram of the structure of SE-DR-Res2Block (T is 5, and s is 4 in the example)

    圖  4  SE-Block結構圖

    Figure  4.  SE-Block structure diagram

    圖  5  檢測誤差權衡曲線. (a) 0~15 s; (b) 15~25 s; (c) 25~40 s

    Figure  5.  Detection error tradeoff curve: (a) 0?15 s; (b) 15?25 s; (c) 25?40 s

    表  1  Voxceleb1測試集在不同Res2Net-50系統下的性能比較

    Table  1.   Performance comparison of the Voxceleb1 test set under different RES2Net-50 systems

    System Parameters/ 106 EER/% minDCF
    (p = 0.1) (p = 0.01) (p = 0.001)
    x-vector[27] 4.19 0.212 0.391 0.512
    Res2Net-50 24.50 3.73 0.205 0.381 0.485
    FULL-Res2Net-50 24.50 3.91 0.210 0.385 0.481
    SE-DR-Res2Net-50 35.19 3.51 0.197 0.374 0.473
    FULL-SE-DR-Res2Net-50 35.19 3.70 0.206 0.379 0.478
    下載: 導出CSV

    表  2  Voxceleb1測試集在不同ECAPA-TDNN系統下的性能比較

    Table  2.   Performance comparison of the Voxceleb1 test set under different ECAPA-TDNN systems

    System Parameters/106 EER/% minDCF
    (p = 0.1) (p = 0.01) (p = 0.001)
    x-vector[27] 4.19 0.212 0.391 0.512
    Res2Block 14.73 2.49 0.132 0.269 0.312
    FULL-Res2Block 14.73 2.51 0.145 0.296 0.372
    SE-DR-Res2Block 16.71 2.24 0.120 0.245 0.300
    FULL-SE-DR-Res2Block 16.71 2.37 0.137 0.280 0.364
    下載: 導出CSV

    表  3  SITW測試集在不同ECAPA-TDNN系統下的性能比較

    Table  3.   Performance comparison of the SITW test set under different ECAPA-TDNN systems

    System Parameters/106 EER/% minDCF
    (p = 0.1) (p = 0.01) (p = 0.001)
    x-vector 6.92 0.357 0.567 0.832
    Res2Block 14.73 3.91 0.179 0.348 0.551
    FULL-Res2Block 14.73 4.0 0.179 0.349 0.529
    SE-DR-Res2Block 16.71 3.65 0.175 0.334 0.509
    FULL-SE-DR-Res2Block 16.71 3.83 0.179 0.337 0.513
    下載: 導出CSV

    表  4  SITW不同時長下的EER

    Table  4.   EER at different SITW duration %

    System ERR
    <15 s 15–25 s 25–40 s
    x-vector 7.52 7.21 6.65
    Res2Block 4.56 4.21 3.43
    FULLRes2Block 4.75 4.30 3.52
    SEDRRes2Block 4.06 3.82 3.30
    FULLSEDRRes2Block 4.12 4.07 3.42
    下載: 導出CSV

    表  5  不同模型下的性能比較

    Table  5.   Performance comparison under different models

    System Testing set EER/% System Testing set EER/%
    Attentive statistics[14] Voxceleb1 3.85 MTAF[18] SITW 5.11
    ResNet-34-SE[17] Voxceleb1 3.52 LSTF-opt+S-CCA[18] SITW 4.21
    LSTF-opt+S-CCA[18] Voxceleb1 2.92 Ours SITW 3.65
    Ours Voxceleb1 2.24
    下載: 導出CSV
    <span id="fpn9h"><noframes id="fpn9h"><span id="fpn9h"></span>
    <span id="fpn9h"><noframes id="fpn9h">
    <th id="fpn9h"></th>
    <strike id="fpn9h"><noframes id="fpn9h"><strike id="fpn9h"></strike>
    <th id="fpn9h"><noframes id="fpn9h">
    <span id="fpn9h"><video id="fpn9h"></video></span>
    <ruby id="fpn9h"></ruby>
    <strike id="fpn9h"><noframes id="fpn9h"><span id="fpn9h"></span>
    www.77susu.com
  • [1] Zheng F, Li L T, Zhang H, et al. Overview of Voiceprint Recognition Technology and Applications. J Inf Secur Res, 2016, 2(1): 44.

    鄭方, 李藍天, 張慧等. 聲紋識別技術及其應用現狀. 信息安全研究, 2016, 2(1):44
    [2] Hayashi V T, Ruggiero W V. Hands-free authentication for virtual assistants with trusted IoT device and machine learning. Sensors, 2022, 22(4): 1325 doi: 10.3390/s22041325
    [3] Faundez-Zanuy M, Lucena-Molina J J, Hagmueller M. Speech watermarking: An approach for the forensic analysis of digital telephonic recordings[J/OL]. arXiv preprint (2022-03-12) [2022-09-19]. https://arxiv.org/abs/2203.02275
    [4] Garain A, Ray B, Giampaolo F, et al. GRaNN: Feature selection with golden ratio-aided neural network for emotion, gender and speaker identification from voice signals. Neural Comput Appl, 2022, 34(17): 14463 doi: 10.1007/s00521-022-07261-x
    [5] Waghmare K, Gawali B. Speaker recognition for forensic application: A review. J Pos Sch Psychol, 2022, 6(3): 984
    [6] Mittal A, Dua M. Automatic speaker verification systems and spoof detection techniques: review and analysis. Int J Speech Technol, 2022, 25: 105 doi: 10.1007/s10772-021-09876-2
    [7] Burget L, Matejka P, Schwarz P, et al. Analysis of feature extraction and channel compensation in a GMM speaker recognition system. IEEE Trans Audio Speech Lang Process, 2007, 15(7): 1979 doi: 10.1109/TASL.2007.902499
    [8] Bao H J, Zheng F. Combined GMM-UBM and SVM speaker identification system. J Tsinghua Univ Sci Technol, 2008(S1): 693

    鮑煥軍, 鄭方. GMM-UBM和SVM說話人辨認系統及融合的分析. 清華大學學報(自然科學版), 2008(S1):693
    [9] Kenny P, Stafylakis T, Ouellet P, et al. JFA-based front ends for speaker recognition // 2014 IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, 2014: 1705
    [10] Cumani S, Plchot O, Laface P. On the use of i–vector posterior distributions in probabilistic linear discriminant analysis. IEEE/ACM Trans Audio Speech Lang Process, 2014, 22(4): 846 doi: 10.1109/TASLP.2014.2308473
    [11] Variani E, Lei X, McDermott E, et al. Deep neural networks for small footprint text-dependent speaker verification // 2014 IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, 2014: 4052
    [12] Snyder D, Ghahremani P, Povey D, et al. Deep neural network-based speaker embeddings for end-to-end speaker verification // 2016 IEEE Spoken Language Technology Workshop. San Diego, 2016: 165
    [13] Peddinti V, Povey D, Khudanpur S, et al. A time delay neural network architecture for efficient modeling of long tem-poral contexts // Sixteenth Annual Conference of the International Speech Communication Association. Dresden, 2015: 3214
    [14] Okabe K, Koshinaka T, Shinoda K. Attentive statistics pooling for deep speaker embedding // Interspeech. Hyderabad, 2018: 2252
    [15] Jiang Y H, Song Y, McLoughhlin I, et al. An effective deep embedding learning architecture for speaker verification // Interspeech. Graz, 2019: 4040
    [16] Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional networks // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, 2017: 4700
    [17] Zhou J F, Jiang T, Li Z, et al. Deep speaker embedding extraction with channel-wise feature responses and additive supervision softmax loss function // Interspeech. Graz, 2019: 2883
    [18] Li Z, Zhao M, Li L, et al. Multi-feature learning with canonical correlation analysis constraint for text-independent speaker verification // 2021 IEEE Spoken Language Technology Workshop. Shenzhen, 2021: 330
    [19] Desplanques B, Thienpondt J, Demuynck K. Ecapa-tdnn: emphasized channel attention, propagation and aggregation in tdnn based speaker verification // Interspeech. Shanghai, 2020: 3830
    [20] Gao S H, Cheng M M, Zhao K, et al. Res2net: A new multi-scale backbone architecture. IEEE Trans Pattern Anal and Mach Intell, 2019, 43(2): 652
    [21] Hu J, Shen L, Sun G. Squeeze-and-excitation networks // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, 2018: 7132
    [22] Nagrani A, Chung J S, Zisserman A. Voxceleb: a large-scale speaker identification dataset[J/OL]. arXiv preprint (2018-05-30) [2022-09-19]. https://arxiv.org/abs/1706.08612
    [23] McLaren M, Ferrer L, Castan D, et al. The speakers in the wild (SITW) speaker recognition database // Interspeech. San Francisco, 2016: 818
    [24] Guo Z C, Yang Z, Ge Z R, et al. An endpoint detection method based on speech graph signal processing. J Signal Process, 2022, 38(4): 788 doi: 10.16798/j.issn.1003-0530.2022.04.013

    郭振超, 楊震, 葛子瑞, 等. 一種基于語音圖信號處理的端點檢測方法. 信號處理, 2022, 38(04):788 doi: 10.16798/j.issn.1003-0530.2022.04.013
    [25] Zheng Y, Jiang Y X. Speaker clustering algorithm based on feature fusion. J Northeast Univ Nat Sci, 2021, 42(7): 952

    鄭艷, 姜源祥. 基于特征融合的說話人聚類算法. 東北大學學報(自然科學版), 2021, 42(7):952
    [26] Deng J K, Guo J, Xue N N, et al. Arcface: Additive angular margin loss for deep face recognition // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, 2019: 4690
    [27] Chen Z G, Li P, Xiao R Q, et al. A multiscale feature extraction method for text-independent speaker recognition. J Electron Inf Technol, 2021, 43(11): 3266 doi: 10.11999/JEIT200917

    陳志高, 李鵬, 肖潤秋, 等. 文本無關說話人識別的一種多尺度特征提取方法. 電子與信息學報, 2021, 43(11):3266 doi: 10.11999/JEIT200917
  • 加載中
圖(5) / 表(5)
計量
  • 文章訪問數:  384
  • HTML全文瀏覽量:  67
  • PDF下載量:  91
  • 被引次數: 0
出版歷程
  • 收稿日期:  2022-09-19
  • 網絡出版日期:  2023-03-06
  • 刊出日期:  2023-11-01

目錄

    /

    返回文章
    返回