Named entity recognition of Chinese electronic medical records based on multifeature embedding and attention mechanism
-
摘要: 中文電子病歷文本包含大量嵌套實體、句子語法結構復雜、句式偏短。為有效識別其醫療實體,提出一種融合多特征嵌入與注意力機制的命名實體識別算法,在輸入表示層融合字符、單詞、字形三個粒度的特征,并在雙向長短期記憶網絡的隱含層引入注意力機制,使算法在捕獲特征時更加關注于醫療實體相關的字符,最終實現對中文電子病歷中疾病、身體部位、癥狀、藥物、操作五類實體的最優標注。面向開源和自建糖尿病數據集的實驗結果中所提算法的實體識別準確率、召回率和F1值都達到97%以上,表明其可以更加有效地識別中文電子病歷中各類實體。Abstract: Medical records, as an essential part of the health care records of residents, save all the information about the clinical treatment of patients, which are traditionally written by doctors on paper. With the development of information technologies, electronic medical records that are more easily saved and managed gradually replace the traditional ones. Intelligent auxiliary diagnosis, patients’ portrait construction, and disease prediction based on medical reports have become research hotspots in the field of intelligent medical care. To fully discover the hidden relationship between symptoms and diseases from the documents saved in electronic medical records, the development of an efficient named entity recognition algorithm is the key issue. Although several studies have been conducted on it, there is relatively little research on the information extraction of Chinese electronic medical records. To the best of our knowledge, the documents in Chinese electronic medical records contain a large number of nested named entities and short sentences. Moreover, there is weak logic among the sentences, causing a complex syntax structure. To effectively recognize the medical entities, a novel named entity recognition method based on multifeature embedding and attention mechanism was proposed. After embedding three types of features derived from characters, words, and glyphs in the input presentation layer, an attention machine was introduced to the hidden layer of the bidirectional long short-term memory network to make the model focus on the characters related to the medical entities. Finally, the optimal labels for the five types of entities in Chinese electronic medical records, including diseases, body parts, symptoms, drugs, and operations, were obtained. The experimental results for the open and self-built Chinese electronic medical records, recognition accuracy, recall rate, and F1 value of the proposed algorithm are all better than 97%, which shows that the proposed algorithm can effectively identify various entities in Chinese electronic medical records.
-
表 1 命名實體類別
Table 1. Types of named entities
The entity class Identifier Definition of categories Diseases B-diseases I-diseases Terms of various diseases Symptom B-symptom I-symptom Abnormal physical manifestations Body B-body I-body Various parts of the human body Drug B-drug I-drug The names of various medicines Test B-test I-test Various physical examinations 表 2 訓練集與測試集醫療實體分布
Table 2. Distribution of training and test datasets for medical entities
Dataset Training data Test data Diseases 856 382 Symptom 3845 1526 Body 563 214 Drug 657 289 Test 3426 1647 Total 9347 4058 表 3 不同特征嵌入下的命名實體識別性能
Table 3. Performance of NER embedding different features
Model P/% R/% F1/% Font embedding-BiLSTM-CRF 79.51 80.35 79.72 Char embedding-BiLSTM-CRF 88.61 87.43 87.96 Word embedding-BiLSTM-CRF 85.82 86.87 86.32 CW embedding-BiLSTM-CRF 86.58 87.23 87.62 CWF embedding-BiLSTM-CRF 96.24 97.25 96.94 表 4 注意力機制對不同特征嵌入的影響
Table 4. Performance of NER with attention
Model P/% R/% F1/% Font embedding-BiLSTM-Att-CRF 92.46 93.12 92.68 Char embedding-BiLSTM-Att-CRF 93.41 93.56 93.49 Word embedding-BiLSTM-Att-CRF 96.36 96.18 96.21 CW embedding -BiLSTM-Att-CRF 96.52 96.18 96.45 CWF embedding -BiLSTM-Att-CRF 97.21 97.83 97.54 表 5 不同算法的性能對比
Table 5. Comparison of the performance of different NER models
Model P/
%R/
%F1/
%Loading
time/sTesting
time/sTransformer 85.46 86.32 85.68 4.33 12.6 BiGRU-CRF 85.87 86.23 86.14 2.95 9.4 BiLSTM-CRF 88.61 87.43 95.16 3.21 9.81 Attention-BiLSTM-CRF 94.52 96.18 96.45 3.56 10.56 Transformer-CRF 95.32 94.62 94.14 5.32 13.57 MFBAC 97.21 97.83 97.54 4.34 11.68 www.77susu.com -
參考文獻
[1] Tang G Q, Gao D Q, Ruan T, et al. Clinical electronic medical record named entity recognition incorporating language model. Comput Sci, 2020, 47(3): 211 doi: 10.11896/jsjkx.190200259唐國強, 高大啟, 阮彤, 等. 融入語言模型和注意力機制的臨床電子病歷命名實體識別. 計算機科學, 2020, 47(3):211 doi: 10.11896/jsjkx.190200259 [2] Topol E J. High-performance medicine: The convergence of human and artificial intelligence. Nat Med, 2019, 25(1): 44 doi: 10.1038/s41591-018-0300-7 [3] He J, Baxter S L, Xu J, et al. The practical implementation of artificial intelligence technologies in medicine. Nat Med, 2019, 25(1): 30 doi: 10.1038/s41591-018-0307-0 [4] Li B, Kang X D, Zhang H L, et al. Named entity recognition in Chinese electronic medical records using transformer-CRF. Comput Eng Appl, 2020, 56(5): 153 doi: 10.3778/j.issn.1002-8331.1909-0211李博, 康曉東, 張華麗, 等. 采用Transformer-CRF的中文電子病歷命名實體識別. 計算機工程與應用, 2020, 56(5):153 doi: 10.3778/j.issn.1002-8331.1909-0211 [5] Luo L, Yang Z H, Yang P, et al. An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics, 2018, 34(8): 1381 doi: 10.1093/bioinformatics/btx761 [6] Xu K, Yang Z G, Kang P P, et al. Document-level attention-based BiLSTM-CRF incorporating disease dictionary for disease named entity recognition. Comput Biol Med, 2019, 108: 122 doi: 10.1016/j.compbiomed.2019.04.002 [7] Yang J F, Yu Q B, Guan Y, et al. An overview of research on electronic medical record oriented named entity recognition and entity relation extraction. Acta Autom Sin, 2014, 40(8): 1537楊錦鋒, 于秋濱, 關毅, 等. 電子病歷命名實體識別和實體關系抽取研究綜述. 自動化學報, 2014, 40(8):1537 [8] Lei J, Tang B, Lu X, et al. A comprehensive study of named entity recognition in Chinese clinical text. J Am Med Inform Assoc, 2014, 21(5): 808 doi: 10.1136/amiajnl-2013-002381 [9] Hirschberg J, Manning C D. Advances in natural language processing. Science, 2015, 349(6245): 261 doi: 10.1126/science.aaa8685 [10] Wang Q, Zhou Y M, Ruan T, et al. Incorporating dictionaries into deep neural networks for the Chinese clinical named entity recognition. J Biomed Informatics, 2019, 92: 103133 doi: 10.1016/j.jbi.2019.103133 [11] Shang J B, Liu L Y, Gu X T, et al. Learning named entity tagger using domain-specific dictionary//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, 2018: 2054 [12] Kraus S, Blake C, West S L. Information extraction from medical notes [J/OL]. arXiv preprint (2007-07-24) [2020-12-26]. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.120.3671&rep=rep1&type=pdf. [13] Gorinski P J, Wu H H, Grover C, et al. Named entity recognition for electronic health records: A comparison of rule-based and machine learning approaches [J/OL]. arXiv preprint (2019-04-25) [2020-12-26]. https://arxiv.org/pdf/1903.03985.pdf. [14] Ma X Z, Hovy E. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF [J/OL]. arXiv preprint (2016-05-29) [2020-12-26]. https://arxiv.org/pdf/1603.01354.pdf. [15] Zhang Y, Yang J. Chinese NER Using Lattice LSTM [J/OL]. arXiv preprint (2018-07-05) [2020-12-26]. https://arxiv.org/pdf/1805.02023.pdf. [16] Alsentzer E, Murphy J R, Boag W, et al. Publicly available clinical BERT embeddings [J/OL]. arXiv preprint (2019-6-20) [2020-12-26]. https://arxiv.org/pdf/1904.03323.pdf. [17] Jiang M, Chen Y K, Liu M, et al. A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries. J Am Med Inform Assoc, 2011, 18(5): 601 doi: 10.1136/amiajnl-2011-000163 [18] Wei Q K, Chen T, Xu R F, et al. Disease named entity recognition by combining conditional random fields and bidirectional recurrent neural networks. Database (Oxford) , 2016, 140: 1 [19] Gong L J, Zhang Z F. Clinical named entity recognition from Chinese electronic medical records using a double-layer annotation model combining a domain dictionary with CRF. Chin J Eng, 2020, 42(4): 469龔樂君, 張知菲. 基于領域詞典與CRF雙層標注的中文電子病歷實體識別. 工程科學學報, 2020, 42(4):469 [20] Hu J L, Shi X, Liu Z J, et al.HITSZ_CNER: a hybrid system for entity recognition from Chinese clinical text//Proceedings of the Evaluation Tasks at the China Conference on Knowledge Graph and Semantic Computing (CCKS 2017). Chengdu, 2017: 1 [21] Mikolov T, Grave E, Bojanowski P, et al. Advances in pre-training distributed word representations [J/OL]. arXiv preprint (2017-12-26) [2020-12-26]. https://arxiv.org/pdf/1712.09405.pdf. [22] Pennington J, Socher R, Manning C. GloVe: global vectors for word representation//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, 2014: 1532 [23] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need [J/OL]. arXiv preprint (2017-12-06) [2020-12-26]. https://arxiv.org/pdf/1706.03762.pdf. [24] Choi E, Bahadori M T, Kulas J A, et al. RETAIN: interpretable predictive model in healthcare using reverse time attention mechanism [J/OL]. arXiv preprint (2016-08-19) [2020-12-26]. https://arxiv.org/pdf/1608.05745.pdf. [25] Zhu Q L, Li X L, Conesa A, et al. GRAM-CNN: a deep learning approach with local context for named entity recognition in biomedical text. Bioinformatics, 2018, 34(9): 1547 doi: 10.1093/bioinformatics/btx815 [26] Wu G H, Tang G G, Wang Z R, et al. An attention-based BiLSTM-CRF model for Chinese clinic named entity recognition. IEEE Access, 2019, 7: 113942 doi: 10.1109/ACCESS.2019.2935223 -