融合多特征嵌入與注意力機制的中文電子病歷命名實體識別

鞏敦衛; 張永凱; 郭一楠; 王斌; 樊寬魯; 火焱

doi:10.13374/j.issn2095-9389.2021.01.12.006

融合多特征嵌入與注意力機制的中文電子病歷命名實體識別

doi: 10.13374/j.issn2095-9389.2021.01.12.006

鞏敦衛^{1, 2},
張永凱^{1, 2},
郭一楠^{1, 2, ,},
王斌^{1, 2},
樊寬魯³,
火焱⁴

1.
中國礦業大學信息與控制工程學院，徐州 221116
2.
中國礦業大學人工智能研究院智慧醫療研究中心，徐州 221116
3.
徐州醫科大學第二附屬醫院內分泌科，徐州 221000
4.
中國礦業大學附屬醫院內分泌科，徐州 221116

基金項目: 國家自然科學基金資助項目（61973305，61773384）；中國礦業大學中央高校基本科研業務費專項資金資助項目（2020ZDPY0302）

詳細信息

通訊作者:
E-mail：nanfly@126.com

中圖分類號: TP391.1
計量
- 文章訪問數: 1139
- HTML全文瀏覽量: 707
- PDF下載量: 171
- 被引次數: 0
出版歷程
- 收稿日期: 2021-01-12
- 網絡出版日期: 2021-03-02
- 刊出日期: 2021-09-18

Named entity recognition of Chinese electronic medical records based on multifeature embedding and attention mechanism

GONG Dun-wei^{1, 2},
ZHANG Yong-kai^{1, 2},
GUO Yi-nan^{1, 2
, ,},
WANG Bin^{1, 2},
FAN Kuan-lu³,
HUO Yan⁴

1.
School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
2.
Intelligent Medical Center, Institute of Artificial Intelligence, China University of Mining and Technology, Xuzhou 221116, China
3.
Department of Endocrinology, the Second Affiliated Hospital of Xuzhou Medical University, Xuzhou 221000, China
4.
Department of Endocrinology, Affiliated Hospital of China University of Mining and Technology, Xuzhou 221116, China

More Information

Corresponding author: E-mail: nanfly@126.com

摘要

摘要: 中文電子病歷文本包含大量嵌套實體、句子語法結構復雜、句式偏短。為有效識別其醫療實體，提出一種融合多特征嵌入與注意力機制的命名實體識別算法，在輸入表示層融合字符、單詞、字形三個粒度的特征，并在雙向長短期記憶網絡的隱含層引入注意力機制，使算法在捕獲特征時更加關注于醫療實體相關的字符，最終實現對中文電子病歷中疾病、身體部位、癥狀、藥物、操作五類實體的最優標注。面向開源和自建糖尿病數據集的實驗結果中所提算法的實體識別準確率、召回率和F1值都達到97%以上，表明其可以更加有效地識別中文電子病歷中各類實體。
- 中文 /
- 電子病歷 /
- 命名實體識別 /
- 多特征嵌入 /
- 注意力機制
Abstract: Medical records, as an essential part of the health care records of residents, save all the information about the clinical treatment of patients, which are traditionally written by doctors on paper. With the development of information technologies, electronic medical records that are more easily saved and managed gradually replace the traditional ones. Intelligent auxiliary diagnosis, patients’ portrait construction, and disease prediction based on medical reports have become research hotspots in the field of intelligent medical care. To fully discover the hidden relationship between symptoms and diseases from the documents saved in electronic medical records, the development of an efficient named entity recognition algorithm is the key issue. Although several studies have been conducted on it, there is relatively little research on the information extraction of Chinese electronic medical records. To the best of our knowledge, the documents in Chinese electronic medical records contain a large number of nested named entities and short sentences. Moreover, there is weak logic among the sentences, causing a complex syntax structure. To effectively recognize the medical entities, a novel named entity recognition method based on multifeature embedding and attention mechanism was proposed. After embedding three types of features derived from characters, words, and glyphs in the input presentation layer, an attention machine was introduced to the hidden layer of the bidirectional long short-term memory network to make the model focus on the characters related to the medical entities. Finally, the optimal labels for the five types of entities in Chinese electronic medical records, including diseases, body parts, symptoms, drugs, and operations, were obtained. The experimental results for the open and self-built Chinese electronic medical records, recognition accuracy, recall rate, and F1 value of the proposed algorithm are all better than 97%, which shows that the proposed algorithm can effectively identify various entities in Chinese electronic medical records.
- Chinese /
- electronic medical records /
- named entity recognition /
- multifeature embedding /
- attention mechanism

HTML全文

圖 1 MFBAC算法框架

Figure 1. MFBAC framework

下載: 全尺寸圖片幻燈片

圖 2 不同算法的F1值

Figure 2. Comparison on the F1 values of different NER models

下載: 全尺寸圖片幻燈片

表 1 命名實體類別

Table 1. Types of named entities

The entity class	Identifier	Definition of categories
Diseases	B-diseases I-diseases	Terms of various diseases
Symptom	B-symptom I-symptom	Abnormal physical manifestations
Body	B-body I-body	Various parts of the human body
Drug	B-drug I-drug	The names of various medicines
Test	B-test I-test	Various physical examinations

下載: 導出CSV

表 2 訓練集與測試集醫療實體分布

Table 2. Distribution of training and test datasets for medical entities

Dataset	Training data	Test data
Diseases	856	382
Symptom	3845	1526
Body	563	214
Drug	657	289
Test	3426	1647
Total	9347	4058

下載: 導出CSV

表 3 不同特征嵌入下的命名實體識別性能

Table 3. Performance of NER embedding different features

Model	P/%	R/%	F1/%
Font embedding-BiLSTM-CRF	79.51	80.35	79.72
Char embedding-BiLSTM-CRF	88.61	87.43	87.96
Word embedding-BiLSTM-CRF	85.82	86.87	86.32
CW embedding-BiLSTM-CRF	86.58	87.23	87.62
CWF embedding-BiLSTM-CRF	96.24	97.25	96.94

下載: 導出CSV

表 4 注意力機制對不同特征嵌入的影響

Table 4. Performance of NER with attention

Model	P/%	R/%	F1/%
Font embedding-BiLSTM-Att-CRF	92.46	93.12	92.68
Char embedding-BiLSTM-Att-CRF	93.41	93.56	93.49
Word embedding-BiLSTM-Att-CRF	96.36	96.18	96.21
CW embedding -BiLSTM-Att-CRF	96.52	96.18	96.45
CWF embedding -BiLSTM-Att-CRF	97.21	97.83	97.54

下載: 導出CSV

表 5 不同算法的性能對比

Table 5. Comparison of the performance of different NER models

Model	P/ %	R/ %	F1/ %	Loading time/s	Testing time/s
Transformer	85.46	86.32	85.68	4.33	12.6
BiGRU-CRF	85.87	86.23	86.14	2.95	9.4
BiLSTM-CRF	88.61	87.43	95.16	3.21	9.81
Attention-BiLSTM-CRF	94.52	96.18	96.45	3.56	10.56
Transformer-CRF	95.32	94.62	94.14	5.32	13.57
MFBAC	97.21	97.83	97.54	4.34	11.68

下載: 導出CSV

www.77susu.com

參考文獻(26)

[1]	Tang G Q, Gao D Q, Ruan T, et al. Clinical electronic medical record named entity recognition incorporating language model. Comput Sci, 2020, 47(3): 211 doi: 10.11896/jsjkx.190200259 唐國強, 高大啟, 阮彤, 等. 融入語言模型和注意力機制的臨床電子病歷命名實體識別. 計算機科學, 2020, 47(3):211 doi: 10.11896/jsjkx.190200259
[2]	Topol E J. High-performance medicine: The convergence of human and artificial intelligence. Nat Med, 2019, 25(1): 44 doi: 10.1038/s41591-018-0300-7
[3]	He J, Baxter S L, Xu J, et al. The practical implementation of artificial intelligence technologies in medicine. Nat Med, 2019, 25(1): 30 doi: 10.1038/s41591-018-0307-0
[4]	Li B, Kang X D, Zhang H L, et al. Named entity recognition in Chinese electronic medical records using transformer-CRF. Comput Eng Appl, 2020, 56(5): 153 doi: 10.3778/j.issn.1002-8331.1909-0211 李博, 康曉東, 張華麗, 等. 采用Transformer-CRF的中文電子病歷命名實體識別. 計算機工程與應用, 2020, 56(5):153 doi: 10.3778/j.issn.1002-8331.1909-0211
[5]	Luo L, Yang Z H, Yang P, et al. An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics, 2018, 34(8): 1381 doi: 10.1093/bioinformatics/btx761
[6]	Xu K, Yang Z G, Kang P P, et al. Document-level attention-based BiLSTM-CRF incorporating disease dictionary for disease named entity recognition. Comput Biol Med, 2019, 108: 122 doi: 10.1016/j.compbiomed.2019.04.002
[7]	Yang J F, Yu Q B, Guan Y, et al. An overview of research on electronic medical record oriented named entity recognition and entity relation extraction. Acta Autom Sin, 2014, 40(8): 1537 楊錦鋒, 于秋濱, 關毅, 等. 電子病歷命名實體識別和實體關系抽取研究綜述. 自動化學報, 2014, 40(8):1537
[8]	Lei J, Tang B, Lu X, et al. A comprehensive study of named entity recognition in Chinese clinical text. J Am Med Inform Assoc, 2014, 21(5): 808 doi: 10.1136/amiajnl-2013-002381
[9]	Hirschberg J, Manning C D. Advances in natural language processing. Science, 2015, 349(6245): 261 doi: 10.1126/science.aaa8685
[10]	Wang Q, Zhou Y M, Ruan T, et al. Incorporating dictionaries into deep neural networks for the Chinese clinical named entity recognition. J Biomed Informatics, 2019, 92: 103133 doi: 10.1016/j.jbi.2019.103133
[11]	Shang J B, Liu L Y, Gu X T, et al. Learning named entity tagger using domain-specific dictionary//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, 2018: 2054
[12]	Kraus S, Blake C, West S L. Information extraction from medical notes [J/OL]. arXiv preprint (2007-07-24) [2020-12-26]. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.120.3671&rep=rep1&type=pdf.
[13]	Gorinski P J, Wu H H, Grover C, et al. Named entity recognition for electronic health records: A comparison of rule-based and machine learning approaches [J/OL]. arXiv preprint (2019-04-25) [2020-12-26]. https://arxiv.org/pdf/1903.03985.pdf.
[14]	Ma X Z, Hovy E. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF [J/OL]. arXiv preprint (2016-05-29) [2020-12-26]. https://arxiv.org/pdf/1603.01354.pdf.
[15]	Zhang Y, Yang J. Chinese NER Using Lattice LSTM [J/OL]. arXiv preprint (2018-07-05) [2020-12-26]. https://arxiv.org/pdf/1805.02023.pdf.
[16]	Alsentzer E, Murphy J R, Boag W, et al. Publicly available clinical BERT embeddings [J/OL]. arXiv preprint (2019-6-20) [2020-12-26]. https://arxiv.org/pdf/1904.03323.pdf.
[17]	Jiang M, Chen Y K, Liu M, et al. A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries. J Am Med Inform Assoc, 2011, 18(5): 601 doi: 10.1136/amiajnl-2011-000163
[18]	Wei Q K, Chen T, Xu R F, et al. Disease named entity recognition by combining conditional random fields and bidirectional recurrent neural networks. Database (Oxford), 2016, 140: 1
[19]	Gong L J, Zhang Z F. Clinical named entity recognition from Chinese electronic medical records using a double-layer annotation model combining a domain dictionary with CRF. Chin J Eng, 2020, 42(4): 469 龔樂君, 張知菲. 基于領域詞典與CRF雙層標注的中文電子病歷實體識別. 工程科學學報, 2020, 42(4):469
[20]	Hu J L, Shi X, Liu Z J, et al.HITSZ_CNER: a hybrid system for entity recognition from Chinese clinical text//Proceedings of the Evaluation Tasks at the China Conference on Knowledge Graph and Semantic Computing (CCKS 2017). Chengdu, 2017: 1
[21]	Mikolov T, Grave E, Bojanowski P, et al. Advances in pre-training distributed word representations [J/OL]. arXiv preprint (2017-12-26) [2020-12-26]. https://arxiv.org/pdf/1712.09405.pdf.
[22]	Pennington J, Socher R, Manning C. GloVe: global vectors for word representation//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, 2014: 1532
[23]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need [J/OL]. arXiv preprint (2017-12-06) [2020-12-26]. https://arxiv.org/pdf/1706.03762.pdf.
[24]	Choi E, Bahadori M T, Kulas J A, et al. RETAIN: interpretable predictive model in healthcare using reverse time attention mechanism [J/OL]. arXiv preprint (2016-08-19) [2020-12-26]. https://arxiv.org/pdf/1608.05745.pdf.
[25]	Zhu Q L, Li X L, Conesa A, et al. GRAM-CNN: a deep learning approach with local context for named entity recognition in biomedical text. Bioinformatics, 2018, 34(9): 1547 doi: 10.1093/bioinformatics/btx815
[26]	Wu G H, Tang G G, Wang Z R, et al. An attention-based BiLSTM-CRF model for Chinese clinic named entity recognition. IEEE Access, 2019, 7: 113942 doi: 10.1109/ACCESS.2019.2935223