基于空間近鄰關系的非平衡數據重采樣算法

李睿峰; 李文海; 孫艷麗; 吳陽勇

doi:10.13374/j.issn2095-9389.2020.04.05.002

基于空間近鄰關系的非平衡數據重采樣算法

doi: 10.13374/j.issn2095-9389.2020.04.05.002

海軍航空大學，煙臺 264001

基金項目: 軍內科研項目“新一代航空電子裝備測試關鍵技術研究”資助項目（4172122113R）

詳細信息

通訊作者:
E-mail：dongzhi1110@foxmail.com

中圖分類號: TP206.1
計量
- 文章訪問數: 1547
- HTML全文瀏覽量: 726
- PDF下載量: 66
- 被引次數: 0
出版歷程
- 收稿日期: 2020-04-05
- 刊出日期: 2021-06-25

Resampling algorithm for imbalanced data based on their neighbor relationship

Naval Aviation University, Yantai 264001, China

More Information

Corresponding author: E-mail: dongzhi1110@foxmail.com

摘要

摘要: 為了提高非平衡數據集的分類精度，提出了一種基于樣本空間近鄰關系的重采樣算法。該方法首先根據數據集中少數類樣本的空間近鄰關系進行安全級別評估，根據安全級別有指導的采用合成少數類過采樣技術（Synthetic minority oversampling technique，SMOTE）進行升采樣；然后對多數類樣本依據其空間近鄰關系計算局部密度，從而對多數類樣本密集區域進行降采樣處理。通過以上兩種手段可以均衡測試數據集，并控制數據規模防止過擬合，實現對兩類樣本分類的均衡化。采用十折交叉驗證的方式產生訓練集和測試集，在對訓練集重采樣之后，以核超限學習機作為分類器進行訓練，并在測試集上進行驗證。在UCI非平衡數據集和電路故障診斷實測數據上的實驗結果表明，所提方法在整體上優于其他重采樣算法。
- 非平衡數據 /
- 近鄰關系 /
- 重采樣 /
- 局部密度 /
- 分類
Abstract: The classification of imbalanced data has become a crucial and significant research issue in many data-intensive applications. The minority samples in such applications usually contain important information. This information plays an important role in data analysis. At present, two methods (improved algorithm and data set reconstruction) are used in machine learning and data mining to address the data set imbalance. Data set reconstruction is also known as the resampling method, which can modify the proportion of every class in the training data set without modifying the classification algorithm and has been widely used. As artificially increasing or reducing samples inevitably results in the increase in noise and loss of original data information, thus reducing the classification accuracy. A reasonable oversampling and undersampling algorithm are the core of the resampling method. To improve the classification accuracy of imbalanced data sets, a resampling algorithm based on the neighbor relationship of sample space was proposed. This method first evaluated the security level according to the spatial neighbor relations of minority samples and oversampled them through the synthetic minority oversampling technique guided by their security level. Then, the local density of majority samples was calculated according to their spatial neighbor relation to undersample the majority samples in a sample-intensive area. By the above two means, the data set can be balanced and the data size can be controlled to prevent overfitting to realize the classification equalization of the two categories. The training set and test set were generated via the method of 5 × 10 fold cross validation. After resampling the training set, the kernel extreme learning machine (KELM) was used as the classifier for training, and the test set was used for verification. The experimental results on a UCI imbalanced data set and measured circuit fault diagnosis data show that the proposed method is superior to other resampling algorithms.
- imbalanced data /
- neighbor relationship /
- resample /
- local density /
- classification

HTML全文

圖 1 RBNR算法流程圖

Figure 1. Flowchart of the RBNR algorithm

下載: 全尺寸圖片幻燈片

圖 2 串聯穩壓電路

Figure 2. Serial regulating circuit

下載: 全尺寸圖片幻燈片

圖 3 測試環境圖

Figure 3. Testing environment

下載: 全尺寸圖片幻燈片

圖 4 BMS算法參數分析。（a）R_C值分析；（b）F-valve值分析；（c）G-mean值分析

Figure 4. Parameter analysis of BMS: (a) analysis of the R_C; (b) analysis of the F-valve; (c) analysis of the G-mean

下載: 全尺寸圖片幻燈片

圖 5 結果對比柱狀圖。（a）R_C值對比；（b）F-value值對比；（c）G-mean值對比

Figure 5. Bar graph of result comparison: (a) comparison of R_C; (b) comparison of F-value; (c) comparison of G-mean

下載: 全尺寸圖片幻燈片

表 1 混淆矩陣

Table 1. Confusion matrix

Category	Classified as minority	Classified as majority
Minority	TP	FN
Majority	FP	TN

下載: 導出CSV

表 2 選用的UCI數據集

Table 2. UCI data set

Data set	Dimension	Minority /majority	Imbalance ratio
CTG	21	176/1655	1:9.403
Diabetes	8	268/500	1:1.866
Glass	9	42/172	1:4.095
Wine	13	48/130	1:2.708

下載: 導出CSV

表 3 電路實測數據（部分）

Table 3. Some circuit measured data

ID	V_{1_max}/V	V_{1_min}/V	V₂/V	V₃/V	V₄/V	V₅/V	V₆/V	V₇/V	V₈/V	Attribute
1	?7.730	?6.360	?6.923	?6.928	?6.281	?2.811	?2.981	?5.579	?0.140	normal
2	?7.794	?6.337	?6.953	?6.955	?6.297	?2.781	?2.969	?5.603	?0.134
……
188	?7.706	?6.344	?6.943	?6.945	?6.271	?2.812	?3.020	?5.613	?0.148
189	?7.760	?6.622	?7.106	?7.089	?6.533	?2.656	?2.456	?4.548	?0.133	faulty
……
233	?7.792	?6.597	?7.078	?7.049	?6.503	?2.670	?2.544	?4.726	?0.113

下載: 導出CSV

表 4 F-value和G-mean性能比較

Table 4. Comparison between the F-value and G-mean

Data set	Algorithm	R_C		F-value		G-mean		Parameter value
Data set	Algorithm	Mean	Std	Mean	Std	Mean	Std	C	σ
CTG	SMOTE	1	0	0.9714	0.0782	0.9976	0.0045	0.1	4.9849
	RU-SMOTE	1	0	0.9849	0.0389	0.9984	0.0034	1	4.9056
	BMS	0.9983	0.0118	0.9825	0.0342	0.9972	0.0068	1	5.0038
	RBNR	1	0	0.9870	0.0382	0.9988	0.0030	1	5.0123
Diabetes	SMOTE	0.6966	0.0852	0.6515	0.0694	0.7318	0.0486	1	2.7590
	RU-SMOTE	0.5775	0.1121	0.6330	0.0830	0.7079	0.0670	1	3.3938
	BMS	0.6656	0.1102	0.6595	0.0801	0.7357	0.0652	0.1	3.0312
	RBNR	0.7871	0.0895	0.6832	0.0624	0.7554	0.0497	0.1	3.0156
Glass	SMOTE	0.8985	0.1529	0.8902	0.1125	0.9319	0.0865	10	1.2357
	RU-SMOTE	0.8523	0.1934	0.8608	0.1266	0.8915	0.1558	10	1.2156
	BMS	0.8656	0.2157	0.8909	0.1371	0.9062	0.1670	10	3.3978
	RBNR	0.9086	0.1295	0.9062	0.0996	0.9416	0.0693	1	1.4562
Wine	SMOTE	1	0	0.9818	0.0513	0.9949	0.0152	10	3.9758
	RU-SMOTE	1	0	0.9770	0.0507	0.9914	0.0181	10	3.6135
	BMS	0.9971	0.0202	0.9600	0.0827	0.9874	0.0230	100	4.0360
	RBNR	1	0	0.9789	0.0454	0.9919	0.0146	10	3.7833
Regulator	SMOTE	0.9272	0.1303	0.8496	0.1067	0.9314	0.0715	1000	1.5781
	RU-SMOTE	0.9320	0.2114	0.8304	0.1118	0.8999	0.1931	10	4.7342
	BMS	0.8685	0.1930	0.8731	0.1007	0.9025	0.1526	0.01	3.6821
	RBNR	0.9075	0.1248	0.8947	0.1043	0.9361	0.0699	10	4.6943

下載: 導出CSV

www.77susu.com

參考文獻(29)

[1]	Chen S, He H B, Garcia E A. RAMOBoost: Ranked minority oversampling in boosting. IEEE Trans Neural Networks, 2010, 21(10): 1624 doi: 10.1109/TNN.2010.2066988
[2]	Xiao Y C, Wang H G, Zhang L, et al. Two methods of selecting Gaussian kernel parameters for one-class SVM and their application to fault detection. Knowledge-Based Syst, 2014, 59: 75 doi: 10.1016/j.knosys.2014.01.020
[3]	Miao Z M, Zhao L W, Yuan W W, et al. Multi-class imbalanced learning implemented in network intrusion detection // 2011 International Conference on Computer Science and Service System (CSSS). Nanjing, 2011: 1395
[4]	Smailovi? J, Gr?ar M, Lavra? N, et al. Stream-based active learning for sentiment analysis in the financial domain. Inform Sci, 2014, 285: 181 doi: 10.1016/j.ins.2014.04.034
[5]	Liu Y Q, Wang C, Zhang L. Decision tree based predictive models for breast cancer survivability on imbalanced data // 2009 3rd International Conference on Bioinformatics and Biomedical Engineering. Beijing, 2009: 1
[6]	Gao M Z, Xu A Q, Xu Q. Fault detection method of electronic equipment based on SL-SMOTE and CS-RVM. Comput Eng Appl, 2019, 55(4): 185 doi: 10.3778/j.issn.1002-8331.1708-0032 高明哲, 許愛強, 許晴. SL-SMOTE和CS-RVM結合的電子設備故障檢測方法. 計算機工程與應用, 2019, 55(4):185 doi: 10.3778/j.issn.1002-8331.1708-0032
[7]	Feng H W, Yao B, Gao Y, et al. Imbalanced data processing algorithm based on boundary mixed sampling. Control Decis, 2017, 32(10): 1831 馮宏偉, 姚博, 高原, 等. 基于邊界混合采樣的非均衡數據處理算法. 控制與決策, 2017, 32(10):1831
[8]	Gao M, Hong X, Chen S, et al. A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems. Neurocomputing, 2011, 74(17): 3456 doi: 10.1016/j.neucom.2011.06.010
[9]	Gu P, Ouyang Y Y. Classification research for unbalanced data based on mixed-sampling. Appl Res Comput, 2015, 32(2): 379 doi: 10.3969/j.issn.1001-3695.2015.02.014 古平, 歐陽源遊. 基于混合采樣的非平衡數據集分類研究. 計算機應用研究, 2015, 32(2):379 doi: 10.3969/j.issn.1001-3695.2015.02.014
[10]	Yu H L, Yang X B, Zheng S, et al. Active learning from imbalanced data: A solution of online weighted extreme learning machine. IEEE Trans Neural Networks Learn Syst, 2019, 30(4): 1088 doi: 10.1109/TNNLS.2018.2855446
[11]	Cai Y Y, Song X D. New fuzzy SVM model used in imbalanced datasets. J Xidian Univ Nat Sci, 2015, 42(5): 120 蔡艷艷, 宋曉東. 針對非平衡數據分類的新型模糊SVM模型. 西安電子科技大學學報(自然科學版), 2015, 42(5):120
[12]	Wang C Y, Su H Y, Qu Y, et al. Imbalanced data sets classification method based on over-sampling technique. Comput Eng Appl, 2011, 47(1): 139 doi: 10.3778/j.issn.1002-8331.2011.01.038 王春玉, 蘇宏業, 渠瑜, 等. 一種基于過抽樣技術的非平衡數據集分類方法. 計算機工程與應用, 2011, 47(1):139 doi: 10.3778/j.issn.1002-8331.2011.01.038
[13]	Zhang Y F, Guo H P, Zhi W M, et al. An ensemble pruning method for imbalanced data classification. Comput Eng, 2014, 40(6): 157 doi: 10.3969/j.issn.1000-3428.2014.06.034 張銀峰, 郭華平, 職為梅, 等. 一種面向不平衡數據分類的組合剪枝方法. 計算機工程, 2014, 40(6):157 doi: 10.3969/j.issn.1000-3428.2014.06.034
[14]	Vong C M, Ip W F, Wong P K, et al. Predicting minority class for suspended particulate matters level by extreme learning machine. Neurocomputing, 2014, 128: 136 doi: 10.1016/j.neucom.2012.11.056
[15]	Zhai Y, Yang B R, Wang S P, et al. Under-sampling method based on cooperative co-evolutionary mechanism. J Univ Sci Technol Beijing, 2011, 33(12): 1550 翟云, 楊炳儒, 王樹鵬, 等. 基于協同進化機制的欠采樣方法. 北京科技大學學報, 2011, 33(12):1550
[16]	Yang Y, Liu F, Jin Z Y, et al. Aliasing artefact suppression in compressed sensing MRI for random phase-encode undersampling. IEEE Trans Bio-Med Eng, 2015, 62(9): 2215 doi: 10.1109/TBME.2015.2419372
[17]	Jia C Z, Zuo Y. S-SulfPred: A sensitive predictor to capture S-sulfenylation sites based on a resampling one-sided selection undersampling-synthetic minority oversampling technique. J Theoret Biol, 2017, 422: 84 doi: 10.1016/j.jtbi.2017.03.031
[18]	Wilson D L. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern, 2007, SMC-2(3): 408
[19]	Zhao Z X, Wang G L, Li X D. An improved SVM based under-sampling method for classifying imbalanced data. Acta Sci Nat Univ Sunyatseni, 2012, 51(6): 10 趙自翔, 王廣亮, 李曉東. 基于支持向量機的不平衡數據分類的改進欠采樣方法. 中山大學學報(自然科學版), 2012, 51(6):10
[20]	Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res, 2002, 16: 321 doi: 10.1613/jair.953
[21]	Liu Y X, Liu S M, Liu T, et al. New oversampling algorithm DB_SMOTE. Comput Eng Appl, 2014, 50(6): 92 doi: 10.3778/j.issn.1002-8331.1308-0099 劉余霞, 劉三民, 劉濤, 等. 一種新的過采樣算法DB_SMOTE. 計算機工程與應用, 2014, 50(6):92 doi: 10.3778/j.issn.1002-8331.1308-0099
[22]	Gu Q, Yuan L, Ning B, et al. A novel classification algorithm for imbalanced datasets based on hybrid resampling strategy. Comput Eng Sci, 2012, 34(10): 128 doi: 10.3969/j.issn.1007-130X.2012.09.024 谷瓊, 袁磊, 寧彬, 等. 一種基于混合重取樣策略的非均衡數據集分類算法. 計算機工程與科學, 2012, 34(10):128 doi: 10.3969/j.issn.1007-130X.2012.09.024
[23]	Tao X M, Hao S Y, Zhang D X, et al. Support vector machine for unbalanced data based on sample properties under-sampling approaches. Control Decis, 2013, 28(7): 978 陶新民, 郝思媛, 張冬雪, 等. 基于樣本特性欠取樣的不均衡支持向量機. 控制與決策, 2013, 28(7):978
[24]	Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C. Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem // Proceedings of Advances in Knowledge Discovery and Data Mining Conference. Bangkok, 2009: 475
[25]	Huang G B, Zhou H M, Ding X J, et al. Extreme learning machine for regression and multiclass classification. IEEE Trans Syst Man Cybern Part B Cybern, 2012, 42(2): 513 doi: 10.1109/TSMCB.2011.2168604
[26]	Gautam C, Tiwari A, Leng Q. On the construction of extreme learning machine for online and offline one-class classification-an expanded toolbox. Neurocomputing, 2017, 261: 126 doi: 10.1016/j.neucom.2016.04.070
[27]	Zhu M, Liu Q, Liu X, et al. Fault detection method for avionics based on LMK and OC-ELM. Syst Eng Electron, 2020, 42(6): 1424 doi: 10.3969/j.issn.1001-506X.2020.06.29 朱敏, 劉奇, 劉星, 等. 基于LMK和OC-ELM的航空電子部件故障檢測方法. 系統工程與電子技術, 2020, 42(6):1424 doi: 10.3969/j.issn.1001-506X.2020.06.29
[28]	Xue L X, Qiu B Z. Boundary points detection algorithm based on coefficient of variation. Pattern Recognit Artif Intell, 2009, 22(5): 799 doi: 10.3969/j.issn.1003-6059.2009.05.020 薛麗香, 邱保志. 基于變異系數的邊界點檢測算法. 模式識別與人工智能, 2009, 22(5):799 doi: 10.3969/j.issn.1003-6059.2009.05.020
[29]	Zhang Z, Duan Z M, Long Y. Fault detection in switched current circuits based on preferred wavelet packet. Chin J Eng, 2017, 39(7): 1101 張鎮, 段哲民, 龍英. 基于小波包的開關電流電路故障診斷. 工程科學學報, 2017, 39(7):1101