-
摘要: 為了提高非平衡數據集的分類精度,提出了一種基于樣本空間近鄰關系的重采樣算法。該方法首先根據數據集中少數類樣本的空間近鄰關系進行安全級別評估,根據安全級別有指導的采用合成少數類過采樣技術(Synthetic minority oversampling technique,SMOTE)進行升采樣;然后對多數類樣本依據其空間近鄰關系計算局部密度,從而對多數類樣本密集區域進行降采樣處理。通過以上兩種手段可以均衡測試數據集,并控制數據規模防止過擬合,實現對兩類樣本分類的均衡化。采用十折交叉驗證的方式產生訓練集和測試集,在對訓練集重采樣之后,以核超限學習機作為分類器進行訓練,并在測試集上進行驗證。在UCI非平衡數據集和電路故障診斷實測數據上的實驗結果表明,所提方法在整體上優于其他重采樣算法。Abstract: The classification of imbalanced data has become a crucial and significant research issue in many data-intensive applications. The minority samples in such applications usually contain important information. This information plays an important role in data analysis. At present, two methods (improved algorithm and data set reconstruction) are used in machine learning and data mining to address the data set imbalance. Data set reconstruction is also known as the resampling method, which can modify the proportion of every class in the training data set without modifying the classification algorithm and has been widely used. As artificially increasing or reducing samples inevitably results in the increase in noise and loss of original data information, thus reducing the classification accuracy. A reasonable oversampling and undersampling algorithm are the core of the resampling method. To improve the classification accuracy of imbalanced data sets, a resampling algorithm based on the neighbor relationship of sample space was proposed. This method first evaluated the security level according to the spatial neighbor relations of minority samples and oversampled them through the synthetic minority oversampling technique guided by their security level. Then, the local density of majority samples was calculated according to their spatial neighbor relation to undersample the majority samples in a sample-intensive area. By the above two means, the data set can be balanced and the data size can be controlled to prevent overfitting to realize the classification equalization of the two categories. The training set and test set were generated via the method of 5 × 10 fold cross validation. After resampling the training set, the kernel extreme learning machine (KELM) was used as the classifier for training, and the test set was used for verification. The experimental results on a UCI imbalanced data set and measured circuit fault diagnosis data show that the proposed method is superior to other resampling algorithms.
-
Key words:
- imbalanced data /
- neighbor relationship /
- resample /
- local density /
- classification
-
表 1 混淆矩陣
Table 1. Confusion matrix
Category Classified as minority Classified as majority Minority TP FN Majority FP TN 表 2 選用的UCI數據集
Table 2. UCI data set
Data set Dimension Minority /majority Imbalance ratio CTG 21 176/1655 1:9.403 Diabetes 8 268/500 1:1.866 Glass 9 42/172 1:4.095 Wine 13 48/130 1:2.708 表 3 電路實測數據(部分)
Table 3. Some circuit measured data
ID V1_max/V V1_min/V V2/V V3/V V4/V V5/V V6/V V7/V V8/V Attribute 1 ?7.730 ?6.360 ?6.923 ?6.928 ?6.281 ?2.811 ?2.981 ?5.579 ?0.140 normal 2 ?7.794 ?6.337 ?6.953 ?6.955 ?6.297 ?2.781 ?2.969 ?5.603 ?0.134 …… 188 ?7.706 ?6.344 ?6.943 ?6.945 ?6.271 ?2.812 ?3.020 ?5.613 ?0.148 189 ?7.760 ?6.622 ?7.106 ?7.089 ?6.533 ?2.656 ?2.456 ?4.548 ?0.133 faulty …… 233 ?7.792 ?6.597 ?7.078 ?7.049 ?6.503 ?2.670 ?2.544 ?4.726 ?0.113 表 4 F-value和G-mean性能比較
Table 4. Comparison between the F-value and G-mean
Data set Algorithm RC F-value G-mean Parameter value Mean Std Mean Std Mean Std C σ CTG SMOTE 1 0 0.9714 0.0782 0.9976 0.0045 0.1 4.9849 RU-SMOTE 1 0 0.9849 0.0389 0.9984 0.0034 1 4.9056 BMS 0.9983 0.0118 0.9825 0.0342 0.9972 0.0068 1 5.0038 RBNR 1 0 0.9870 0.0382 0.9988 0.0030 1 5.0123 Diabetes SMOTE 0.6966 0.0852 0.6515 0.0694 0.7318 0.0486 1 2.7590 RU-SMOTE 0.5775 0.1121 0.6330 0.0830 0.7079 0.0670 1 3.3938 BMS 0.6656 0.1102 0.6595 0.0801 0.7357 0.0652 0.1 3.0312 RBNR 0.7871 0.0895 0.6832 0.0624 0.7554 0.0497 0.1 3.0156 Glass SMOTE 0.8985 0.1529 0.8902 0.1125 0.9319 0.0865 10 1.2357 RU-SMOTE 0.8523 0.1934 0.8608 0.1266 0.8915 0.1558 10 1.2156 BMS 0.8656 0.2157 0.8909 0.1371 0.9062 0.1670 10 3.3978 RBNR 0.9086 0.1295 0.9062 0.0996 0.9416 0.0693 1 1.4562 Wine SMOTE 1 0 0.9818 0.0513 0.9949 0.0152 10 3.9758 RU-SMOTE 1 0 0.9770 0.0507 0.9914 0.0181 10 3.6135 BMS 0.9971 0.0202 0.9600 0.0827 0.9874 0.0230 100 4.0360 RBNR 1 0 0.9789 0.0454 0.9919 0.0146 10 3.7833 Regulator SMOTE 0.9272 0.1303 0.8496 0.1067 0.9314 0.0715 1000 1.5781 RU-SMOTE 0.9320 0.2114 0.8304 0.1118 0.8999 0.1931 10 4.7342 BMS 0.8685 0.1930 0.8731 0.1007 0.9025 0.1526 0.01 3.6821 RBNR 0.9075 0.1248 0.8947 0.1043 0.9361 0.0699 10 4.6943 www.77susu.com -
參考文獻
[1] Chen S, He H B, Garcia E A. RAMOBoost: Ranked minority oversampling in boosting. IEEE Trans Neural Networks, 2010, 21(10): 1624 doi: 10.1109/TNN.2010.2066988 [2] Xiao Y C, Wang H G, Zhang L, et al. Two methods of selecting Gaussian kernel parameters for one-class SVM and their application to fault detection. Knowledge-Based Syst, 2014, 59: 75 doi: 10.1016/j.knosys.2014.01.020 [3] Miao Z M, Zhao L W, Yuan W W, et al. Multi-class imbalanced learning implemented in network intrusion detection // 2011 International Conference on Computer Science and Service System (CSSS). Nanjing, 2011: 1395 [4] Smailovi? J, Gr?ar M, Lavra? N, et al. Stream-based active learning for sentiment analysis in the financial domain. Inform Sci, 2014, 285: 181 doi: 10.1016/j.ins.2014.04.034 [5] Liu Y Q, Wang C, Zhang L. Decision tree based predictive models for breast cancer survivability on imbalanced data // 2009 3rd International Conference on Bioinformatics and Biomedical Engineering. Beijing, 2009: 1 [6] Gao M Z, Xu A Q, Xu Q. Fault detection method of electronic equipment based on SL-SMOTE and CS-RVM. Comput Eng Appl, 2019, 55(4): 185 doi: 10.3778/j.issn.1002-8331.1708-0032高明哲, 許愛強, 許晴. SL-SMOTE和CS-RVM結合的電子設備故障檢測方法. 計算機工程與應用, 2019, 55(4):185 doi: 10.3778/j.issn.1002-8331.1708-0032 [7] Feng H W, Yao B, Gao Y, et al. Imbalanced data processing algorithm based on boundary mixed sampling. Control Decis, 2017, 32(10): 1831馮宏偉, 姚博, 高原, 等. 基于邊界混合采樣的非均衡數據處理算法. 控制與決策, 2017, 32(10):1831 [8] Gao M, Hong X, Chen S, et al. A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems. Neurocomputing, 2011, 74(17): 3456 doi: 10.1016/j.neucom.2011.06.010 [9] Gu P, Ouyang Y Y. Classification research for unbalanced data based on mixed-sampling. Appl Res Comput, 2015, 32(2): 379 doi: 10.3969/j.issn.1001-3695.2015.02.014古平, 歐陽源遊. 基于混合采樣的非平衡數據集分類研究. 計算機應用研究, 2015, 32(2):379 doi: 10.3969/j.issn.1001-3695.2015.02.014 [10] Yu H L, Yang X B, Zheng S, et al. Active learning from imbalanced data: A solution of online weighted extreme learning machine. IEEE Trans Neural Networks Learn Syst, 2019, 30(4): 1088 doi: 10.1109/TNNLS.2018.2855446 [11] Cai Y Y, Song X D. New fuzzy SVM model used in imbalanced datasets. J Xidian Univ Nat Sci, 2015, 42(5): 120蔡艷艷, 宋曉東. 針對非平衡數據分類的新型模糊SVM模型. 西安電子科技大學學報(自然科學版), 2015, 42(5):120 [12] Wang C Y, Su H Y, Qu Y, et al. Imbalanced data sets classification method based on over-sampling technique. Comput Eng Appl, 2011, 47(1): 139 doi: 10.3778/j.issn.1002-8331.2011.01.038王春玉, 蘇宏業, 渠瑜, 等. 一種基于過抽樣技術的非平衡數據集分類方法. 計算機工程與應用, 2011, 47(1):139 doi: 10.3778/j.issn.1002-8331.2011.01.038 [13] Zhang Y F, Guo H P, Zhi W M, et al. An ensemble pruning method for imbalanced data classification. Comput Eng, 2014, 40(6): 157 doi: 10.3969/j.issn.1000-3428.2014.06.034張銀峰, 郭華平, 職為梅, 等. 一種面向不平衡數據分類的組合剪枝方法. 計算機工程, 2014, 40(6):157 doi: 10.3969/j.issn.1000-3428.2014.06.034 [14] Vong C M, Ip W F, Wong P K, et al. Predicting minority class for suspended particulate matters level by extreme learning machine. Neurocomputing, 2014, 128: 136 doi: 10.1016/j.neucom.2012.11.056 [15] Zhai Y, Yang B R, Wang S P, et al. Under-sampling method based on cooperative co-evolutionary mechanism. J Univ Sci Technol Beijing, 2011, 33(12): 1550翟云, 楊炳儒, 王樹鵬, 等. 基于協同進化機制的欠采樣方法. 北京科技大學學報, 2011, 33(12):1550 [16] Yang Y, Liu F, Jin Z Y, et al. Aliasing artefact suppression in compressed sensing MRI for random phase-encode undersampling. IEEE Trans Bio-Med Eng, 2015, 62(9): 2215 doi: 10.1109/TBME.2015.2419372 [17] Jia C Z, Zuo Y. S-SulfPred: A sensitive predictor to capture S-sulfenylation sites based on a resampling one-sided selection undersampling-synthetic minority oversampling technique. J Theoret Biol, 2017, 422: 84 doi: 10.1016/j.jtbi.2017.03.031 [18] Wilson D L. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern, 2007, SMC-2(3): 408 [19] Zhao Z X, Wang G L, Li X D. An improved SVM based under-sampling method for classifying imbalanced data. Acta Sci Nat Univ Sunyatseni, 2012, 51(6): 10趙自翔, 王廣亮, 李曉東. 基于支持向量機的不平衡數據分類的改進欠采樣方法. 中山大學學報(自然科學版), 2012, 51(6):10 [20] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res, 2002, 16: 321 doi: 10.1613/jair.953 [21] Liu Y X, Liu S M, Liu T, et al. New oversampling algorithm DB_SMOTE. Comput Eng Appl, 2014, 50(6): 92 doi: 10.3778/j.issn.1002-8331.1308-0099劉余霞, 劉三民, 劉濤, 等. 一種新的過采樣算法DB_SMOTE. 計算機工程與應用, 2014, 50(6):92 doi: 10.3778/j.issn.1002-8331.1308-0099 [22] Gu Q, Yuan L, Ning B, et al. A novel classification algorithm for imbalanced datasets based on hybrid resampling strategy. Comput Eng Sci, 2012, 34(10): 128 doi: 10.3969/j.issn.1007-130X.2012.09.024谷瓊, 袁磊, 寧彬, 等. 一種基于混合重取樣策略的非均衡數據集分類算法. 計算機工程與科學, 2012, 34(10):128 doi: 10.3969/j.issn.1007-130X.2012.09.024 [23] Tao X M, Hao S Y, Zhang D X, et al. Support vector machine for unbalanced data based on sample properties under-sampling approaches. Control Decis, 2013, 28(7): 978陶新民, 郝思媛, 張冬雪, 等. 基于樣本特性欠取樣的不均衡支持向量機. 控制與決策, 2013, 28(7):978 [24] Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C. Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem // Proceedings of Advances in Knowledge Discovery and Data Mining Conference. Bangkok, 2009: 475 [25] Huang G B, Zhou H M, Ding X J, et al. Extreme learning machine for regression and multiclass classification. IEEE Trans Syst Man Cybern Part B Cybern, 2012, 42(2): 513 doi: 10.1109/TSMCB.2011.2168604 [26] Gautam C, Tiwari A, Leng Q. On the construction of extreme learning machine for online and offline one-class classification-an expanded toolbox. Neurocomputing, 2017, 261: 126 doi: 10.1016/j.neucom.2016.04.070 [27] Zhu M, Liu Q, Liu X, et al. Fault detection method for avionics based on LMK and OC-ELM. Syst Eng Electron, 2020, 42(6): 1424 doi: 10.3969/j.issn.1001-506X.2020.06.29朱敏, 劉奇, 劉星, 等. 基于LMK和OC-ELM的航空電子部件故障檢測方法. 系統工程與電子技術, 2020, 42(6):1424 doi: 10.3969/j.issn.1001-506X.2020.06.29 [28] Xue L X, Qiu B Z. Boundary points detection algorithm based on coefficient of variation. Pattern Recognit Artif Intell, 2009, 22(5): 799 doi: 10.3969/j.issn.1003-6059.2009.05.020薛麗香, 邱保志. 基于變異系數的邊界點檢測算法. 模式識別與人工智能, 2009, 22(5):799 doi: 10.3969/j.issn.1003-6059.2009.05.020 [29] Zhang Z, Duan Z M, Long Y. Fault detection in switched current circuits based on preferred wavelet packet. Chin J Eng, 2017, 39(7): 1101張鎮, 段哲民, 龍英. 基于小波包的開關電流電路故障診斷. 工程科學學報, 2017, 39(7):1101 -