Imbalanced data ensemble classification based on cluster-based under-sampling algorithm
-
摘要: 傳統的分類算法大多假設數據集是均衡的,追求整體的分類精度.而實際數據集經常是不均衡的,因此傳統的分類算法在處理實際數據集時容易導致少數類樣本有較高的分類錯誤率.現有針對不均衡數據集改進的分類方法主要有兩類:一類是進行數據層面的改進,用過采樣或欠采樣的方法增加少數類數據或減少多數類數據;另一個是進行算法層面的改進.本文在原有的基于聚類的欠采樣方法和集成學習方法的基礎上,采用兩種方法相結合的思想,對不均衡數據進行分類.即先在數據處理階段采用基于聚類的欠采樣方法形成均衡數據集,然后用AdaBoost集成算法對新的數據集進行分類訓練,并在算法集成過程中引用權重來區分少數類數據和多數類數據對計算集成學習錯誤率的貢獻,進而使算法更關注少數數據類,提高少數類數據的分類精度.Abstract: Most traditional classification algorithms assume the data set to be well-balanced and focus on achieving overall classification accuracy. However, actual data sets are usually imbalanced, so traditional classification approaches may lead to classification errors in minority class samples. With respect to imbalanced data, there are two main methods for improving classification performance. The first is to improve the data set by increasing the number of minority class samples by over-sampling and decreasing the number of majority class samples by under-sampling. The other method is to improve the algorithm itself. By combining the cluster-based under-sampling method with ensemble classification, in this paper, an approach was proposed for classifying imbalanced data. First, the cluster-based under-sampling method is used to establish a balanced data set in the data processing stage, and then the new data set is trained by the AdaBoost ensemble algorithm. In the integration process, when calculating the error rate of integrated learning, this algorithm uses weights to distinguish minority class data from majority class data. This makes the algorithm focus more on small data classes, thereby improving the classification accuracy of minority class data.
-
Key words:
- imbalanced data /
- under-sampling /
- classification /
- ensemble learning
-
參考文獻
[1] Napierala K, Stefanowski J. Types of minority class examples and their influence on learning classifiers from imbalanced data. J Intell Inf Syst, 2016, 46(3):563 [2] Glauner P, Boechat A, Dolberg L, et al. Large-scale detection of non-technical losses in imbalanced data sets//2016 IEEE Power & Energy Society Innovative Smart Grid Technologies Conference (ISGT). Minneapolis, 2016 [3] Haque M N, Noman N, Berretta R, et al. Heterogeneous ensemble combination search using genetic algorithm for class imbalanced data classification. Plos One, 2016, 11(1):e0146116 [4] Klein K, Hennig S, Paul S K. A bayesian modelling approach with balancing informative prior for analysing imbalanced data. Plos One, 2016, 11(4):e0152700 [5] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE:synthetic minority over-sampling technique. J Artif Intell Res, 2002, 16:321 [9] Liu X Y, Wu J X, Zhou Z H. Exploratory under-sampling for class-imbalance learning. IEEE Trans Syst Man Cybernetics Part B Cybernetics, 2009, 39(2):539 [10] Mani I, Zhang I. kNN approach to unbalanced data distributions:a case study involving information extraction//Proceedings of the ICML 2003 Workshop on Learning from Imbalanced Datasets. Washington DC,2003:42 [11] Kubat M, Matwin S. Addressing the curse of imbalanced training sets:one-sided selection//International Conference on Machine Learning. Scotland, 2012:179 [13] Dietterich T G. Machine learning research:four current directions. Artif Intell Mag, 1997, 18(4):97 -

計量
- 文章訪問數: 1190
- HTML全文瀏覽量: 307
- PDF下載量: 39
- 被引次數: 0