-
摘要: 針對經典K–means算法對不均衡數據進行聚類時產生的“均勻效應”問題,提出一種基于近鄰的不均衡數據聚類算法(Clustering algorithm for imbalanced data based on nearest neighbor,CABON)。CABON算法首先對數據對象進行初始聚類,通過定義的類別待定集來確定初始聚類結果中類別歸屬有待進一步核定的數據對象集合;并給出一種類別待定集的動態調整機制,利用近鄰思想實現此集合中數據對象所屬類別的重新劃分,按照從集合邊緣到中心的順序將類別待定集中的數據對象依次歸入其最近鄰居所在的類別中,得到最終的聚類結果,以避免“均勻效應”對聚類結果的影響。將該算法與K–means、多中心的非平衡K_均值聚類方法(Imbalanced K–means clustering method with multiple centers,MC_IK)和非均勻數據的變異系數聚類算法(Coefficient of variation clustering for non-uniform data,CVCN)在人工數據集和真實數據集上分別進行實驗對比,結果表明CABON算法能夠有效消減K–means算法對不均衡數據聚類時所產生的“均勻效應”,聚類效果明顯優于K–means、MC_IK和CVCN算法。
-
關鍵詞:
- K–means /
- 均勻效應 /
- 類別待定集 /
- 近鄰 /
- 基于近鄰的不均衡數據聚類算法
Abstract: Clustering is an important task in the field of data mining. Most clustering algorithms can effectively deal with the clustering problems of balanced datasets, but their processing ability is weak for imbalanced datasets. For example, K–means, a classical partition clustering algorithm, tends to produce a “uniform effect” when dealing with imbalanced datasets, i.e., the K–means algorithm often produces clusters that are relatively uniform in size when clustering unbalanced datasets with the data objects in small clusters “swallowing” the part of the data objects in large clusters. This means that the number and density of the data objects in different clusters tend to be the same. To solve the problem of “uniform effect” generated by the classical K–means algorithm in the clustering of imbalanced data, a clustering algorithm based on nearest neighbor (CABON) is proposed for imbalanced data. Firstly, the initial clustering of data objects is performed to obtain the undetermined-cluster set, which is defined as a set that consists of the data objects that must be checked further regarding the clusters in which they belong. Then, from the edge to the center of the set, the nearest-neighbor method is used to reassign the data objects in the undetermined-cluster set to the clusters of their nearest neighbors. Meanwhile the undetermined-cluster set is dynamically adjusted, to obtain the final clustering result, which prevents the influence of the “uniform effect” on the clustering result. The clustering results of the proposed algorithm is compared with that of K–means, the imbalanced K–means clustering method with multiple centers (MC_IK), and the coefficient of variation clustering for non-uniform data (CVCN) on synthetic and real datasets. The experimental results reveal that the CABON algorithm effectively reduces “uniform effect” generated by the K–means algorithm on imbalanced data, and its clustering result is superior to that of the K–means, MC_IK, and CVCN algorithms. -
表 1 數據集參數信息
Table 1. Parameter information for the data set
No Datasets Data Sources Distribution Dimension Class Instance 1 Flame Synthesis 147∶93 2 2 240 2 Aggregation Synthesis 272∶170∶127∶105∶45∶35∶34 2 7 788 3 Jain Synthesis 276∶97 2 2 373 4 DS1 Synthesis 1000∶200 2 2 1200 5 DS2 Synthesis 2823∶529 2 2 3352 6 DS3 Synthesis 1000∶500∶400∶200∶200 2 5 2300 7 Wine UCI 59∶71∶48 13 3 178 8 Newthyroid UCI 150∶35∶30 5 3 215 9 Ionosphere UCI 225∶126 34 2 351 10 Heart UCI 150∶120 13 2 270 表 2 人工數據集不同算法聚類結果(Accuracy指標)
Table 2. Clustering results of artificial data set with different algorithms (Accuracy indicators)
Index Datasets CABON K–means MC_IK CVCN Accuracy Flame 0.985 0.8292 0.8242 0.8583 Aggregation 0.9505 0.9156 0.9364 0.9010 Jain 0.8801 0.7819 0.7943 0.8729 DS1 0.9817 0.8628 0.8724 0.866 DS2 0.9852 0.8422 0.8640 0.8765 DS3 0.9917 0.9032 0.9212 0.6687 表 3 人工數據集不同算法聚類結果(F-measure指標)
Table 3. Clustering results of artificial data set with different algorithms (F-measure indicators)
Index Datasets CABON K–means MC_IK CVCN F-measure Flame 0.9748 0.8313 0.8264 0.859 Aggregation 0.9148 0.8498 0.8504 0.8858 Jain 0.862 0.7953 0.8069 0.876 DS1 0.982 0.8768 0.8864 0.9074 DS2 0.9851 0.8574 0.8785 0.8889 DS3 0.9918 0.9061 0.9247 0.6371 表 4 人工數據集不同算法聚類結果(NMI指標)
Table 4. Clustering results of artificial data set with different algorithms (NMI indicators)
Index Datasets CABON K–means MC_IK CVCN NMI Flame 0.842 0.3543 0.3461 0.407 Aggregation 0.8635 0.8134 0.8271 0.8129 Jain 0.3561 0.3332 0.3661 0.4272 DS1 0.816 0.4696 0.4025 0.3105 DS2 0.5301 0.3430 0.3850 0.4092 DS3 0.9651 0.8089 0.9123 0.5014 表 5 人工數據集不同算法聚類結果(RI指標)
Table 5. Clustering results of artificial data set with different algorithms (RI indicators)
Index Datasets CABON K–means MC_IK CVCN RI Flame 0.945 0.7155 0.7091 0.7558 Aggregation 0.9473 0.9192 0.9259 0.9301 Jain 0.7832 0.6581 0.6724 0.7795 DS1 0.9641 0.7641 0.7776 0.7693 DS2 0.9426 0.7292 0.7649 0.7834 DS3 0.9913 0.9065 0.9278 0.7383 表 6 UCI數據集不同算法聚類結果(Accuracy指標)
Table 6. Clustering results of UCI dataset with different algorithms (Accuracy indicators)
Index Datasets CABON K–means MC_IK CVCN Accurary Wine 0.7542 0.6972 0.7032 0.7331 Newthyroid 0.8884 0.8233 0.8512 0.8279 Ionosphere 0.7322 0.7123 0.698 0.7193 Heart 0.6037 0.5889 0.6011 0.6235 表 7 UCI數據集不同算法聚類結果(F-measure指標)
Table 7. Clustering results of UCI dataset with different algorithms (F-measure indicators)
Index Datasets CABON K–means MC_IK CVCN F-measure Wine 0.7559 0.6912 0.7148 0.7265 Newthyroid 0.8835 0.8175 0.8507 0.8251 Ionosphere 0.7317 0.7177 0.7021 0.6906 Heart 0.6102 0.5853 0.5907 0.6305 表 8 UCI數據集不同算法聚類結果(NMI指標)
Table 8. Clustering results of UCI dataset with different algorithms (NMI indicators)
Index Datasets CABON K–means MC_IK CVCN NMI Wine 0.4201 0.4197 0.4298 0.3947 Newthyroid 0.5308 0.3918 0.4249 0.4067 Ionosphere 0.1421 0.1312 0.1008 0.1268 Heart 0.0213 0.0176 0.0222 0.0688 表 9 UCI數據集不同算法聚類結果(RI指標)
Table 9. Clustering results of UCI dataset with different algorithms (RI indicators)
Index Datasets CABON K–means MC_IK CVCN RI Wine 0.7223 0.7106 0.7128 0.7183 Newthyroid 0.8199 0.7402 0.7867 0.7501 Ionosphere 0.6067 0.5889 0.5772 0.5963 Heart 0.5806 0.5144 0.5182 0.5330 www.77susu.com -
參考文獻
[1] Wu S, Feng X D, Zhou W J. Spectral clustering of high-dimensional data exploiting sparse representation vectors. <italic>Neurocomputing</italic>, 2014, 135: 229 doi: 10.1016/j.neucom.2013.12.027 [2] Wilson J, Chaudhury S, Lall B. Clustering short temporal behaviour sequences for customer segmentation using LDA. <italic>Expert Syst</italic>, 2018, 35(3): e12250 doi: 10.1111/exsy.12250 [3] Zhao L B, Shi G Y. A trajectory clustering method based on Douglas-Peucker compression and density for marine traffic pattern recognition. <italic>Ocean Eng</italic>, 2019, 172: 456 doi: 10.1016/j.oceaneng.2018.12.019 [4] Al-Shammari A, Zhou R, Naseriparsaa M, et al. An effective density-based clustering and dynamic maintenance framework for evolving medical data streams. <italic>Int J Med Inform</italic>, 2019, 126: 176 doi: 10.1016/j.ijmedinf.2019.03.016 [5] Hu Y, Li H, Chen M. Taxi abnormal trajectory detection based on density clustering. <italic>Comput Modernization</italic>, 2019(6): 49 doi: 10.3969/j.issn.1006-2475.2019.06.008胡圓, 李暉, 陳梅. 基于密度聚類的出租車異常軌跡檢測. 計算機與現代化, 2019(6):49 doi: 10.3969/j.issn.1006-2475.2019.06.008 [6] Han W H, Huang Z Z, Li S D, et al. Distribution-sensitive unbalanced data oversampling method for medical diagnosis. <italic>J Med Syst</italic>, 2019, 43(2): 39 doi: 10.1007/s10916-018-1154-8 [7] Chen L T, Xu G H, Zhang Q, et al. Learning deep representation of imbalanced SCADA data for fault detection of wind turbines. <italic>Meas</italic>, 2019, 139: 370 doi: 10.1016/j.measurement.2019.03.029 [8] Xiong H, Wu J J, Chen J. K–means clustering versus validation measures: A data-distribution perspective. <italic>IEEE Trans Syst Man Cybern Part B </italic>(<italic>Cybern</italic>)<italic></italic>, 2009, 39(2): 318 doi: 10.1109/TSMCB.2008.2004559 [9] Luo Z C, Jin S, Qiu X F. Spectral clustering based oversampling: oversampling taking within class imbalance into consideration. <italic>Comput Eng Appl</italic>, 2014, 50(11): 120 doi: 10.3778/j.issn.1002-8331.1312-0148駱自超, 金隼, 邱雪峰. 考慮類內不平衡的譜聚類過抽樣方法. 計算機工程與應用, 2014, 50(11):120 doi: 10.3778/j.issn.1002-8331.1312-0148 [10] Kumar N S, Rao K N, Govardhan A, et al. Undersampled K–means approach for handling imbalanced distributed data. <italic>Prog Artif Intelligence</italic>, 2014, 3(1): 29 doi: 10.1007/s13748-014-0045-6 [11] Wu S, Liu L, Lu D. Imbalanced data ensemble classification based on cluster-based under-sampling algorithm. <italic>Chin J Eng</italic>, 2017, 39(8): 1244武森, 劉露, 盧丹. 基于聚類欠采樣的集成不均衡數據分類算法. 工程科學學報, 2017, 39(8):1244 [12] Lin W C, Tsai C F, Hu Y H, et al. Clustering-based undersampling in class-imbalanced data. <italic>Inform Sci</italic>, 2017, 409-410: 17 doi: 10.1016/j.ins.2017.05.008 [13] Liang J Y, Bai L, Dang C Y, et al. The K–means–type algorithms versus imbalanced data distributions. <italic>IEEE Trans Fuzzy Syst</italic>, 2012, 20(4): 728 doi: 10.1109/TFUZZ.2011.2182354 [14] Qi H. Imbalanced K–means clustering method with multiple centers. <italic>J North Univ China Nat Sci</italic>, 2015, 36(4): 453亓慧. 多中心的非平衡K–均值聚類方法. 中北大學學報(自然科學版), 2015, 36(4):453 [15] Yang T P, Xu K P, Chen L F. Coefficient of variation clustering algorithm for non-uniform data. <italic>J Shandong Univ Eng Sci</italic>, 2018, 48(3): 140楊天鵬, 徐鯤鵬, 陳黎飛. 非均勻數據的變異系數聚類算法. 山東大學學報: 工學版, 2018, 48(3):140 [16] Liu H, Hu D M. Research on Chi-square clustering algorithm for unbalanced data. <italic>Comput Eng Software</italic>, 2019, 40(4): 7 doi: 10.3969/j.issn.1003-6970.2019.04.002劉歡, 胡德敏. 類不平衡數據的卡方聚類算法研究. 軟件, 2019, 40(4):7 doi: 10.3969/j.issn.1003-6970.2019.04.002 [17] Jiang P. The Research of Multi-clusters IB Algorithm for Imbalanced Data Set [Dissertation]. Zhengzhou: Zhengzhou University, 2015江鵬. 面向非平衡數據集的多簇IB算法研究[學位論文]. 鄭州: 鄭州大學, 2015 [18] Bai L. Theoretical Analysis and Effective Algorithms of Cluster Learning [Dissertation]. Taiyuan: Shanxi University, 2012白亮. 聚類學習的理論分析與高效算法研究[學位論文]. 太原: 山西大學, 2012 [19] Gionis A, Mannila H, Tsaparas P. Clustering aggregation. <italic>ACM Trans Knowledge Discovery Data</italic>, 2007, 1(1): 1 doi: 10.1145/1217299.1217300 [20] Chen M, Li L J, Wang B, et al. Effectively clustering by finding density backbone based-on kNN. <italic>Pattern Recognit</italic>, 2016, 60: 486 doi: 10.1016/j.patcog.2016.04.018 [21] Li T, Geng H W, Su S Z. Density peaks clustering based on density adaptive distance. <italic>J Chin Comput Syst</italic>, 2017, 38(6): 1347 doi: 10.3969/j.issn.1000-1220.2017.06.032李濤, 葛洪偉, 蘇樹智. 基于密度自適應距離的密度峰聚類. 小型微型計算機系統, 2017, 38(6):1347 doi: 10.3969/j.issn.1000-1220.2017.06.032 [22] Forina M. Wine Data Set [EB/OL]. UCI Machine Learning (1991-07-01) [2019-10-09]. http://archive.ics.uci.edu/ml/datasets/Wine [23] Quinlan J R. Thyroid Disease Data Set [EB/OL]. UCI Machine Learning (1987-01-01) [2019-10-09]. http://archive.ics.uci.edu/ml/datasets/Thyroid+Disease [24] Sigillito V G. Ionosphere Data Set [EB/OL]. UCI Machine Learning (1989-01-01) [2019-10-09]. http://archive.ics.uci.edu/ml/datasets/Ionosphere [25] Dua D, Graff C. Statlog (Heart) Data Set [EB/OL]. UCI Machine Learning (1993-02-13) [2019-10-09]. http://archive.ics.uci.edu/ml/datasets/Statlog+%28Heart%29 [26] Wu P P. Research on Initial Cluster Centers Choice Algorithm and Clustering for Imbalanced Data [Dissertation]. Taiyuan: Shanxi University, 2015武鵬鵬. 初始類中心選擇及在非平衡數據中的聚類研究[學位論文]. 太原: 山西大學, 2015 [27] Fu L W, Wu S. A new internal clustering validation index for categorical data based on concentration of attribute values. <italic>Chin J Eng</italic>, 2019, 41(5): 682傅立偉, 武森. 基于屬性值集中度的分類數據聚類有效性內部評價指標. 工程科學學報, 2019, 41(5):682 [28] Hussain S F, Haris M. A K–means based co-clustering (kCC) algorithm for sparse, high dimensional data. <italic>Expert Syst Appl</italic>, 2019, 118: 20 doi: 10.1016/j.eswa.2018.09.006 [29] Yeh C C, Yang M S. Evaluation measures for cluster ensembles based on a fuzzy generalized Rand index. <italic>Appl Soft Comput</italic>, 2017, 57: 225 doi: 10.1016/j.asoc.2017.03.030 [30] Qannari E M, Courcoux P, Faye P. Significance test of the adjusted Rand index. Application to the free sorting task. <italic>Food Qual Preference</italic>, 2014, 32: 93 doi: 10.1016/j.foodqual.2013.05.005 -