基于近鄰的不均衡數據聚類算法

武森; 汪玉枝; 高曉楠

doi:10.13374/j.issn2095-9389.2019.10.09.003

基于近鄰的不均衡數據聚類算法

doi: 10.13374/j.issn2095-9389.2019.10.09.003

北京科技大學經濟管理學院，北京 100083

基金項目: 國家自然科學基金資助項目（71971025, 71271027）

詳細信息

通訊作者:
E-mail: gaoxiaonan0001@163.com

中圖分類號: TP311.13
計量
- 文章訪問數: 3058
- HTML全文瀏覽量: 1438
- PDF下載量: 158
- 被引次數: 0
出版歷程
- 收稿日期: 2019-10-09
- 刊出日期: 2020-09-20

Clustering algorithm for imbalanced data based on nearest neighbor

School of Economics and Management, University of Science and Technology Beijing, Beijing 100083, China

More Information

Corresponding author: E-mail: gaoxiaonan0001@163.com

摘要

摘要: 針對經典K–means算法對不均衡數據進行聚類時產生的“均勻效應”問題，提出一種基于近鄰的不均衡數據聚類算法（Clustering algorithm for imbalanced data based on nearest neighbor，CABON）。CABON算法首先對數據對象進行初始聚類，通過定義的類別待定集來確定初始聚類結果中類別歸屬有待進一步核定的數據對象集合；并給出一種類別待定集的動態調整機制，利用近鄰思想實現此集合中數據對象所屬類別的重新劃分，按照從集合邊緣到中心的順序將類別待定集中的數據對象依次歸入其最近鄰居所在的類別中，得到最終的聚類結果，以避免“均勻效應”對聚類結果的影響。將該算法與K–means、多中心的非平衡K_均值聚類方法（Imbalanced K–means clustering method with multiple centers，MC_IK）和非均勻數據的變異系數聚類算法（Coefficient of variation clustering for non-uniform data，CVCN）在人工數據集和真實數據集上分別進行實驗對比，結果表明CABON算法能夠有效消減K–means算法對不均衡數據聚類時所產生的“均勻效應”，聚類效果明顯優于K–means、MC_IK和CVCN算法。
- K–means /
- 均勻效應 /
- 類別待定集 /
- 近鄰 /
- 基于近鄰的不均衡數據聚類算法
Abstract: Clustering is an important task in the field of data mining. Most clustering algorithms can effectively deal with the clustering problems of balanced datasets, but their processing ability is weak for imbalanced datasets. For example, K–means, a classical partition clustering algorithm, tends to produce a “uniform effect” when dealing with imbalanced datasets, i.e., the K–means algorithm often produces clusters that are relatively uniform in size when clustering unbalanced datasets with the data objects in small clusters “swallowing” the part of the data objects in large clusters. This means that the number and density of the data objects in different clusters tend to be the same. To solve the problem of “uniform effect” generated by the classical K–means algorithm in the clustering of imbalanced data, a clustering algorithm based on nearest neighbor (CABON) is proposed for imbalanced data. Firstly, the initial clustering of data objects is performed to obtain the undetermined-cluster set, which is defined as a set that consists of the data objects that must be checked further regarding the clusters in which they belong. Then, from the edge to the center of the set, the nearest-neighbor method is used to reassign the data objects in the undetermined-cluster set to the clusters of their nearest neighbors. Meanwhile the undetermined-cluster set is dynamically adjusted, to obtain the final clustering result, which prevents the influence of the “uniform effect” on the clustering result. The clustering results of the proposed algorithm is compared with that of K–means, the imbalanced K–means clustering method with multiple centers (MC_IK), and the coefficient of variation clustering for non-uniform data (CVCN) on synthetic and real datasets. The experimental results reveal that the CABON algorithm effectively reduces “uniform effect” generated by the K–means algorithm on imbalanced data, and its clustering result is superior to that of the K–means, MC_IK, and CVCN algorithms.
- K–means /
- uniform effect /
- undetermined-cluster set /
- neighbor /
- clustering algorithm for imbalanced data based on nearest neighbor (CABON)

HTML全文

圖 1 二維不均衡數據集的真實分布和K–means的聚類結果。（a）數據集真實分布；（b） K–means的聚類結果

Figure 1. Real distributions of two-dimensional unbalanced data sets and clustering results of the K–means algorithm: (a) real distribution of datasets; (b) the clustering result of the K–means algorithm

下載: 全尺寸圖片幻燈片

圖 2 MC_IK算法中模糊工作集構造規則存在缺陷示意圖。（a）數據集真實分布；（b）MC_IK對數據集初次聚類結果

Figure 2. Schematic diagram of the defects in the construction rules of the MC_IK algorithm’s fuzzy working set: (a) real distribution of data sets; (b) initial clustering result of the MC_IK algorithm on datasets

下載: 全尺寸圖片幻燈片

圖 3 類別待定集的構造過程示意圖。（a）K–means算法對數據集的初始聚類結果；（b）CABON算法構造類別待定集圖示

Figure 3. Schematic of the construction process of the undetermined-cluster set: (a) initial clustering result of the K-means algorithm on data sets; (b) undetermined-cluster set constructed by the CABON algorithm

下載: 全尺寸圖片幻燈片

圖 4 CABON算法固定類別待定集的情況示意圖

Figure 4. Schematic of the CABON algorithm’s fixed of the undetermined-cluster set

下載: 全尺寸圖片幻燈片

圖 5 類別待定集中數據對象類別重新劃分過程示意圖。（a）邊界對象確定過程；（b）類別待定集調整過程

Figure 5. Schematic of the reclassification process of data objects in the undetermined-cluster set: (a) determination of boundary objects; (b) the adjustment process of undetermined-cluster set

下載: 全尺寸圖片幻燈片

圖 6 人工數據集真實分布。（a） Flame；（b） Aggregation；（c） Jain；（d） DS1；（e） DS2；（f） DS3

Figure 6. True distribution of synthetic data sets: (a) Flame; (b) Aggregation; (c) Jain; (d) DS1; (e) DS2; (f) DS3

下載: 全尺寸圖片幻燈片

圖 7 Flame數據集不同算法聚類結果圖示。（a） CABON；（b） K–means；（c） MC_IK；（d） CVCN

Figure 7. Graphical representations of clustering results with different algorithms on Flame data sets: (a) CABON; (b) K–means; (c) MC_IK; (d) CVCN

下載: 全尺寸圖片幻燈片

圖 8 Aggregation數據集不同算法聚類結果圖示。（a） CABON；（b） K–means；（c） MC_IK；（d） CVCN

Figure 8. Graphical representations of clustering results with different algorithms on Aggregation data sets: (a) CABON; (b) K–means; (c) MC_IK; (d) CVCN

下載: 全尺寸圖片幻燈片

圖 9 Jain數據集不同算法聚類結果圖示。（a） CABON；（b） K–means；（c） MC_IK；（d） CVCN

Figure 9. Graphical representations of clustering results with different algorithms on Jain data sets: (a) CABON; (b) K–means; (c) MC_IK; (d) CVCN

下載: 全尺寸圖片幻燈片

圖 10 DS1數據集不同算法聚類結果圖示。（a） CABON；（b） K–means；（c） MC_IK；（d） CVCN

Figure 10. Graphical representations of clustering results with different algorithms on DS1 data sets: (a) CABON; (b) K–means; (c) MC_IK; (d) CVCN

下載: 全尺寸圖片幻燈片

圖 11 DS2數據集不同算法聚類結果圖示。（a） CABON；（b） K–means；（c） MC_IK；（d） CVCN

Figure 11. Graphical representations of clustering results with different algorithms on DS2 data sets: (a) CABON; (b) K–means; (c) MC_IK; (d) CVCN

下載: 全尺寸圖片幻燈片

圖 12 DS3數據集不同算法聚類結果圖示。（a） CABON；（b） K–means；（c） MC_IK；（d） CVCN

Figure 12. Graphical representations of clustering results with different algorithms on DS3 data sets: (a) CABON; (b) K–means; (c) MC_IK; (d) CVCN

下載: 全尺寸圖片幻燈片

表 1 數據集參數信息

Table 1. Parameter information for the data set

No	Datasets	Data Sources	Distribution	Dimension	Class	Instance
1	Flame	Synthesis	147∶93	2	2	240
2	Aggregation	Synthesis	272∶170∶127∶105∶45∶35∶34	2	7	788
3	Jain	Synthesis	276∶97	2	2	373
4	DS1	Synthesis	1000∶200	2	2	1200
5	DS2	Synthesis	2823∶529	2	2	3352
6	DS3	Synthesis	1000∶500∶400∶200∶200	2	5	2300
7	Wine	UCI	59∶71∶48	13	3	178
8	Newthyroid	UCI	150∶35∶30	5	3	215
9	Ionosphere	UCI	225∶126	34	2	351
10	Heart	UCI	150∶120	13	2	270

下載: 導出CSV

表 2 人工數據集不同算法聚類結果（Accuracy指標）

Table 2. Clustering results of artificial data set with different algorithms (Accuracy indicators)

Index	Datasets	CABON	K–means	MC_IK	CVCN
Accuracy	Flame	0.985	0.8292	0.8242	0.8583
	Aggregation	0.9505	0.9156	0.9364	0.9010
	Jain	0.8801	0.7819	0.7943	0.8729
	DS1	0.9817	0.8628	0.8724	0.866
	DS2	0.9852	0.8422	0.8640	0.8765
	DS3	0.9917	0.9032	0.9212	0.6687

下載: 導出CSV

表 3 人工數據集不同算法聚類結果（F-measure指標）

Table 3. Clustering results of artificial data set with different algorithms (F-measure indicators)

Index	Datasets	CABON	K–means	MC_IK	CVCN
F-measure	Flame	0.9748	0.8313	0.8264	0.859
	Aggregation	0.9148	0.8498	0.8504	0.8858
	Jain	0.862	0.7953	0.8069	0.876
	DS1	0.982	0.8768	0.8864	0.9074
	DS2	0.9851	0.8574	0.8785	0.8889
	DS3	0.9918	0.9061	0.9247	0.6371

下載: 導出CSV

表 4 人工數據集不同算法聚類結果（NMI指標）

Table 4. Clustering results of artificial data set with different algorithms (NMI indicators)

Index	Datasets	CABON	K–means	MC_IK	CVCN
NMI	Flame	0.842	0.3543	0.3461	0.407
	Aggregation	0.8635	0.8134	0.8271	0.8129
	Jain	0.3561	0.3332	0.3661	0.4272
	DS1	0.816	0.4696	0.4025	0.3105
	DS2	0.5301	0.3430	0.3850	0.4092
	DS3	0.9651	0.8089	0.9123	0.5014

下載: 導出CSV

表 5 人工數據集不同算法聚類結果（RI指標）

Table 5. Clustering results of artificial data set with different algorithms (RI indicators)

Index	Datasets	CABON	K–means	MC_IK	CVCN
RI	Flame	0.945	0.7155	0.7091	0.7558
	Aggregation	0.9473	0.9192	0.9259	0.9301
	Jain	0.7832	0.6581	0.6724	0.7795
	DS1	0.9641	0.7641	0.7776	0.7693
	DS2	0.9426	0.7292	0.7649	0.7834
	DS3	0.9913	0.9065	0.9278	0.7383

下載: 導出CSV

表 6 UCI數據集不同算法聚類結果（Accuracy指標）

Table 6. Clustering results of UCI dataset with different algorithms (Accuracy indicators)

Index	Datasets	CABON	K–means	MC_IK	CVCN
Accurary	Wine	0.7542	0.6972	0.7032	0.7331
	Newthyroid	0.8884	0.8233	0.8512	0.8279
	Ionosphere	0.7322	0.7123	0.698	0.7193
	Heart	0.6037	0.5889	0.6011	0.6235

下載: 導出CSV

表 7 UCI數據集不同算法聚類結果（F-measure指標）

Table 7. Clustering results of UCI dataset with different algorithms (F-measure indicators)

Index	Datasets	CABON	K–means	MC_IK	CVCN
F-measure	Wine	0.7559	0.6912	0.7148	0.7265
	Newthyroid	0.8835	0.8175	0.8507	0.8251
	Ionosphere	0.7317	0.7177	0.7021	0.6906
	Heart	0.6102	0.5853	0.5907	0.6305

下載: 導出CSV

表 8 UCI數據集不同算法聚類結果（NMI指標）

Table 8. Clustering results of UCI dataset with different algorithms (NMI indicators)

Index	Datasets	CABON	K–means	MC_IK	CVCN
NMI	Wine	0.4201	0.4197	0.4298	0.3947
	Newthyroid	0.5308	0.3918	0.4249	0.4067
	Ionosphere	0.1421	0.1312	0.1008	0.1268
	Heart	0.0213	0.0176	0.0222	0.0688

下載: 導出CSV

表 9 UCI數據集不同算法聚類結果（RI指標）

Table 9. Clustering results of UCI dataset with different algorithms (RI indicators)

Index	Datasets	CABON	K–means	MC_IK	CVCN
RI	Wine	0.7223	0.7106	0.7128	0.7183
	Newthyroid	0.8199	0.7402	0.7867	0.7501
	Ionosphere	0.6067	0.5889	0.5772	0.5963
	Heart	0.5806	0.5144	0.5182	0.5330

下載: 導出CSV

www.77susu.com

參考文獻(30)

[1]	Wu S, Feng X D, Zhou W J. Spectral clustering of high-dimensional data exploiting sparse representation vectors. <italic>Neurocomputing</italic>, 2014, 135: 229 doi: 10.1016/j.neucom.2013.12.027
[2]	Wilson J, Chaudhury S, Lall B. Clustering short temporal behaviour sequences for customer segmentation using LDA. <italic>Expert Syst</italic>, 2018, 35(3): e12250 doi: 10.1111/exsy.12250
[3]	Zhao L B, Shi G Y. A trajectory clustering method based on Douglas-Peucker compression and density for marine traffic pattern recognition. <italic>Ocean Eng</italic>, 2019, 172: 456 doi: 10.1016/j.oceaneng.2018.12.019
[4]	Al-Shammari A, Zhou R, Naseriparsaa M, et al. An effective density-based clustering and dynamic maintenance framework for evolving medical data streams. <italic>Int J Med Inform</italic>, 2019, 126: 176 doi: 10.1016/j.ijmedinf.2019.03.016
[5]	Hu Y, Li H, Chen M. Taxi abnormal trajectory detection based on density clustering. <italic>Comput Modernization</italic>, 2019(6): 49 doi: 10.3969/j.issn.1006-2475.2019.06.008 胡圓, 李暉, 陳梅. 基于密度聚類的出租車異常軌跡檢測. 計算機與現代化, 2019(6):49 doi: 10.3969/j.issn.1006-2475.2019.06.008
[6]	Han W H, Huang Z Z, Li S D, et al. Distribution-sensitive unbalanced data oversampling method for medical diagnosis. <italic>J Med Syst</italic>, 2019, 43(2): 39 doi: 10.1007/s10916-018-1154-8
[7]	Chen L T, Xu G H, Zhang Q, et al. Learning deep representation of imbalanced SCADA data for fault detection of wind turbines. <italic>Meas</italic>, 2019, 139: 370 doi: 10.1016/j.measurement.2019.03.029
[8]	Xiong H, Wu J J, Chen J. K–means clustering versus validation measures: A data-distribution perspective. <italic>IEEE Trans Syst Man Cybern Part B </italic>(<italic>Cybern</italic>)<italic></italic>, 2009, 39(2): 318 doi: 10.1109/TSMCB.2008.2004559
[9]	Luo Z C, Jin S, Qiu X F. Spectral clustering based oversampling: oversampling taking within class imbalance into consideration. <italic>Comput Eng Appl</italic>, 2014, 50(11): 120 doi: 10.3778/j.issn.1002-8331.1312-0148 駱自超, 金隼, 邱雪峰. 考慮類內不平衡的譜聚類過抽樣方法. 計算機工程與應用, 2014, 50(11):120 doi: 10.3778/j.issn.1002-8331.1312-0148
[10]	Kumar N S, Rao K N, Govardhan A, et al. Undersampled K–means approach for handling imbalanced distributed data. <italic>Prog Artif Intelligence</italic>, 2014, 3(1): 29 doi: 10.1007/s13748-014-0045-6
[11]	Wu S, Liu L, Lu D. Imbalanced data ensemble classification based on cluster-based under-sampling algorithm. <italic>Chin J Eng</italic>, 2017, 39(8): 1244 武森, 劉露, 盧丹. 基于聚類欠采樣的集成不均衡數據分類算法. 工程科學學報, 2017, 39(8):1244
[12]	Lin W C, Tsai C F, Hu Y H, et al. Clustering-based undersampling in class-imbalanced data. <italic>Inform Sci</italic>, 2017, 409-410: 17 doi: 10.1016/j.ins.2017.05.008
[13]	Liang J Y, Bai L, Dang C Y, et al. The K–means–type algorithms versus imbalanced data distributions. <italic>IEEE Trans Fuzzy Syst</italic>, 2012, 20(4): 728 doi: 10.1109/TFUZZ.2011.2182354
[14]	Qi H. Imbalanced K–means clustering method with multiple centers. <italic>J North Univ China Nat Sci</italic>, 2015, 36(4): 453 亓慧. 多中心的非平衡K–均值聚類方法. 中北大學學報(自然科學版), 2015, 36(4):453
[15]	Yang T P, Xu K P, Chen L F. Coefficient of variation clustering algorithm for non-uniform data. <italic>J Shandong Univ Eng Sci</italic>, 2018, 48(3): 140 楊天鵬, 徐鯤鵬, 陳黎飛. 非均勻數據的變異系數聚類算法. 山東大學學報: 工學版, 2018, 48(3):140
[16]	Liu H, Hu D M. Research on Chi-square clustering algorithm for unbalanced data. <italic>Comput Eng Software</italic>, 2019, 40(4): 7 doi: 10.3969/j.issn.1003-6970.2019.04.002 劉歡, 胡德敏. 類不平衡數據的卡方聚類算法研究. 軟件, 2019, 40(4):7 doi: 10.3969/j.issn.1003-6970.2019.04.002
[17]	Jiang P. The Research of Multi-clusters IB Algorithm for Imbalanced Data Set [Dissertation]. Zhengzhou: Zhengzhou University, 2015 江鵬. 面向非平衡數據集的多簇IB算法研究[學位論文]. 鄭州: 鄭州大學, 2015
[18]	Bai L. Theoretical Analysis and Effective Algorithms of Cluster Learning [Dissertation]. Taiyuan: Shanxi University, 2012 白亮. 聚類學習的理論分析與高效算法研究[學位論文]. 太原: 山西大學, 2012
[19]	Gionis A, Mannila H, Tsaparas P. Clustering aggregation. <italic>ACM Trans Knowledge Discovery Data</italic>, 2007, 1(1): 1 doi: 10.1145/1217299.1217300
[20]	Chen M, Li L J, Wang B, et al. Effectively clustering by finding density backbone based-on kNN. <italic>Pattern Recognit</italic>, 2016, 60: 486 doi: 10.1016/j.patcog.2016.04.018
[21]	Li T, Geng H W, Su S Z. Density peaks clustering based on density adaptive distance. <italic>J Chin Comput Syst</italic>, 2017, 38(6): 1347 doi: 10.3969/j.issn.1000-1220.2017.06.032 李濤, 葛洪偉, 蘇樹智. 基于密度自適應距離的密度峰聚類. 小型微型計算機系統, 2017, 38(6):1347 doi: 10.3969/j.issn.1000-1220.2017.06.032
[22]	Forina M. Wine Data Set [EB/OL]. UCI Machine Learning (1991-07-01) [2019-10-09]. http://archive.ics.uci.edu/ml/datasets/Wine
[23]	Quinlan J R. Thyroid Disease Data Set [EB/OL]. UCI Machine Learning (1987-01-01) [2019-10-09]. http://archive.ics.uci.edu/ml/datasets/Thyroid+Disease
[24]	Sigillito V G. Ionosphere Data Set [EB/OL]. UCI Machine Learning (1989-01-01) [2019-10-09]. http://archive.ics.uci.edu/ml/datasets/Ionosphere
[25]	Dua D, Graff C. Statlog (Heart) Data Set [EB/OL]. UCI Machine Learning (1993-02-13) [2019-10-09]. http://archive.ics.uci.edu/ml/datasets/Statlog+%28Heart%29
[26]	Wu P P. Research on Initial Cluster Centers Choice Algorithm and Clustering for Imbalanced Data [Dissertation]. Taiyuan: Shanxi University, 2015 武鵬鵬. 初始類中心選擇及在非平衡數據中的聚類研究[學位論文]. 太原: 山西大學, 2015
[27]	Fu L W, Wu S. A new internal clustering validation index for categorical data based on concentration of attribute values. <italic>Chin J Eng</italic>, 2019, 41(5): 682 傅立偉, 武森. 基于屬性值集中度的分類數據聚類有效性內部評價指標. 工程科學學報, 2019, 41(5):682
[28]	Hussain S F, Haris M. A K–means based co-clustering (kCC) algorithm for sparse, high dimensional data. <italic>Expert Syst Appl</italic>, 2019, 118: 20 doi: 10.1016/j.eswa.2018.09.006
[29]	Yeh C C, Yang M S. Evaluation measures for cluster ensembles based on a fuzzy generalized Rand index. <italic>Appl Soft Comput</italic>, 2017, 57: 225 doi: 10.1016/j.asoc.2017.03.030
[30]	Qannari E M, Courcoux P, Faye P. Significance test of the adjusted Rand index. Application to the free sorting task. <italic>Food Qual Preference</italic>, 2014, 32: 93 doi: 10.1016/j.foodqual.2013.05.005