<span id="fpn9h"><noframes id="fpn9h"><span id="fpn9h"></span>
<span id="fpn9h"><noframes id="fpn9h">
<th id="fpn9h"></th>
<strike id="fpn9h"><noframes id="fpn9h"><strike id="fpn9h"></strike>
<th id="fpn9h"><noframes id="fpn9h">
<span id="fpn9h"><video id="fpn9h"></video></span>
<ruby id="fpn9h"></ruby>
<strike id="fpn9h"><noframes id="fpn9h"><span id="fpn9h"></span>
  • 《工程索引》(EI)刊源期刊
  • 中文核心期刊
  • 中國科技論文統計源期刊
  • 中國科學引文數據庫來源期刊

留言板

尊敬的讀者、作者、審稿人, 關于本刊的投稿、審稿、編輯和出版的任何問題, 您可以本頁添加留言。我們將盡快給您答復。謝謝您的支持!

姓名
郵箱
手機號碼
標題
留言內容
驗證碼

基于近鄰的不均衡數據聚類算法

武森 汪玉枝 高曉楠

武森, 汪玉枝, 高曉楠. 基于近鄰的不均衡數據聚類算法[J]. 工程科學學報, 2020, 42(9): 1209-1219. doi: 10.13374/j.issn2095-9389.2019.10.09.003
引用本文: 武森, 汪玉枝, 高曉楠. 基于近鄰的不均衡數據聚類算法[J]. 工程科學學報, 2020, 42(9): 1209-1219. doi: 10.13374/j.issn2095-9389.2019.10.09.003
WU Sen, WANG Yu-zhi, GAO Xiao-nan. Clustering algorithm for imbalanced data based on nearest neighbor[J]. Chinese Journal of Engineering, 2020, 42(9): 1209-1219. doi: 10.13374/j.issn2095-9389.2019.10.09.003
Citation: WU Sen, WANG Yu-zhi, GAO Xiao-nan. Clustering algorithm for imbalanced data based on nearest neighbor[J]. Chinese Journal of Engineering, 2020, 42(9): 1209-1219. doi: 10.13374/j.issn2095-9389.2019.10.09.003

基于近鄰的不均衡數據聚類算法

doi: 10.13374/j.issn2095-9389.2019.10.09.003
基金項目: 國家自然科學基金資助項目(71971025, 71271027)
詳細信息
    通訊作者:

    E-mail: gaoxiaonan0001@163.com

  • 中圖分類號: TP311.13

Clustering algorithm for imbalanced data based on nearest neighbor

More Information
  • 摘要: 針對經典K–means算法對不均衡數據進行聚類時產生的“均勻效應”問題,提出一種基于近鄰的不均衡數據聚類算法(Clustering algorithm for imbalanced data based on nearest neighbor,CABON)。CABON算法首先對數據對象進行初始聚類,通過定義的類別待定集來確定初始聚類結果中類別歸屬有待進一步核定的數據對象集合;并給出一種類別待定集的動態調整機制,利用近鄰思想實現此集合中數據對象所屬類別的重新劃分,按照從集合邊緣到中心的順序將類別待定集中的數據對象依次歸入其最近鄰居所在的類別中,得到最終的聚類結果,以避免“均勻效應”對聚類結果的影響。將該算法與K–means、多中心的非平衡K_均值聚類方法(Imbalanced K–means clustering method with multiple centers,MC_IK)和非均勻數據的變異系數聚類算法(Coefficient of variation clustering for non-uniform data,CVCN)在人工數據集和真實數據集上分別進行實驗對比,結果表明CABON算法能夠有效消減K–means算法對不均衡數據聚類時所產生的“均勻效應”,聚類效果明顯優于K–means、MC_IK和CVCN算法。

     

  • 圖  1  二維不均衡數據集的真實分布和K–means的聚類結果。(a)數據集真實分布;(b) K–means的聚類結果

    Figure  1.  Real distributions of two-dimensional unbalanced data sets and clustering results of the K–means algorithm: (a) real distribution of datasets; (b) the clustering result of the K–means algorithm

    圖  2  MC_IK算法中模糊工作集構造規則存在缺陷示意圖。(a)數據集真實分布;(b)MC_IK對數據集初次聚類結果

    Figure  2.  Schematic diagram of the defects in the construction rules of the MC_IK algorithm’s fuzzy working set: (a) real distribution of data sets; (b) initial clustering result of the MC_IK algorithm on datasets

    圖  3  類別待定集的構造過程示意圖。(a)K–means算法對數據集的初始聚類結果;(b)CABON算法構造類別待定集圖示

    Figure  3.  Schematic of the construction process of the undetermined-cluster set: (a) initial clustering result of the K-means algorithm on data sets; (b) undetermined-cluster set constructed by the CABON algorithm

    圖  4  CABON算法固定類別待定集的情況示意圖

    Figure  4.  Schematic of the CABON algorithm’s fixed of the undetermined-cluster set

    圖  5  類別待定集中數據對象類別重新劃分過程示意圖。(a)邊界對象確定過程;(b)類別待定集調整過程

    Figure  5.  Schematic of the reclassification process of data objects in the undetermined-cluster set: (a) determination of boundary objects; (b) the adjustment process of undetermined-cluster set

    圖  6  人工數據集真實分布。(a) Flame;(b) Aggregation;(c) Jain;(d) DS1;(e) DS2;(f) DS3

    Figure  6.  True distribution of synthetic data sets: (a) Flame; (b) Aggregation; (c) Jain; (d) DS1; (e) DS2; (f) DS3

    圖  7  Flame數據集不同算法聚類結果圖示。(a) CABON;(b) K–means;(c) MC_IK;(d) CVCN

    Figure  7.  Graphical representations of clustering results with different algorithms on Flame data sets: (a) CABON; (b) K–means; (c) MC_IK; (d) CVCN

    圖  8  Aggregation數據集不同算法聚類結果圖示。(a) CABON;(b) K–means;(c) MC_IK;(d) CVCN

    Figure  8.  Graphical representations of clustering results with different algorithms on Aggregation data sets: (a) CABON; (b) K–means; (c) MC_IK; (d) CVCN

    圖  9  Jain數據集不同算法聚類結果圖示。(a) CABON;(b) K–means;(c) MC_IK;(d) CVCN

    Figure  9.  Graphical representations of clustering results with different algorithms on Jain data sets: (a) CABON; (b) K–means; (c) MC_IK; (d) CVCN

    圖  10  DS1數據集不同算法聚類結果圖示。(a) CABON;(b) K–means;(c) MC_IK;(d) CVCN

    Figure  10.  Graphical representations of clustering results with different algorithms on DS1 data sets: (a) CABON; (b) K–means; (c) MC_IK; (d) CVCN

    圖  11  DS2數據集不同算法聚類結果圖示。(a) CABON;(b) K–means;(c) MC_IK;(d) CVCN

    Figure  11.  Graphical representations of clustering results with different algorithms on DS2 data sets: (a) CABON; (b) K–means; (c) MC_IK; (d) CVCN

    圖  12  DS3數據集不同算法聚類結果圖示。(a) CABON;(b) K–means;(c) MC_IK;(d) CVCN

    Figure  12.  Graphical representations of clustering results with different algorithms on DS3 data sets: (a) CABON; (b) K–means; (c) MC_IK; (d) CVCN

    表  1  數據集參數信息

    Table  1.   Parameter information for the data set

    NoDatasetsData SourcesDistributionDimensionClassInstance
    1FlameSynthesis147∶9322240
    2AggregationSynthesis272∶170∶127∶105∶45∶35∶3427788
    3JainSynthesis276∶9722373
    4DS1Synthesis1000∶200221200
    5DS2Synthesis2823∶529223352
    6DS3Synthesis1000∶500∶400∶200∶200252300
    7WineUCI59∶71∶48133178
    8NewthyroidUCI150∶35∶3053215
    9IonosphereUCI225∶126342351
    10HeartUCI150∶120132270
    下載: 導出CSV

    表  2  人工數據集不同算法聚類結果(Accuracy指標)

    Table  2.   Clustering results of artificial data set with different algorithms (Accuracy indicators)

    IndexDatasetsCABONK–meansMC_IKCVCN
    AccuracyFlame0.9850.82920.82420.8583
    Aggregation0.95050.91560.93640.9010
    Jain0.88010.78190.79430.8729
    DS10.98170.86280.87240.866
    DS20.98520.84220.86400.8765
    DS30.99170.90320.92120.6687
    下載: 導出CSV

    表  3  人工數據集不同算法聚類結果(F-measure指標)

    Table  3.   Clustering results of artificial data set with different algorithms (F-measure indicators)

    IndexDatasetsCABONK–meansMC_IKCVCN
    F-measureFlame0.97480.83130.82640.859
    Aggregation0.91480.84980.85040.8858
    Jain0.8620.79530.80690.876
    DS10.9820.87680.88640.9074
    DS20.98510.85740.87850.8889
    DS30.99180.90610.92470.6371
    下載: 導出CSV

    表  4  人工數據集不同算法聚類結果(NMI指標)

    Table  4.   Clustering results of artificial data set with different algorithms (NMI indicators)

    IndexDatasetsCABONK–meansMC_IKCVCN
    NMIFlame0.8420.35430.34610.407
    Aggregation0.86350.81340.82710.8129
    Jain0.35610.33320.36610.4272
    DS10.8160.46960.40250.3105
    DS20.53010.34300.38500.4092
    DS30.96510.80890.91230.5014
    下載: 導出CSV

    表  5  人工數據集不同算法聚類結果(RI指標)

    Table  5.   Clustering results of artificial data set with different algorithms (RI indicators)

    IndexDatasetsCABONK–meansMC_IKCVCN
    RIFlame0.9450.71550.70910.7558
    Aggregation0.94730.91920.92590.9301
    Jain0.78320.65810.67240.7795
    DS10.96410.76410.77760.7693
    DS20.94260.72920.76490.7834
    DS30.99130.90650.92780.7383
    下載: 導出CSV

    表  6  UCI數據集不同算法聚類結果(Accuracy指標)

    Table  6.   Clustering results of UCI dataset with different algorithms (Accuracy indicators)

    IndexDatasetsCABONK–meansMC_IKCVCN
    AccuraryWine0.75420.69720.70320.7331
    Newthyroid0.88840.82330.85120.8279
    Ionosphere0.73220.71230.6980.7193
    Heart0.60370.58890.60110.6235
    下載: 導出CSV

    表  7  UCI數據集不同算法聚類結果(F-measure指標)

    Table  7.   Clustering results of UCI dataset with different algorithms (F-measure indicators)

    IndexDatasetsCABONK–meansMC_IKCVCN
    F-measureWine0.75590.69120.71480.7265
    Newthyroid0.88350.81750.85070.8251
    Ionosphere0.73170.71770.70210.6906
    Heart0.61020.58530.59070.6305
    下載: 導出CSV

    表  8  UCI數據集不同算法聚類結果(NMI指標)

    Table  8.   Clustering results of UCI dataset with different algorithms (NMI indicators)

    IndexDatasetsCABONK–meansMC_IKCVCN
    NMIWine0.42010.41970.42980.3947
    Newthyroid0.53080.39180.42490.4067
    Ionosphere0.14210.13120.10080.1268
    Heart0.02130.01760.02220.0688
    下載: 導出CSV

    表  9  UCI數據集不同算法聚類結果(RI指標)

    Table  9.   Clustering results of UCI dataset with different algorithms (RI indicators)

    IndexDatasetsCABONK–meansMC_IKCVCN
    RIWine0.72230.71060.71280.7183
    Newthyroid0.81990.74020.78670.7501
    Ionosphere0.60670.58890.57720.5963
    Heart0.58060.51440.51820.5330
    下載: 導出CSV
    <span id="fpn9h"><noframes id="fpn9h"><span id="fpn9h"></span>
    <span id="fpn9h"><noframes id="fpn9h">
    <th id="fpn9h"></th>
    <strike id="fpn9h"><noframes id="fpn9h"><strike id="fpn9h"></strike>
    <th id="fpn9h"><noframes id="fpn9h">
    <span id="fpn9h"><video id="fpn9h"></video></span>
    <ruby id="fpn9h"></ruby>
    <strike id="fpn9h"><noframes id="fpn9h"><span id="fpn9h"></span>
    www.77susu.com
  • [1] Wu S, Feng X D, Zhou W J. Spectral clustering of high-dimensional data exploiting sparse representation vectors. <italic>Neurocomputing</italic>, 2014, 135: 229 doi: 10.1016/j.neucom.2013.12.027
    [2] Wilson J, Chaudhury S, Lall B. Clustering short temporal behaviour sequences for customer segmentation using LDA. <italic>Expert Syst</italic>, 2018, 35(3): e12250 doi: 10.1111/exsy.12250
    [3] Zhao L B, Shi G Y. A trajectory clustering method based on Douglas-Peucker compression and density for marine traffic pattern recognition. <italic>Ocean Eng</italic>, 2019, 172: 456 doi: 10.1016/j.oceaneng.2018.12.019
    [4] Al-Shammari A, Zhou R, Naseriparsaa M, et al. An effective density-based clustering and dynamic maintenance framework for evolving medical data streams. <italic>Int J Med Inform</italic>, 2019, 126: 176 doi: 10.1016/j.ijmedinf.2019.03.016
    [5] Hu Y, Li H, Chen M. Taxi abnormal trajectory detection based on density clustering. <italic>Comput Modernization</italic>, 2019(6): 49 doi: 10.3969/j.issn.1006-2475.2019.06.008

    胡圓, 李暉, 陳梅. 基于密度聚類的出租車異常軌跡檢測. 計算機與現代化, 2019(6):49 doi: 10.3969/j.issn.1006-2475.2019.06.008
    [6] Han W H, Huang Z Z, Li S D, et al. Distribution-sensitive unbalanced data oversampling method for medical diagnosis. <italic>J Med Syst</italic>, 2019, 43(2): 39 doi: 10.1007/s10916-018-1154-8
    [7] Chen L T, Xu G H, Zhang Q, et al. Learning deep representation of imbalanced SCADA data for fault detection of wind turbines. <italic>Meas</italic>, 2019, 139: 370 doi: 10.1016/j.measurement.2019.03.029
    [8] Xiong H, Wu J J, Chen J. K–means clustering versus validation measures: A data-distribution perspective. <italic>IEEE Trans Syst Man Cybern Part B </italic>(<italic>Cybern</italic>)<italic></italic>, 2009, 39(2): 318 doi: 10.1109/TSMCB.2008.2004559
    [9] Luo Z C, Jin S, Qiu X F. Spectral clustering based oversampling: oversampling taking within class imbalance into consideration. <italic>Comput Eng Appl</italic>, 2014, 50(11): 120 doi: 10.3778/j.issn.1002-8331.1312-0148

    駱自超, 金隼, 邱雪峰. 考慮類內不平衡的譜聚類過抽樣方法. 計算機工程與應用, 2014, 50(11):120 doi: 10.3778/j.issn.1002-8331.1312-0148
    [10] Kumar N S, Rao K N, Govardhan A, et al. Undersampled K–means approach for handling imbalanced distributed data. <italic>Prog Artif Intelligence</italic>, 2014, 3(1): 29 doi: 10.1007/s13748-014-0045-6
    [11] Wu S, Liu L, Lu D. Imbalanced data ensemble classification based on cluster-based under-sampling algorithm. <italic>Chin J Eng</italic>, 2017, 39(8): 1244

    武森, 劉露, 盧丹. 基于聚類欠采樣的集成不均衡數據分類算法. 工程科學學報, 2017, 39(8):1244
    [12] Lin W C, Tsai C F, Hu Y H, et al. Clustering-based undersampling in class-imbalanced data. <italic>Inform Sci</italic>, 2017, 409-410: 17 doi: 10.1016/j.ins.2017.05.008
    [13] Liang J Y, Bai L, Dang C Y, et al. The K–means–type algorithms versus imbalanced data distributions. <italic>IEEE Trans Fuzzy Syst</italic>, 2012, 20(4): 728 doi: 10.1109/TFUZZ.2011.2182354
    [14] Qi H. Imbalanced K–means clustering method with multiple centers. <italic>J North Univ China Nat Sci</italic>, 2015, 36(4): 453

    亓慧. 多中心的非平衡K–均值聚類方法. 中北大學學報(自然科學版), 2015, 36(4):453
    [15] Yang T P, Xu K P, Chen L F. Coefficient of variation clustering algorithm for non-uniform data. <italic>J Shandong Univ Eng Sci</italic>, 2018, 48(3): 140

    楊天鵬, 徐鯤鵬, 陳黎飛. 非均勻數據的變異系數聚類算法. 山東大學學報: 工學版, 2018, 48(3):140
    [16] Liu H, Hu D M. Research on Chi-square clustering algorithm for unbalanced data. <italic>Comput Eng Software</italic>, 2019, 40(4): 7 doi: 10.3969/j.issn.1003-6970.2019.04.002

    劉歡, 胡德敏. 類不平衡數據的卡方聚類算法研究. 軟件, 2019, 40(4):7 doi: 10.3969/j.issn.1003-6970.2019.04.002
    [17] Jiang P. The Research of Multi-clusters IB Algorithm for Imbalanced Data Set [Dissertation]. Zhengzhou: Zhengzhou University, 2015

    江鵬. 面向非平衡數據集的多簇IB算法研究[學位論文]. 鄭州: 鄭州大學, 2015
    [18] Bai L. Theoretical Analysis and Effective Algorithms of Cluster Learning [Dissertation]. Taiyuan: Shanxi University, 2012

    白亮. 聚類學習的理論分析與高效算法研究[學位論文]. 太原: 山西大學, 2012
    [19] Gionis A, Mannila H, Tsaparas P. Clustering aggregation. <italic>ACM Trans Knowledge Discovery Data</italic>, 2007, 1(1): 1 doi: 10.1145/1217299.1217300
    [20] Chen M, Li L J, Wang B, et al. Effectively clustering by finding density backbone based-on kNN. <italic>Pattern Recognit</italic>, 2016, 60: 486 doi: 10.1016/j.patcog.2016.04.018
    [21] Li T, Geng H W, Su S Z. Density peaks clustering based on density adaptive distance. <italic>J Chin Comput Syst</italic>, 2017, 38(6): 1347 doi: 10.3969/j.issn.1000-1220.2017.06.032

    李濤, 葛洪偉, 蘇樹智. 基于密度自適應距離的密度峰聚類. 小型微型計算機系統, 2017, 38(6):1347 doi: 10.3969/j.issn.1000-1220.2017.06.032
    [22] Forina M. Wine Data Set [EB/OL]. UCI Machine Learning (1991-07-01) [2019-10-09]. http://archive.ics.uci.edu/ml/datasets/Wine
    [23] Quinlan J R. Thyroid Disease Data Set [EB/OL]. UCI Machine Learning (1987-01-01) [2019-10-09]. http://archive.ics.uci.edu/ml/datasets/Thyroid+Disease
    [24] Sigillito V G. Ionosphere Data Set [EB/OL]. UCI Machine Learning (1989-01-01) [2019-10-09]. http://archive.ics.uci.edu/ml/datasets/Ionosphere
    [25] Dua D, Graff C. Statlog (Heart) Data Set [EB/OL]. UCI Machine Learning (1993-02-13) [2019-10-09]. http://archive.ics.uci.edu/ml/datasets/Statlog+%28Heart%29
    [26] Wu P P. Research on Initial Cluster Centers Choice Algorithm and Clustering for Imbalanced Data [Dissertation]. Taiyuan: Shanxi University, 2015

    武鵬鵬. 初始類中心選擇及在非平衡數據中的聚類研究[學位論文]. 太原: 山西大學, 2015
    [27] Fu L W, Wu S. A new internal clustering validation index for categorical data based on concentration of attribute values. <italic>Chin J Eng</italic>, 2019, 41(5): 682

    傅立偉, 武森. 基于屬性值集中度的分類數據聚類有效性內部評價指標. 工程科學學報, 2019, 41(5):682
    [28] Hussain S F, Haris M. A K–means based co-clustering (kCC) algorithm for sparse, high dimensional data. <italic>Expert Syst Appl</italic>, 2019, 118: 20 doi: 10.1016/j.eswa.2018.09.006
    [29] Yeh C C, Yang M S. Evaluation measures for cluster ensembles based on a fuzzy generalized Rand index. <italic>Appl Soft Comput</italic>, 2017, 57: 225 doi: 10.1016/j.asoc.2017.03.030
    [30] Qannari E M, Courcoux P, Faye P. Significance test of the adjusted Rand index. Application to the free sorting task. <italic>Food Qual Preference</italic>, 2014, 32: 93 doi: 10.1016/j.foodqual.2013.05.005
  • 加載中
圖(12) / 表(9)
計量
  • 文章訪問數:  3060
  • HTML全文瀏覽量:  1439
  • PDF下載量:  158
  • 被引次數: 0
出版歷程
  • 收稿日期:  2019-10-09
  • 刊出日期:  2020-09-20

目錄

    /

    返回文章
    返回