Differential privacy protection random forest algorithm and its application in steel materials
-
摘要: 基于數據驅動的材料信息學被認為是材料研發第四范式,可以極大降低新材料的研發成本,縮短研發周期。然而,數據驅動的方法在材料數據共享利用時,會增加材料研發中關鍵工藝等敏感信息的隱私泄露風險。因此,面向隱私保護的機器學習是材料信息學中的關鍵問題。基于此,本文針對在材料信息學領域廣泛使用的隨機森林模型,提出了一種差分隱私保護的隨機森林算法。算法將整體隱私預算分配到每棵樹上,在建決策樹過程中引入差分隱私的拉普拉斯機制和指數機制,即在決策樹的分裂過程中采用指數機制隨機選擇分裂特征,同時采用拉普拉斯機制對節點數量添加噪聲,實現對隨機森林算法的差分隱私保護。本文結合鋼材料疲勞性能預測實驗,驗證算法在數據分別采用集中式存儲和分布式存儲下的有效性。實驗結果表明,在添加差分隱私保護后,各目標性能的預測決定系數R2值均達到0.8以上,與普通隨機森林的結果相差很小。另外,在數據分布式存儲情況下,隨著隱私預算的增加,各目標性能的預測R2值隨之增加。同時,隨著最大樹深度的增加,算法整體的預測精度先增加后降低,當最大樹深度取5時,預測精度最好。綜合看來,本文算法在實現隨機森林的差分隱私保護前提下,仍能保持較高的預測精度,且數據在分散存儲的分布式網絡的環境中,可根據隱私預算等算法參數設置,實現隱私保護強度和預測精度的平衡,有廣泛的應用前景。Abstract: Data-driven material informatics is considered the fourth paradigm of materials research and development (R&D), which can greatly reduce R&D costs and shorten the R&D cycle. However, the data-driven method increases the risk of privacy disclosure when sharing and using materials data and sensitive information such as key processes in materials R&D. Therefore, privacy-preserving machine learning is a key issue in material informatics. The mainstream privacy protection methods in the current times include differential privacy, secure multi-party computation, federated learning, etc. The differential privacy model proposes strict definitions and metrics for quantitative evaluation of privacy protection, and the noise added by differential privacy is independent of the data scale. Only a small amount of noise is required to achieve a high level of protection, which considerably improves data usability. A novel differential privacy preserving random forest algorithm (DPRF) is proposed based on the fact that random forest is one of the most widely used models in material informatics. DPRF introduces the Laplace mechanism and exponential mechanism of differential privacy during the decision process tree building. First, the total privacy budget for the DPRF algorithm is set and then equally divided into each decision tree. During the tree-building process, the splitting features are randomly selected in the decision tree by the exponential mechanism and noise is added to the number of nodes by the Laplace mechanism, which is effective for differential privacy protection for the random forest. In experiments such as steel fatigue prediction experiments, the efficacies of DPRF under centralized or distributed data storage are verified. By setting different privacy budgets, the R2 of the predicted results of the DPRF algorithm can reach more than 0.8 for each target feature after adding differential privacy, which is not much different from the original random forest algorithm. A distributed data storage scenario shows that with the increase of privacy budget, the R2 of each target property prediction gradually increases. Comparing the effect of different tree depths in DPRF, it is shown that the overall R2 of the target prediction tends to increase and then later decrease .as the maximum depth of the tree increases. Overall, the best prediction accuracy is achieved when the maximum depth of the tree is set at 5. In summary, DPRF has good prediction accuracy in terms of achieving differential privacy protection of random forests. Specifically, in a distributed and decentralized data environment, DPRF can strike a balance between privacy-preserving strength and prediction accuracy by setting privacy budgets, tree depth, etc., which shows a wide range of application prospects of our algorithm.
-
表 1 差分隱私保護的樹模型算法對比分析
Table 1. Comparative analysis among different differential privacy preserving tree model algorithms
Algorithm Basic model Realization mechanism Task Data storage SuLQ-based ID3 Decision tree Laplace Classification Centralization DiffP-ID3 Decision tree Laplace & Exponential Classification Centralization DiffP-C4.5 Decision tree Laplace & Exponential Classification Centralization DiffPRF Random forest Laplace & Exponential Classification Centralization DiffPRFs Random forest Laplace & Exponential Classification Centralization DPRF Random forest Laplace & Exponential Regression Centralization & distribution 算法1 基于差分隱私保護的DPRF算法 輸入:訓練數據集D,特征集合F,隱私預算B,決策樹數量T,決策樹最大深度d,樹分裂時隨機特征個數m,數據分布情況下節點數N;
輸出:滿足ε-差分隱私的隨機森林;
停止條件 :隨機森林建立的決策樹數量達到T或隱私預算耗盡;
Procedure DPRF_fit (D,F,B,T,d,m)
1: Forest={};
2: 將整體的隱私預算平均分給每棵樹,每棵決策樹分配到的隱私預算$ \varepsilon ' = B/T $;
3: for i=1 to T; //循環建立T棵樹
4: 在數據集D中有放回采樣得到數據子集Dt,從特征集合F中隨機選擇m個特征;
5: 將決策樹獲得的隱私預算分配到每一層,再將每一層的隱私預算分為$\varepsilon '' = \dfrac{ { {\varepsilon '} } }{ {d + 1} }$;
6: ε=ε''/2;
7: Treei=BuildTree(Dt,m,ε,d,0); //下述為建樹過程
8: if 當前節點滿足樹停止建立條件設置當前節點為葉子節點,葉子節點取值為葉子節點所有樣本的目標值的均值,|NDt|=|NDt|+Laplace(1/ε),返回葉子節點;
9: else
10: for each_feature in m
11: 以當前特征中的值劃分左右數據集,記錄劃分時平均絕對誤差MAE最小的值為當前特征的split_value;
12: 當前特征以split_value劃分數據集,計算該特征分數$\text{ex}\mathrm{p}\left(\dfrac{\varepsilon }{2\mathrm{\Delta }q}q\left({D}_{\mathrm{C} },f\right)\right)$;
13: 計算m個特征的特征分數總分,任意特征f被選中為當前節點的分裂特征的概率滿足:$\dfrac{\mathrm{exp}(\dfrac{\varepsilon }{2\mathrm{\Delta }q}q({D}_{\mathrm{c} },f))}{ {\sum }_{1}^{m}\mathrm{exp}(\frac{\varepsilon }{2\mathrm{\Delta }q}q({D}_{\mathrm{c} },f))}$, 其中$ q({D}_{\mathrm{C}},f) $為可用性函數,$ \Delta q $為敏感度;
14: 根據選出特征f的split_value,劃分左右數據集,并在左右數據集上繼續建樹;
15: Forest=Forset∪Treei;
16: end for
17: return ForestProcedure predict (Forest, Dtest)
1: Result={};
2: for d in Dtest
3: sum_predict=0;
4: for tree in Forest
5: 遍歷當前樹,到達葉子節點,得到預測值predict_value;
6: sum_predict+=predict_value;
7: res=sum_predict/length(Forest);
8: Result=Result∪res;
9: return ResultProcedure Distributed_fit (F,B,T,d,m)
1: Forest_Distributed ={};
2: 將整體的隱私預算平均分給個節點,每個節點分配到的隱私預算E=B/N;
3: for i=1 to n
4: 設節點i的數據集為Di;
5: foresti=DPRF_fit (Di,F,E,T,d,m);
6: Forest_Distribute = Forest_Distributed∪foresti;
7: return Forest_DistributedProcedure Distributed_Predict(D, Forest_Distribute)
1: Result=0;
2: for i=1 to n
3: r=predict(Forest_Distributei,D);
4: Result+=r;
5: Result=Result/n;
6: return Result表 2 NIMS鋼疲勞數據集具體特征信息
Table 2. Descriptor information of the NIMS dataset
Feature Description Minimum value Maximum value Mean value Standard deviation NT Normalizing temperature 825 900 865.6 17.37 QT Hardening temperature 825 865 846.2 9.86 TT Tempering temperature 550 680 605 42.4 C Carbon content 0.28 0.57 0.407 0.061 Si Silicon content 0.16 0.35 0.258 0.034 Mn Manganese content 0.37 1.3 0.849 0.294 P Phosphorus content 0.007 0.031 0.016 0.005 S Sulfur content 0.003 0.03 0.014 0.006 Ni Nickel content 0.01 2.78 0.548 0.899 Cr Chromium content 0.01 1.12 0.556 0.419 Cu Copper content 0.01 0.22 0.064 0.045 Mo Molybdenum content 0 0.24 0.066 0.089 RR Reduction ratio 420 5530 971.2 601.4 dA Plastic inclusion 0 0.13 0.047 0.032 dB Discontinuous inclusions 0 0.05 0.003 0.009 dC Isolated inclusion 0 0.04 0.008 0.01 表 3 隨機森林與差分隱私保護隨機森林預測結果
Table 3. Predictive results of target properties with random forest and DPRF
Model and privacy
budgetR2 Fatigue Tensile Fracture Hardness RF 0.9059 0.9282 0.9252 0.9193 ε=0.1 DPRF 0.6588 0.6469 0.7588 0.6565 ε=0.25 DPRF 0.6930 0.6906 0.7721 0.7008 ε=0.5 DPRF 0.7704 0.7605 0.7918 0.7593 ε=1.0 DPRF 0.8035 0.8105 0.8219 0.8094 ε=3.0 DPRF 0.8249 0.8270 0.8461 0.8399 ε=10.0 DPRF 0.8527 0.8462 0.8852 0.8641 表 4 不同隱私預算下各目標性能的預測結果
Table 4. Predictive results of target properties under different privacy budgets
ε R2 Fatigue Tensile Fracture Hardness 0.3 0.6153 0.6030 0.6979 0.6139 0.75 0.6563 0.6748 0.7585 0.6502 1.5 0.7038 0.7448 0.8082 0.7308 2.25 0.7615 0.7773 0.8377 0.7618 3.0 0.7981 0.8025 0.8491 0.8017 9.0 0.8130 0.8380 0.8677 0.8429 表 5 不同樹深度下各目標性能的預測結果
Table 5. Predictive results of each target property under different tree depths
d R2 Fatigue Tensile Fracture Hardness 3 0.6027 0.6113 0.6796 0.6387 4 0.7088 0.7061 0.7951 0.7183 5 0.7961 0.8025 0.8491 0.8017 6 0.7560 0.7605 0.8568 0.7659 7 0.6920 0.7427 0.8251 0.7303 www.77susu.com -
參考文獻
[1] Zhou S G, Li F, Tao Y F, et al. Privacy preservation in database applications: A survey. Chin J Comput, 2009, 32(5): 847 doi: 10.3724/SP.J.1016.2009.00847周水庚, 李豐, 陶宇飛, 等. 面向數據庫應用的隱私保護研究綜述. 計算機學報, 2009, 32(5):847 doi: 10.3724/SP.J.1016.2009.00847 [2] Sweeney L. k-anonymity: A model for protecting privacy. Int J Uncertain Fuzziness Knowl Based Syst, 2002, 10(5): 557 doi: 10.1142/S0218488502001648 [3] Du W L, Atallah M J. Secure multi-party computation problems and their applications: A review and open problems//Proceedings of the 2001 Workshop on New Security Paradigms. Cloudcroft, 2001: 13 [4] Konečný J, McMahan H B, Yu F X, et al. Federated learning: Strategies for improving communication efficiency [J/OL]. ArXiv Preprint (2017-10-30) [2022-5-29]. https://arxiv.org/abs/1610.05492 [5] Dwork C. Differential privacy//Proceedings of the 33rd International Conference on Automata, Languages and Programming. New York, 2006: 1 [6] Xiong J, Zhang T Y, Shi S Q. Machine learning of mechanical properties of steels. Sci China Technol Sci, 2020, 63(7): 1247 doi: 10.1007/s11431-020-1599-5 [7] Dai M Y, Hu J M. Field-free spin-orbit torque perpendicular magnetization switching in ultrathin nanostructures. Npj Comput Mater, 2020, 6: 78 doi: 10.1038/s41524-020-0347-0 [8] Huber L, Hadian R, Grabowski B, et al. A machine learning approach to model solute grain boundary segregation. Npj Comput Mater, 2018, 4: 64 doi: 10.1038/s41524-018-0122-7 [9] Choudhary K, Garrity K F, Sharma V, et al. High-throughput density functional perturbation theory and machine learning predictions of infrared, piezoelectric, and dielectric responses. Npj Comput Mater, 2020, 6: 64 doi: 10.1038/s41524-020-0337-2 [10] Bartel C J, Trewartha A, Wang Q, et al. A critical examination of compound stability predictions from machine-learned formation energies. Npj Comput Mater, 2020, 6: 97 doi: 10.1038/s41524-020-00362-y [11] Tang S L, Meng Y, Wang G Q, et al. Extraction of metamorphic minerals by multiscale segmentation combined with random forest. Chin J Eng, 2022, 44(2): 170 doi: 10.3321/j.issn.1001-053X.2022.2.bjkjdxxb202202002唐淑蘭, 孟勇, 王國強, 等. 結合多尺度分割和隨機森林的變質礦物提取. 工程科學學報, 2022, 44(2):170 doi: 10.3321/j.issn.1001-053X.2022.2.bjkjdxxb202202002 [12] Chen L, Fu D M. Processing and modeling dual-rate sampled data in seawater corrosion monitoring of low alloy steels. Chin J Eng, 2022, 44(1): 95 doi: 10.3321/j.issn.1001-053X.2022.1.bjkjdxxb202201009陳亮, 付冬梅. 低合金鋼海水腐蝕監測中的雙率數據處理與建模. 工程科學學報, 2022, 44(1):95 doi: 10.3321/j.issn.1001-053X.2022.1.bjkjdxxb202201009 [13] Sigmund G, Gharasoo M, Hüffer T, et al. Deep learning neural network approach for predicting the sorption of ionizable and polar organic pollutants to a wide range of carbonaceous materials. Environ Sci Technol, 2020, 54(7): 4583 doi: 10.1021/acs.est.9b06287 [14] Le T D, Noumeir R, Quach H L, et al. Critical temperature prediction for a superconductor: A variational Bayesian neural network approach. IEEE Trans Appl Supercond, 2020, 30(4): 1 [15] Wei M, Wang Q, Ye M, et al. An indirect remaining useful life prediction of lithium-ion batteries based on a NARX dynamic neural network. Chin J Eng, 2022, 44(3): 380 doi: 10.3321/j.issn.1001-053X.2022.3.bjkjdxxb202203007魏孟, 王橋, 葉敏, 等. 基于NARX動態神經網絡的鋰離子電池剩余壽命間接預測. 工程科學學報, 2022, 44(3):380 doi: 10.3321/j.issn.1001-053X.2022.3.bjkjdxxb202203007 [16] De Cock M, Dowsley R, Horst C, et al. Efficient and private scoring of decision trees, support vector machines and logistic regression models based on pre-computation. IEEE Trans Dependable Secure Comput, 2019, 16(2): 217 doi: 10.1109/TDSC.2017.2679189 [17] Wu Y C, Cai S F, Xiao X K, et al. Privacy preserving vertical federated learning for tree-based models [J/OL]. ArXiv Preprint (2020-08-14) [2020-05-29]. https://arxiv.org/abs/2008.06170 [18] Liu Y, Liu Y T, Liu Z J, et al. Federated forest. IEEE Trans Big Data, 2022, 8(3): 843 doi: 10.1109/TBDATA.2020.2992755 [19] Cheng K W, Fan T, Jin Y L, et al. SecureBoost: A lossless federated learning framework. IEEE Intell Syst, 2021, 36(6): 87 doi: 10.1109/MIS.2021.3082561 [20] Blum A, Dwork C, McSherry F, et al. Practical privacy: The SuLQ framework//Proceedings of the Twenty-Fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Baltimore, 2005: 128 [21] Friedman A, Schuster A. Data mining with differential privacy//Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, 2010: 493 [22] Patil A, Singh S. Differential private random forest//2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI). Delhi, 2014: 2623 [23] Mu H R, Ding L P, Song Y N, et al. DiffPRFs: Random forest under differential privacy. J Commun, 2016, 37(9): 175 doi: 10.11959/j.issn.1000-436x.2016169穆海蓉, 丁麗萍, 宋宇寧, 等. DiffPRFs: 一種面向隨機森林的差分隱私保護算法. 通信學報, 2016, 37(9):175 doi: 10.11959/j.issn.1000-436x.2016169 [24] Breiman L. Random forests. Mach Learn, 2001, 45(1): 5 doi: 10.1023/A:1010933404324 [25] Dwork C, McSherry F, Nissim K, et al. Calibrating noise to sensitivity in private data analysis. J Priv Confidentiality, 2017, 7(3): 17 doi: 10.29012/jpc.v7i3.405 [26] McSherry F, Talwar K. Mechanism design via differential privacy//48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07). Providence, 2007: 94 [27] Kairouz P, Oh S, Viswanath P. The composition theorem for differential privacy. IEEE Trans Inf Theory, 2017, 63(6): 4037 doi: 10.1109/TIT.2017.2685505 [28] Agrawal A, Choudhary A. An online tool for predicting fatigue strength of steel alloys based on ensemble data mining. Int J Fatigue, 2018, 113: 389 doi: 10.1016/j.ijfatigue.2018.04.017 -