差分隱私保護的隨機森林算法及在鋼材料上的應用

陳薛輝; 馮燕; 錢權

doi:10.13374/j.issn2095-9389.2022.05.29.002

差分隱私保護的隨機森林算法及在鋼材料上的應用

doi: 10.13374/j.issn2095-9389.2022.05.29.002

陳薛輝¹,
馮燕¹,
錢權^{1, 2, 3, ,}

1.
上海大學計算機工程與科學學院，上海 200444
2.
上海大學材料信息與數據科學中心，材料基因組工程研究院，上海 200444
3.
之江實驗室，杭州 311100

基金項目: 國家重點研發計劃資助項目（2018YFB0704400）；云南省重大科技專項資助項目（202102AB080019-3，202002AB080001-2）；之江實驗室科研攻關資助項目（2021PE0AC02）；上海張江國家自主創新示范區專項發展資金重大項目（ZJ2021-ZD-006）

詳細信息

通訊作者:
E-mail: qqian@shu.edu.cn

中圖分類號: TG391
計量
- 文章訪問數: 434
- HTML全文瀏覽量: 144
- PDF下載量: 58
- 被引次數: 0
出版歷程
- 收稿日期: 2022-05-29
- 網絡出版日期: 2022-07-27
- 刊出日期: 2023-07-25

Differential privacy protection random forest algorithm and its application in steel materials

CHEN Xue-hui¹,
FENG Yan¹,
QIAN Quan^{1, 2, 3
, ,}

1.
School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
2.
Center of Materials Informatics and Data Science, Materials Genome Institute, Shanghai University, Shanghai 200444, China
3.
Zhejiang Laboratory, Hangzhou 311100, China

More Information

Corresponding author: E-mail: qqian@shu.edu.cn

摘要

摘要: 基于數據驅動的材料信息學被認為是材料研發第四范式，可以極大降低新材料的研發成本，縮短研發周期。然而，數據驅動的方法在材料數據共享利用時，會增加材料研發中關鍵工藝等敏感信息的隱私泄露風險。因此，面向隱私保護的機器學習是材料信息學中的關鍵問題。基于此，本文針對在材料信息學領域廣泛使用的隨機森林模型，提出了一種差分隱私保護的隨機森林算法。算法將整體隱私預算分配到每棵樹上，在建決策樹過程中引入差分隱私的拉普拉斯機制和指數機制，即在決策樹的分裂過程中采用指數機制隨機選擇分裂特征，同時采用拉普拉斯機制對節點數量添加噪聲，實現對隨機森林算法的差分隱私保護。本文結合鋼材料疲勞性能預測實驗，驗證算法在數據分別采用集中式存儲和分布式存儲下的有效性。實驗結果表明，在添加差分隱私保護后，各目標性能的預測決定系數R²值均達到0.8以上，與普通隨機森林的結果相差很小。另外，在數據分布式存儲情況下，隨著隱私預算的增加，各目標性能的預測R²值隨之增加。同時，隨著最大樹深度的增加，算法整體的預測精度先增加后降低，當最大樹深度取5時，預測精度最好。綜合看來，本文算法在實現隨機森林的差分隱私保護前提下，仍能保持較高的預測精度，且數據在分散存儲的分布式網絡的環境中，可根據隱私預算等算法參數設置，實現隱私保護強度和預測精度的平衡，有廣泛的應用前景。
- 材料信息學 /
- 隨機森林 /
- 隱私保護 /
- 差分隱私 /
- 鋼疲勞性能預測
Abstract: Data-driven material informatics is considered the fourth paradigm of materials research and development (R&D), which can greatly reduce R&D costs and shorten the R&D cycle. However, the data-driven method increases the risk of privacy disclosure when sharing and using materials data and sensitive information such as key processes in materials R&D. Therefore, privacy-preserving machine learning is a key issue in material informatics. The mainstream privacy protection methods in the current times include differential privacy, secure multi-party computation, federated learning, etc. The differential privacy model proposes strict definitions and metrics for quantitative evaluation of privacy protection, and the noise added by differential privacy is independent of the data scale. Only a small amount of noise is required to achieve a high level of protection, which considerably improves data usability. A novel differential privacy preserving random forest algorithm (DPRF) is proposed based on the fact that random forest is one of the most widely used models in material informatics. DPRF introduces the Laplace mechanism and exponential mechanism of differential privacy during the decision process tree building. First, the total privacy budget for the DPRF algorithm is set and then equally divided into each decision tree. During the tree-building process, the splitting features are randomly selected in the decision tree by the exponential mechanism and noise is added to the number of nodes by the Laplace mechanism, which is effective for differential privacy protection for the random forest. In experiments such as steel fatigue prediction experiments, the efficacies of DPRF under centralized or distributed data storage are verified. By setting different privacy budgets, the R² of the predicted results of the DPRF algorithm can reach more than 0.8 for each target feature after adding differential privacy, which is not much different from the original random forest algorithm. A distributed data storage scenario shows that with the increase of privacy budget, the R² of each target property prediction gradually increases. Comparing the effect of different tree depths in DPRF, it is shown that the overall R² of the target prediction tends to increase and then later decrease .as the maximum depth of the tree increases. Overall, the best prediction accuracy is achieved when the maximum depth of the tree is set at 5. In summary, DPRF has good prediction accuracy in terms of achieving differential privacy protection of random forests. Specifically, in a distributed and decentralized data environment, DPRF can strike a balance between privacy-preserving strength and prediction accuracy by setting privacy budgets, tree depth, etc., which shows a wide range of application prospects of our algorithm.
- materials informatics /
- random forest /
- privacy protection /
- differential privacy /
- steel fatigue properties prediction

HTML全文

圖 1 DPRF算法總體框架

Figure 1. Framework of the DPRF algorithm

下載: 全尺寸圖片幻燈片

圖 2 ε=10.0、d=5時DPRF算法各目標特征真實值與預測值散點圖. (a)疲勞; (b)拉伸; (c)斷裂; (d)硬度

Figure 2. Scatter diagrams of the real and predictive values of each target of the DPRF algorithm, whereby ε=10.0, d=5: (a) fatigue; (b) tensile; (c) fracture; (d) hardness

下載: 全尺寸圖片幻燈片

圖 3 DPRF算法在不同隱私預算(a)和不同最大樹深度下(b)各目標性能的預測結果

Figure 3. Predive results of each target property of DPRF algorithms under different privacy budgets (a) and tree depths (b)

下載: 全尺寸圖片幻燈片

表 1 差分隱私保護的樹模型算法對比分析

Table 1. Comparative analysis among different differential privacy preserving tree model algorithms

Algorithm	Basic model	Realization mechanism	Task	Data storage
SuLQ-based ID3	Decision tree	Laplace	Classification	Centralization
DiffP-ID3	Decision tree	Laplace & Exponential	Classification	Centralization
DiffP-C4.5	Decision tree	Laplace & Exponential	Classification	Centralization
DiffPRF	Random forest	Laplace & Exponential	Classification	Centralization
DiffPRFs	Random forest	Laplace & Exponential	Classification	Centralization
DPRF	Random forest	Laplace & Exponential	Regression	Centralization & distribution

下載: 導出CSV

算法1 基于差分隱私保護的DPRF算法
輸入：訓練數據集D，特征集合F，隱私預算B，決策樹數量T，決策樹最大深度d，樹分裂時隨機特征個數m，數據分布情況下節點數N；輸出：滿足ε-差分隱私的隨機森林；停止條件：隨機森林建立的決策樹數量達到T或隱私預算耗盡； Procedure DPRF_fit (D,F,B,T,d,m) 1:　Forest={}; 2:　將整體的隱私預算平均分給每棵樹，每棵決策樹分配到的隱私預算$ \varepsilon ' = B/T $; 3:　for i=1 to T; //循環建立T棵樹 4:　　在數據集D中有放回采樣得到數據子集D_t，從特征集合F中隨機選擇m個特征; 5:　　將決策樹獲得的隱私預算分配到每一層，再將每一層的隱私預算分為$\varepsilon '' = \dfrac{ { {\varepsilon '} } }{ {d + 1} }$; 6:　　 ε=ε''/2; 7:　　Tree_i=BuildTree(D_t,m,ε,d,0); //下述為建樹過程 8:　　 if 當前節點滿足樹停止建立條件設置當前節點為葉子節點，葉子節點取值為葉子節點所有樣本的目標值的均值，\|N_Dt\|=\|N_Dt\|+Laplace(1/ε),返回葉子節點; 9:　　else 10:　　　for each_feature in m 11:　　　　以當前特征中的值劃分左右數據集，記錄劃分時平均絕對誤差MAE最小的值為當前特征的split_value; 12:　　　　當前特征以split_value劃分數據集，計算該特征分數$\text{ex}\mathrm{p}\left(\dfrac{\varepsilon }{2\mathrm{\Delta }q}q\left({D}_{\mathrm{C} },f\right)\right)$; 13:　　　計算m個特征的特征分數總分，任意特征f被選中為當前節點的分裂特征的概率滿足：$\dfrac{\mathrm{exp}(\dfrac{\varepsilon }{2\mathrm{\Delta }q}q({D}_{\mathrm{c} },f))}{ {\sum }_{1}^{m}\mathrm{exp}(\frac{\varepsilon }{2\mathrm{\Delta }q}q({D}_{\mathrm{c} },f))}$, 其中$ q({D}_{\mathrm{C}},f) $為可用性函數，$ \Delta q $為敏感度; 14:　　　根據選出特征f的split_value，劃分左右數據集，并在左右數據集上繼續建樹; 15:　　Forest=Forset∪Tree_i; 16:　end for 17:　return Forest

Procedure predict (Forest, D_test) 1:　Result={}; 2:　for d in D_test 3:　　sum_predict=0; 4:　　for tree in Forest 5:　　　遍歷當前樹，到達葉子節點，得到預測值predict_value; 6:　　　sum_predict+=predict_value; 7:　　res=sum_predict/length(Forest); 8: Result=Result∪res; 9: return Result

Procedure Distributed_fit (F,B,T,d,m) 1:　Forest_Distributed ={}; 2:　將整體的隱私預算平均分給個節點，每個節點分配到的隱私預算E=B/N; 3:　for i=1 to n 4:　　設節點i的數據集為D_i; 5:　　forest_i=DPRF_fit (D_i,F,E,T,d,m); 6:　　Forest_Distribute = Forest_Distributed∪forest_i; 7:　return Forest_Distributed

Procedure Distributed_Predict(D, Forest_Distribute) 1:　Result=0; 2:　for i=1 to n 3:　　r=predict(Forest_Distribute_i,D); 4:　　Result+=r; 5:　Result=Result/n; 6:　return Result

下載: 導出CSV

表 2 NIMS鋼疲勞數據集具體特征信息

Table 2. Descriptor information of the NIMS dataset

Feature	Description	Minimum value	Maximum value	Mean value	Standard deviation
NT	Normalizing temperature	825	900	865.6	17.37
QT	Hardening temperature	825	865	846.2	9.86
TT	Tempering temperature	550	680	605	42.4
C	Carbon content	0.28	0.57	0.407	0.061
Si	Silicon content	0.16	0.35	0.258	0.034
Mn	Manganese content	0.37	1.3	0.849	0.294
P	Phosphorus content	0.007	0.031	0.016	0.005
S	Sulfur content	0.003	0.03	0.014	0.006
Ni	Nickel content	0.01	2.78	0.548	0.899
Cr	Chromium content	0.01	1.12	0.556	0.419
Cu	Copper content	0.01	0.22	0.064	0.045
Mo	Molybdenum content	0	0.24	0.066	0.089
RR	Reduction ratio	420	5530	971.2	601.4
dA	Plastic inclusion	0	0.13	0.047	0.032
dB	Discontinuous inclusions	0	0.05	0.003	0.009
dC	Isolated inclusion	0	0.04	0.008	0.01

下載: 導出CSV

表 3 隨機森林與差分隱私保護隨機森林預測結果

Table 3. Predictive results of target properties with random forest and DPRF

Model and privacy budget	R²
Model and privacy budget	Fatigue	Tensile	Fracture	Hardness
RF	0.9059	0.9282	0.9252	0.9193
ε=0.1 DPRF	0.6588	0.6469	0.7588	0.6565
ε=0.25 DPRF	0.6930	0.6906	0.7721	0.7008
ε=0.5 DPRF	0.7704	0.7605	0.7918	0.7593
ε=1.0 DPRF	0.8035	0.8105	0.8219	0.8094
ε=3.0 DPRF	0.8249	0.8270	0.8461	0.8399
ε=10.0 DPRF	0.8527	0.8462	0.8852	0.8641

下載: 導出CSV

表 4 不同隱私預算下各目標性能的預測結果

Table 4. Predictive results of target properties under different privacy budgets

ε	R²
ε	Fatigue	Tensile	Fracture	Hardness
0.3	0.6153	0.6030	0.6979	0.6139
0.75	0.6563	0.6748	0.7585	0.6502
1.5	0.7038	0.7448	0.8082	0.7308
2.25	0.7615	0.7773	0.8377	0.7618
3.0	0.7981	0.8025	0.8491	0.8017
9.0	0.8130	0.8380	0.8677	0.8429

下載: 導出CSV

表 5 不同樹深度下各目標性能的預測結果

Table 5. Predictive results of each target property under different tree depths

d	R²
d	Fatigue	Tensile	Fracture	Hardness
3	0.6027	0.6113	0.6796	0.6387
4	0.7088	0.7061	0.7951	0.7183
5	0.7961	0.8025	0.8491	0.8017
6	0.7560	0.7605	0.8568	0.7659
7	0.6920	0.7427	0.8251	0.7303

下載: 導出CSV

www.77susu.com

參考文獻(28)

[1]	Zhou S G, Li F, Tao Y F, et al. Privacy preservation in database applications: A survey. Chin J Comput, 2009, 32(5): 847 doi: 10.3724/SP.J.1016.2009.00847 周水庚, 李豐, 陶宇飛, 等. 面向數據庫應用的隱私保護研究綜述. 計算機學報, 2009, 32(5):847 doi: 10.3724/SP.J.1016.2009.00847
[2]	Sweeney L. k-anonymity: A model for protecting privacy. Int J Uncertain Fuzziness Knowl Based Syst, 2002, 10(5): 557 doi: 10.1142/S0218488502001648
[3]	Du W L, Atallah M J. Secure multi-party computation problems and their applications: A review and open problems//Proceedings of the 2001 Workshop on New Security Paradigms. Cloudcroft, 2001: 13
[4]	Konečný J, McMahan H B, Yu F X, et al. Federated learning: Strategies for improving communication efficiency [J/OL]. ArXiv Preprint (2017-10-30) [2022-5-29]. https://arxiv.org/abs/1610.05492
[5]	Dwork C. Differential privacy//Proceedings of the 33rd International Conference on Automata, Languages and Programming. New York, 2006: 1
[6]	Xiong J, Zhang T Y, Shi S Q. Machine learning of mechanical properties of steels. Sci China Technol Sci, 2020, 63(7): 1247 doi: 10.1007/s11431-020-1599-5
[7]	Dai M Y, Hu J M. Field-free spin-orbit torque perpendicular magnetization switching in ultrathin nanostructures. Npj Comput Mater, 2020, 6: 78 doi: 10.1038/s41524-020-0347-0
[8]	Huber L, Hadian R, Grabowski B, et al. A machine learning approach to model solute grain boundary segregation. Npj Comput Mater, 2018, 4: 64 doi: 10.1038/s41524-018-0122-7
[9]	Choudhary K, Garrity K F, Sharma V, et al. High-throughput density functional perturbation theory and machine learning predictions of infrared, piezoelectric, and dielectric responses. Npj Comput Mater, 2020, 6: 64 doi: 10.1038/s41524-020-0337-2
[10]	Bartel C J, Trewartha A, Wang Q, et al. A critical examination of compound stability predictions from machine-learned formation energies. Npj Comput Mater, 2020, 6: 97 doi: 10.1038/s41524-020-00362-y
[11]	Tang S L, Meng Y, Wang G Q, et al. Extraction of metamorphic minerals by multiscale segmentation combined with random forest. Chin J Eng, 2022, 44(2): 170 doi: 10.3321/j.issn.1001-053X.2022.2.bjkjdxxb202202002 唐淑蘭, 孟勇, 王國強, 等. 結合多尺度分割和隨機森林的變質礦物提取. 工程科學學報, 2022, 44(2):170 doi: 10.3321/j.issn.1001-053X.2022.2.bjkjdxxb202202002
[12]	Chen L, Fu D M. Processing and modeling dual-rate sampled data in seawater corrosion monitoring of low alloy steels. Chin J Eng, 2022, 44(1): 95 doi: 10.3321/j.issn.1001-053X.2022.1.bjkjdxxb202201009 陳亮, 付冬梅. 低合金鋼海水腐蝕監測中的雙率數據處理與建模. 工程科學學報, 2022, 44(1):95 doi: 10.3321/j.issn.1001-053X.2022.1.bjkjdxxb202201009
[13]	Sigmund G, Gharasoo M, Hüffer T, et al. Deep learning neural network approach for predicting the sorption of ionizable and polar organic pollutants to a wide range of carbonaceous materials. Environ Sci Technol, 2020, 54(7): 4583 doi: 10.1021/acs.est.9b06287
[14]	Le T D, Noumeir R, Quach H L, et al. Critical temperature prediction for a superconductor: A variational Bayesian neural network approach. IEEE Trans Appl Supercond, 2020, 30(4): 1
[15]	Wei M, Wang Q, Ye M, et al. An indirect remaining useful life prediction of lithium-ion batteries based on a NARX dynamic neural network. Chin J Eng, 2022, 44(3): 380 doi: 10.3321/j.issn.1001-053X.2022.3.bjkjdxxb202203007 魏孟, 王橋, 葉敏, 等. 基于NARX動態神經網絡的鋰離子電池剩余壽命間接預測. 工程科學學報, 2022, 44(3):380 doi: 10.3321/j.issn.1001-053X.2022.3.bjkjdxxb202203007
[16]	De Cock M, Dowsley R, Horst C, et al. Efficient and private scoring of decision trees, support vector machines and logistic regression models based on pre-computation. IEEE Trans Dependable Secure Comput, 2019, 16(2): 217 doi: 10.1109/TDSC.2017.2679189
[17]	Wu Y C, Cai S F, Xiao X K, et al. Privacy preserving vertical federated learning for tree-based models [J/OL]. ArXiv Preprint (2020-08-14) [2020-05-29]. https://arxiv.org/abs/2008.06170
[18]	Liu Y, Liu Y T, Liu Z J, et al. Federated forest. IEEE Trans Big Data, 2022, 8(3): 843 doi: 10.1109/TBDATA.2020.2992755
[19]	Cheng K W, Fan T, Jin Y L, et al. SecureBoost: A lossless federated learning framework. IEEE Intell Syst, 2021, 36(6): 87 doi: 10.1109/MIS.2021.3082561
[20]	Blum A, Dwork C, McSherry F, et al. Practical privacy: The SuLQ framework//Proceedings of the Twenty-Fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Baltimore, 2005: 128
[21]	Friedman A, Schuster A. Data mining with differential privacy//Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, 2010: 493
[22]	Patil A, Singh S. Differential private random forest//2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI). Delhi, 2014: 2623
[23]	Mu H R, Ding L P, Song Y N, et al. DiffPRFs: Random forest under differential privacy. J Commun, 2016, 37(9): 175 doi: 10.11959/j.issn.1000-436x.2016169 穆海蓉, 丁麗萍, 宋宇寧, 等. DiffPRFs: 一種面向隨機森林的差分隱私保護算法. 通信學報, 2016, 37(9):175 doi: 10.11959/j.issn.1000-436x.2016169
[24]	Breiman L. Random forests. Mach Learn, 2001, 45(1): 5 doi: 10.1023/A:1010933404324
[25]	Dwork C, McSherry F, Nissim K, et al. Calibrating noise to sensitivity in private data analysis. J Priv Confidentiality, 2017, 7(3): 17 doi: 10.29012/jpc.v7i3.405
[26]	McSherry F, Talwar K. Mechanism design via differential privacy//48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07). Providence, 2007: 94
[27]	Kairouz P, Oh S, Viswanath P. The composition theorem for differential privacy. IEEE Trans Inf Theory, 2017, 63(6): 4037 doi: 10.1109/TIT.2017.2685505
[28]	Agrawal A, Choudhary A. An online tool for predicting fatigue strength of steel alloys based on ensemble data mining. Int J Fatigue, 2018, 113: 389 doi: 10.1016/j.ijfatigue.2018.04.017