面向材料數據的主動回歸學習方法

張函; 錢權; 武星

doi:10.13374/j.issn2095-9389.2022.05.03.004

面向材料數據的主動回歸學習方法

doi: 10.13374/j.issn2095-9389.2022.05.03.004

張函¹,
錢權^{1, 2},
武星^{1, 2, ,}

1.
上海大學計算機工程與科學學院，上海 200444
2.
之江實驗室，杭州 311100

基金項目: 國家重點研發計劃資助項目（2022YFB3707800）；云南省重大科技專項（202102AB080019-3，202002AB080001-2）；之江實驗室科研攻關項目（2021PE0AC02）；上海張江國家自主創新示范區專項發展資金重大項目（ZJ2021-ZD-006）

詳細信息

通訊作者:
E-mail: xingwu@shu.edu.cn

中圖分類號: TG142.71
計量
- 文章訪問數: 440
- HTML全文瀏覽量: 164
- PDF下載量: 70
- 被引次數: 0
出版歷程
- 收稿日期: 2022-05-03
- 網絡出版日期: 2022-09-19
- 刊出日期: 2023-07-25

Active regression learning method for material data

ZHANG Han¹,
QIAN Quan^{1, 2},
WU Xing^{1, 2
, ,}

1.
School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
2.
Zhejiang Laboratory, Hangzhou 311100, China

More Information

Corresponding author: E-mail: xingwu@shu.edu.cn

摘要

摘要: 材料的生產環境和測量條件不同，導致用于機器學習的材料數據的噪聲較大。對材料數據進行標注需要一定的專業知識和專業技能，因此標注成本也相對較高。這兩方面的因素給機器學習應用于材料領域帶來了巨大挑戰。為應對這個挑戰，提出了一個主動回歸學習方法，由離群點檢測模塊、貪婪采樣模塊和最小變化采樣模塊組成。同其他主動學習方法相比，該方法整合了離群點檢測機制，選取高質量樣本的同時有效地排除了噪聲數據的影響，避免了沉沒成本。在公開數據集和非公開數據集上與最新的主動回歸學習方法進行了對比實驗，實驗結果表明本文方法在相同的數據量下訓練的任務模型性能指標相比于其他模型平均提高15%，且只需30%~40%的數據量作為訓練集就可以達到甚至超過使用全部數據訓練任務模型的精度。
- 主動學習 /
- 材料 /
- 離群點檢測 /
- 回歸 /
- 高質量樣本
Abstract: To date, artificial intelligence has been successfully applied in various fields of material science, but these applications require a large amount of high-quality data. In practical applications, many unlabeled data points but few labeled data points can be obtained directly. The reason is that data annotations require fine and expensive experiments, and the cost of time and money cannot be ignored. Active learning can select a few high-quality samples from many unlabeled data points for labeling and use as little labeling cost as possible to optimize task model performance. However, active learning methods suitable for material attribute regression are poorly understood, and the general active learning method cannot easily avoid the negative effects of noise data, resulting in decreased costs. Therefore, we propose a new active regression learning method that includes the following features: (1) outlier detection module: using the labeled data prediction from a task model trained to fit and the labeled dataset to train the auxiliary classification model for classifying outliers and then excluding the samples that are most likely to be outliers in the unlabeled dataset; (2) greedy sampling: an iterative method is adopted to select the data farthest from the data in the labeled dataset and the selected data in the geometric space to fully consider sample diversity; and (3) minimum change sampling: selecting the unlabeled data with minimum change before and after the task model, which is trained on the labeled dataset. This part of the data is relatively lacking in the feature space of the labeled dataset. We performed experiments on the concrete slump test dataset and the negative coefficient of thermal expansion dataset and compared our method with the latest active regression learning methods. The results show that other methods do not necessarily improve task model performance after labeling data in each active learning circle on noisy datasets, and the final performance cannot reach the level of the task model trained by all data. Under the same amount of data, the performance index of the task model trained by our method is improved by 15% on average compared with other models. Because of the addition of an outlier detection mechanism, our method can effectively avoid sampling outliers when selecting high-quality samples. The task model trained using only 30%–40% of the data can achieve or even exceed the accuracy of the task model trained by all data.
- active learning /
- material /
- outlier detection /
- regression /
- high-quality samples

HTML全文

圖 1 面向材料數據的主動回歸學習數據流示意圖

Figure 1. Data flow diagram of active regression learning for material data

下載: 全尺寸圖片幻燈片

圖 2 四種算法在混凝土坍落度測試數據集上的表現

Figure 2. Performance of four algorithms on the concrete slump test dataset

下載: 全尺寸圖片幻燈片

圖 3 四種算法在負熱膨脹材料數據集上的表現

Figure 3. Performance of four algorithms on the negative thermal expansion material dataset

下載: 全尺寸圖片幻燈片

Algorithm 1
Input: labeled data set $ {\boldsymbol{D}}_{\mathrm{l}} $; unlabeled data set $ {\boldsymbol{D}}_{\mathrm{u}} $; number of sampling $ K $
Output: result of sampling $ {\boldsymbol{D}}_{K} $, $ \left\|{\boldsymbol{D}}_{K}\right\|=K $
Set $ {D}_{K}=\varnothing $
For k=1, …, $ K $ do
Calculate $d_{nm}^x$using (2) where $ \boldsymbol{R}={\boldsymbol{D}}_{\mathrm{u}}-{\boldsymbol{D}}_{K} $ and $ \boldsymbol{S}={\boldsymbol{D}}_{\mathrm{l}}\cup {\boldsymbol{D}}_{K} $
Calculate $ {d}_{n}^{k} $ using (3)
Choose $ x=\mathrm{arg ma}{\mathrm{x}}_{x}{d}_{n}^{x} $
Reset $ {\boldsymbol{D}}_{K}={\boldsymbol{D}}_{K}\cup \left\{x\right\} $
End

下載: 導出CSV

表 1 在混凝土坍落度測試數據集和負膨脹材料數據集上的消融實驗結果

Table 1. Ablation experimental results on the concrete slump test dataset and the negative thermal expansion material dataset

Method	$\bar{ {R}^{2} }$ for concrete slump test dataset	$\bar{ {R}^{2} }$ for negative coefficient of thermal expansion dataset
Without outlier detection module	0.39	?2
Without greedy sampling module	0.43	?1.42
Without minimum change sampling module	0.46	?1.83
Complete method	0.52	?0.97

下載: 導出CSV

www.77susu.com

參考文獻(25)

[1]	Cao X Y, Yao J, Xu Z B, et al. Hyperspectral image classification with convolutional neural network and active learning. IEEE Trans Geosci Remote Sens, 2020, 58(7): 4604 doi: 10.1109/TGRS.2020.2964627
[2]	Tsymbalov E, Panov M, Shapeev A. Dropout-based active learning for regression // International Conference on Analysis of Images, Social Networks and Texts. Moscow, 2018: 247
[3]	Kumar P, Gupta A. Active learning query strategies for classification, regression, and clustering: A survey. J Comput Sci Technol, 2020, 35(4): 913 doi: 10.1007/s11390-020-9487-4
[4]	Guo J N, Shi H C, Kang Y Y, et al. Semi-supervised active learning for semi-supervised models: Exploit adversarial examples with graph-based virtual labels // Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, 2021: 2896
[5]	Sener O, Savarese S. Active learning for convolutional neural networks: A core-set approach // International Conference on Learning Representations. Vancouver, 2018: 38
[6]	Sinha S, Ebrahimi S, Darrell T. Variational adversarial active learning // Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul, 2019: 5971
[7]	Bengar J Z, van de Weijer J, Twardowski B, et al. Reducing label effort: Self-supervised meets active learning // Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, 2021: 1631
[8]	Choi J, Yi K M, Kim J, et al. Vab-al: Incorporating class imbalance and difficulty with variational bayes for active learning // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, 2021: 6745
[9]	Barz B, K?ding C, Denzler J. Information-theoretic active learning for content-based image retrieval // German Conference on Pattern Recognition. Stuttgart, 2018: 650
[10]	Deng Y, Chen K W, Shen Y L, et al. Adversarial active learning for sequences labeling and generation // Proceedings of the 27th International Joint Conference on Artificial Intelligence. Stockholm, 2018: 4012
[11]	Bengar J Z, Gonzalez-Garcia A, Villalonga G, et al. Temporal coherence for active learning in videos // Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. Seoul, 2019: 914
[12]	Kusne A G, Yu H S, Wu C M, et al. On-the-fly closed-loop materials discovery via Bayesian active learning. Nat Commun, 2020, 11: 5966 doi: 10.1038/s41467-020-19597-w
[13]	Min K, Cho E. Accelerated discovery of novel inorganic materials with desired properties using active learning. J Phys Chem C, 2020, 124(27): 14759 doi: 10.1021/acs.jpcc.0c00545
[14]	Lookman T, Balachandran P V, Xue D Z, et al. Active learning in materials science with emphasis on adaptive sampling using uncertainties for targeted design. Npj Comput Mater, 2019, 5: 21 doi: 10.1038/s41524-019-0153-8
[15]	Bassman L, Rajak P, Kalia R K, et al. Active learning for accelerated design of layered materials. Npj Comput Mater, 2018, 4: 74 doi: 10.1038/s41524-018-0129-0
[16]	Juneja R, Yumnam G, Satsangi S, et al. Coupling the high-throughput property map to machine learning for predicting lattice thermal conductivity. Chem Mater, 2019, 31(14): 5145 doi: 10.1021/acs.chemmater.9b01046
[17]	Loftis C, Yuan K P, Zhao Y, et al. Lattice thermal conductivity prediction using symbolic regression and machine learning. J Phys Chem A, 2021, 125(1): 435 doi: 10.1021/acs.jpca.0c08103
[18]	Le T T. Prediction of tensile strength of polymer carbon nanotube composites using practical machine learning method. J Compos Mater, 2020, 55(6): 787
[19]	?olak A B, Y?ld?z O, Bayrak M, et al. Experimental study for predicting the specific heat of water based Cu–Al₂O₃ hybrid nanofluid using artificial neural network and proposing new correlation. Int J Energy Res, 2020, 44(9): 7198 doi: 10.1002/er.5417
[20]	Assad M E H, Mahariq I, Ghandour R, et al. Utilization of machine learning methods in modeling specific heat capacity of nanofluids. Comput Mater Continua, 2022(1): 361
[21]	Wu D R, Lin C T, Huang J. Active learning for regression using greedy sampling. Inf Sci, 2019, 474: 90 doi: 10.1016/j.ins.2018.09.060
[22]	Liang E J, Sun Q, Yuan H L, et al. Negative thermal expansion: Mechanisms and materials. Front Phys, 2021, 16(5): 53302 doi: 10.1007/s11467-021-1070-0
[23]	Mehr A D, Nourani V, Khosrowshahi V K, et al. A hybrid support vector regression–firefly model for monthly rainfall forecasting. Int J Environ Sci Technol, 2019, 16(1): 335 doi: 10.1007/s13762-018-1674-2
[24]	Wu D R. Pool-based sequential active learning for regression. IEEE Trans Neural Netw Learn Syst, 2019, 30(5): 1348 doi: 10.1109/TNNLS.2018.2868649
[25]	Onyutha C. A hydrological model skill score and revised R-squared. Hydrol Res, 2022, 53(1): 51 doi: 10.2166/nh.2021.071