-
摘要: 材料的生產環境和測量條件不同,導致用于機器學習的材料數據的噪聲較大。對材料數據進行標注需要一定的專業知識和專業技能,因此標注成本也相對較高。這兩方面的因素給機器學習應用于材料領域帶來了巨大挑戰。為應對這個挑戰,提出了一個主動回歸學習方法,由離群點檢測模塊、貪婪采樣模塊和最小變化采樣模塊組成。同其他主動學習方法相比,該方法整合了離群點檢測機制,選取高質量樣本的同時有效地排除了噪聲數據的影響,避免了沉沒成本。在公開數據集和非公開數據集上與最新的主動回歸學習方法進行了對比實驗,實驗結果表明本文方法在相同的數據量下訓練的任務模型性能指標相比于其他模型平均提高15%,且只需30%~40%的數據量作為訓練集就可以達到甚至超過使用全部數據訓練任務模型的精度。Abstract: To date, artificial intelligence has been successfully applied in various fields of material science, but these applications require a large amount of high-quality data. In practical applications, many unlabeled data points but few labeled data points can be obtained directly. The reason is that data annotations require fine and expensive experiments, and the cost of time and money cannot be ignored. Active learning can select a few high-quality samples from many unlabeled data points for labeling and use as little labeling cost as possible to optimize task model performance. However, active learning methods suitable for material attribute regression are poorly understood, and the general active learning method cannot easily avoid the negative effects of noise data, resulting in decreased costs. Therefore, we propose a new active regression learning method that includes the following features: (1) outlier detection module: using the labeled data prediction from a task model trained to fit and the labeled dataset to train the auxiliary classification model for classifying outliers and then excluding the samples that are most likely to be outliers in the unlabeled dataset; (2) greedy sampling: an iterative method is adopted to select the data farthest from the data in the labeled dataset and the selected data in the geometric space to fully consider sample diversity; and (3) minimum change sampling: selecting the unlabeled data with minimum change before and after the task model, which is trained on the labeled dataset. This part of the data is relatively lacking in the feature space of the labeled dataset. We performed experiments on the concrete slump test dataset and the negative coefficient of thermal expansion dataset and compared our method with the latest active regression learning methods. The results show that other methods do not necessarily improve task model performance after labeling data in each active learning circle on noisy datasets, and the final performance cannot reach the level of the task model trained by all data. Under the same amount of data, the performance index of the task model trained by our method is improved by 15% on average compared with other models. Because of the addition of an outlier detection mechanism, our method can effectively avoid sampling outliers when selecting high-quality samples. The task model trained using only 30%–40% of the data can achieve or even exceed the accuracy of the task model trained by all data.
-
Key words:
- active learning /
- material /
- outlier detection /
- regression /
- high-quality samples
-
Algorithm 1 Input: labeled data set $ {\boldsymbol{D}}_{\mathrm{l}} $; unlabeled data set $ {\boldsymbol{D}}_{\mathrm{u}} $; number of sampling $ K $ Output: result of sampling $ {\boldsymbol{D}}_{K} $, $ \left|{\boldsymbol{D}}_{K}\right|=K $ Set $ {D}_{K}=\varnothing $ For k=1, …, $ K $ do Calculate $d_{nm}^x$using (2) where $ \boldsymbol{R}={\boldsymbol{D}}_{\mathrm{u}}-{\boldsymbol{D}}_{K} $ and $ \boldsymbol{S}={\boldsymbol{D}}_{\mathrm{l}}\cup {\boldsymbol{D}}_{K} $ Calculate $ {d}_{n}^{k} $ using (3) Choose $ x=\mathrm{arg ma}{\mathrm{x}}_{x}{d}_{n}^{x} $ Reset $ {\boldsymbol{D}}_{K}={\boldsymbol{D}}_{K}\cup \left\{x\right\} $ End 表 1 在混凝土坍落度測試數據集和負膨脹材料數據集上的消融實驗結果
Table 1. Ablation experimental results on the concrete slump test dataset and the negative thermal expansion material dataset
Method $\bar{ {R}^{2} }$ for concrete slump test dataset $\bar{ {R}^{2} }$ for negative coefficient of thermal expansion dataset Without outlier detection module 0.39 ?2 Without greedy sampling module 0.43 ?1.42 Without minimum change sampling module 0.46 ?1.83 Complete method 0.52 ?0.97 www.77susu.com -
參考文獻
[1] Cao X Y, Yao J, Xu Z B, et al. Hyperspectral image classification with convolutional neural network and active learning. IEEE Trans Geosci Remote Sens, 2020, 58(7): 4604 doi: 10.1109/TGRS.2020.2964627 [2] Tsymbalov E, Panov M, Shapeev A. Dropout-based active learning for regression // International Conference on Analysis of Images, Social Networks and Texts. Moscow, 2018: 247 [3] Kumar P, Gupta A. Active learning query strategies for classification, regression, and clustering: A survey. J Comput Sci Technol, 2020, 35(4): 913 doi: 10.1007/s11390-020-9487-4 [4] Guo J N, Shi H C, Kang Y Y, et al. Semi-supervised active learning for semi-supervised models: Exploit adversarial examples with graph-based virtual labels // Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, 2021: 2896 [5] Sener O, Savarese S. Active learning for convolutional neural networks: A core-set approach // International Conference on Learning Representations. Vancouver, 2018: 38 [6] Sinha S, Ebrahimi S, Darrell T. Variational adversarial active learning // Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul, 2019: 5971 [7] Bengar J Z, van de Weijer J, Twardowski B, et al. Reducing label effort: Self-supervised meets active learning // Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, 2021: 1631 [8] Choi J, Yi K M, Kim J, et al. Vab-al: Incorporating class imbalance and difficulty with variational bayes for active learning // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, 2021: 6745 [9] Barz B, K?ding C, Denzler J. Information-theoretic active learning for content-based image retrieval // German Conference on Pattern Recognition. Stuttgart, 2018: 650 [10] Deng Y, Chen K W, Shen Y L, et al. Adversarial active learning for sequences labeling and generation // Proceedings of the 27th International Joint Conference on Artificial Intelligence. Stockholm, 2018: 4012 [11] Bengar J Z, Gonzalez-Garcia A, Villalonga G, et al. Temporal coherence for active learning in videos // Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. Seoul, 2019: 914 [12] Kusne A G, Yu H S, Wu C M, et al. On-the-fly closed-loop materials discovery via Bayesian active learning. Nat Commun, 2020, 11: 5966 doi: 10.1038/s41467-020-19597-w [13] Min K, Cho E. Accelerated discovery of novel inorganic materials with desired properties using active learning. J Phys Chem C, 2020, 124(27): 14759 doi: 10.1021/acs.jpcc.0c00545 [14] Lookman T, Balachandran P V, Xue D Z, et al. Active learning in materials science with emphasis on adaptive sampling using uncertainties for targeted design. Npj Comput Mater, 2019, 5: 21 doi: 10.1038/s41524-019-0153-8 [15] Bassman L, Rajak P, Kalia R K, et al. Active learning for accelerated design of layered materials. Npj Comput Mater, 2018, 4: 74 doi: 10.1038/s41524-018-0129-0 [16] Juneja R, Yumnam G, Satsangi S, et al. Coupling the high-throughput property map to machine learning for predicting lattice thermal conductivity. Chem Mater, 2019, 31(14): 5145 doi: 10.1021/acs.chemmater.9b01046 [17] Loftis C, Yuan K P, Zhao Y, et al. Lattice thermal conductivity prediction using symbolic regression and machine learning. J Phys Chem A, 2021, 125(1): 435 doi: 10.1021/acs.jpca.0c08103 [18] Le T T. Prediction of tensile strength of polymer carbon nanotube composites using practical machine learning method. J Compos Mater, 2020, 55(6): 787 [19] ?olak A B, Y?ld?z O, Bayrak M, et al. Experimental study for predicting the specific heat of water based Cu–Al2O3 hybrid nanofluid using artificial neural network and proposing new correlation. Int J Energy Res, 2020, 44(9): 7198 doi: 10.1002/er.5417 [20] Assad M E H, Mahariq I, Ghandour R, et al. Utilization of machine learning methods in modeling specific heat capacity of nanofluids. Comput Mater Continua, 2022(1): 361 [21] Wu D R, Lin C T, Huang J. Active learning for regression using greedy sampling. Inf Sci, 2019, 474: 90 doi: 10.1016/j.ins.2018.09.060 [22] Liang E J, Sun Q, Yuan H L, et al. Negative thermal expansion: Mechanisms and materials. Front Phys, 2021, 16(5): 53302 doi: 10.1007/s11467-021-1070-0 [23] Mehr A D, Nourani V, Khosrowshahi V K, et al. A hybrid support vector regression–firefly model for monthly rainfall forecasting. Int J Environ Sci Technol, 2019, 16(1): 335 doi: 10.1007/s13762-018-1674-2 [24] Wu D R. Pool-based sequential active learning for regression. IEEE Trans Neural Netw Learn Syst, 2019, 30(5): 1348 doi: 10.1109/TNNLS.2018.2868649 [25] Onyutha C. A hydrological model skill score and revised R-squared. Hydrol Res, 2022, 53(1): 51 doi: 10.2166/nh.2021.071 -