Title: A pipeline for data and knowledge extraction from material science literature to accelerate scientific discovery
Date: Wednesday, August 2, 2023
Time: 2 pm ET
Location: Love 210, J. Erskine Love Building (https://gatech.zoom.us/j/95839070234)
Pranav Shetty
Machine Learning Ph.D. Student
School of Computational Science and Engineering
Georgia Institute of Technology
Committee
Prof. Rampi Ramprasad (Advisor), School of Materials Science & Engineering, Georgia Tech
Prof. Chao Zhang (Co-advisor), School of Computational Science and Engineering, Georgia Tech
Prof. Alan Ritter, School of Interactive Computing, Georgia Tech
Prof. Roshan Joseph, School of Industrial & Systems Engineering, Georgia Tech
Prof. Seung Soon Jang, School of Materials Science & Engineering, Georgia Tech
Abstract
Scientific literature is growing at an exponential pace which makes it difficult for scientists to search through and effectively utilize the data contained in it. In this work, we develop methods and data sets needed to extract knowledge and material property data from a corpus of 2.4 million materials science articles. We uniquely identify extracted polymer materials by training supervised clustering models using parameterized cosine distances with hierarchical agglomerative clustering that achieve state-of-the-art results on a benchmark data set of polymer named entity (PNE) clusters. In addition, we build sequence labeling models that can tag property information using an ontology specific to the materials domain. MaterialsBERT, a pre-trained encoder fine-tuned on the aforementioned corpus of materials science papers was used as the encoder for the sequence labeling model and outperforms the baselines tested for data sets in the materials domain. We develop two pipelines, one that combines sequence labeling outputs with heuristic rules, and another using prompts to a large language model, to extract material property records from our corpus of papers. The extracted data is made available to the public through the interface polymerscholar.org. A subset of the extracted data is used to train machine learning models to predict power conversion efficiency of polymer solar cells, thus demonstrating an end-to-end pipeline that goes from literature extracted data to data-driven insights. This work will reduce the time taken during the search as well as the discovery phase of experimental work, thus allowing researchers to move beyond an Edisonian trial and error approach.