An efficient machine-learning method uses chemical knowledge to create learnable grammars with production rules for making synthesized monomers and polymers.
Chemical engineers and materials scientists are constantly on the lookout for the next revolutionary material, chemical and drug. The rise of machine-learning approaches is accelerating the discovery process, which could otherwise take years.
Wojciech Matusic, professor of electrical engineering, says, “Ideally, the goal is to train a machine-learning model on some existing chemical sample and then allow it to produce multiple manufacturable molecules of the same class with predictable physical properties.” Is.” and computer science at MIT. “If you have all these components, you can create new molecules with optimal properties, and you also know how to synthesize them. That’s the overall vision that the people in that space want to achieve.” “
However, current techniques, primarily deep learning, require extensive datasets for training models, and many class-specific chemical datasets contain a handful of example compounds, which represent physical molecules that can be created in the real world. limit their ability to generalize and generate.
Now, a new paper from researchers at MIT and IBM tackles this problem by using a generative graph model to produce new synthesizable molecules within the same chemical class as their training data. To do this, they treat the formation of atoms and chemical bonds as a graph and develop a graph grammar – a linguistic analogy of systems and structures for word order – in which to form molecules such as monomers and polymers. There is a sequence of rules.
Using the grammar and production rules inferred from the training set, the model can not only invert its instances, but also form new compounds in a systematic and data-efficient manner. “We basically created a language for making molecules,” Matusic says. “This grammar is essentially a generative model.”
Matusic’s co-authors include MIT graduate student Minghao Guo, who is the lead author, and Beechen Li, as well as Veronica Thost, Payal Das and Ji Chen, research staff members at IBM Research. Matusik, Thost, and Chen are affiliated with the MIT-IBM Watson AI Lab. His method, which he calls Data-Efficient Graph Grammar (DEG), will be presented at the International Conference on Learning Representation.
“We want to use this grammar representation for monomer and polymer generation because this grammar is explanatory and expressive,” Guo says. “With only a few number of production rules, we can generate a wide variety of structures.”
A molecular structure can be thought of as a symbolic representation in a graph – a string of atoms (nodes) joined together by chemical bonds (edges). In this method, researchers allow the model to take the chemical structure and work down a substructure of the molecule to a node; This can be two atoms joined by a single bond, a short sequence of bonded atoms, or a ring of atoms.
This is done repeatedly, creating production rules until one node remains. The rules and grammar can then be applied in reverse order to recreate the combined training set from scratch or in various combinations to produce new molecules of the same chemical class.
“Existing graph construction methods will produce sequentially one node or one edge at a time, but we are looking at higher-level structures and, in particular, exploiting knowledge of chemistry, so that we can identify individual atoms.” And don’t treat the bonds as units. It simplifies the generation process and makes it more data-efficient for learning,” Chen says.
In addition, the researchers adapted the technique so that the bottom-up grammar is relatively simple and straightforward, such that it produces molecules that can be built.
“If we change the order in which these production rules are applied, we will get another molecule; What’s more, we can calculate all the possibilities and generate tons of them,” Chen says. “Some of these molecules are valid and some of them are not, so grammar learning is really just a minimum of the production rules.” The collection is intended to detect, such as maximizing the percentage of molecules that can actually be synthesized.” While the researchers focused on three training sets of no fewer than 33 samples — acrylates, chain extenders and isocyanates — they Note that this process can be applied to any chemical class.
To see how their method performed, the researchers compared the DEGs against other state-of-the-art models and techniques by looking at the percentage of chemically validated and unique molecules, the variety of those created, the success rate of retrosynthesis, and the percentage of related molecules. tested. for the monomer class of the training data.
“We clearly show that, for synthesis and subscription, our algorithm outperforms all existing methods by a very large margin, while it is comparable to some other widely used metrics,” Guo says. Huh. In addition, “the amazing thing about our algorithm is that we only needed 0.15 percent of the original dataset to achieve very similar results, compared to state-of-the-art approaches trained on thousands of samples. Our algorithm is specifically designed for data sparseness.” can handle the problem.”
In the near future, the team plans to enhance this grammar learning process to be able to generate larger graphs, as well as produce and identify chemicals with desired properties.
Down the road, the researchers see many applications for the DEG method, as it is adaptable beyond generating new chemical structures, the team explains. A graph is a very flexible representation, and in this form can symbolize many entities – for example robots, vehicles, buildings and electronic circuits. “Essentially, our goal is to build our grammar so that our graphic representation can be widely used in many different domains,” Guo says, “that DEGs can automate the design of novel entities and structures. is,” Chen says.
Written by Lauren Hinkelow
Source: Massachusetts Institute of Technology