Try a simple QSAR model with Chainer [Predicting blood-brain barrier permeability of compounds]

2020/3/22021/8/28

QSAR (Quantitative Structure-Activity Relationship) is a statistical correlation between the structure of a chemical substance and its physiological activity (toxicity, ability to bind to enzymes, activity as a drug, etc.). Says.Compound performance can be predicted from correlations based on vast experimental data sets of chemicals.

This time, we will create a simple QSAR model that "predicts the blood-brain barrier permeability of compounds" using Chainer, a deep learning framework made in Japan, and verify its performance against the test set.

table of contents

1. Prediction target and data
2. Creating a model
- 2.1. Environment
3. Model building
- 3.1. inference

Prediction target and data

BBBP of Molecule Net is used for the data.See below for a bird's eye view of the data.

Visualization of chemical datasets and exploratory data analysis [Python, RDKit]

As a preliminary step to machine learning / QSAR model creation, for example, compound data consisting of a list of chemical structures and activities ...

https://insilico-notebook.com/chemdata-eda/

Regarding the blood-brain barrier permeability of the compound, "penetration" is XNUMX and "non-penetration" is XNUMX.

Creating a model

Environment

from rdkit import rdBase
import chainer
print('rdkit version: ',rdBase.rdkitVersion)
chainer.print_runtime_info()

rdkit version: 2019.03.4 Platform: Linux-5.0.0-37-generic-x86_64-with-debian-buster-sid Chainer: 6.2.0 NumPy: 1.17.4 CuPy: CuPy Version: 6.2.0 CUDA Root: / usr / local / cuda CUDA Build Version: 10010 CUDA Driver Version: 10010 CUDA Runtime Version: 10010 cuDNN Build Version: 7500 cuDNN Version: 7605 NCCL Build Version: 2402 NCCL Runtime Version: 2402 iDeep: Not Available

You can check the version of Chainer, Numpy, Cupy used in chainer.print_runtime_info ().

Model building

import numpy as np
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Draw, PandasTools, Descriptors
　
# データの読み込み
df = pd.read_csv('BBBP.csv',index_col=0)
　
# smilesからmolファイルを生成し、データフレーム中に加える
PandasTools.AddMoleculeColumnToFrame(df, smilesCol = 'smiles')
　
# molができなかった行を削除する
df = df.dropna()
　
# molファイルから化合物記述子を算出する
for i,j in Descriptors.descList:
    df[i] = df['ROMol'].map(j)
df['Ipc'] = [Descriptors.Ipc(mol, avg=True) for mol in df['ROMol']]  
　
#　chainer用にデータ型を変換
x = df.iloc[:,4:].values.astype('float32')
y = df['p_np'].values.astype('int32')
indices = np.array(range(x.shape[0])) # train_test_split後も列番号を保持しておく
　
# train, test, valに分割
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test, indices_train, indices_test = train_test_split(x, y, indices, test_size=0.05, random_state=123)
　
# 説明変数の標準化
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(x_train)
x_train= scaler.transform(x_train)
x_test = scaler.transform(x_test)
　
print(type(x_train), x_train.shape, type(y_train), y_train.shape)
print(type(x_test), x_test.shape, type(y_test), y_test.shape)

(1937, 200) (1937,) (102, 200) (102,)

See here for creating descriptors from mol objects.
Calculate molecular descriptors and fingerprints from SMILES and store them in a data frame [Python, RDKit]

The official tutorials on how to use Chainer are very extensive.
Getting Started with Deep Learning Chainer Tutorial

#　説明変数と目的変数のセットで使えるように変換する
from chainer.datasets import TupleDataset
train = TupleDataset(x_train, y_train)
test = TupleDataset(x_test, y_test)
　
# イテレータの準備
from chainer.iterators import SerialIterator
train_iter = SerialIterator(train, batch_size=64, repeat=True, shuffle=True)
test_iter = SerialIterator(test, batch_size=64, shuffle=False, repeat=False)
　
# ニューラルネットワークの作成
# ３層のmulti layer perceptron（MLP）
import chainer.links as L
import chainer.functions as F
from chainer import Chain
from chainer import optimizers, training
from chainer.training import extensions

class MLP(chainer.Chain):
　
    def __init__(self):
        super().__init__()
        with self.init_scope():
            self.fc1 = L.Linear(None, 100)
            self.fc2 = L.Linear(None, 20)
            self.fc3 = L.Linear(None, 2)
　
    def forward(self, x):
        h = F.relu(self.fc1(x))
        h = F.relu(self.fc2(h))
        h = self.fc3(h)
        return h
　
# ネットワークをClassifierでラップしする
# （目的関数（デフォルトはsoftmax交差エントロピー）の計算し、損失を返す）
predictor = MLP()
net = L.Classifier(predictor)
　
# 最適化手法を選択して、オプティマイザを作成する
optimizer = optimizers.MomentumSGD(lr=0.1).setup(net)
　
# アップデータにイテレータとオプティマイザを渡す
updater = training.StandardUpdater(train_iter, optimizer, device=-1)
trainer = training.Trainer(updater, (50, 'epoch'), out='/results/')

from chainer.training import extensions
　
trainer.extend(extensions.LogReport(trigger=(5, 'epoch'), log_name='log'))
trainer.extend(extensions.snapshot(filename='snapshot_epoch-{.updater.epoch}'))
trainer.extend(extensions.dump_graph('main/loss'))
trainer.extend(extensions.Evaluator(test_iter, net, device=-1), name='val')
trainer.extend(extensions.PrintReport(['epoch', 'iteration', 'main/loss', 'main/accuracy', 'val/main/loss', 'val/main/accuracy', 'fc1/W/data/mean', 'elapsed_time']))
trainer.extend(extensions.PlotReport(['fc1/W/grad/mean'], x_key='epoch', file_name='mean.png'))
trainer.extend(extensions.PlotReport(['main/loss', 'val/main/loss'], x_key='epoch', file_name='loss.png'))
trainer.extend(extensions.PlotReport(['main/accuracy', 'val/main/accuracy'], x_key='epoch', file_name='accuracy.png'))
trainer.extend(extensions.ParameterStatistics(net.predictor.fc1, {'mean': np.mean}, report_grads=True))
　
trainer.run()

from IPython.display import Image, display
display(Image(filename='results/accuracy.png'))

It seems that the accuracy is reasonable, but the accuracy for the test set has not improved much even as the learning progresses ...

inference

# 学習したモデルで推論してみる
with chainer.using_config('train', False), chainer.using_config('enable_backprop', False):
    y_pred = predictor(x_test)
　
# 推論結果の確認
print('accuracy', F.accuracy(y_pred, y_test))　#　accuracy variable(0.88235295)
　
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred.data.argmax(axis=1))

 array ([[21, 7], [5, 69]])

accuracy is a value as you can see from the figure.When evaluating the accuracy of classification by the confusion matrix, there are false positives and false negatives, but it seems that the accuracy is not gained by the classification that is biased to one side.

# 一部予測結果を見てみる
for i in range(int(len(y_pred)/10)):
    print('No.', indices_test[i])
    print('label:', y_test[i])
    print('pred :', np.argmax(y_pred[i].array))
    img = Draw.MolToImage(df.ROMol[indices_test[i]])
    display(img)

The image is part of the output, but the correct answer is that it can be discerned by the human eye.It seems fun to scrutinize what you made a mistake.

Machine learning,programming,Cheminformatics,Science and technology / natural scienceChainers,QSARs,Blood-brain barrier permeability prediction