Visualize compound space by dimensionality reduction of Fingeprint

2019/10/26

Let's visualize the compound space (chemical space) based on the chemical structure using the compound data set that is open to the public.

動機

Visualization of compound space
-Are there any unique compounds in the dataset?
・ What kind of compound is contained and how much
It will help you to understand. When creating predictive models for QSAR and machine learning, you can roughly determine if your data has extrapolations.

method

Fingerprint calculation

By generating a Fingerprint for each compound and reducing the dimension, it is possible to plot on a plane. According to the principle of similar properties of Johnson and Maggiora, "Similar compounds have similar properties", compounds with similar structures and properties should be distributed close to each other on a plane.

For the fingerprint used for visualization, try the Morgan fingerprint and the RDkit fingerprint.

Reference: Fingerprint available in RDkit

Dimensionality reduction method

For dimensionality reduction, try using Principal Component Analysis (PCA) and UMAP. PCA is most commonly used, but it compresses into a lower dimensional space based on the linearity of the data, so it may not be suitable for 0-1 data such as fingerprint.
On the other hand, UMAP is one of the dimension reduction methods considering non-linear components.Dimensionality reduction like the standard t-SNE of the same method can be completed several times faster, so it can be used for large data sets.

Use the kmeans method and Spectral Clustering for clustering.
The kmeans method clusters by repeating a series of operations: (XNUMX) randomly setting clusters, (XNUMX) adding nearby data points to update the position of the center of gravity, and (XNUMX) finding the center of gravity again with the updated data near the center of gravity. It is known that non-linear data such as Moon and Swiss roll cannot be classified well, but Spectral Clustering seems to be able to handle such data. Even the scikit learn cheat sheet is an option if the kmeans method does not work.

I would like to combine each of them with the "PCA x kmeans method" that assumes normal distribution and linear data and the "UMAP x Spectral Clustering" that supports nonlinear data.

By the way, there is also a method of clustering compounds based on the tanimoto coefficient (the similarity between two compounds is expressed from 2 to 0. The same structure as 1), which is often used as an index of similarity of compounds, but this time I tried it. Is not ...

Reference: Dimensionality reduction and 2D plotting techniques for high-dimensional data
   Spectral Clustering (cluster analysis)

Let's visualize the compound space

Use Molecule net's Lipophilicity for the dataset.Approximately 4200 compounds contain their respective SMILES and experimental logP (hydrophobicity index: octanol / water partition coefficient).

reference:List of compound datasets

# package import import pandas as pd import numpy as np from sklearn.cluster import KMeans from sklearn.decomposition import PCA from sklearn.cluster import SpectralClustering import umap from rdkit import Chem from rdkit.Chem import AllChem import matplotlib.pyplot as plt% matplotlib inline # read dataset df = pd.read_csv ('Lipophilicity.csv') print (df.info ()) df.head (5)


RangeIndex: 4200 entries, 0 to 4199
Data columns (total 3 columns):
CMPD_CHEMBLID 4200 non-null object
exp 4200 non-null float64
smiles 4200 non-null objects
dtypes: float64 (1), object (2) memory usage: 98.5+ KB None

logP dataset

 

#Morgan and RDkit Get Fingerprint mols = [Chem.MolFromSmiles (x) for x in df.smiles] morgan_fps = [AllChem.GetMorganFingerprintAsBitVect (x, 2, 1024) for x in mols] rdkit_fps = [Chem.RDKFingerprint (x, fpSize = 1024) for x in mols] # Store fingerprint in DF df_morgan_fps = pd.DataFrame (np.array (morgan_fps)) df_rdkit_fps = pd.DataFrame (np.array (rdkit_fps))

The Morgan fingerprint counts substructures within a set radius from an atom. An algorithm similar to the ECFP (Extended Connectivity Fingerprint) fingerprint, where Morgan's radius 2 corresponds to ECFP4.Here, we calculate the fingerprint of radius = 2 bits.
The RDkit fingerprint counts partial structures based on bond length rather than radius from the atom.This is similar to the Daylight fingerprint.By default, minimum path length: 1 join – maximum path length: 7 joins are considered.

The obtained fingerprint isSince the subsequent processing cannot be performed as it is with an object like, correct it to a data frame.
 

# kmeans clustering kmeans = KMeans (n_clusters = 8, n_jobs = -1) kmeans.fit (df_morgan_fps) # Dimension reduction with PCA pca = PCA (n_components = 2) decomp = pca.fit_transform (df_morgan_fps) x = decomp [:, 0 ] y = decomp [:, 1] #Visualization plt.figure (figsize = (15,5)) #Color-coded clusters obtained by kmeans plt.subplot (1,2,1) plt.scatter (x, y) , c = kmeans.labels_, alpha = 0.7) plt.title ("PCA: morgan_fps, cluster") plt.colorbar () # color coded with logP plt.subplot (1,2,2) plt.scatter (x, y, c = df.exp, alpha = 0.7, cmap ='spring') plt.title ("PCA: morgan_fps, logP") plt.colorbar ()

The result of visualizing the compound space with the RDkit fingerprint.

Similarly, change the df_morgan_fps part to df_rdkit_fps and visualize it with the RDkit fingerprint.Somehow the impression that the RDkit fingerprint is projected more beautifully ↓

# RDkit Fingerprints visualized with "UMAP x Spectral Clustering" sc = SpectralClustering (n_clusters = 50, affinity ='nearest_neighbors', n_jobs = -1) sc.fit (df_rdkit_fps) embedding = umap.UMAP (n_neighbors = 50, n_components = 2, min_dist = 0.5) .fit_transform (df_rdkit_fps) x = embedding [:, 0] y = embedding [:, 1] fig = plt.figure (figsize = (15,5)) plt.subplot (1,2,1) ) plt.scatter (x, y, c = sc.labels_, alpha = 0.7) plt.title ("rdkit_fps, SpectralCluster") plt.colorbar () plt.subplot (1,2,2) plt.scatter (x, y, c = df.exp, alpha = 0.7, cmap ='spring') plt.title ("rdkit_fps, logP") plt.colorbar ()

I feel that "UMAP x Spectral Clustering" is more densely packed with similar compounds than "PCA x kmeans", but either one is fine.

Regarding logP, the relationship between the position on the plane and the physical property value is delicate.Autoencoder or supervised learning seems to be better for plotting in a two-dimensional space in correlation with the physical properties and activity values ​​of compounds, so I would like to try it soon.