Substructure Searching

The UniversalIsomorphismTester class in the CDK can be used for substructure searching. It allows you to determine if some structure is a substructure and what the matching substructures are. As such, this can also be used to determine if two structures are identical.

In this chapter we will see how the class returns all possible substructure matches, and we'll notice that redundancy occurs due to symmetrically equivalent matches, and how these redundant matches can be removed.

Exact Search

Before we look at isomorphism checking, it should be noted that canonical SMILES and InChI generation (see elsewhere in this book) may be an appropriate way to determine identity and use this for database lookup.

But historically, this below approach was the way to go and is included are prelude to the substructure search.

The UniversalIsomorphismTester class implements an algorithm that was originally developed for isomorphism checking. However, it can be used for substructure search too. This section will first show how the class is used to check if two classes are identical:

Isomorphism

This algorithm works by looking the how bonds are connected to each other. This is important to realize, because it explains a typical corner case for this algorithm: it cannot distinguish cyclopropane from isobutane (see Figure cyclopropane:isobutane) when they are hydrogen depleted:

UITLimitation

Fortunately, the CDK implementation has a workaround for this so that they are still considered different, based on the fact that they have different atom counts:

UITLimitation

However, for substructure searching we're less lucky, as we will see shortly.

%%% RenderCyclopropane %%% RenderIsobutane

![](images/generated/RenderCyclopropane.png) ![](images/generated/RenderIsobutane.png)

Substructures

Starting from the above code to match two structures, the step to substructure searching is made via the isSubgraph() method:

IsSubgraph

It gives this output:

IsSubgraph

Now, you may wonder why propane is a subgraph of butane, because it is indeed not. But while the variable names suggest that that is what we have been testing, we have been testing something else: this code works because of the fact that the MoleculeFactory returns hydrogen depleted graphs (see Section sec:hydrogens). Therefore, butane is a chain of four carbons, and propane is a chain of three carbons. Then, the latter is a chemical subgraph of the former.

If we now return to our previous cyclopropane-isobutane example, we can run a subgraph analysis on them too:

UITSubgraphLimitation

Here we do see the intrinsic limitation of the algorithm reflected. While it is possible to see that isobutane has more atoms then cyclobutane and therefore cannot be a substructure, that conclusion cannot be derived for cyclobutane as substructure as isobutane, visualizing that algorithmic limitation:

UITSubgraphLimitation

Matching Substructures

Substructure searching is finding in a target molecule the atoms that match the given searched substructure. With the UniversalIsomorphismTester we can do:

Overlap

However, this only returns us one match, selected as being the largest:

Overlap

There is an alternative:

Substructure

The getSubgraphAtomsMaps() methods returns a List<List<RMap>> object, where each List<RMap> represents on substructure match. When we look at the outer list, we see that the subgraph of three carbon atoms is found 4 times in butane, each with 3 atoms:

Substructure

This is caused by the symmetrical nature of the substructure. It can map twice onto the same three atoms in butane: once in the forward direction, and once in the backward direction.

SMARTS matching

A common method to find substructures in cheminformatics is the SMiles ARbitrary Target Specification (SMARTS). The CDK has a SMARTSParser class to parse SMARTS strings and a convenience tool to perform SMARTS substructure searches. This is a typical use case:

SMARTSSearching

This shows us that the SMARTS-encoded carboxylic acid substructure is found twice and which atoms in the input structure form that match:

SMARTSSearching

Unique matches

Symmetry can cause identical hits to match multiple times, in different ways. This is, for example, the case when we loosen the above substructure search to a carbon connected to two oxygens, whatever the bond order is:

SMARTSUniqueSearching

This shows the different between the getMatchingAtoms and getUniqueMatchingAtoms method:

SMARTSUniqueSearching

Fingerprints

Substructure searching is a relatively slow algorithm, and the time required to compare two molecules scales with the number of atoms in each molecule. To reduce the computation time, molecular fingerprints were invented. There are two key aspects to fingerprints that make them efficient: first, they have a fixed length so that the time to compare two molecule is independent of the size of the two structures; secondly, the fingerprint of a substructure always matches the fingerprint of any molecules that has that substructure.

In this section we will see two fingerprint types available in the CDK: a substructure based fingerprint, and a path based fingerprint. Before I will explain how these fingerprints are created, we will first look at the BitSet class that is used by the CDK to represent these fingerprints. Consider this code:

BitSetDemo

If we analyze the output, we see that all set bits are listed, and that all other bits are not:

BitSetDemo

Let us now consider a simple substructure fingerprint of length four with the following bit definitions:

bit 1: molecule contains a carbon
bit 2: molecule contains a nitrogen
bit 3: molecule contains a oxygen
bit 4: molecule contains a chlorine

Let's call this fingerprinter SimpleFingerprinter:

SimpleFingerprinter

We can then calculate the fingerprints for ethanol and benzene:

SimpleFingerprintDemo

and we get these bit sets:

SimpleFingerprintDemo

Now, we can replace the presence of a particular atom, by the presence of a substructure, such as a phenyl or a carbonyl group. We have then defined a substructure fingerprint.

The CDK has several kinds of fingerprints, including path-based fingerprints (Fingerprinter and HybridizationFingerprinter), a MACSS fingerprint (MACSSFingerprinter) [Q34160151], and the PubChem fingerprint (PubChemFingerprinter). These fingerprints have been used for various tasks, including ligand classification [Q42704791], and databases like BRENDA [Q24599948] and TIN [Q33874102].

MACCS Fingerprints

One substructure-based fingerprinter is the MACCSFingerprinter which partly implements the MACSS fingerprint specification [Q34160151]. The substructures are defined as SMARTS substructure specifications, inherited from RDKit (http://rdkit.org/). For this fingerprint it is required the implicit hydrogen counts are first set:

MACCSFingerprint

The object returned by the getBitFingerprint method is the IBitFingerprint which we can convert into a Java BitSet with the asBitSet method:

MACCSFingerprint

ECFP and FCFP Fingerprints

The CDK also has an implementation for the circular ECFP and FCFP fingerprints [Q29616639]. These are developed by Alex M. Clark at Collaborative Drug Discovery, Inc in the CircularFingerprinter [Q27902272]. It implements both in four variants: ECFP0, ECFP2, ECFP4, ECFP6, FCFP0, FCFP2, FCFP4, and FCFP6. The code is quite similar as for other fingerprints, but we do have to indicate what variant we want:

ECFPFingerprint

Again we get an IBitFingerprint resulting in a BitSet of bits:

ECFPFingerprint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Substructure Searching

Unique matches

MACCS Fingerprints

ECFP and FCFP Fingerprints

References

FilesExpand file tree

substructure.i.md

Latest commit

History

substructure.i.md

File metadata and controls

Substructure Searching

Unique matches

MACCS Fingerprints

ECFP and FCFP Fingerprints

References