The UniversalIsomorphismTester class in the CDK can be used for substructure searching. It allows you to determine if some structure is a substructure and what the matching substructures are. As such, this can also be used to determine if two structures are identical.
In this chapter we will see how the class returns all possible substructure matches, and we'll notice that redundancy occurs due to symmetrically equivalent matches, and how these redundant matches can be removed.
Before we look at isomorphism checking, it should be noted that canonical SMILES and InChI generation (see elsewhere in this book) may be an appropriate way to determine identity and use this for database lookup.
But historically, this below approach was the way to go and is included are prelude to the substructure search.
The UniversalIsomorphismTester class implements an algorithm that was originally developed for isomorphism checking. However, it can be used for substructure search too. This section will first show how the class is used to check if two classes are identical:
Isomorphism
This algorithm works by looking the how bonds are connected to each other. This is important to realize, because it explains a typical corner case for this algorithm: it cannot distinguish cyclopropane from isobutane (see Figure cyclopropane:isobutane) when they are hydrogen depleted:
UITLimitation
Fortunately, the CDK implementation has a workaround for this so that they are still considered different, based on the fact that they have different atom counts:
UITLimitation
However, for substructure searching we're less lucky, as we will see shortly.
%%% RenderCyclopropane
%%% RenderIsobutane
Starting from the above code to match two structures, the step to substructure searching
is made via the isSubgraph() method:
IsSubgraph
It gives this output:
IsSubgraph
Now, you may wonder why propane is a subgraph of butane, because it is indeed not. But while the variable names suggest that that is what we have been testing, we have been testing something else: this code works because of the fact that the MoleculeFactory returns hydrogen depleted graphs (see Section sec:hydrogens). Therefore, butane is a chain of four carbons, and propane is a chain of three carbons. Then, the latter is a chemical subgraph of the former.
If we now return to our previous cyclopropane-isobutane example, we can run a subgraph analysis on them too:
UITSubgraphLimitation
Here we do see the intrinsic limitation of the algorithm reflected. While it is possible to see that isobutane has more atoms then cyclobutane and therefore cannot be a substructure, that conclusion cannot be derived for cyclobutane as substructure as isobutane, visualizing that algorithmic limitation:
UITSubgraphLimitation
Substructure searching is finding in a target molecule the atoms that match the given searched substructure. With the UniversalIsomorphismTester we can do:
Overlap
However, this only returns us one match, selected as being the largest:
Overlap
There is an alternative:
Substructure
The getSubgraphAtomsMaps() methods returns a List<List<RMap>>
object, where each List<RMap> represents on substructure match.
When we look at the outer list, we see that the subgraph of three carbon atoms
is found 4 times in butane, each with 3 atoms:
Substructure
This is caused by the symmetrical nature of the substructure. It can map twice onto the same three atoms in butane: once in the forward direction, and once in the backward direction.
A common method to find substructures in cheminformatics is the SMiles ARbitrary Target Specification (SMARTS). The CDK has a SMARTSParser class to parse SMARTS strings and a convenience tool to perform SMARTS substructure searches. This is a typical use case:
SMARTSSearching
This shows us that the SMARTS-encoded carboxylic acid substructure is found twice and which atoms in the input structure form that match:
SMARTSSearching
Symmetry can cause identical hits to match multiple times, in different ways. This is, for example, the case when we loosen the above substructure search to a carbon connected to two oxygens, whatever the bond order is:
SMARTSUniqueSearching
This shows the different between the getMatchingAtoms and getUniqueMatchingAtoms
method:
SMARTSUniqueSearching
Substructure searching is a relatively slow algorithm, and the time required to compare two molecules scales with the number of atoms in each molecule. To reduce the computation time, molecular fingerprints were invented. There are two key aspects to fingerprints that make them efficient: first, they have a fixed length so that the time to compare two molecule is independent of the size of the two structures; secondly, the fingerprint of a substructure always matches the fingerprint of any molecules that has that substructure.
In this section we will see two fingerprint types available in the CDK: a substructure based fingerprint, and a path based fingerprint. Before I will explain how these fingerprints are created, we will first look at the BitSet class that is used by the CDK to represent these fingerprints. Consider this code:
BitSetDemo
If we analyze the output, we see that all set bits are listed, and that all other bits are not:
BitSetDemo
Let us now consider a simple substructure fingerprint of length four with the following bit definitions:
- bit 1: molecule contains a carbon
- bit 2: molecule contains a nitrogen
- bit 3: molecule contains a oxygen
- bit 4: molecule contains a chlorine
Let's call this fingerprinter SimpleFingerprinter:
SimpleFingerprinter
We can then calculate the fingerprints for ethanol and benzene:
SimpleFingerprintDemo
and we get these bit sets:
SimpleFingerprintDemo
Now, we can replace the presence of a particular atom, by the presence of a substructure, such as a phenyl or a carbonyl group. We have then defined a substructure fingerprint.
The CDK has several kinds of fingerprints, including path-based fingerprints (Fingerprinter and HybridizationFingerprinter), a MACSS fingerprint (MACSSFingerprinter) [Q34160151], and the PubChem fingerprint (PubChemFingerprinter). These fingerprints have been used for various tasks, including ligand classification [Q42704791], and databases like BRENDA [Q24599948] and TIN [Q33874102].
One substructure-based fingerprinter is the MACCSFingerprinter which partly implements the MACSS fingerprint specification [Q34160151]. The substructures are defined as SMARTS substructure specifications, inherited from RDKit (http://rdkit.org/). For this fingerprint it is required the implicit hydrogen counts are first set:
MACCSFingerprint
The object returned by the getBitFingerprint method is the IBitFingerprint
which we can convert into a Java BitSet with the asBitSet method:
MACCSFingerprint
The CDK also has an implementation for the circular ECFP and FCFP fingerprints [Q29616639]. These are developed by Alex M. Clark at Collaborative Drug Discovery, Inc in the CircularFingerprinter [Q27902272]. It implements both in four variants: ECFP0, ECFP2, ECFP4, ECFP6, FCFP0, FCFP2, FCFP4, and FCFP6. The code is quite similar as for other fingerprints, but we do have to indicate what variant we want:
ECFPFingerprint
Again we get an IBitFingerprint resulting in a BitSet of bits:
ECFPFingerprint