Organize the search of molecular structures using Lucene and Chemistry Development Kit
full-text search Library Lucene provides the ability to organize search by text documents. There are also tools with which you can search "similar" chemical structures, such as OpenBabel. Sometimes you may need to combine these two types of search into a single "canvas". For example, if you want to create a system that can answer such queries: find substance in the text which has the word "amino acid", structurally similar to indole (it is expected that we will find the amino acid tryptophan). This article describes the solution to this problem based on full-text engine Lucene.
As you know, based on full-text search is the inverted index construction. The document text is split into individual words, each of which builds a list of documents that contain the word. Usually in addition to the list of documents for a given word, the index contains the position of the words in each document.
Search for similar molecules in databases can be based on the "molecular fingerprints". Often the "molecular fingerprint" is a bit string in which each bit corresponds to the presence of certain properties or of the structural fragment. The role of proximity measures uses the Tanimoto coefficient (a special case of the jaccard coefficient).
As a simple example of a molecular imprint, consider this: Suppose we want to obtain a molecular fingerprint long n. The structure is cut into a one-dimensional chain of atoms of length more than k. For each resulting chain is used hash-function that maps the chain number from 0 to n-1. This number determines the number of bits that will be set to 1 in the fingerprint.
For such manipulations with chemical structures, you can use the java library Chemistry Development Kit (hereinafter CDK). CDK is licensed under the LGPL and contains the java classes for the representation of structures, calculation of several types of "molecular imprints", and supports a variety of chemical file formats. For example, cutting the molecule into fragments can be implemented as:
the
Described "molecular fingerprint" is calculated using the class CDK org.openscience.cdk.fingerprint.Fingerprinter.
In order to use Lucene for the search structures, we reformulate the structure search on the prints in a kind of "search words". Structure as in the above-described method for constructing a fingerprint, cut into short one-dimensional chains, which are analogues of words in a text search. Thus, if we give the system a different structure as a search query, the query will also be cut on the chain, and the result will be the structures that have the same chain of atoms. The mechanism of ordering by relevancy will seek to withdraw in the first position of the results of the structures, which on one side have many similarities, on the other hand these coincidences slightly "diluted". That is, it is logical to expect that in the first place will exactly match the structure (unless, of course, it is a indexed), then similar structures differing by one atom or by a group, then even more different structures, etc.
It should be noted that used in the Lucene ranking formula by default (vector space model + TF-IDF) is different from the proximity measure, Tanimoto. You can override the Similarity class of Lucene under the formula of the coefficient of Tanimoto, but I'm not going to stop there, especially since the formula default should give a "qualitatively correct" result: structures with a large proportion of the intersections of fragments will be in the first places in the result.
This approach to indexing and searching of molecules can be implemented for Lucene by creating a special "chemical tokenizer". The tokenizer (tokenizer) is a component of Lucene that splits text into individual tokens (most often the tokens are words). I already brought up the main logic of this tokenizer. The difference will be that the tokenizer in Lucene is not necessary to obtain all the tokens in the cycle. Instead implement the method Tokenizer.incrementToken()which will output the text representation of the chains of atoms one at a time.
There are several formats for the presentation of molecules in the form of a text string. For now, focus on supporting only one format SMILES. CDK provides a parser SMILES-string (class org.openscience.cdk.smiles.SmilesParser). Custom them it is easy:
the
Logic convert the SMILES representation of molecules in the structure with the aim of further reducing it into tokens-the chain is placed in the class SmilesTokenizer.
Allow myself to omit some minor details, I can only say that the source posted on GitHub. Go directly to the example of using these classes for search structures. Suppose that the search documents contain three fields: name — the name of chemical compounds, description — text description field and smiles — information about the structure of the molecule.
Indexing of documents might look something like this (use a helper class SmilesAnalyzerthat simply creates SmilesTokenizer):
the
The search will look like this:
the
Article based on information from habrahabr.ru
As you know, based on full-text search is the inverted index construction. The document text is split into individual words, each of which builds a list of documents that contain the word. Usually in addition to the list of documents for a given word, the index contains the position of the words in each document.
Search for similar molecules in databases can be based on the "molecular fingerprints". Often the "molecular fingerprint" is a bit string in which each bit corresponds to the presence of certain properties or of the structural fragment. The role of proximity measures uses the Tanimoto coefficient (a special case of the jaccard coefficient).
As a simple example of a molecular imprint, consider this: Suppose we want to obtain a molecular fingerprint long n. The structure is cut into a one-dimensional chain of atoms of length more than k. For each resulting chain is used hash-function that maps the chain number from 0 to n-1. This number determines the number of bits that will be set to 1 in the fingerprint.
For such manipulations with chemical structures, you can use the java library Chemistry Development Kit (hereinafter CDK). CDK is licensed under the LGPL and contains the java classes for the representation of structures, calculation of several types of "molecular imprints", and supports a variety of chemical file formats. For example, cutting the molecule into fragments can be implemented as:
the
/To represent chemical structure in CDK are used to implement the interface IAtomContainer.
IAtomContainer structure;
...
for (IAtom startAtom : structure.atoms()) {
//we will be interested in the class CDK org.openscience.cdk.graph.PathTools, which contains a method getPathsOfLengthUpto chopping the structure into one-dimensional chains.
List<List<IAtom>> paths = PathTools.getPathsOfLengthUpto(structure,
startAtom,
searchDepth);
//Then do something with a chain of atoms
...
}
Described "molecular fingerprint" is calculated using the class CDK org.openscience.cdk.fingerprint.Fingerprinter.
In order to use Lucene for the search structures, we reformulate the structure search on the prints in a kind of "search words". Structure as in the above-described method for constructing a fingerprint, cut into short one-dimensional chains, which are analogues of words in a text search. Thus, if we give the system a different structure as a search query, the query will also be cut on the chain, and the result will be the structures that have the same chain of atoms. The mechanism of ordering by relevancy will seek to withdraw in the first position of the results of the structures, which on one side have many similarities, on the other hand these coincidences slightly "diluted". That is, it is logical to expect that in the first place will exactly match the structure (unless, of course, it is a indexed), then similar structures differing by one atom or by a group, then even more different structures, etc.
It should be noted that used in the Lucene ranking formula by default (vector space model + TF-IDF) is different from the proximity measure, Tanimoto. You can override the Similarity class of Lucene under the formula of the coefficient of Tanimoto, but I'm not going to stop there, especially since the formula default should give a "qualitatively correct" result: structures with a large proportion of the intersections of fragments will be in the first places in the result.
This approach to indexing and searching of molecules can be implemented for Lucene by creating a special "chemical tokenizer". The tokenizer (tokenizer) is a component of Lucene that splits text into individual tokens (most often the tokens are words). I already brought up the main logic of this tokenizer. The difference will be that the tokenizer in Lucene is not necessary to obtain all the tokens in the cycle. Instead implement the method Tokenizer.incrementToken()which will output the text representation of the chains of atoms one at a time.
There are several formats for the presentation of molecules in the form of a text string. For now, focus on supporting only one format SMILES. CDK provides a parser SMILES-string (class org.openscience.cdk.smiles.SmilesParser). Custom them it is easy:
the
SmilesParser sp = new SmilesParser(DefaultChemObjectBuilder.getInstance());
String smiles = "OCC(O)C(O)C(O)C(O)CO"; //Sorbitol
IAtomContainer structure = sp.parseSmiles( smiles );
//Then do something with the resulting structure
...
Logic convert the SMILES representation of molecules in the structure with the aim of further reducing it into tokens-the chain is placed in the class SmilesTokenizer.
Allow myself to omit some minor details, I can only say that the source posted on GitHub. Go directly to the example of using these classes for search structures. Suppose that the search documents contain three fields: name — the name of chemical compounds, description — text description field and smiles — information about the structure of the molecule.
Indexing of documents might look something like this (use a helper class SmilesAnalyzerthat simply creates SmilesTokenizer):
the
Map<String,Analyzer> analyzerPerField = new HashMap<String,Analyzer > ();
analyzerPerField.put(SMILES_FIELD, new SmilesAnalyzer() );
PerFieldAnalyzerWrapper analyzerWrapper =
new PerFieldAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_40), analyzerPerField);
Document doc = new Document();
doc.add( new TextField( "name", "Acetic acid", Field.Store.YES ) );
doc.add( new TextField( "description", "Acetic acid is one of the simplest carboxylic acids. Liquid.", Field.Store.YES ) );
doc.add( new TextField( "smiles", "CC(O)=O" Field.Store.YES ) );
Directory directory = null;
IndexWriter indexWriter = null;
try {
directory = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzerWrapper );
indexWriter = new IndexWriter( directory, config);
indexWriter.addDocument(doc);
}
...
The search will look like this:
the
reader = IndexReader.open(directory );
IndexSearcher searcher = new IndexSearcher(reader);
String querystr = "smiles:CCC";
Query q = null;
q = new QueryParser(Version.LUCENE_40, FREE_TEXT_FIELD, getAnalyzer()).parse(querystr);
TopScoreDocCollector collector = TopScoreDocCollector.create(5, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
//Then do something with search results in hits
...
Комментарии
Отправить комментарий