This package provides first-class cheminformatics support for the Datagrok platform. See it in action on YouTube.
The Datagrok platform provides rich exploratory data analysis capabilities, advanced data mining techniques such as multivariate analysis and stochastic proximity embedding, and out-of-the-box support for predictive modeling and scientific computations. By providing first-class cheminformatics support, this package turns Datagrok into a comprehensive platform for working with chemical and biological data.
Regarding the performance, our goal is to be able to open chemical datasets of up to 10 millions small molecules completely in the browser, and interactively perform commonly used operations such as substructure and similarity search without having to rely on a server. In order to hit these goals, we are using a couple of techniques. First of all, we are leveraging Datagrok's capability to efficiently work with relational data. For cheminformatics, we are relying on the RDKit library compiled to WebAssembly. This not only gives us the ability to execute C++ code at the native speed, but also take full advantage of the modern multicore CPUs by running computations in multiple threads.
Here are some of the features:
- Molecule sketching
- OpenChemLib
- SAR analysis
- Chembl integration
- Pubchem integration
- Works completely on the client side where possible.
- RDKit-based rendering.
- Highlighting substructures on search (in progress)
- Aligning to scaffolds (in progress)
- Rendering options (in progress)
- Fingerprint-based similarity and diversity analyses (see video)
- Efficient in-memory substructure and similarity searching
The following features are still in the core, but we plan to move them out to this package:
- Property calculators (server-side)
- 3D: coordinate calculation using RDKit, rendering using NGL Viewer
- "Sketch-to-predict": run predictive models as you sketch the molecule
Supports multiple sketchers (MarvinJS, OpenChemLib, Ketcher).
Access the recently sketched structures from the ☰ -> Recent
menu.
☰ -> Favorites
contains your favorite structures. To add current molecule
to the favorites, click on ☰ -> Favorites -> Add to Favorites
.
Out-of-the-box, you can paste SMILES, MOLBLOCK, and InChi keys into the input field, and the sketcher automatically translates it to a structure. In addition to that, you can make sketcher understand other structure notations (such as from your company's internal database of structures) by registering a function annotated in a special way. The following example provides support for Chembl. The important tags are:
--meta.role: converter
: indicates that such a function serves as a value converter--meta.inputRegexp: (CHEMBL[0-9]+)
: RegExp that is evaluated to check if this function is applicable to the user input. The captured group (in this case the whole input) is then passed to this function as a parameter.--output: string smiles { semType: Molecule }
: should return string with the semTypeMolecule
--name: chemblIdToSmiles
--meta.role: converter
--meta.inputRegexp: (CHEMBL[0-9]+)
--connection: Chembl
--input: string id = "CHEMBL1185"
--output: string smiles { semType: Molecule }
select canonical_smiles from compound_structures s
join molecule_dictionary d on s.molregno = d.molregno
where d.chembl_id = @id
A molecule query does not have to be a database query, any function will do. For instance, InChi query is implemented as a Python script.
See also: