-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Searching and filtering mordred error/Missing values #83
Comments
The best way is probably to do |
Thanks |
Are you sure you want to assign them numerical values? If their value is undefined, then assigning them a value (0, for example), may create spurious or anomalous effects in further numerical processing. When descriptors are "missing" (not a number, undefined, infinite, etc.), either leave out the descriptor entirely, or leave out the structure(s) that do not have that descriptor value. Such determinations are done relatively easily in Excel, or other spreadsheet applications. Number precision of the computer platform may create some issues with numerical values, but they should not result in blank or text values. For example, I sometimes see a column with a bunch of 0 values, but some with 1 E-15. Those values are at the precision limit for numbers in Windows (treat them all as zero). |
Most downstream ML algorithms are going to require them to have some value, and I think using such algorithms is the goal of many users of this package. Zero may not be best for the reasons you suggest (I would say that zero is "opinionated" i.e. it might be a particularly low or high value for that descriptor), so it may be better to fill them with the column median, or do some other imputation (e.g. nuclear norm). |
Mathematically (as in regardless of anyone's opinion), inserting zero values where no value has been determined contradicts essential requirements for numerical treatments of data by regression, genetic algorithms or artificial neural networks, for example. There may be approaches to dealing with undefined values, but I've yet to see or hear of an instance where simply creating and inserting zero values could withstand a level of scrutiny to warrant publication in a relevant venue. A shorter name for such a practice is "making up data," which has been the reason that many publications have been recalled, recanted, or worse when the circumstances were uncovered. I doubt that those serious about advancing machine learning advocate making up data, even if it does offer essentially instant gratification through substantial simplification of daunting challenges. Simply calculating all of the descriptors a program offers is absurd, so absurd results should be expected. On the other hand, parsing a chemical formula for elemental composition then eliminating descriptors for elements not present adds meaningful data. Mordred expects the user to input meaningful data, select relevant descriptors to calculate, and finally, have at least some contextual understanding of the descriptors to interpret and apply them meaningfully. To quote Albert Einstein, "It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience." Or, to paraphrase, "Everything should be made as simple as possible, but no simpler." Conceptually, through ML a model could be "taught" the equations, definitions, evolution, and applicability of far more descriptors than any one person could master. Using that breadth & depth of knowledge, it could develop deeper understanding of molecular descriptors (the logical, algebraic and other mathematical relationships), the spectrum of molecules and their myriad properties and activities. With all of this in hand, ML would essentially take over new molecule design for pharma and beyond. I have no doubt that substantial pharma dollars have been and will continue to be directed toward these ends until they are achieved (I suspect they have achieved them in limited contexts, which will continue to expand, and it is all tightly guarded). In the meantime, molecular descriptors are created and refined by computational scientists. Many (including me) are working on quantitative structure-activity relationships (QSAR) for molecules in biological circumstances, or quantitative structure-property relationships (QSPR) for molecules and physicochemical properties. The 2018 article introducing mordred describes mordred's capabilities, but is focused primarily on the superior quality and breadth of descriptors calculated by mordred versus PaDel descriptor, Dragon and other software packages. This is because those (including ML) engaged in QSPR or QSAR search for and rely on quality tools and quality data, because ultimately, the product is no better than its weakest component. Which means "garbage in → garbage out." "Zero insertion" for instances of undefined descriptor values contradicts the spirit of mordred promoted in their article, and would substantially diminish mordred's utility for QSAR/QSPR. For those bothered by non-numerical values, they can be easily replaced with any values one likes in Excel, using simple find and replace. Open-access article here: https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0258-y#Sec12 In any case, the current status of this mordred incarnation appears to be "abandoned," so this may all be moot unless someone takes up the project. The most recent mordred activity seems to be in Docker images, e.g. the XenonPy project, where they claim to intent to introduce mordred 1.2.0 soon. Regards, plkx |
Yes, I conceded that zero imputation is not a good idea, much more concisely, in my previous comment. But imputation of some kind (typically not zero imputation) is almost always done on some training data in these contexts. Removing entire descriptors because a very small fraction of molecules do not have values for them reduces predictive power, even though it would be appropriate to remove them for statistical inference. These are two different goals. |
Predictive power from undefined descriptors comes from the boundaries within which they are defined and undefined. Numbers, for example, may be characterized by square roots, but "square roots" of letters in the alphabet are not defined. One might try to impute numerical values, such as rank of alphabetic order, but such an approach lacks the rigorous mathematical logic applied in creation of molecular descriptors. Square roots of letters of the alphabet are not invariant properties, since one may choose different rules for ranking letters, or even different alphabets. The availability of equations to calculate descriptors and their application to astronomical numbers of compounds which may be generated, for example, as SMILES in spreadsheets through simple concatenation operations, does not infer value to any of the data. There are infinite legitimate SMILES strings that represent molecules that could never exist in our universe because they lack chemical logic and context. In the current case, attempting to treat undefined molecular descriptors as if they are defined seems likely to cause more damage in ML predictive models through loss of context. Failure to either recognize and/or capitalize on objective and logical contexts degrades ML outcome. A few key points: Mordred calculates descriptors developed/intended/validated for predominantly covalent carbon compounds. The fewer cationic or anionic atoms in a molecule, the more relevant mordred's descriptors. This still leaves mordred able to generate descriptors for an essentially infinite number of compounds comprising only the elements hydrogen, carbon, nitrogen and oxygen. Relatively, a very small fraction of molecules that actually exist have atoms other than hydrogen, carbon, nitrogen and oxygen. Many of those don't exist at ambient temperatures, in the presence of air, or in the presence of water (including humid air). Expansion of the set to include halogens, sulfur and phosphorus encompasses the vast majority of the molecules that exist on our planet. The importance of organic compounds which include metal atoms probably cannot be overstated. They are also far more difficult to model, largely because of the sharp decline in the number or relevant descriptors when metal atoms are included. Mordred explicitly specifies elements in 327 descriptors, and in 1 specifies halogens, for a total of 328 element-specific descriptors. 200 of those descriptors specify elements other than H, C, N and O. Those elements are Li, Be, B, F, Si, P, S, Cl, Ge, As, Se, Br, Sn, I and Pb. Through routine use of mordred, I found it useful to develop means to readily identify and delete irrelevant descriptors from mordred csv output files. This information is in the spreadsheet I have attached. It was derived from one of the supplemental files to the 2018 Mordred paper in the Journal of Cheminformatics. Besides the names and types of descriptors, it now provides easy identification of irrelevant descriptors, so they may readily removed by index. There are many elements missing from that list, elements that form "significant compounds" with carbon (significant as in toxic, bioactive, high-value, etc.). Missing elements of significance include (at least) Mg, Al, Ti, V, Fe, Co, Ni, Cu, Zn, Zr, Pd, Hg and Bi. Meaningful descriptors for carbon compounds with most elements are few or not available at all. There are many reasons this is so. For one, elements with electrons in d-orbitals are generally not-well modeled by current theories (comprising predominantly molecular mechanics, semiempirical quantum, ab initio quantum, and density functional theories). Treatments for chemical bonds involving d-orbitals, especially covalent-type, are generally specialized within very limited parameters. Furthermore, for elements in the periodic table starting around iodine and higher atomic numbers, general relativity effects increase in significance due to the higher masses of the atomic nuclei. These are non-trivial effects — gold metal is yellow and lead provides substantial electric current in batteries due to relativistic quantum effects. Nonetheless, I encourage people continue to pursue chemistry computational and information science, and wish them more successes than failures. This is how endeavors that were considered intractable 30 years ago became almost trivial now, and so may progress continue. Best Regards, |
In practice, imputation of missing values is both broadly used and improves prediction in many, many cases. This is broadly accepted in the ML community. Predictive modeling challenges--including QSAR applications-- have been won using imputation, producing models that outperform nonimputed variants. Perhaps you are unfamiliar with the literature and track record of the technique? |
There is an expectation of informed application of the descriptors. Applying descriptors calculations for lithium compounds where none exist is anything but informed. Perhaps what's being "won" is NOT winning the race to the bottom of least valuable predictions, since it is after all, relative. The ML literature I am familiar with espouses the value of contextual learning, and seeks to not obliterate context in the quest for ever more "data points." Eventually, better data wins over more data. |
So, what should we do with the missing data in this case? |
@Ragingdemo for descriptors where there are no values at all, drop the column. For columns with missing values, imput the values with some algorithm, like replacing with the mean, or median, etc. etc. Also, check out my fork of this repo that is still maintained: https://github.com/JacksonBurns/mordred-community |
@JacksonBurns "import rdkit where l is the list containing smiles notations |
It's difficult to say without knowing the actual species, but some of these descriptors just aren't defined for some molecules. The exact error you get doesn't mean a whole lot unless you are familiar with how each descriptor is calculated. It is easier to simply do the imputation as you have mentioned. |
Hello All:
--
I am computing descriptors for a set of compounds using a code similar to the one shown below:
Some of the test_df cells have errors/missing values with messages like this,
"max() arg is an empty sequence (MAXsLi)". Other datasets have similar messages
like, "float division by zero (MDEC-14)" etc. After going through the Mordred descriptor page, I understand what these mean. But, I am having difficulty in grepping them or assigning them with a fixed numerical value for later analysis. My question, is there a way, I can handle missing values differently in Mordred.
Thanks very much for your time and help.
Ravi
--
environment
OS/distribution
Windows 10
conda or pip
conda
python version
3.8.1
library version
Name Version Build Channel
attrs 19.3.0 py_0 conda-forge
backcall 0.1.0 py_0 conda-forge
blas 1.0 mkl anaconda
bleach 3.1.0 py_0 conda-forge
boost 1.70.0 py38h79cbd7a_1 conda-forge
boost-cpp 1.70.0 h6a4c333_2 conda-forge
ca-certificates 2019.11.28 hecc5488_0 conda-forge
cairo 1.16.0 h60892f0_1002 conda-forge
certifi 2019.11.28 py38_0 conda-forge
colorama 0.4.3 py_0 conda-forge
cycler 0.10.0 py_2 conda-forge
decorator 4.4.1 py_0 conda-forge
defusedxml 0.6.0 py_0 conda-forge
entrypoints 0.3 py38_1000 conda-forge
freetype 2.10.0 h563cfd7_1 conda-forge
icc_rt 2019.0.0 h0cc432a_1 anaconda
icu 64.2 he025d50_1 conda-forge
importlib_metadata 1.4.0 py38_0 conda-forge
inflect 4.0.0 py38_1 conda-forge
ipykernel 5.1.4 py38h5ca1d4c_0 conda-forge
ipymol 0.5 pypi_0 pypi
ipython 7.11.1 py38h5ca1d4c_0 conda-forge
ipython_genutils 0.2.0 py_1 conda-forge
jaraco.itertools 5.0.0 py_0 conda-forge
jedi 0.16.0 py38_0 conda-forge
jinja2 2.10.3 py_0 conda-forge
joblib 0.14.1 py_0 anaconda
jpeg 9c hfa6e2cd_1001 conda-forge
json5 0.8.5 py_0 conda-forge
jsonschema 3.2.0 py38_0 conda-forge
jupyter_client 5.3.4 py38_1 conda-forge
jupyter_core 4.6.1 py38_0 conda-forge
jupyterlab 1.1.4 py_0 conda-forge
jupyterlab_server 1.0.6 py_0 conda-forge
kiwisolver 1.1.0 py38he980bc4_0 conda-forge
libblas 3.8.0 8_mkl conda-forge
libcblas 3.8.0 8_mkl conda-forge
libclang 9.0.1 default_hf44288c_0 conda-forge
liblapack 3.8.0 8_mkl conda-forge
libpng 1.6.37 h7602738_0 conda-forge
libsodium 1.0.17 h2fa13f4_0 conda-forge
libtiff 4.1.0 h21b02b4_3 conda-forge
llvm-openmp 9.0.1 2 conda-forge
lz4-c 1.8.3 he025d50_1001 conda-forge
m2w64-gcc-libgfortran 5.3.0 6
m2w64-gcc-libs 5.3.0 7
m2w64-gcc-libs-core 5.3.0 7
m2w64-gmp 6.1.0 2
m2w64-libwinpthread-git 5.0.0.4634.697f757 2
markupsafe 1.1.1 py38hfa6e2cd_0 conda-forge
matplotlib 3.1.2 py38_1 conda-forge
matplotlib-base 3.1.2 py38h2981e6d_1 conda-forge
mistune 0.8.4 py38hfa6e2cd_1000 conda-forge
mkl 2019.5 281 conda-forge
mkl-service 2.3.0 py38hfa6e2cd_0 conda-forge
mordred 1.2.0 pyhe5148d4_0 mordred-descriptor
more-itertools 8.1.0 py_0 conda-forge
msys2-conda-epoch 20160418 1
nbconvert 5.6.1 py38_0 conda-forge
nbformat 5.0.4 py_0 conda-forge
networkx 2.4 py_0 conda-forge
notebook 6.0.3 py38_0 conda-forge
numpy 1.17.5 py38hc71023c_0 conda-forge
olefile 0.46 py_0 conda-forge
openssl 1.1.1d hfa6e2cd_0 conda-forge
pandas 0.25.3 py38he350917_0 conda-forge
pandoc 2.9.1.1 0 conda-forge
pandocfilters 1.4.2 py_1 conda-forge
parso 0.6.0 py_0 conda-forge
pickleshare 0.7.5 py38_1000 conda-forge
pillow 7.0.0 py38h9ea1dd6_0 conda-forge
pip 20.0.2 py38_0 conda-forge
pixman 0.38.0 hfa6e2cd_1003 conda-forge
prometheus_client 0.7.1 py_0 conda-forge
prompt_toolkit 3.0.2 py_0 conda-forge
py3dmol 0.8.0 py_0 rmg
pycairo 1.19.0 py38h905957f_0 conda-forge
pygments 2.5.2 py_0 conda-forge
pyparsing 2.4.6 py_0 conda-forge
pyqt 5.12.3 py38h6538335_1 conda-forge
pyqt5-sip 4.19.18 pypi_0 pypi
pyqtwebengine 5.12.1 pypi_0 pypi
pyrsistent 0.15.7 py38hfa6e2cd_0 conda-forge
python 3.8.1 he1f5543_1 conda-forge
python-dateutil 2.8.1 py_0 conda-forge
pytz 2019.3 py_0 conda-forge
pywin32 225 py38hfa6e2cd_0 conda-forge
pywinpty 0.5.7 py38_0 conda-forge
pyzmq 18.1.1 py38h16f9016_0 conda-forge
qt 5.12.5 h7ef1ec2_0 conda-forge
rdkit 2019.09.3 py38h422b363_0 conda-forge
scikit-learn 0.22.1 py38h6288b17_0 anaconda
scipy 1.3.2 py38h29ff71c_0 anaconda
send2trash 1.5.0 py_0 conda-forge
setuptools 45.1.0 py38_0 conda-forge
six 1.14.0 py38_0 conda-forge
sqlite 3.30.1 hfa6e2cd_0 conda-forge
terminado 0.8.3 py38_0 conda-forge
testpath 0.4.4 py_0 conda-forge
tk 8.6.10 hfa6e2cd_0 conda-forge
tornado 6.0.3 py38hfa6e2cd_0 conda-forge
tqdm 4.42.0 py_0 conda-forge
traitlets 4.3.3 py38_0 conda-forge
vc 14.1 h0510ff6_4
vs2015_runtime 14.16.27012 hf0eaf9b_1
wcwidth 0.1.8 py_0 conda-forge
webencodings 0.5.1 py_1 conda-forge
wheel 0.34.1 py38_0 conda-forge
wincertstore 0.2 py38_1003 conda-forge
winpty 0.4.3 4 conda-forge
xz 5.2.4 h2fa13f4_1001 conda-forge
zeromq 4.3.2 h6538335_2 conda-forge
zipp 2.1.0 py_0 conda-forge
zlib 1.2.11 h2fa13f4_1006 conda-forge
zstd 1.4.4 hd8a0e53_1 conda-forge
The text was updated successfully, but these errors were encountered: