Question: Searching and filtering mordred error/Missing values #83

ravichas · 2020-02-09T17:10:27Z

Hello All:

--
I am computing descriptors for a set of compounds using a code similar to the one shown below:

# create descriptor calculator with ALL descriptors
suppl = Chem.SmilesMolSupplier('Data/bzr.smi')
mols = [x for x in suppl]
calc = Calculator(descriptors, ignore_3D=True)
test_df = calc.pandas(mols)

Some of the test_df cells have errors/missing values with messages like this,
"max() arg is an empty sequence (MAXsLi)". Other datasets have similar messages
like, "float division by zero (MDEC-14)" etc. After going through the Mordred descriptor page, I understand what these mean. But, I am having difficulty in grepping them or assigning them with a fixed numerical value for later analysis. My question, is there a way, I can handle missing values differently in Mordred.

Thanks very much for your time and help.

Ravi

--

environment

OS/distribution

Windows 10

conda or pip

conda

python version

3.8.1

library version

The text was updated successfully, but these errors were encountered:

rgerkin · 2020-02-19T00:03:21Z

The best way is probably to do test_df.astype(float).fillna(0) which will force all the values to be floats, turning the string errors into NaNs, and then replace them with 0.

ravichas · 2020-02-24T17:16:38Z

Thanks

plkx · 2021-01-22T04:22:29Z

Are you sure you want to assign them numerical values? If their value is undefined, then assigning them a value (0, for example), may create spurious or anomalous effects in further numerical processing.

When descriptors are "missing" (not a number, undefined, infinite, etc.), either leave out the descriptor entirely, or leave out the structure(s) that do not have that descriptor value.

Such determinations are done relatively easily in Excel, or other spreadsheet applications.

Number precision of the computer platform may create some issues with numerical values, but they should not result in blank or text values. For example, I sometimes see a column with a bunch of 0 values, but some with 1 E-15. Those values are at the precision limit for numbers in Windows (treat them all as zero).

rgerkin · 2021-01-22T05:07:15Z

Most downstream ML algorithms are going to require them to have some value, and I think using such algorithms is the goal of many users of this package. Zero may not be best for the reasons you suggest (I would say that zero is "opinionated" i.e. it might be a particularly low or high value for that descriptor), so it may be better to fill them with the column median, or do some other imputation (e.g. nuclear norm).

plkx · 2021-01-22T08:14:28Z

Mathematically (as in regardless of anyone's opinion), inserting zero values where no value has been determined contradicts essential requirements for numerical treatments of data by regression, genetic algorithms or artificial neural networks, for example. There may be approaches to dealing with undefined values, but I've yet to see or hear of an instance where simply creating and inserting zero values could withstand a level of scrutiny to warrant publication in a relevant venue. A shorter name for such a practice is "making up data," which has been the reason that many publications have been recalled, recanted, or worse when the circumstances were uncovered.

I doubt that those serious about advancing machine learning advocate making up data, even if it does offer essentially instant gratification through substantial simplification of daunting challenges. Simply calculating all of the descriptors a program offers is absurd, so absurd results should be expected.

On the other hand, parsing a chemical formula for elemental composition then eliminating descriptors for elements not present adds meaningful data. Mordred expects the user to input meaningful data, select relevant descriptors to calculate, and finally, have at least some contextual understanding of the descriptors to interpret and apply them meaningfully.

To quote Albert Einstein, "It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience." Or, to paraphrase, "Everything should be made as simple as possible, but no simpler."

Conceptually, through ML a model could be "taught" the equations, definitions, evolution, and applicability of far more descriptors than any one person could master. Using that breadth & depth of knowledge, it could develop deeper understanding of molecular descriptors (the logical, algebraic and other mathematical relationships), the spectrum of molecules and their myriad properties and activities. With all of this in hand, ML would essentially take over new molecule design for pharma and beyond. I have no doubt that substantial pharma dollars have been and will continue to be directed toward these ends until they are achieved (I suspect they have achieved them in limited contexts, which will continue to expand, and it is all tightly guarded).

In the meantime, molecular descriptors are created and refined by computational scientists. Many (including me) are working on quantitative structure-activity relationships (QSAR) for molecules in biological circumstances, or quantitative structure-property relationships (QSPR) for molecules and physicochemical properties. The 2018 article introducing mordred describes mordred's capabilities, but is focused primarily on the superior quality and breadth of descriptors calculated by mordred versus PaDel descriptor, Dragon and other software packages. This is because those (including ML) engaged in QSPR or QSAR search for and rely on quality tools and quality data, because ultimately, the product is no better than its weakest component. Which means "garbage in → garbage out." "Zero insertion" for instances of undefined descriptor values contradicts the spirit of mordred promoted in their article, and would substantially diminish mordred's utility for QSAR/QSPR.

For those bothered by non-numerical values, they can be easily replaced with any values one likes in Excel, using simple find and replace.

Open-access article here: https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0258-y#Sec12

In any case, the current status of this mordred incarnation appears to be "abandoned," so this may all be moot unless someone takes up the project.

The most recent mordred activity seems to be in Docker images, e.g. the XenonPy project, where they claim to intent to introduce mordred 1.2.0 soon.

Regards,

plkx

rgerkin · 2021-01-22T17:25:51Z

Yes, I conceded that zero imputation is not a good idea, much more concisely, in my previous comment. But imputation of some kind (typically not zero imputation) is almost always done on some training data in these contexts. Removing entire descriptors because a very small fraction of molecules do not have values for them reduces predictive power, even though it would be appropriate to remove them for statistical inference. These are two different goals.

plkx · 2021-01-23T02:21:08Z

Predictive power from undefined descriptors comes from the boundaries within which they are defined and undefined. Numbers, for example, may be characterized by square roots, but "square roots" of letters in the alphabet are not defined. One might try to impute numerical values, such as rank of alphabetic order, but such an approach lacks the rigorous mathematical logic applied in creation of molecular descriptors. Square roots of letters of the alphabet are not invariant properties, since one may choose different rules for ranking letters, or even different alphabets.

The availability of equations to calculate descriptors and their application to astronomical numbers of compounds which may be generated, for example, as SMILES in spreadsheets through simple concatenation operations, does not infer value to any of the data. There are infinite legitimate SMILES strings that represent molecules that could never exist in our universe because they lack chemical logic and context.

In the current case, attempting to treat undefined molecular descriptors as if they are defined seems likely to cause more damage in ML predictive models through loss of context. Failure to either recognize and/or capitalize on objective and logical contexts degrades ML outcome.

A few key points:

Mordred calculates descriptors developed/intended/validated for predominantly covalent carbon compounds. The fewer cationic or anionic atoms in a molecule, the more relevant mordred's descriptors.

This still leaves mordred able to generate descriptors for an essentially infinite number of compounds comprising only the elements hydrogen, carbon, nitrogen and oxygen. Relatively, a very small fraction of molecules that actually exist have atoms other than hydrogen, carbon, nitrogen and oxygen. Many of those don't exist at ambient temperatures, in the presence of air, or in the presence of water (including humid air).

Expansion of the set to include halogens, sulfur and phosphorus encompasses the vast majority of the molecules that exist on our planet.

The importance of organic compounds which include metal atoms probably cannot be overstated. They are also far more difficult to model, largely because of the sharp decline in the number or relevant descriptors when metal atoms are included.

Mordred explicitly specifies elements in 327 descriptors, and in 1 specifies halogens, for a total of 328 element-specific descriptors. 200 of those descriptors specify elements other than H, C, N and O. Those elements are Li, Be, B, F, Si, P, S, Cl, Ge, As, Se, Br, Sn, I and Pb. Through routine use of mordred, I found it useful to develop means to readily identify and delete irrelevant descriptors from mordred csv output files. This information is in the spreadsheet I have attached. It was derived from one of the supplemental files to the 2018 Mordred paper in the Journal of Cheminformatics. Besides the names and types of descriptors, it now provides easy identification of irrelevant descriptors, so they may readily removed by index.

There are many elements missing from that list, elements that form "significant compounds" with carbon (significant as in toxic, bioactive, high-value, etc.). Missing elements of significance include (at least) Mg, Al, Ti, V, Fe, Co, Ni, Cu, Zn, Zr, Pd, Hg and Bi.

Meaningful descriptors for carbon compounds with most elements are few or not available at all. There are many reasons this is so. For one, elements with electrons in d-orbitals are generally not-well modeled by current theories (comprising predominantly molecular mechanics, semiempirical quantum, ab initio quantum, and density functional theories). Treatments for chemical bonds involving d-orbitals, especially covalent-type, are generally specialized within very limited parameters. Furthermore, for elements in the periodic table starting around iodine and higher atomic numbers, general relativity effects increase in significance due to the higher masses of the atomic nuclei. These are non-trivial effects — gold metal is yellow and lead provides substantial electric current in batteries due to relativistic quantum effects.

Nonetheless, I encourage people continue to pursue chemistry computational and information science, and wish them more successes than failures. This is how endeavors that were considered intractable 30 years ago became almost trivial now, and so may progress continue.

Best Regards,

plkx
mord_dscrptrs_addns.xlsx

rgerkin · 2021-01-23T03:51:21Z

In practice, imputation of missing values is both broadly used and improves prediction in many, many cases. This is broadly accepted in the ML community. Predictive modeling challenges--including QSAR applications-- have been won using imputation, producing models that outperform nonimputed variants. Perhaps you are unfamiliar with the literature and track record of the technique?

plkx · 2021-01-23T05:28:45Z

There is an expectation of informed application of the descriptors.

Applying descriptors calculations for lithium compounds where none exist is anything but informed.

Perhaps what's being "won" is NOT winning the race to the bottom of least valuable predictions, since it is after all, relative.

The ML literature I am familiar with espouses the value of contextual learning, and seeks to not obliterate context in the quest for ever more "data points."

Eventually, better data wins over more data.

Ragingdemo · 2023-12-07T08:48:36Z

So, what should we do with the missing data in this case?
I am also facing the same issue.
Thanks for your time.

JacksonBurns · 2023-12-07T16:22:04Z

@Ragingdemo for descriptors where there are no values at all, drop the column. For columns with missing values, imput the values with some algorithm, like replacing with the mean, or median, etc. etc.

Also, check out my fork of this repo that is still maintained: https://github.com/JacksonBurns/mordred-community
It has support for modern Python version and fixes a number of small bugs.

Ragingdemo · 2024-06-10T14:50:03Z

@JacksonBurns
Thank you for the response.
I checked it and implemented the community version which does help.
However, I have one doubt, for some chemicals , i get error as "min() arg is an empty sequence (MINssssN)" and for some descriptors as "max() arg is an empty sequence (MAXsssP)". So, should I do what you suggested that to imput some values using some algorithms.
I wanted to know, that am I calculating it correctly or there are some issues with my code as listed below:

"import rdkit
from rdkit import Chem
from rdkit.Chem import Draw, AllChem
import mordred
from mordred import Calculator, descriptors
import pandas as pd
mol_list=[]
for smiles in l:
mol=Chem.MolFromSmiles(smiles)
mol=Chem.AddHs(mol)
mol_list.append(mol)
calc=Calculator(descriptors,ignore_3D=True)
mol=pd.DataFrame(mol_list)
all_desc=calc.pandas(mol[0])"

where l is the list containing smiles notations
Thank you for your time.

JacksonBurns · 2024-06-10T15:55:13Z

It's difficult to say without knowing the actual species, but some of these descriptors just aren't defined for some molecules. The exact error you get doesn't mean a whole lot unless you are familiar with how each descriptor is calculated. It is easier to simply do the imputation as you have mentioned.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Searching and filtering mordred error/Missing values #83

Question: Searching and filtering mordred error/Missing values #83

ravichas commented Feb 9, 2020 •

edited

Loading

rgerkin commented Feb 19, 2020

ravichas commented Feb 24, 2020

plkx commented Jan 22, 2021

rgerkin commented Jan 22, 2021

plkx commented Jan 22, 2021

rgerkin commented Jan 22, 2021

plkx commented Jan 23, 2021

rgerkin commented Jan 23, 2021

plkx commented Jan 23, 2021

Ragingdemo commented Dec 7, 2023

JacksonBurns commented Dec 7, 2023

Ragingdemo commented Jun 10, 2024 •

edited

Loading

JacksonBurns commented Jun 10, 2024

Question: Searching and filtering mordred error/Missing values #83

Question: Searching and filtering mordred error/Missing values #83

Comments

ravichas commented Feb 9, 2020 • edited Loading

environment

OS/distribution

conda or pip

python version

library version

Name Version Build Channel

rgerkin commented Feb 19, 2020

ravichas commented Feb 24, 2020

plkx commented Jan 22, 2021

rgerkin commented Jan 22, 2021

plkx commented Jan 22, 2021

rgerkin commented Jan 22, 2021

plkx commented Jan 23, 2021

rgerkin commented Jan 23, 2021

plkx commented Jan 23, 2021

Ragingdemo commented Dec 7, 2023

JacksonBurns commented Dec 7, 2023

Ragingdemo commented Jun 10, 2024 • edited Loading

JacksonBurns commented Jun 10, 2024

ravichas commented Feb 9, 2020 •

edited

Loading

Ragingdemo commented Jun 10, 2024 •

edited

Loading