diff --git a/README.md b/README.md
index d9fa13d..9a486fd 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,5 @@
# ElementEmbeddings
-
[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
@@ -143,21 +142,19 @@ The `composition_featuriser` function can be used to featurise the data. The com
```python
from elementembeddings.composition import composition_featuriser
-df_featurised = composition_featuriser(df, embedding="magpie", stats="mean")
+df_featurised = composition_featuriser(df, embedding="magpie", stats=["mean","sum"])
df_featurised
```
-| formula | mean_Number | mean_MendeleevNumber | mean_AtomicWeight | mean_MeltingT | ... | mean_SpaceGroupNumber |
-|---------|-------------|----------------------|--------------------|-------------------|-----|-----------------------|
-| CsPbI3 | 59.2 | 74.8 | 144.16377238 | 412.55 | ... | 129.20000000000002 |
-| Fe2O3 | 15.2 | 74.19999999999999 | 31.937640000000002 | 757.2800000000001 | ... | 98.80000000000001 |
-| NaCl | 14.0 | 48.0 | 29.221384640000004 | 271.235 | ... | 146.5 |
-| ZnS | 23.0 | 78.5 | 48.7225 | 540.52 | ... | 132.0 |
-
-(The columns of the resulting dataframe have been truncated for clarity.)
+| formula | mean_Number | mean_MendeleevNumber | mean_AtomicWeight | mean_MeltingT | mean_Column | mean_Row | mean_CovalentRadius | mean_Electronegativity | mean_NsValence | mean_NpValence | mean_NdValence | mean_NfValence | mean_NValence | mean_NsUnfilled | mean_NpUnfilled | mean_NdUnfilled | mean_NfUnfilled | mean_NUnfilled | mean_GSvolume_pa | mean_GSbandgap | mean_GSmagmom | mean_SpaceGroupNumber | sum_Number | sum_MendeleevNumber | sum_AtomicWeight | sum_MeltingT | sum_Column | sum_Row | sum_CovalentRadius | sum_Electronegativity | sum_NsValence | sum_NpValence | sum_NdValence | sum_NfValence | sum_NValence | sum_NsUnfilled | sum_NpUnfilled | sum_NdUnfilled | sum_NfUnfilled | sum_NUnfilled | sum_GSvolume_pa | sum_GSbandgap | sum_GSmagmom | sum_SpaceGroupNumber |
+|---------|-------------|----------------------|--------------------|-------------------|-------------|----------|---------------------|------------------------|----------------|----------------|--------------------|--------------------|---------------|-----------------|-----------------|-----------------|-----------------|----------------|------------------|----------------|--------------------|-----------------------|------------|---------------------|-------------------|--------------|------------|---------|--------------------|-----------------------|---------------|---------------|---------------|---------------|--------------|----------------|----------------|----------------|----------------|---------------|--------------------|---------------|--------------|----------------------|
+| CsPbI3 | 59.2 | 74.8 | 144.16377238 | 412.55 | 13.2 | 5.4 | 161.39999999999998 | 2.22 | 1.8 | 3.4 | 8.0 | 2.8000000000000003 | 16.0 | 0.2 | 1.4 | 0.0 | 0.0 | 1.6 | 54.584 | 0.6372 | 0.0 | 129.20000000000002 | 296.0 | 374.0 | 720.8188619 | 2062.75 | 66.0 | 27.0 | 807.0 | 11.100000000000001 | 9.0 | 17.0 | 40.0 | 14.0 | 80.0 | 1.0 | 7.0 | 0.0 | 0.0 | 8.0 | 272.92 | 3.186 | 0.0 | 646.0 |
+| Fe2O3 | 15.2 | 74.19999999999999 | 31.937640000000002 | 757.2800000000001 | 12.8 | 2.8 | 92.4 | 2.7960000000000003 | 2.0 | 2.4 | 2.4000000000000004 | 0.0 | 6.8 | 0.0 | 1.2 | 1.6 | 0.0 | 2.8 | 9.755 | 0.0 | 0.8442651200000001 | 98.80000000000001 | 76.0 | 371.0 | 159.6882 | 3786.4 | 64.0 | 14.0 | 462.0 | 13.98 | 10.0 | 12.0 | 12.0 | 0.0 | 34.0 | 0.0 | 6.0 | 8.0 | 0.0 | 14.0 | 48.775000000000006 | 0.0 | 4.2213256 | 494.0 |
+| NaCl | 14.0 | 48.0 | 29.221384640000004 | 271.235 | 9.0 | 3.0 | 134.0 | 2.045 | 1.5 | 2.5 | 0.0 | 0.0 | 4.0 | 0.5 | 0.5 | 0.0 | 0.0 | 1.0 | 26.87041666665 | 1.2465 | 0.0 | 146.5 | 28.0 | 96.0 | 58.44276928000001 | 542.47 | 18.0 | 6.0 | 268.0 | 4.09 | 3.0 | 5.0 | 0.0 | 0.0 | 8.0 | 1.0 | 1.0 | 0.0 | 0.0 | 2.0 | 53.7408333333 | 2.493 | 0.0 | 293.0 |
+| ZnS | 23.0 | 78.5 | 48.7225 | 540.52 | 14.0 | 3.5 | 113.5 | 2.115 | 2.0 | 2.0 | 5.0 | 0.0 | 9.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 19.8734375 | 1.101 | 0.0 | 132.0 | 46.0 | 157.0 | 97.445 | 1081.04 | 28.0 | 7.0 | 227.0 | 4.23 | 4.0 | 4.0 | 10.0 | 0.0 | 18.0 | 0.0 | 2.0 | 0.0 | 0.0 | 2.0 | 39.746875 | 2.202 | 0.0 | 264.0 |
-The returned dataframe contains the mean-pooled features of the magpie representation for the four formulas.
+The returned dataframe contains the mean-pooled and sum-pooled features of the magpie representation for the four formulas.
## Development notes
diff --git a/docs/about.md b/docs/about.md
index f6e4275..0450079 100644
--- a/docs/about.md
+++ b/docs/about.md
@@ -7,13 +7,19 @@
[![GitHub issues](https://img.shields.io/github/issues-raw/WMD-Group/ElementEmbeddings)](https://github.com/WMD-group/ElementEmbeddings/issues)
[![CI Status](https://github.com/WMD-group/ElementEmbeddings/actions/workflows/ci.yml/badge.svg)](https://github.com/WMD-group/ElementEmbeddings/actions/workflows/ci.yml)
[![codecov](https://codecov.io/gh/WMD-group/ElementEmbeddings/branch/main/graph/badge.svg?token=OCMIM5SHL0)](https://codecov.io/gh/WMD-group/ElementEmbeddings)
+[![DOI](https://zenodo.org/badge/493285385.svg)](https://zenodo.org/badge/latestdoi/493285385)
+[![PyPI](https://img.shields.io/pypi/v/ElementEmbeddings)](https://pypi.org/project/ElementEmbeddings/)
+[![documentation](https://img.shields.io/badge/docs-mkdocs%20material-blue.svg?style=flat)](https://wmd-group.github.io/ElementEmbeddings/)
+![python version](https://img.shields.io/pypi/pyversions/elementembeddings)
-The **ElementEmbeddings** package provides high-level tools for analysing elemental
+The **Element Embeddings** package provides high-level tools for analysing elemental
embeddings data. This primarily involves visualising the correlation between
embedding schemes using different statistical measures.
-Motivation
---------
+* **Documentation:**
+* **Examples:**
+
+## Motivation
Machine learning approaches for materials informatics have become increasingly
widespread. Some of these involve the use of deep learning
@@ -22,4 +28,4 @@ rather than specified by the user of the model. While an important goal of
machine learning training is to minimise the chosen error function to make more
accurate predictions, it is also important for us material scientists to be able
to interpret these models. As such, we aim to evaluate and compare different atomic embedding
-schemes in a consistent framework.
\ No newline at end of file
+schemes in a consistent framework.
diff --git a/docs/contribution.md b/docs/contribution.md
index 0f24cb3..7d32b66 100644
--- a/docs/contribution.md
+++ b/docs/contribution.md
@@ -1,6 +1,6 @@
-## Bug reports, feature requests and questions
+# Contributing
-Please use the [Issue Tracker](https://github.com/WMD-group/ElementEmbeddings/issues) to report bugs or request features in the first instance. Contributions are always welcome.
+This is a quick guide on how to follow best practice and contribute smoothly to `ElementEmbeddings`.
## Code contributions
@@ -8,4 +8,56 @@ We are always looking for ways to make `ElementEmbeddings` better and a more use
* Code style should comply with [PEP8](https://peps.python.org/pep-0008/) where possible. [Google's house style](https://google.github.io/styleguide/pyguide.html) is also helpful, including a good model for docstrings.
* Please use comments liberally when adding nontrivial features, and take the chance to clean up other people's code while looking at it.
-* Add tests wherever possible, and use the test suite to check if you broke anything.
\ No newline at end of file
+* Add tests wherever possible, and use the test suite to check if you broke anything.
+
+## Workflow
+
+We follow the [GitHub flow]
+(), using
+branches for new work and pull requests for verifying the work.
+
+The steps for a new piece of work can be summarised as follows:
+
+1. Push up or create [an issue](https://guides.github.com/features/issues).
+2. Create a branch from main, with a sensible name that relates to the issue.
+3. Do the work and commit changes to the branch. Push the branch
+ regularly to GitHub to make sure no work is accidentally lost.
+4. Write or update unit tests for the code you work on.
+5. When you are finished with the work, ensure that all of the unit
+ tests pass on your own machine.
+6. Open a pull request [on the pull request page](https://github.com/WMD-group/ElementEmbeddings/pulls).
+7. If nobody acknowledges your pull request promptly, feel free to poke one of the main developers into action.
+
+## Pull requests
+
+For a general overview of using pull requests on GitHub look [in the GitHub docs](https://help.github.com/en/articles/about-pull-requests).
+
+When creating a pull request you should:
+
+* Ensure that the title succinctly describes the changes so it is easy to read on the overview page
+* Reference the issue which the pull request is closing
+
+Recommended reading: [How to Write the Perfect Pull Request](https://github.blog/2015-01-21-how-to-write-the-perfect-pull-request/)
+
+## Dev requirements
+
+When developing locally, it is recommended to install the python packages in `requirements-dev.txt`.
+
+```bash
+pip install -r requirements-dev.txt
+```
+
+This will allow you to run the tests locally with pytest as described in the main README,
+as well as run pre-commit hooks to automatically format python files with isort and black.
+To install the pre-commit hooks (only needs to be done once):
+
+```bash
+pre-commit install
+pre-commit run --all-files # optionally run hooks on all files
+```
+
+Pre-commit hooks will check all files when you commit changes, automatically fixing any files which are not formatted correctly. Those files will need to be staged again before re-attempting the commit.
+
+## Bug reports, feature requests and questions
+
+Please use the [Issue Tracker](https://github.com/WMD-group/ElementEmbeddings/issues) to report bugs or request features in the first instance. Contributions are always welcome.
diff --git a/docs/images/magpie_cosine_sim_heatmap.png b/docs/images/magpie_cosine_sim_heatmap.png
new file mode 100644
index 0000000..90bb4eb
Binary files /dev/null and b/docs/images/magpie_cosine_sim_heatmap.png differ
diff --git a/docs/images/magpie_umap.png b/docs/images/magpie_umap.png
new file mode 100644
index 0000000..6d02152
Binary files /dev/null and b/docs/images/magpie_umap.png differ
diff --git a/docs/installation.md b/docs/installation.md
index 3bdfd9b..2be000f 100644
--- a/docs/installation.md
+++ b/docs/installation.md
@@ -1,18 +1,21 @@
+# Getting Started
-The latest version of the package can be installed using:
+The latest stable release can be installed via pip using:
-```
-pip install git+git://github.com/WMD-group/ElementEmbeddings.git
+```bash
+pip install ElementEmbeddings
```
## Developer's installation (optional)
+
For development work, `ElementEmbeddings` can eb installed from a copy of the [source repository](https://github.com/WMD-group/ElementEmbeddings.git); this is preferred if using experimental code branches.
To clone the project from Github and make a local installation:
-```
+```bash
git clone https://github.com/WMD-group/ElementEmbeddings.git
cd ElementEmbeddings
pip install -e .
```
-With `-e`, pip will create links to the source folder so that the changes to the code will be reflected on the PATH.
\ No newline at end of file
+
+With `-e`, pip will create links to the source folder so that the changes to the code will be reflected on the PATH.
diff --git a/docs/reference.md b/docs/reference.md
index f8f14e7..8217abf 100644
--- a/docs/reference.md
+++ b/docs/reference.md
@@ -1,14 +1,39 @@
# Elemental Embeddings
-The data contained in this folder is a collection of various elemental representation/embedding schemes
+The data contained in this repository are a collection of various elemental representation/embedding schemes. We provide the literature source for these representations as well as the data source for which the files were obtained. A majority of these representations have been obtained from the following repositories:
+
+* [lrcfmd/ElMD](https://github.com/lrcfmd/ElMD/tree/master)
+* [Kaaiian/CBFV](https://github.com/Kaaiian/CBFV/tree/master)
+
+## Linear representations
+
+For the linear/scalar representations, the `Embedding` class will load these representations as one-hot vectors where the vector components are ordered following the scale (i.e. the `atomic` representation is ordered by atomic numbers).
+
+### Modified Pettifor scale
+
+The following paper describes the details of the modified Pettifor chemical scale:
+[The optimal one-dimensional periodic table: a modified Pettifor chemical scale from data mining](https://iopscience.iop.org/article/10.1088/1367-2630/18/9/093011/meta)
+
+[Data source](https://github.com/lrcfmd/ElMD/blob/master/ElMD/el_lookup/mod_petti.json)
+
+### Atomic numbers
+
+We included `atomic` as a linear representation to generate one-hot vectors corresponding to the atomic numbers
+
+## Vector representations
+
+The following representations are all vector representations (some are local, some are distributed) and the `Embedding` class will load these representations as they are.
+
+### Magpie
-## Magpie
The following paper describes the details of the Materials Agnostic Platform for Informatics and Exploration (Magpie) framework:
[A general-purpose machine learning framework for predicting properties of inorganic materials](https://www.nature.com/articles/npjcompumats201628)
The source code for Magpie can be found
[here](https://bitbucket.org/wolverton/magpie/src/master/)
+[Data source](https://github.com/Kaaiian/CBFV/blob/master/cbfv/element_properties/magpie.csv)
+
The 22 dimensional embedding vector includes the following elemental properties:
@@ -32,30 +57,36 @@ The 22 dimensional embedding vector includes the following elemental properties:
* Space Group Number
-* `magpie_sc` is scaled version of the magpie embeddings
+* `magpie_sc` is a scaled version of the magpie embeddings. [Data source](https://github.com/lrcfmd/ElMD/blob/master/ElMD/el_lookup/magpie_sc.json)
-## mat2vec
+### mat2vec
The following paper describes the implementation of mat2vec:
[Unsupervised word embeddings capture latent knowledge from materials science literature](https://www.nature.com/articles/s41586-019-1335-8)
-## MatScholar
+[Data source](https://github.com/Kaaiian/CBFV/blob/master/cbfv/element_properties/mat2vec.csv)
+
+### MatScholar
The following paper describes the natural language processing implementation of Materials Scholar (matscholar):
[Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature](https://pubs.acs.org/doi/abs/10.1021/acs.jcim.9b00470)
-## MEGnet
+[Data source](https://github.com/lrcfmd/ElMD/blob/master/ElMD/el_lookup/matscholar.json)
+
+### MEGnet
+
The following paper describes the details of the construction of the MatErials Graph Network (MEGNet):
[Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals](https://doi.org/10.1021/acs.chemmater.9b01294)
-## Modified Pettifor scale
-The following paper describes the details of the modified Pettifor chemical scale:
-[The optimal one dimensional periodic table: a modified Pettifor chemical scale from data mining](https://iopscience.iop.org/article/10.1088/1367-2630/18/9/093011/meta)
+[Data source](https://github.com/lrcfmd/ElMD/blob/master/ElMD/el_lookup/megnet16.json)
+
+### Oliynyk
-## Oliynkyk
The following paper describes the details:
[High-Throughput Machine-Learning-Driven Synthesis of Full-Heusler Compounds](https://pubs.acs.org/doi/full/10.1021/acs.chemmater.6b02724)
+[Data source](https://github.com/Kaaiian/CBFV/blob/master/cbfv/element_properties/oliynyk.csv)
+
The 44 features of the embedding vector are formed of the following properties:
Click to see the 44 features!
@@ -106,21 +137,24 @@ The 44 features of the embedding vector are formed of the following properties:
* Cohesive_energy
-* `oliynyk_sc` is scaled version of the oliynyk embeddings
+* `oliynyk_sc` is a scaled version of the oliynyk embeddings: [Data source](https://github.com/lrcfmd/ElMD/blob/master/ElMD/el_lookup/oliynyk_sc.json)
-## Random
+### Random
This is a set of 200-dimensional vectors in which the components are randomly generated
-The 118 200-dimensional vectors in `random_200_new` was generated using the following code:
+The 118 200-dimensional vectors in `random_200_new` were generated using the following code:
```python
import numpy as np
-mu , sigma = 0 , 0.1 # mean and standard deviation s = np.random.normal(mu, sigma, 1000)
+mu , sigma = 0 , 1 # mean and standard deviation s = np.random.normal(mu, sigma, 1000)
s = np.random.default_rng(seed=42).normal(mu, sigma, (118,200))
```
-## SkipAtom
+
+### SkipAtom
The following paper describes the details:
[Distributed representations of atoms and materials for machine learning](https://www.nature.com/articles/s41524-022-00729-3)
+
+[Data source](https://github.com/lantunes/skipatom/blob/main/data/skipatom_20201009_induced.csv)
diff --git a/docs/tutorials.md b/docs/tutorials.md
index 26be149..cfef658 100644
--- a/docs/tutorials.md
+++ b/docs/tutorials.md
@@ -4,20 +4,96 @@ Here we will demonstrate how to use some of `ElementEmbeddings`'s features. For
The `Embedding` class lies at the heart of the package. It handles elemental representation data and enables analysis and visualisation.
-```py
-from elementembeddings.core import Embedding
+For simple usage, you can instantiate an Embedding object using one of the embeddings in the [data directory](src/elementembeddings/data/element_representations/README.md). For this example, let's use the magpie elemental representation.
+
+```python
+# Import the class
+>>> from elementembeddings.core import Embedding
# Load the magpie data
-magpie = Embedding.load_data('magpie')
+>>> magpie = Embedding.load_data('magpie')
+```
+
+We can access some of the properties of the `Embedding` class. For example, we can find the dimensions of the elemental representation and the list of elements for which an embedding exists.
+```python
# Print out some of the properties of the ElementEmbeddings class
+>>> print(f'The magpie representation has embeddings of dimension {magpie.dim}')
+>>> print(f'The magpie representation contains these elements: \n {magpie.element_list}') # prints out all the elements considered for this representation
+>>> print(f'The magpie representation contains these features: \n {magpie.feature_labels}') # Prints out the feature labels of the chosen representation
+
+The magpie representation has embeddings of dimension 22
+The magpie representation contains these elements:
+['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Xe', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd', 'Pm', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Po', 'At', 'Rn', 'Fr', 'Ra', 'Ac', 'Th', 'Pa', 'U', 'Np', 'Pu', 'Am', 'Cm', 'Bk']
+The magpie representation contains these features:
+['Number', 'MendeleevNumber', 'AtomicWeight', 'MeltingT', 'Column', 'Row', 'CovalentRadius', 'Electronegativity', 'NsValence', 'NpValence', 'NdValence', 'NfValence', 'NValence', 'NsUnfilled', 'NpUnfilled', 'NdUnfilled', 'NfUnfilled', 'NUnfilled', 'GSvolume_pa', 'GSbandgap', 'GSmagmom', 'SpaceGroupNumber']
+```
-# Print the dimensions of the embedding
-print(f'The magpie representation has embeddings of dimension {magpie.dim} \n')
+### Plotting
-print(magpie.element_list) # prints out all the elements considered for this representation
+We can quickly generate heatmaps of distance/similarity measures between the element vectors using `heatmap_plotter` and plot the representations in two dimensions using the `dimension_plotter` from the plotter module. Before we do that, we will standardise the embedding using the `standardise` method available to the Embedding class
-The magpie representation has embeddings of dimension 21
-['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Xe', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd', 'Pm', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Po', 'At', 'Rn', 'Fr', 'Ra', 'Ac', 'Th', 'Pa', 'U', 'Np', 'Pu', 'Am', 'Cm', 'Bk']
+```python
+from elementembeddings.plotter import heatmap_plotter, dimension_plotter
+import matplotlib.pyplot as plt
+
+magpie.standardise(inplace=True) # Standardises the representation
+
+fig, ax = plt.subplots(1, 1, figsize=(6,6))
+heatmap_params = {"vmin": -1, "vmax": 1}
+heatmap_plotter(embedding=magpie, metric="cosine_similarity",show_axislabels=False,cmap="Blues_r",ax=ax, **heatmap_params)
+ax.set_title("Magpie cosine similarities")
+fig.tight_layout()
+fig.show()
+
+```
+
+![Magpie cosine similarity heatmap](images/magpie_cosine_sim_heatmap.png)
+
+```python
+fig, ax = plt.subplots(1, 1, figsize=(6,6))
+
+reducer_params={"n_neighbors": 30, "random_state":42}
+scatter_params = {"s":100}
+
+dimension_plotter(embedding=magpie, reducer="umap",n_components=2,ax=ax,adjusttext=True,reducer_params=reducer_params, scatter_params=scatter_params)
+ax.set_title("Magpie UMAP (n_neighbours=30)")
+ax.legend().remove()
+handles, labels = ax1.get_legend_handles_labels()
+fig.legend(handles, labels, bbox_to_anchor=(1.25, 0.5), loc="center right", ncol=1)
+
+fig.tight_layout()
+fig.show()
+```
+
+![Magpie UMAP scatter plot](images/magpie_umap.png)
+
+### Compositions
+
+The package can also be used to featurise compositions. Your data could be a list of formula strings or a pandas dataframe of the following format:
+
+| formula |
+|---------|
+| CsPbI3 |
+| Fe2O3 |
+| NaCl |
+| ZnS |
+
+The `composition_featuriser` function can be used to featurise the data. The compositions can be featurised using different representation schemes and different types of pooling through the `embedding` and `stats` arguments respectively.
+
+```python
+from elementembeddings.composition import composition_featuriser
+
+df_featurised = composition_featuriser(df, embedding="magpie", stats=["mean","sum"])
+
+df_featurised
+```
+
+| formula | mean_Number | mean_MendeleevNumber | mean_AtomicWeight | mean_MeltingT | mean_Column | mean_Row | mean_CovalentRadius | mean_Electronegativity | mean_NsValence | mean_NpValence | mean_NdValence | mean_NfValence | mean_NValence | mean_NsUnfilled | mean_NpUnfilled | mean_NdUnfilled | mean_NfUnfilled | mean_NUnfilled | mean_GSvolume_pa | mean_GSbandgap | mean_GSmagmom | mean_SpaceGroupNumber | sum_Number | sum_MendeleevNumber | sum_AtomicWeight | sum_MeltingT | sum_Column | sum_Row | sum_CovalentRadius | sum_Electronegativity | sum_NsValence | sum_NpValence | sum_NdValence | sum_NfValence | sum_NValence | sum_NsUnfilled | sum_NpUnfilled | sum_NdUnfilled | sum_NfUnfilled | sum_NUnfilled | sum_GSvolume_pa | sum_GSbandgap | sum_GSmagmom | sum_SpaceGroupNumber |
+|---------|-------------|----------------------|--------------------|-------------------|-------------|----------|---------------------|------------------------|----------------|----------------|--------------------|--------------------|---------------|-----------------|-----------------|-----------------|-----------------|----------------|------------------|----------------|--------------------|-----------------------|------------|---------------------|-------------------|--------------|------------|---------|--------------------|-----------------------|---------------|---------------|---------------|---------------|--------------|----------------|----------------|----------------|----------------|---------------|--------------------|---------------|--------------|----------------------|
+| CsPbI3 | 59.2 | 74.8 | 144.16377238 | 412.55 | 13.2 | 5.4 | 161.39999999999998 | 2.22 | 1.8 | 3.4 | 8.0 | 2.8000000000000003 | 16.0 | 0.2 | 1.4 | 0.0 | 0.0 | 1.6 | 54.584 | 0.6372 | 0.0 | 129.20000000000002 | 296.0 | 374.0 | 720.8188619 | 2062.75 | 66.0 | 27.0 | 807.0 | 11.100000000000001 | 9.0 | 17.0 | 40.0 | 14.0 | 80.0 | 1.0 | 7.0 | 0.0 | 0.0 | 8.0 | 272.92 | 3.186 | 0.0 | 646.0 |
+| Fe2O3 | 15.2 | 74.19999999999999 | 31.937640000000002 | 757.2800000000001 | 12.8 | 2.8 | 92.4 | 2.7960000000000003 | 2.0 | 2.4 | 2.4000000000000004 | 0.0 | 6.8 | 0.0 | 1.2 | 1.6 | 0.0 | 2.8 | 9.755 | 0.0 | 0.8442651200000001 | 98.80000000000001 | 76.0 | 371.0 | 159.6882 | 3786.4 | 64.0 | 14.0 | 462.0 | 13.98 | 10.0 | 12.0 | 12.0 | 0.0 | 34.0 | 0.0 | 6.0 | 8.0 | 0.0 | 14.0 | 48.775000000000006 | 0.0 | 4.2213256 | 494.0 |
+| NaCl | 14.0 | 48.0 | 29.221384640000004 | 271.235 | 9.0 | 3.0 | 134.0 | 2.045 | 1.5 | 2.5 | 0.0 | 0.0 | 4.0 | 0.5 | 0.5 | 0.0 | 0.0 | 1.0 | 26.87041666665 | 1.2465 | 0.0 | 146.5 | 28.0 | 96.0 | 58.44276928000001 | 542.47 | 18.0 | 6.0 | 268.0 | 4.09 | 3.0 | 5.0 | 0.0 | 0.0 | 8.0 | 1.0 | 1.0 | 0.0 | 0.0 | 2.0 | 53.7408333333 | 2.493 | 0.0 | 293.0 |
+| ZnS | 23.0 | 78.5 | 48.7225 | 540.52 | 14.0 | 3.5 | 113.5 | 2.115 | 2.0 | 2.0 | 5.0 | 0.0 | 9.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 19.8734375 | 1.101 | 0.0 | 132.0 | 46.0 | 157.0 | 97.445 | 1081.04 | 28.0 | 7.0 | 227.0 | 4.23 | 4.0 | 4.0 | 10.0 | 0.0 | 18.0 | 0.0 | 2.0 | 0.0 | 0.0 | 2.0 | 39.746875 | 2.202 | 0.0 | 264.0 |
-```
\ No newline at end of file
+The returned dataframe contains the mean- and sum-pooled features of the magpie representation for the four formulas.
\ No newline at end of file
diff --git a/mkdocs.yml b/mkdocs.yml
index a689055..5fa36f7 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -26,6 +26,13 @@ theme:
name: material
custom_dir: docs/.overrides
+
+# Customisation
+extra:
+ version:
+ provider: mike
+
+
plugins:
- mkdocstrings
- search
@@ -39,11 +46,8 @@ markdown_extensions:
- pymdownx.inlinehilite
- pymdownx.snippets
- pymdownx.superfences
+ - attr_list
+ - md_in_html
# Configuration
-
-# Customisation
-extra:
- version:
- provider: mike
diff --git a/requirements-dev.txt b/requirements-dev.txt
index fe413b3..843add3 100644
--- a/requirements-dev.txt
+++ b/requirements-dev.txt
@@ -11,4 +11,5 @@ pytest-cov ==4.1.0
mkdocs ==1.4.3
mkdocs-material == 9.1.17
mkdocstrings ==0.21.2
-mkdocstrings-python == 1.2.1
\ No newline at end of file
+mkdocstrings-python == 1.2.1
+mike ==1.1.2
\ No newline at end of file
diff --git a/setup.py b/setup.py
index 0c981d8..a5ee940 100644
--- a/setup.py
+++ b/setup.py
@@ -5,7 +5,7 @@
module_dir = os.path.dirname(os.path.abspath(__file__))
-VERSION = "0.2.0"
+VERSION = "0.3.0"
DESCRIPTION = "Element Embeddings"
with open(os.path.join(module_dir, "README.md"), encoding="utf-8") as f:
LONG_DESCRIPTION = f.read()
@@ -55,6 +55,7 @@
"mkdocs-material==9.1.17",
"mkdocstrings ==0.21.2",
"mkdocstrings-python == 1.2.1",
+ "mike ==1.1.2",
],
},
classifiers=[
diff --git a/src/elementembeddings/core.py b/src/elementembeddings/core.py
index 4d553f4..f21e2bc 100644
--- a/src/elementembeddings/core.py
+++ b/src/elementembeddings/core.py
@@ -14,13 +14,10 @@
import warnings
from itertools import combinations_with_replacement
from os import path
-from typing import Dict, List, Optional, Tuple, Union
+from typing import Dict, List, Optional, Union
-import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
-import seaborn as sns
-from numpy.linalg import norm
from pymatgen.core import Element
from scipy.stats import energy_distance, pearsonr, spearmanr, wasserstein_distance
from sklearn import decomposition
@@ -510,55 +507,6 @@ def create_pairs(self):
ele_pairs = combinations_with_replacement(ele_list, 2)
return ele_pairs
- def stats_correlation_df(self) -> pd.DataFrame:
- """Return a pandas.DataFrame with correlation metrics.
-
- The columns of returned dataframe are:
- [element_1, element_2, pearson_corr, euclid_dist].
- """
- warnings.warn(
- "This method is deprecated and will be removed in a future release. ",
- DeprecationWarning,
- )
- ele_pairs = self.create_pairs()
- table = []
- for ele1, ele2 in ele_pairs:
- pearson = pearsonr(self.embeddings[ele1], self.embeddings[ele2])
- dist = norm(self.embeddings[ele1] - self.embeddings[ele2])
-
- table.append((ele1, ele2, pearson[0], dist))
- if ele1 != ele2:
- table.append((ele2, ele1, pearson[0], dist))
-
- corr_df = pd.DataFrame(
- table,
- columns=[
- "ele_1",
- "ele_2",
- "pearson_corr",
- "euclid_dist",
- ],
- )
-
- mend_1 = [(Element(ele).mendeleev_no, ele) for ele in corr_df["ele_1"]]
- mend_2 = [(Element(ele).mendeleev_no, ele) for ele in corr_df["ele_2"]]
-
- corr_df["mend_1"] = mend_1
- corr_df["mend_2"] = mend_2
-
- corr_df = corr_df[
- [
- "ele_1",
- "ele_2",
- "mend_1",
- "mend_2",
- "euclid_dist",
- "pearson_corr",
- ]
- ]
-
- return corr_df
-
def compute_correlation_metric(
self, ele1: str, ele2: str, metric: str = "pearson"
) -> float:
@@ -654,23 +602,6 @@ def compute_distance_metric(
)
raise ValueError
- def pearson_pivot_table(self) -> pd.DataFrame:
- """
- Return a pandas.DataFrame style pivot object.
-
- The index and column are the mendeleev number of the element pairs
- and the values being the pearson correlation metrics.
- """
- warnings.warn(
- "This method is deprecated and will be removed in a future release. ",
- DeprecationWarning,
- )
- corr_df = self.correlation_df()
- pearson_pivot = corr_df.pivot_table(
- values="pearson_corr", index="mend_1", columns="mend_2"
- )
- return pearson_pivot
-
def distance_df(self, metric: str = "euclidean") -> pd.DataFrame:
"""
Return a dataframe with columns ["ele_1", "ele_2", metric].
@@ -812,58 +743,6 @@ def correlation_pivot_table(
)
return correlation_pivot
- def plot_pearson_correlation(self, figsize: Tuple[int, int] = (24, 24), **kwargs):
- """
- Plot the heatmap of the pearson correlation values.
-
- Args:
- figsize (tuple): A tuple of (width, height).
- **kwargs: Other keyword arguments to be passed to sns.heatmap
-
- Returns:
- ax (matplotlib Axes): An Axes object with the heatmap
-
- """
- warnings.warn(
- "This method is deprecated and will be removed in a future release. ",
- DeprecationWarning,
- )
- pearson_pivot = self.pearson_pivot_table()
-
- plt.figure(figsize=figsize)
- ax = sns.heatmap(
- pearson_pivot, cmap="bwr", square=True, linecolor="k", **kwargs
- )
-
- return ax
-
- def plot_distance_correlation(
- self, metric: str = "euclidean", figsize: Tuple[int, int] = (24, 24), **kwargs
- ):
- """
- Plot the heatmap of the pairwise distance metrics.
-
- Args:
- metric (str): A valid distance metric
- figsize (tuple): A tuple of (width, height)
-
- Returns:
- ax (matplotlib.axes.Axes): An Axes object with the heatmap
-
- """
- warnings.warn(
- "This method is deprecated and will be removed in a future release. ",
- DeprecationWarning,
- )
- distance_pivot = self.distance_pivot_table(metric=metric)
-
- plt.figure(figsize=figsize)
- ax = sns.heatmap(
- distance_pivot, cmap="bwr", square=True, linecolor="k", **kwargs
- )
-
- return ax
-
def calculate_PC(self, n_components: int = 2, standardise: bool = True, **kwargs):
"""Calculate the principal componenets (PC) of the embeddings.
@@ -946,128 +825,3 @@ def calculate_UMAP(self, n_components: int = 2, standardise: bool = True, **kwar
umap_result = umap.fit_transform(embeddings_array)
self._umap_data = umap_result
return self._umap_data
-
- def plot_PCA_2D(
- self,
- figsize: Tuple[int, int] = (16, 12),
- points_hue: str = "group",
- points_size: int = 200,
- **kwargs,
- ):
- """Plot a PCA plot of the atomic embedding.
-
- Args:
- figsize (tuple): A tuple of (width, height)
- points_size (float): The marker size
-
- Returns:
- ax (matplotlib.axes.Axes): An Axes object with the PCA plot
-
- """
- warnings.warn(
- "This method is deprecated and will be removed in a future release. ",
- DeprecationWarning,
- )
- embeddings_array = np.array(list(self.embeddings.values()))
- element_array = np.array(self.element_list)
-
- pca = decomposition.PCA(n_components=2) # project to 2 dimensions
-
- pca.fit(embeddings_array)
- X = pca.transform(embeddings_array)
-
- pca_dim1 = X[:, 0]
- pca_dim2 = X[:, 1]
-
- # Create a dataframe to store the dimensions, labels and group info for the PCA
- pca_df = pd.DataFrame(
- {
- "pca_dim1": pca_dim1,
- "pca_dim2": pca_dim2,
- "element": element_array,
- "group": list(self.element_groups_dict.values()),
- }
- )
- fig, ax = plt.subplots(figsize=figsize)
-
- sns.scatterplot(
- x="pca_dim1",
- y="pca_dim2",
- data=pca_df,
- hue=points_hue,
- s=points_size,
- **kwargs,
- ax=ax,
- )
-
- plt.xlabel("Dimension 1")
- plt.ylabel("Dimension 2")
-
- for i in range(len(X)):
- plt.text(x=pca_dim1[i], y=pca_dim2[i], s=element_array[i])
-
- return plt
-
- def plot_tSNE(
- self,
- n_components: str = 2,
- figsize: Tuple[int, int] = (16, 12),
- points_hue: str = "group",
- points_size: int = 200,
- **kwargs,
- ):
- """Plot a t-SNE plot of the atomic embedding.
-
- Args:
- n_components (int): Number of t-SNE components to plot.
- figsize (tuple): A tuple of (width, height)
- points_size (float): The marker size
-
- Returns:
- ax (matplotlib.axes.Axes): An Axes object with the PCA plot
-
-
- """
- warnings.warn(
- "This method is deprecated and will be removed in a future release. ",
- DeprecationWarning,
- )
- embeddings_array = np.array(list(self.embeddings.values()))
- element_array = np.array(self.element_list)
-
- tsne = TSNE(n_components)
- tsne_result = tsne.fit_transform(embeddings_array)
-
- tsne_df = pd.DataFrame(
- {
- "tsne_dim1": tsne_result[:, 0],
- "tsne_dim2": tsne_result[:, 1],
- "element": element_array,
- "group": list(self.element_groups_dict.values()),
- }
- )
- # Create the t-SNE plot
- fig, ax = plt.subplots(figsize=figsize)
- sns.scatterplot(
- x="tsne_dim1",
- y="tsne_dim2",
- data=tsne_df,
- hue=points_hue,
- s=points_size,
- ax=ax,
- )
- # lim = (tsne_result.min()-5, tsne_result.max()+5)
- # ax.set_xlim(lim)
- # ax.set_ylim(lim)
- plt.xlabel("Dimension 1")
- plt.ylabel("Dimension 2")
-
- # Label the points
- for i in range(tsne_df.shape[0]):
- plt.text(
- x=tsne_df["tsne_dim1"][i],
- y=tsne_df["tsne_dim2"][i],
- s=tsne_df["element"][i],
- )
-
- return plt
diff --git a/src/elementembeddings/tests/test_core.py b/src/elementembeddings/tests/test_core.py
index 030d0ca..04b56bd 100644
--- a/src/elementembeddings/tests/test_core.py
+++ b/src/elementembeddings/tests/test_core.py
@@ -3,7 +3,6 @@
import os
import unittest
-import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
@@ -425,11 +424,6 @@ def test_distance_dataframe_functions(self):
assert isinstance(
self.test_magpie.distance_pivot_table(sortby="atomic_number"), pd.DataFrame
)
- assert isinstance(self.test_magpie.plot_distance_correlation(), plt.Axes)
- assert isinstance(
- self.test_magpie.plot_distance_correlation(metric="euclidean"), plt.Axes
- )
- assert isinstance(self.test_magpie.stats_correlation_df(), pd.DataFrame)
def test_remove_elements(self):
"""Test the remove_elements function."""