Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update paper after carpentries lab review process #554

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 114 additions & 0 deletions paper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -190,3 +190,117 @@ @article{gaviria_rojas_dollar_2022
pages = {12979--12990},
file = {Full Text PDF:/Users/carstenschnober/Zotero/storage/PJZDNZTV/Gaviria Rojas et al. - 2022 - The Dollar Street Dataset Images Representing the.pdf:application/pdf}
}


@article{huber_ms2deepscore_2021,
title = {{MS2DeepScore}: a novel deep learning similarity measure to compare tandem mass spectra},
volume = {13},
issn = {1758-2946},
shorttitle = {{MS2DeepScore}},
url = {https://doi.org/10.1186/s13321-021-00558-4},
doi = {10.1186/s13321-021-00558-4},
abstract = {Mass spectrometry data is one of the key sources of information in many workflows in medicine and across the life sciences. Mass fragmentation spectra are generally considered to be characteristic signatures of the chemical compound they originate from, yet the chemical structure itself usually cannot be easily deduced from the spectrum. Often, spectral similarity measures are used as a proxy for structural similarity but this approach is strongly limited by a generally poor correlation between both metrics. Here, we propose MS2DeepScore: a novel Siamese neural network to predict the structural similarity between two chemical structures solely based on their MS/MS fragmentation spectra. Using a cleaned dataset of {\textgreater} 100,000 mass spectra of about 15,000 unique known compounds, we trained MS2DeepScore to predict structural similarity scores for spectrum pairs with high accuracy. In addition, sampling different model varieties through Monte-Carlo Dropout is used to further improve the predictions and assess the model’s prediction uncertainty. On 3600 spectra of 500 unseen compounds, MS2DeepScore is able to identify highly-reliable structural matches and to predict Tanimoto scores for pairs of molecules based on their fragment spectra with a root mean squared error of about 0.15. Furthermore, the prediction uncertainty estimate can be used to select a subset of predictions with a root mean squared error of about 0.1. Furthermore, we demonstrate that MS2DeepScore outperforms classical spectral similarity measures in retrieving chemically related compound pairs from large mass spectral datasets, thereby illustrating its potential for spectral library matching. Finally, MS2DeepScore can also be used to create chemically meaningful mass spectral embeddings that could be used to cluster large numbers of spectra. Added to the recently introduced unsupervised Spec2Vec metric, we believe that machine learning-supported mass spectral similarity measures have great potential for a range of metabolomics data processing pipelines.},
number = {1},
urldate = {2025-02-11},
journal = {Journal of Cheminformatics},
author = {Huber, Florian and van der Burg, Sven and van der Hooft, Justin J. J. and Ridder, Lars},
month = oct,
year = {2021},
keywords = {Deep learning, Mass spectrometry, Metabolomics, Spectral similarity measure, Supervised machine learning},
pages = {84},
file = {Full Text PDF:/Users/svenvanderburg/Zotero/storage/Y3KAXM5F/Huber et al. - 2021 - MS2DeepScore a novel deep learning similarity mea.pdf:application/pdf;Snapshot:/Users/svenvanderburg/Zotero/storage/BIH5UWCE/s13321-021-00558-4.html:text/html},
}

@misc{van_der_burg_dollar_2024,
title = {Dollar street 10 - 64x64x3},
url = {https://zenodo.org/records/10970014},
doi = {10.5281/zenodo.10970014},
abstract = {The MLCommons Dollar Street Dataset is a collection of images of everyday household items from homes around the world that visually captures socioeconomic diversity of traditionally underrepresented populations. It consists of public domain data, licensed for academic, commercial and non-commercial usage, under CC-BY and CC-BY-SA 4.0. The dataset was developed because similar datasets lack socioeconomic metadata and are not representative of global diversity.

This is a subset of the original dataset that can be used for multiclass classification with 10 categories. It is designed to be used in teaching, similar to the widely used, but unlicensed CIFAR-10 dataset.

These are the preprocessing steps that were performed:



Only take examples with one imagenet\_synonym label

Use only examples with the 10 most frequently occuring labels

Downscale images to 64 x 64 pixels

Split data in train and test

Store as numpy array


This is the label mapping:




Category
label


day bed
0


dishrag
1


plate
2


running shoe
3


soap dispenser
4


street sign
5


table lamp
6


tile roof
7


toilet seat
8


washing machine
9




Checkout this notebook to see how the subset was created.

The original dataset was downloaded from https://www.kaggle.com/datasets/mlcommons/the-dollar-street-dataset. See https://mlcommons.org/datasets/dollar-street/ for more information.},
urldate = {2025-02-11},
publisher = {Zenodo},
author = {van der burg, Sven},
month = apr,
year = {2024},
keywords = {CC-BY, CIFAR-10, Deep learning, Image classification, Machine learning},
file = {Snapshot:/Users/svenvanderburg/Zotero/storage/QPJDIYXH/10970014.html:text/html},
}

@misc{noauthor_cifar-10_nodate,
title = {{CIFAR}-10 and {CIFAR}-100 datasets},
url = {https://www.cs.toronto.edu/~kriz/cifar.html},
urldate = {2025-02-11},
file = {CIFAR-10 and CIFAR-100 datasets:/Users/svenvanderburg/Zotero/storage/CTXZX76B/cifar.html:text/html},
}

24 changes: 18 additions & 6 deletions paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,9 +83,9 @@ The lesson starts by explaining the basic concepts of neural networks,
and then guides learners through the different steps of a deep learning workflow.
After following this lesson,
learners will be able to prepare data for deep learning,
implement a basic deep learning model in Python with Keras,
monitor and troubleshoot the training process, and implement different layer types,
such as convolutional layers.
implement a basic deep learning model in Python with Keras,
and monitor and troubleshoot the training process.
In addition, they will be able to implement and understand different layer types, such as convolutional layers and dropout layers, and apply transfer learning.

We use data with permissive licenses and designed for real world use cases:

Expand Down Expand Up @@ -148,16 +148,20 @@ and these can even be included at the level of the lesson content.
In addition, the Carpentries Workbench prioritises accessibility of the content, for example by having clearly visible figure captions
and promoting alt-texts for pictures.

The lesson is split into a general introduction, and 3 episodes that cover 3 distinct increasingly more complex deep learning problems.
The lesson is split into a general introduction, and 4 episodes that cover 3 distinct increasingly more complex deep learning problems.
Each of the deep learning problems is approached using the same 10-step deep learning workflow (https://carpentries-incubator.github.io/deep-learning-intro/1-introduction.html#deep-learning-workflow).
By going through the deep learning cycle three times with different problems, learners become increasingly confident in applying this deep learning workflow to their own projects.
We end with an outlook episode. Firstly, the outlook eposide discusses a real-world application of deep learning in chemistry [@huber_ms2deepscore_2021]. In addition, it discusses bias in datasets, large language models, and good practices for organising deep learning projects. Finally, we end with ideas for next steps after finishing the lesson.

# Feedback
This course was taught 12 times over the course of 3 years, both online and in-person, by the Netherlands eScience Center
(Netherlands, https://www.esciencecenter.nl/) and Helmholz-Zentrum Dresden-Rossendorf (Germany, https://www.hzdr.de/).
This course was taught 13 times over the course of 4 years, both online and in-person, by the Netherlands eScience Center
(Netherlands, https://www.esciencecenter.nl/) and Helmholtz-Zentrum Dresden-Rossendorf (Germany, https://www.hzdr.de/).
Apart from the core group of contributors, the workshop was also taught at 3 independent institutes, namely:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Apart from the core group of contributors, the workshop was also taught at 3 independent institutes, namely:
Apart from the core group of contributors, the workshop was also taught at at least 3 independent institutes, namely:

University of Wisconson-Madison (US, https://www.wisc.edu/), University of Auckland (New Zealand, https://www.auckland.ac.nz/),
and EMBL Heidelberg (Germany, https://www.embl.org/sites/heidelberg/).

An up-to-date list of workshops using this lesson can be found in a `workshops.md` file in the GitHub repository (https://github.com/carpentries-incubator/deep-learning-intro/blob/main/workshops.md).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
An up-to-date list of workshops using this lesson can be found in a `workshops.md` file in the GitHub repository (https://github.com/carpentries-incubator/deep-learning-intro/blob/main/workshops.md).
An up-to-date list of workshops that the authors are aware of having using this lesson can be found in a `workshops.md` file in the GitHub repository (https://github.com/carpentries-incubator/deep-learning-intro/blob/main/workshops.md).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
An up-to-date list of workshops using this lesson can be found in a `workshops.md` file in the GitHub repository (https://github.com/carpentries-incubator/deep-learning-intro/blob/main/workshops.md).
An up-to-date list of workshops using this lesson can be found in a `workshops.md` file in the GitHub repository (https://github.com/carpentries-lab/deep-learning-intro/blob/main/workshops.md).


In general, adoption of the lesson material by the instructors not involved in the project went well.
The feedback gathered from our own and others' teachings was used to polish the lesson further.

Expand Down Expand Up @@ -193,6 +197,13 @@ The results from these 2 workshops are a good representation of the general feed
Table 2: Quality of the different episodes of the workshop as rated by students from 2 workshops taught at the Netherlands eScience Center.
The results from these 2 workshops are a good representation of the general feedback we get when teaching this workshop.

## Carpentries Lab review process
Prior to submitting this paper the lesson went through the substantial review in the process of becoming an official Carpentries Lab (https://carpentries-lab.org/) lesson. This led to a number of improvements to the lesson. In general the accessibility and user-friendliness improved, for example by updating alt-texts and using more beginner-friendly and clearer wording. Additionally, the instructor notes were improved and many missing explanations of important deep learning concepts were added to the lesson.

Most importantly, the reviewers pointed out that the CIFAR-10 [@noauthor_cifar-10_nodate] dataset that we initially used does not have a license. We were surprised to find out that this dataset, that is one of the most widely used datasets in the field of machine learning and deep learning, is actually unethically scraped from the internet without permission from image owners. As an alternative we now use 'Dollar street 10' [@van_der_burg_dollar_2024], a dataset that was adapted for this lesson from The Dollar Street Dataset (@gaviria_rojas_dollar_2022). The Dollar Street Dataset is representative and contains accurate demographic information to ensure their robustness and fairness, especially for smaller subpopulations. In addition, it is a great entry to teach learners about ethical AI and bias in datasets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Most importantly, the reviewers pointed out that the CIFAR-10 [@noauthor_cifar-10_nodate] dataset that we initially used does not have a license. We were surprised to find out that this dataset, that is one of the most widely used datasets in the field of machine learning and deep learning, is actually unethically scraped from the internet without permission from image owners. As an alternative we now use 'Dollar street 10' [@van_der_burg_dollar_2024], a dataset that was adapted for this lesson from The Dollar Street Dataset (@gaviria_rojas_dollar_2022). The Dollar Street Dataset is representative and contains accurate demographic information to ensure their robustness and fairness, especially for smaller subpopulations. In addition, it is a great entry to teach learners about ethical AI and bias in datasets.
Most importantly, the reviewers pointed out that the CIFAR-10 [@noauthor_cifar-10_nodate] dataset that we initially used does not have a license. We were surprised to find out that this dataset, that is one of the most widely used datasets in the field of machine learning and deep learning, is actually unethically scraped from the internet without permission from image owners. As an alternative we now use 'Dollar street 10' [@van_der_burg_dollar_2024], a dataset that was adapted for this lesson from The Dollar Street Dataset (@gaviria_rojas_dollar_2022). The Dollar Street Dataset is representative and contains accurate demographic information to ensure their robustness and fairness, especially for smaller subpopulations. In addition, it is a great entry point to teach learners about ethical AI and bias in datasets.


You can find all details of the review process on GitHub: https://github.com/carpentries-lab/reviews/issues/25.

# Conclusion
This lesson can be taught as a stand-alone workshop to students already familiar with machine learning and Python.
It can also be taught in a broader curriculum after an introduction to Python programming (for example: @azalee_bostroem_software_2016)
Expand All @@ -208,6 +219,7 @@ Nidhi Gowdra (University of Auckland, New Zealand, https://www.auckland.ac.nz/),
Renato Alves and Lisanna Paladin (EMBL Heidelberg, Germany, https://www.embl.org/sites/heidelberg/),
that piloted this workshop at their institutes.
We thank the Carpentries for providing such a great framework for developing this lesson material.
We thank Sarah Brown, Johanna Bayer, and Mike Laverick for giving us excellent feedback on the lesson during the Carpentries Lab review process.
We thank all students enrolled in the workshops that were taught using this lesson material for providing us with feedback.

# References
Loading