forked from datalad-handbook/repro-paper-sketch
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmain.tex
295 lines (246 loc) · 11.2 KB
/
main.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
%%%%%%%%%%%%%%%%%%%%%%% file template.tex %%%%%%%%%%%%%%%%%%%%%%%%%
%
% This is a general template file for the LaTeX package SVJour3
% for Springer journals. Springer Heidelberg 2010/09/16
%
% Copy it to a new file with a new name and use it as the basis
% for your article. Delete % signs as needed.
%
% This template includes a few options for different layouts and
% content for various journals. Please consult a previous issue of
% your journal as needed.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% First comes an example EPS file -- just ignore it and
% proceed on the \documentclass line
% your LaTeX will extract the file if required
\begin{filecontents*}{example.eps}
%!PS-Adobe-3.0 EPSF-3.0
%%BoundingBox: 19 19 221 221
%%CreationDate: Mon Sep 29 1997
%%Creator: programmed by hand (JK)
%%EndComments
gsave
newpath
20 20 moveto
20 220 lineto
220 220 lineto
220 20 lineto
closepath
2 setlinewidth
gsave
.4 setgray fill
grestore
stroke
grestore
\end{filecontents*}
%
\RequirePackage{fix-cm}
%
%\documentclass{svjour3} % onecolumn (standard format)
%\documentclass[smallcondensed]{svjour3} % onecolumn (ditto)
%\documentclass[smallextended]{svjour3} % onecolumn (second format)
\documentclass[smallextended,twocolumn]{svjour3} % twocolumn
%
% FIXME: ADD YOUR FAVOURITE JOURNAL NAME HERE
\journalname{Journal of Reproducible Research}
\smartqed % flush right qed marks, e.g. at end of proof
\usepackage{graphicx}
%usepackage{mathptmx} % use Times fonts if available on your TeX system
\usepackage{natbib}
\usepackage{amsmath}
\usepackage{textcomp}
\usepackage{booktabs}
\usepackage{units}
\usepackage{hyperref}
\usepackage{wrapfig}
%\usepackage{todonotes}
\usepackage{csvsimple}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{document}
% This file contains variable definitions with results
\input{results_def.tex}
\onecolumn
% FIXME: ADD YOUR TITLE HERE
\title{Writing a reproducible paper}
% FIXME: ADD YOURSELF AS AN AUTHOR
\author{%
Alice~Allison \and Bob~McBobFace\\}
% FIXME: ADD NAME AND AFFILIATION
\institute{Alice~Allison \at
Max Planck Institute for Human Development, Institute for Open Science, Lentzallee 94, Berlin, Germany
\email{[email protected]}
%Tel.: +123-45-678910\\
%Fax: +123-45-678910\\
%\email{[email protected]} % \\
% \emph{Present address:} of F. Author % if needed
\and
Bob~McBobFace \at
Research Centre Juelich, Institute for Reproducible Research, Juelich Germany
\email{[email protected]}
}
\date{Received: date / Accepted: date}
\maketitle
% Please list all authors that played a significant role in the research
% involved in the article. Please provide full affiliation information
% (including full institutional address, ZIP code and e-mail address) for all
% authors, and identify who is/are the corresponding author(s).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{abstract}
% Abstracts should be up to 300 words and provide a succinct summary of the
% article. Although the abstract should explain why the article might be
% interesting, care should be taken not to inappropriately over-emphasize the
% importance of the work described in the article. Citations should not be used
% in the abstract, and the use of abbreviations should be minimized. If you are
% writing a Research or Systematic Review article, please structure your
% abstract into Background, Methods, Results, and Conclusions.
There are many tools that make it not only possible, but also easy to create reproducible research.
It is often not straightforward to know, however, which tools are available, how
they are to be used, and how they can interoperate with eachother, if necessary.
This paper demonstrates \textit{one way} of using \textit{one set} of tools (Git,
DataLad, \LaTeX\, and Make/Makefiles) interoperably to create a dynamically generated,
automatically reproducible research object.
In this example, DataLad is used to link code, manuscript, and data.
Within a Python script that computes results and figures, DataLad's Python API is used
to retrieve data automatically.
The \LaTeX\ manuscript does not hard-code results or tables, but embeds external
files or variables with results.
The Python script saves its results and figures into the dataset, but also outputs
results in form that they can be embedded into the manuscript as variables.
To orchestrate code execution and \LaTeX\ manuscript compiling, a Makefile is
used.
With this setup, generating a manuscript with freshly computed results is a matter
of running \texttt{make} in a cloned repository.
\keywords{%
open science \and
reproducible research \and
dynamic document generation \and
open source software
}
\end{abstract}
%\todo[inline]{Here is a todo note}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\twocolumn
\section*{Introduction}\label{intro}
Many tools, workflows and tutorials exist or are being developed to make
reproducible science "too easy not to do" \citep{the_turing_way}, e.g. \citet{van2020worcs}
or \cite{peikert_brandmaier_2019}.
The greatest hurdle to using those tools or workflows is in many cases getting to
know about them.
This small example paper is a demonstration of using the tools Git, DataLad, Make,
Python, and \LaTeX\ in conjunction to create a dynamically generated document.
This is by no means the only way to generate such a document, and certainly not the
easiest, but it a demonstration of \textit{one} way of doing it.
It also does not use all features and capabilities of DataLad, but uses it "only"
to link all relevant workflow elements together into a domain agnostic, common data
structure as the basis for decentralized research data management (RDM).
\section*{Methods}\label{methods}
We use DataLad to package all relevant aspects to create an open and transparent
digital research object into a domain agnostic, common data structure, the DataLad
dataset.
This dataset allows for decentralized research data management, a concept that is
visualized in Figure \ref{fig:RDM}.
%
\subsection*{Operation}\label{op}
Since the repository is a DataLad dataset that includes input data as a subdataset
and a script that retrieves data prior to result computation, and given that a
Makefile handles the execution of all relevant commands in the correct order, the
only necessary command to generate this manuscript after cloning it from a repository
hosting site is
%
\begin{verbatim}
make
\end{verbatim}
%
\subsection*{Analysis}
As an example, this repository performs the analysis that is demonstrated in
\href{http://handbook.datalad.org/en/latest/basics/101-130-yodaproject.html}{this section of the DataLad Handbook}
\citep{handbook}:
A K-Nearest-Neighbour Classification analysis on a standard tutorial dataset.
The dataset is the Iris dataset, a small dataset commonly used for introductory
machine-learning tutorials.
It consists of measurements of sepal and petal length and width from three
different species of iris flowers, Virginica, Setosa, and Versicolor.
\begin{figure}
\includegraphics[trim=0 0 0 0,clip,width=0.5\textwidth]{img/decentral_RDM_overview.png}
\caption{A DataLad dataset is a domain agnostic data structure that can serve as
a basis for decentralized data management.
A dataset can include and share any digital research object and its provenance,
and interfaces with a vast amount of third party storage and hosting providers.}
\label{fig:RDM}
\end{figure}
\section*{Results}\label{res}
%
Figure \ref{fig:relation} shows the pairwise relationships between the variables in the dataset.
This figure is created by a script and embedded into the manuscript during \texttt{make}.
\begin{figure}
\includegraphics[width=0.5\textwidth]{img/pairwise_relationships.png} \\
% figure caption is below the figure
\caption{Pairwise relationships in the iris dataset.}
% Give a unique label
\label{fig:relation}
\end{figure}
The overall accuracy of the prediction is \accuracy.
The complete prediction report of the classification, split by flower species
and including the weighted average and the macro average, is displayed in
Table \ref{tab:predictions}.
% create a table with custom variables
\begin{table}
% table caption is above the table
\caption{Prediction metrics embedded as LaTeX variables}
% Give a unique label
\label{tab:predictions}
% For LaTeX tables use
\begin{tabular}{|c|llll|}
\hline
& \textbf{precision} & \textbf{recall} & \textbf{f1-score} & \textbf{support} \\
\hline
Setosa & \SetosaPrecision & \SetosaRecall & \SetosaF & \SetosaSupport \\
Versicolor & \VersicolorPrecision & \VersicolorRecall & \VersicolorF & \VersicolorSupport \\
Virginica & \VirginicaPrecision & \VirginicaRecall & \VirginicaF & \VirginicaSupport \\
\hline
macro avg & \MAPrecision & \MARecall & \MAF & \MASupport \\
weighted avg & \WAPrecision & \WARecall & \WAF & \WASupport \\
\hline
\end{tabular}
\end{table}
Table \ref{tab:predictions} was created with custom variable definitions.
Its is, however, equally possible to simply embed a CSV file as a table, which has
been done with the result file \texttt{prediction\_report.csv} in Table \ref{tab:predictions2}.
% embed a table from a csv file
\begin{table}
% table caption is above the table
\caption{Prediction metrics as imported from csv file}
\label{tab:predictions2}
\begin{tabular}{cllll}
\csvautotabular{prediction_report.csv}
\end{tabular}
\end{table}
\section*{Conclusion}\label{con}
In order to achieve transparent, reproducible, and open science, a manuscript
should entail more than only a PDF, but contain all relevant elements to rerun
an analysis and reproduce a result.
There are many ways to do so, and this is one of them.
%
Note that this demonstration is vastly simplified.
A published paper as a complex real-world example can be found at \\
\href{https://github.com/psychoinformatics-de/paper-remodnav/}{github.com/psychoinformatics-de/paper-remodnav/}.
\begin{acknowledgements}
Open Science stands on the shoulder of giants: Open Source Software tools.
Cite the tools you use. :)
\end{acknowledgements}
\bibliographystyle{spbasic} % basic style, author-year citations
\bibliography{references}
% References can be listed in any standard referencing style that uses a numbering system
% (i.e. not Harvard referencing style), and should be consistent between references within
% a given article.
% Reference management systems such as Zotero provide options for exporting
% bibliographies as Bib\TeX{} files. Bib\TeX{} is a bibliographic tool that is
% used with \LaTeX{} to help organize the user's references and create a
% bibliography. This template contains an example of such a file,
% \texttt{sample.bib}, which can be replaced with your own. Use the
% \verb|\cite| command to create in-text citations, like this
% \cite{Smith:2012qr} and this \cite{Smith:2013jd}.
\clearpage
\end{document}