-
Notifications
You must be signed in to change notification settings - Fork 2
/
manuscript.tex
431 lines (331 loc) · 37 KB
/
manuscript.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
\documentclass[10pt,letterpaper]{article}
\usepackage[top=0.85in,left=2.75in,footskip=0.75in]{geometry}
\usepackage{amsmath,amssymb}
% Use adjustwidth environment to exceed column width (see example table in text)
\usepackage{changepage}
% Use Unicode characters when possible
\usepackage[utf8x]{inputenc}
% textcomp package and marvosym package for additional characters
\usepackage{textcomp,marvosym}
% cite package, to clean up citations in the main text. Do not remove.
\usepackage{cite}
\bibliographystyle{unsrt}
\usepackage{url}
% Use nameref to cite supporting information files (see Supporting Information section for more info)
\usepackage{nameref,hyperref}
% line numbers
\usepackage[right]{lineno}
% ligatures disabled
\usepackage{microtype}
\DisableLigatures[f]{encoding = *, family = * }
% color can be used to apply background shading to table cells only
\usepackage[table]{xcolor}
% array package and thick rules for tables
\usepackage{array}
% create "+" rule type for thick vertical lines
\newcolumntype{+}{!{\vrule width 2pt}}
% create \thickcline for thick horizontal lines of variable length
\newlength\savedwidth
\newcommand\thickcline[1]{%
\noalign{\global\savedwidth\arrayrulewidth\global\arrayrulewidth 2pt}%
\cline{#1}%
\noalign{\vskip\arrayrulewidth}%
\noalign{\global\arrayrulewidth\savedwidth}%
}
% \thickhline command for thick horizontal lines that span the table
\newcommand\thickhline{\noalign{\global\savedwidth\arrayrulewidth\global\arrayrulewidth 2pt}%
\hline
\noalign{\global\arrayrulewidth\savedwidth}}
% Remove comment for double spacing
%\usepackage{setspace}
%\doublespacing
% Text layout
\raggedright
\setlength{\parindent}{0.5cm}
\textwidth 5.25in
\textheight 8.75in
% Bold the 'Figure #' in the caption and separate it from the title/caption with a period
% Captions will be left justified
\usepackage[aboveskip=1pt,labelfont=bf,labelsep=period,justification=raggedright,singlelinecheck=off]{caption}
\renewcommand{\figurename}{Fig}
% Use the PLoS provided BiBTeX style
\bibliographystyle{plos2015}
% Remove brackets from numbering in List of References
\makeatletter
\renewcommand{\@biblabel}[1]{\quad#1.}
\makeatother
% Leave date blank
\date{}
% Header and Footer with logo
\usepackage{lastpage,fancyhdr,graphicx}
\usepackage{epstopdf}
\pagestyle{myheadings}
\pagestyle{fancy}
\fancyhf{}
\setlength{\headheight}{27.023pt}
\lhead{\includegraphics[width=2.0in]{PLOS-submission.eps}}
\rfoot{\thepage/\pageref{LastPage}}
\renewcommand{\footrule}{\hrule height 2pt \vspace{2mm}}
\fancyheadoffset[L]{2.25in}
\fancyfootoffset[L]{2.25in}
\lfoot{\sf PLOS}
%% Include all macros below
\newcommand{\lorem}{{\bf LOREM}}
\newcommand{\ipsum}{{\bf IPSUM}}
%% END MACROS SECTION
\usepackage{fullpage}
\usepackage{multirow}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{enumitem}
\usepackage{booktabs}
\usepackage{graphicx}
\graphicspath{ {figures/} }
\setlength{\parskip}{\baselineskip}%
\setlength{\parindent}{0pt}%
\begin{document}
\vspace*{0.2in}
\linenumbers
% Title must be 250 characters or less.
\begin{flushleft}
{\Large
\textbf\newline{Women are Underrepresented in Computational Biology: an analysis of the scholarly literature in biology, computer science and computational biologys}
}
\newline
% Insert author names, affiliations and corresponding author email (do not include titles, positions, or degrees).
\\
Kevin S. Bonham\textsuperscript{1*},
Melanie I. Stefan\textsuperscript{2},
\\
\bigskip
\textbf{1} Microbiology and Immunobiology, Harvard Medical School. Boston, MA, USA
\\
\textbf{2} {Centre for Integrative Physiology, Edinburgh Medical School. Biomedical Sciences, University of Edinburgh. Edinburgh, UK}
\\
\bigskip
% Use the asterisk to denote corresponding authorship and provide email address in note below.
* kevbonham\@gmail.com
\section*{Abstract}
While women are generally underrepresented in STEM fields, there are noticeable differences between fields. For instance, the gender ratio in biology is more balanced than in computer science. We were interested in how this difference is reflected in the interdisciplinary field of computational/quantitative biology. To this end, we examined the proportion of female authors in publications from the PubMed and arXiv databases. There are fewer female authors on research papers in computational biology, as compared to biology in general. This is true across authorship position, year, and journal impact factor. A comparison with arXiv shows that quantitative biology papers have a higher ratio of female authors than computer science papers, placing computational biology in between its two parent fields in terms of gender representation. Both in biology and in computational biology, a female last author increases the probability of other authors on the paper being female, pointing to a potential role of female PIs in influencing the gender balance.
\section*{Author Summary}
There are fewer women than men working in Science, Technology, Engineering and Mathematics (STEM). However, some fields within STEM are more gender-balanced than others. For instance, biology has a relatively high proportion of women, whereas there are few women in computer science. But what about computational biology? As an interdisciplinary STEM field, would its gender balance be close to one of its ``parent'' fields, or in between the two? To investigate this question, we examined authorship data from databases of scholarly publications in biology, computational biology, and computer science. We found that computational biology lies in between computer science and biology, as far as female representation goes. This is independent of other factors, e.g. year of publication. This suggests that computational biology might provide an environment that is more conducive to female participation that other areas of computer science. Across all three fields, we also found that if the last author on a publication - usually the person leading the study - is a women, then there will also be more women in other authorship positions. This suggests that having women in leadership positions might be beneficial for overall gender balance, though our data do not allow us to uncover the underlying mechanism.
\section{Introduction}
There is ample literature on the underrepresentation of women in STEM fields and the biases contributing to it. Those biases, though often subtle, are pervasive in several ways: they are often held and perpetuated by both men and women, and they are apparent across all aspects of academic and scientific practice. Undergraduate students show bias in favor of men both when rating their peers\cite{Grunspan2016} and their professors \cite{MacNell2014}. Professors, in turn, are more likely to respond to e-mail from prospective students who are male \cite{Milkman2015}. They also show gender bias when hiring staff and deciding on a starting salary \cite{Moss-Racusin2012}.
When looking at research output in the form of publication and impact, the story is complex: Women tend to publish less than men \cite{Lariviere2013}, are underrepresented in the more prestigious first and last author positions, and publish fewer single-author papers \cite{West2013}. In mathematics, women tend to publish in lower-impact journals \cite{Mihaljevic-Brandt2016}, while in engineering, women publish in journals with higher impact factors \cite{Ghiasi2015}. In general, however, articles authored by women are cited less frequently than articles authored by men \cite{Lariviere2013,Ghiasi2015}, which might in part be due to men citing their own work more often than women do \cite{King2016}. Inferring bias in these studies is difficult, since the cause of the disparity between male and female authorship cannot be readily determined. At the same time, when stories of scientific discoveries are told, gender biases are readily identified: Work by female scientists is more likely to be attributed to a male colleague \cite{Rossiter1993}, and biographies of successful female scientists perpetuate gender stereotypes \cite{Fara2013}. Finally, the way in which evidence for gender bias is received is in itself biased: Male scientists are less likely to accept studies that point to the existence of gender bias than are their female colleagues \cite{Handley2015}.
Although gender imbalance seems to be universal across all aspects of the scientific enterprise, there are also more nuanced effects. In particular, not all disciplines are equally affected. For instance, in the biosciences over half of PhD recipients are now women, while in computer science, it is less than 20\% \cite{NSF2015}. This raises an intriguing question, namely how do the effects of gender persist in interdisciplinary fields where the parent fields are discordant for female representation?
To this end, we are interested in the gender balance in computational biology and how it compares to other areas of biology, since computational biology is a relatively young field at the disciplinary intersection between biology and computer science. We examined authorship on papers from Pubmed published between 1997 and 2014 and compared computational biology to biology in general. We found that in computational biology, there is a smaller proportion of female authors overall, and a lower proportion of female authors in first and last authorship positions than in all biological fields combined. This is true across all years, though the gender gap has been narrowing, both in computational biology and in biology overall. A comparison to computer science papers shows that computational biology stands between biology and computer science in terms of gender equality.
\section*{Results and Discussion}
In order to determine if there is a difference in the gender of authors in computational biology compared to biology as a whole, we used data from Pubmed, a database of biology and biomedical publications administered by the US National Library of Medicine. Pubmed uses Medical Subject Heading (MeSH) terms to classify individual papers by subject. The MeSH term “Computational Biology” is a subset of “Biology” and was introduced in 1997, so we restricted our analysis to primary articles published after this date (see S1 Fig A-B, Materials and Methods).
To determine the gender of authors, we used the web service Gender-API.com, which curates a database of first names and associated genders from government records as well as social media profiles. Gender-API searches provide information on the likely gender as well as confidence in the estimate based on the number of times a name appears in the database. We used bootstrap analysis to estimate the probability ($P_{female}$) that an author in a particular dataset is female as well as a 95\% confidence interval (see Materials and Methods).
We validated this method by comparing it to a set of 2155 known author:gender pairs from the biomedical literature provided by Filardo et. al. \cite{Filardo2016} Filardo and colleagues manually determined the genders of the first authors for over 3000 papers by searching for authors’ photographs on institutional web pages or social media profiles like LinkedIn. We compared the results obtained from our method of computational inference of gender for a subset of this data (see Materials and Methods), to the known gender composition of this author set. Infering author gender using Gender-API data suggested that $P_{female} = 0.373 \pm 0.023$ (Supplementary Fig 1C, black bar). Because the actual gender of each of these authors is known, we could also calculate the actual $P_{female}$. Using the same bootstrap method on actual gender (known female authors were assigned $P_{female} = 1$, known male authors were assigned $P_{female} = 0$), we determined that the real $P_{female} = 0.360 \pm 0.018$ (S1 Fig C, white bar).
Unfortunately, 43\% of names used to query to Gender-API did not have associated gender information. These names, representing 26.6\% of authors, were therefore excluded from our analysis. In order to ensure that this was not systematically skewing our results, we also determined the $P_{female}$ in Filardo et al.'s known gender dataset excluding those authors with names that were not associated with a Gender-API record, giving $P_{female} = 0.381 \pm .027$ (S1 Fig C, white bar). Together, these results suggest that our method of automatically assigning gender using Gender-API gives comparable results to human-validated gender assignment, and that excluding names without clear gender information does not lead us to underestimate the proportion of women in our dataset.
We began our investigation of the gender make-up in biology and computational biology publications by analyzing the gender representation in primary publications from 1997 to 2014. Consistent with previous publications, women were substantially less likely to be in senior author positions than first author positions in publications labeled with the Biology (Bio) MeSH term (Last author, $P_{female} = 0.245 \pm 0.002$, First author, $P_{female} = 0.376 \pm 0.003$ (Fig 1A, Table 1). We observed the same trend in papers labeled with the computational biology (comp) MeSH term, though the $P_{female}$ at every author position was 4-6 percentage points lower. An analysis of publications by year suggests that the gender gaps in both biology and computational biology are narrowing, but by less than 1 percentage point per year (for bio, change in $P_{female} = 0.0035 \pm 0.0005 / year$, for comp, change in $P_{female} = 0.0049 \pm 0.0008 / year$). However, the discrepancy between biology and computational biology has been consistent over time (Fig 1B).
\begin{figure}[!h]
\caption{
A: Mean probability that an author in a given position is female for primary articles indexed in Pubmed with the MeSH term Biology (black) or Computational Biology (grey). The bio dataset is inclusive of papers in the comp dataset. Error bars represent 95\% confidence intervals. B: Mean probability that an author is female for publications in a given year. Error bars represent 95\% confidence intervals. C: Mean probability that the first (F), second (S), penultimate (P) or other (O) author is female for publications where the last author is male ($P_{female}$ < 0.2) or female ($P_{female}$ > 0.8). Papers where the gender of the last author was uncertain or could not be determined were excluded. Error bars represent 95\% confidence intervals.}
\label{fig1}
\end{figure}
\begin{table}[]
\centering
\caption{Proportion of Female Authors}
\label{Table 1}
\begin{tabular}{llccc}
\toprule
& & & \multicolumn{2}{c}{95\% CI} \\
\cmidrule(r){4-5}
Dataset & Position & Mean & lower & upper \\
\midrule
bio & first & 0.376 & 0.373 & 0.378 \\
& second & 0.379 & 0.376 & 0.381 \\
& other & 0.368 & 0.367 & 0.370 \\
& penultimate & 0.279 & 0.277 & 0.282 \\
& last & 0.245 & 0.243 & 0.247 \\
comp & first & 0.316 & 0.312 & 0.320 \\
& second & 0.322 & 0.317 & 0.327 \\
& other & 0.331 & 0.328 & 0.333 \\
& penultimate & 0.236 & 0.231 & 0.241 \\
& last & 0.207 & 0.203 & 0.211 \\
\bottomrule
\end{tabular}
\end{table}
One possible explanation for the difference in male and female authorship position might be a difference in role models or mentors. If true, we would expect studies with a female principal investigator to be more likely to attract female collaborators. Conventionally in biology, the last author on a publication is the principal investigator on the project. Therefore, we looked at two subsets of our data: publications with a female last author ($P_{female} > 0.8$) and those with a male last author ($P_{female} < 0.2$). We found that women were substantially more likely to be authors at every other position if the paper had a female last author than if the last author was male (Fig 1C, Table 2). It is possible that female trainees are be more likely to pursue computational biology if they have a mentor that is also female. Since women are less likely to be senior authors, this might reduce the proportion of women overall. However, we cannot determine if the effect we observe is instead due to a tendancy for women that pursue computational biology to select female mentors.
\begin{table}[]
\centering
\caption{Proportion of Female Authors with Female PI}
\label{Table 2}
\begin{tabular}{llcccccc}
\toprule
& & \multicolumn{3}{c}{Male Last Author} & \multicolumn{3}{c}{Female Last Author} \\
\cmidrule(r){3-8}
& & & \multicolumn{2}{c}{95\% CI} & & \multicolumn{2}{c}{95\% CI} \\
\cmidrule(r){4-5}\cmidrule(r){7-8}
Dataset & Position & Mean & lower & upper & Mean & lower & upper \\
\midrule
bio & first & 0.362 & 0.359 & 0.365 & 0.478 & 0.472 & 0.484 \\
& second & 0.359 & 0.357 & 0.362 & 0.460 & 0.454 & 0.466 \\
& other & 0.355 & 0.353 & 0.357 & 0.425 & 0.421 & 0.428 \\
& penultimate & 0.259 & 0.256 & 0.263 & 0.336 & 0.330 & 0.343 \\
comp & first & 0.305 & 0.300 & 0.311 & 0.390 & 0.378 & 0.402 \\
& second & 0.306 & 0.300 & 0.312 & 0.379 & 0.366 & 0.392 \\
& other & 0.321 & 0.318 & 0.324 & 0.368 & 0.361 & 0.376 \\
& penultimate & 0.223 & 0.218 & 0.229 & 0.263 & 0.249 & 0.277 \\
\bottomrule
\end{tabular}
\end{table}
Though MeSH terms enable sorting a large number of papers regardless of where they are published, the assignment of these terms is a manual process and may not be comprehensive for all publications. As another way to qualitatively examine gender differences in publishing, we examined different journals, since some journals specialize in computational papers, while others are more general. We looked at the 123 journals that had at least 1000 authors in our bio dataset, and determined $P_{female}$ for each journal separately (Fig 2A). Of these journals, 21 (14\%) have titles indicative of computational biology or bioinformatics, and these journals have substantially lower representation of female authors. The 3 journals with the lowest female representation and 6 out of the bottom 10 are all journals focused on studies using computational methods. Only 4 computational biology/bioinformatics journals are above the median of female representation.
\begin{figure}[!h]
\caption{
A: Mean probability that an author is female for every journal that had at least 1000 authors in our dataset. Grey bars represent journals that have the words “Bioinformatics,” “Computational,” “Computer,” “System(s),” or “omic(s)” in their title. Vertical line represents the median for female author representation. See also S1 Table. B: Mean probability that an author is female for articles in the “Bio” dataset (black dot) or in the “Comp” dataset (open square) for each journal that had at least 1000 authors plotted against the journals’ 2014 impact factor. Journals that had computational biology articles are included in both datasets. An ordinary least squares regression was performed for each dataset. Bio: $m = -0.00264$, $P_{Z>|z|} = 0.0022$. Comp: $m = -0.00079$, $P_{Z>|z|} = 0.568$.}
\label{fig2}
\end{figure}
One possible explanation might be that women are less likely to publish in high-impact journals, so we considered the possibility that the differences in the gender of authors that we observe could be the result of differences in impact factor between papers published in biology versus computational biology publications. We compared the $P_{female}$ of authors in each journal with that journal’s 2014 impact factor (Fig 2B). There is a marginal but significant negative correlation ($-0.00264$, $P_{Z > |z|} = 0.0022$) between impact factor and gender for the biology dataset. This is in contrast to previous studies from engineering that have found that women tend to publish in higher-impact journals \cite{Ghiasi2015}. It is, however, consistent with a previous studies from mathematics \cite{Mihaljevic-Brandt2016}. By contrast, there is no significant correlation ($P_{Z > |z|} = 0.568$) between impact factor and $P_{female}$ in computational biology publications. Further, for journals that have articles labeled with the computational biology MeSH term, the $P_{female}$ for those articles is the same or lower than that for all biology publications in the same journal.
We also examined whether computational biology or biology articles tend to have higher impact factors. Bootstrap analysis of authors in each dataset suggest that computational biology publications tend to be published in journals with a higher impact factor ($\bar{IF} = 7.25 \pm 0.04$) than publications in biology as a whole ($\bar{IF} = 6.5 \pm 0.02$). However, given the magnitude of the correlation between IF and $P_{female}$, this difference is unlikely to explain the differences in $P_{female}$ observed between our computational biology and biology datasets. Taken together, these data suggest that the authors of computational biology papers are less likely to be women than the authors of biology papers generally.
We turned next to an investigation of biological fields relative to computer science. Since Pubmed does not index computer science publications, we cannot compare the computational biology dataset to computer science research papers directly. Instead, we investigated the gender balance of authors of manuscripts submitted to arXiv, a preprint repository for academic papers used frequently by quantitative fields like mathematics and physics. These preprint records cannot be compared to peer-reviewed publications indexed on pubmed, but a “quantitative biology” (qb) section was added to arXiv in 2003. Quantitative biology is not necessarily equivalent to computational biology, and analysis of arXiv-qb papers that have been published and indexed on pubmed suggests that only a fraction of them are labeled with the "computational biology" MeSH term. However, this does allow us to make an apples-to-apples comparision between a field of biology and computer science. There are relatively few papers preprints prior to 2007, so we compared preprints in “quantitative biology” to those in “computer science” from 2007-2016.
Women were more likely to be authors in quantitative biology manuscripts than in computer science manuscripts in first, second, and middle author positions (Fig 3A, Table 3). We found no significant difference in the frequency of female authors in the last or penultimate author positions in these two datasets, though the conventions for determining author order are not necessarily the same in computer science as in biology. Nevertheless, women had higher representation in quantitative biology than in computer science for all years except 2009 (Fig 3B). Interestingly, there is a slight but significant ($0.0052 / year$, $P_{Z > |z|} < 0.005$) increase in the proportion of female authors over time in quantitative biology, while there’s no significant increase in female representation in computer science preprints.
\begin{figure}[!h]
\caption{
A: Mean probability that an author in a given position is female for all preprints in the arXiv quantitative biology (black) or computer science (grey) categories between 2007 and 2014. Error bars represent 95\% confidence intervals. B: Mean probability of authors being female in arXiv preprints in a given year. Error bars represent 95\% confidence intervals. Slopes were determined using ordinary least squares regression. The slope for q\-bio is slightly positive ($p < 0.05$), but the slope for cs is not.}
\label{fig3}
\end{figure}
\begin{table}[]
\centering
\caption{Proportion of Female Authors in Arxiv}
\label{Table 3}
\begin{tabular}{llccc}
\toprule
& & & \multicolumn{2}{c}{95\% CI} \\
\cmidrule(r){4-5}
Dataset & Position & Mean & lower & upper \\
\midrule
arxivbio & first & 0.184 & 0.178 & 0.190 \\
& second & 0.210 & 0.200 & 0.219 \\
& other & 0.265 & 0.253 & 0.276 \\
& penultimate & 0.196 & 0.183 & 0.209 \\
& last & 0.148 & 0.141 & 0.155 \\
arxivcs & first & 0.157 & 0.155 & 0.160 \\
& second & 0.175 & 0.172 & 0.179 \\
& other & 0.188 & 0.182 & 0.195 \\
& penultimate & 0.175 & 0.170 & 0.181 \\
& last & 0.155 & 0.153 & 0.158 \\
\bottomrule
\end{tabular}
\end{table}
Taken together, our results suggest that computational biology lies between biology in general and computer science when it comes to gender representation in publications. This is perhaps not surprising given the interdisciplinary nature of computational biology. Compared to biology in general, computational biology papers have fewer female authors, and this is consistent across all authorship positions. Importantly, this difference is not due to a difference in impact factor between computational biology and general biology papers.
Articles with a female last author tend to have more female authors in other positions and this is true for both biology in general and computational biology. Since the last author position is most often occupied by the principal investigator of the study, this suggests that having a woman as principal investigator has a positive influence on the participation of women. This resonates with findings by Macaluso et al., who studied the nature of authorship contribution by gender in PLoS publications \cite{Macaluso2016}. They found that if the corresponding author of a paper was female, then there was also a greater proportion of women across almost all authorship roles (data analysis, experimental design, performing experiments, and writing the paper). In contrast, if the corresponding author was male, then men were dominating all authorship roles except for performing experiments, which remained female-dominated. The reasons for this are difficult to ascertain. It could be the case that female PIs tend to work in more female-dominated sub-fields and therefore naturally have more female co-authors. It is also possible that female PIs are more likely to recognise contributions by female staff members, or that they are more likely to attract female co-workers and collaborators. Our publication data cannot differentiate between those two (and other) explanations, but points to the important role that women in senior positions may play as role models for trainees.
Since biology attracts more women than computer science, we suspect that many women initially decide to study biology and later become interested in computational biology. If this is the case, understanding what factors influence the field of study will provide useful insight when designing interventions to help narrow the gender gap in computer science and computational biology.
\section*{Materials and Methods}
\subsection*{Datasets}
\subsubsection*{Biology publications 1997-2014 (bio)}
This dataset \cite{Bonham2016} contains all English language publications under the MeSH term "Biology" published between 1997 and 2014, excluding many non-primary sources. This set contains 204,767 records. Downloaded 12 February, 2016. Search term: ("Biology"[Mesh]) NOT (Review[ptyp] OR Comment[ptyp] OR Editorial[ptyp] OR Letter[ptyp] OR Case Reports[ptyp] OR News[ptyp] OR "Biography" [Publication Type]) AND ("1997/01/01"[PDAT] : "2014/12/31"[PDAT]) AND english[language]
\subsubsection*{Computational biology publications 1997-2014 (comp)}
Same as above \cite{Bonham2016}, except using MeSH term "Computational Biology". Only uses papers where this is a major term. Date range was selected because this MeSH term was introduced in 1997. This dataset is a subset of the “bio” dataset (all of the papers in this dataset are contained within “bio") and contains 43,198 records. Downloaded 12 February, 2016. Search term: ("Computational Biology"[Majr]) NOT (Review[ptyp] OR Comment[ptyp] OR Editorial[ptyp] OR Letter[ptyp] OR Case Reports[ptyp] OR News[ptyp] OR "Biography" [Publication Type]) AND ("1997/01/01"[PDAT] : "2014/12/31"[PDAT]) AND english[language]
\subsubsection*{Medical Papers}
Subset of author and gender data from Filardo et.al \cite{Filardo2016}. This dataset did not contain author first names or unique publication identifiers. We searched pubmed for the title, author and publication date, and were able to identify 2155/3153 publications to analyze. Publications with no matching search results or with multiple matching search results were excluded.
\subsubsection*{arXiv Quantitative Biology (q-bio)}
This dataset \cite{Bonham2016a} contains all preprints with the label “q-bio” from 2003 (when the section was introduced) to 2014. This set contains 41,637 records and was downloaded on 10 June, 2016.
\subsubsection*{arXiv CS (cs)}
This dataset \cite{Bonham2016a} contains all preprints with the label “cs” from 2003 to 2014, and contains 188,617 records. Downloaded on 10 June, 2016. There are 1412 preprints that are found in both the qbio and cs dataset (3.4\% of bio and 0.75\% of cs).
\subsection*{Gender Inference}
Genders were determined using Gender-API (http://gender-api.com), which compares first names to a database compiled from government sources as well as from crawling social media profiles and returns a gender probability and a measure of confidence based on the number of times the name appears in the database. The API was queried with the 74,760 unique first names in the dataset (24 May, 2016).
Mean gender probabilities were determined using bootstrap analysis. Briefly, for each dataset, authors were randomly sampled with replacement to generate a new dataset of the same size. The mean $P_{female}$ for each sample was determined excluding names for which no gender information was available (~26.6\% of authors). The reported $P_{female}$ represents the mean of means for 1000 samples. Error bars in figures represent 95\% confidence intervals. Code and further explanation can be found on github \cite{Bonham2016b}.
Author positions were assigned based on the number of total authors. In papers with 5 or more authors, all authors besides first, second, last and penultimate were designated “other.” Papers with 3 authors were assigned only first, second and last, papers with two authors were assigned only first and last, and single-author papers were assigned only first.
\subsection*{Regression Analysis}
We used ordinary least squares regression analyses on IF and $P_{female}$ using the the GLM.jl package for the julia programming language. Correlations were considered significant if $P_{Z > |z|} < 0.05$.
\begin{thebibliography}{10}
\bibitem{Grunspan2016}
Grunspan DZ, Eddy SL, Brownell SE, Wiggins BL, Crowe AJ, Goodreau SM.
\newblock {{M}ales Under-Estimate Academic Performance of Their Female Peers in Undergraduate Biology Classrooms.}.
\newblock PLoS ONE. 2016 11:e0148405.
\bibitem{MacNell2014}
MacNell L, Driscoll A, Hunt AN.
\newblock {{W}hat’s in a Name: Exposing Gender Bias in Student Ratings of Teaching.}.
\newblock Innovative Higher Education. 2014 40: 291–303.
\bibitem{Milkman2015}
Milkman KL, Akinola M, Chugh D.
\newblock {{W}hat happens before? {A} field experiment exploring how pay and representation differentially shape bias on the pathway into organizations.}.
\newblock J Appl Psychol. 2015 100:1678–1712.
\bibitem{Moss-Racusin2012}
Moss-Racusin CA, Dovidio JF, Brescoll VL, Graham MJ, Handelsman J.
\newblock {{S}cience faculty’s subtle gender biases favor male students.}.
\newblock Proc Natl Acad Sci U S A. 2012 109:16474–16479.
\bibitem{Lariviere2013}
Larivi{\`e}re V, Ni C, Gingras Y, Cronin B, Sugimoto CR.
\newblock {{B}ibliometrics: global gender disparities in science.}.
\newblock Nature. 2013 504: 211–213.
\bibitem{West2013}
West JD, Jacquet J, King MM, Correll SJ, Bergstrom CT.
\newblock {{T}he role of gender in scholarly authorship.}.
\newblock PLoS One. 2013 8:e66212.
\bibitem{Mihaljevic-Brandt2016}
Mihaljevi{\'c}-Brandt H, Santamar{\'i}a L, Tullney M.
\newblock {{T}he Effect of Gender in the Publication Patterns in Mathematics}.
\newblock PLoS ONE. 2016 11(10):e0165367.
\bibitem{King2016}
King MM, Bergstrom CT, Correll SJ, Jacquet J, West JD.
\newblock {{M}en set their own cites high: Gender and self-citation across fields and over time.}.
\newblock 2016. Available: \url{http://arxiv.org/abs/1607.00376v1}
\bibitem{Rossiter1993}
Rossiter MW.
\newblock {{T}he {M}atthew {M}atilda Effect in Science.}.
\newblock Soc Stud Sci. 1993 23:325–341.
\bibitem{Fara2013}
Fara P.
\newblock {{W}omen in science: Weird sisters?}.
\newblock Nature. 2013 495: 43–44.
\bibitem{Handley2015}
Handley IM, Brown ER, Moss-Racusin CA, Smith JL.
\newblock {{Q}uality of evidence revealing subtle gender biases in science is in the eye of the beholder.}.
\newblock Proc Natl Acad Sci U S A. 2015 112:13201–13206.
\bibitem{NSF2015}
National Science Foundation, National Center for Science and Engineering Statistics.
\newblock {{W}omen, Minorities, and Persons with Disabilities in Science and Engineering: 2015}.
\newblock [Internet]. 2015 Available: \url{http://www.nsf.gov/statistics/wmpd/}.
\bibitem{Filardo2016}
Filardo G, da Graca B, Sass DM, Pollock BD, Smith EB, Martinez MA-M.
\newblock {{}Trends and comparison of female first authorship in high impact medical journals: observational study (1994-2014).}.
\newblock BMJ. 2016 352:i847.
\bibitem{Ghiasi2015}
Ghiasi G, Larivi{\`e}re V, Sugimoto CR.
\newblock {{O}n the Compliance of Women Engineers with a Gendered Scientific System}.
\newblock PLoS ONE. 2015 10(12):e0145931.
\bibitem{Macaluso2016}
Macaluso B, Larivi{\`e}re V, Sugimoto T, Sugimoto CR.
\newblock {{I}s Science Built on the Shoulders of Women? A Study of Gender Differences in Contributorship}.
\newblock Academic Medicine. 2016 91(8):1136–11420.
\bibitem{Bonham2016}
Bonham KS, Stefan MI.
\newblock {{B}iology and Computational Biology Papers in Pubmed, 1997-2014}.
\newblock [Internet]. Zenodo; 2016. \url{doi:10.5281/zenodo.58990}
\bibitem{Bonham2016a}
Bonham KS, Stefan MI.
\newblock {{P}reprints from arXiv.org in cs and q-bio}.
\newblock [Internet]. Zenodo; 2016. \url{doi:10.5281/zenodo.60088}
\bibitem{Bonham2016b}
Bonham KS, Stefan MI.
\newblock {{gender-comp-bio: Pre-publication release}}.
\newblock [Internet]. Zenodo; 2016. \url{doi:10.5281/zenodo.60090}
\end{thebibliography}
\subsection*{Supporting Information}
\begin{figure}[!h]
\caption*{
\textbf{S1 Fig.} A: Number of primary publications per year indexed under the “Biology” MeSH term. B: Number of primary publications per year indexed with “Computational Biology” as a major MeSH term. C: Comparison of computational gender inference (black) with known genders (white) for the dataset from Filardo et. al. \cite{Filardo2016}. Grey represents the known proportion of female authors when excluding names for which the gender could not be computationally inferred. Error bars represent 95\% confidence intervals.}
\label{S1 Fig}
\end{figure}
\begin{figure}[!h]
\caption*{
\textbf{S2 Fig.} A: Mean probability that an author in a given position is female for primary articles indexed in Pubmed with the MeSH term Biology (black), Computational Biology (gray) or for those articles with Biology \textit{but not} Computational biology (white). Error bars represent 95\% confidence intervals. B: Mean probability that an author is female for articles in the “Bio” dataset (black) in the “Comp” dataset (white), or for articles in the Bio \textit{but not} Comp (gray) for each journal that had at least 1000 authors plotted against the journals’ 2014 impact factor. Excluding computational publications from the biology dataset does not substantially alter the correlation between impact factor and $P_{female}$.}
\label{S2 Fig}
\end{figure}
\begin{figure}[!h]
\caption*{
\textbf{S1 table. }$P_{female}$ for each journal with at least 1000 authors in the bio dataset. Journals identified as primarily computational are shaded grey.}
\label{S1 Table}
\end{figure}
\begin{figure}[!h]
\caption*{
\textbf{sample\_name\_data.json}: A subset of 1000 name:gender pairs, downloaded from GenderAPI.com. Permission to share these data was granted by Markus Perl. For additional information, e-mail contact\@gender-api.com}
\label{sample_name_data}
\end{figure}
\section*{Acknowledgements}
The authors would like to thank Markus Perl for the free use of Gender-API - contact\@gender-api.com; Casper Strømgren for the free use of genderize.io - info\@genderize.io; Giovanni Filardo for sharing data \cite{Filardo2016}; Johanna Gutlerner, Marshall Thomas, Diane Lam, and other members of the Curriculum Fellows Program (CFP) at Harvard Medical School (HMS) for helpful feedback and discussions; and The HMS CFP and Educational Laboratory for resources and mentorship.
\end{document}