diff --git a/README.md b/README.md index a54e3bf..b4c2ff9 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# Analysis of science journalism reveals gender and regional disparities in coverage +# Analysis of science journalism reveals disparities in coverage across predicted gender and ethnic identities @@ -28,7 +28,7 @@ We then used the extracted names to predict gender and name origin of the cited In order to appropriately quantify the level of difference, we must identify a suitable reference set for comparison. We chose first and last authors within primary research articles in _Nature_ and a subset of _Springer Nature_ articles in the same time period as our comparator. -In our analysis, we found a skew towards male quotation in _Nature_ science journalism-related articles. +In our analysis, we found a skew towards quoting men in _Nature_ science journalism-related articles. However, quotation is trending toward equal representation at a faster rate than first and last authorship in academic publishing. Interestingly, we found that the gender disparity in _Nature_ quotes was column-dependent, with the "Career Features" column reaching gender parity. Our name origin analysis found a significant over-representation of names with predicted Celtic/English origin and under-representation of names with a predicted East Asian origin. diff --git a/content/01.abstract.md b/content/01.abstract.md index fd58f9d..cff268b 100644 --- a/content/01.abstract.md +++ b/content/01.abstract.md @@ -9,9 +9,9 @@ We extracted cited authors' names and those of quoted speakers. While citations and quotations within a piece do not reflect the entire information-gathering process, they can provide insight into the demographics of visible sources. We then predicted gender and name origin of the cited authors and speakers. We compared articles with a comparator set made up of first and last authors within primary research articles in Nature and a subset of Springer Nature articles in the same time period. -In our analysis, we found a skew toward male quotation in Nature science journalism. +In our analysis, we found a skew toward quoting men in Nature science journalism. However, quotation is trending toward equal representation at a faster rate than authorship rates in academic publishing. -Gender disparity in Nature quotes was column-dependent. +Gender disparity in Nature quotes was dependent on the article type. We found a significant over-representation of names with predicted Celtic/English origin and under-representation of names with a predicted East Asian origin in both in extracted quotes and journal citations but dampened in citations. diff --git a/content/02.introduction.md b/content/02.introduction.md index 0965d4b..a7da817 100644 --- a/content/02.introduction.md +++ b/content/02.introduction.md @@ -5,8 +5,8 @@ However, it is important to identify the ways in which its coverage may skew tow Coverage of science shapes who is considered a scientist and field expert by both peers and the public. This indication of legitimacy can either help recognize people who are typically overlooked due to systemic biases or intensify biases. Journalistic biases in general-interest, online and printed news have been observed by journalists themselves [@https://www.theguardian.com/world/2016/may/25/enduring-whiteness-of-american-journalism; @https://medium.com/ladybits-on-medium/i-analyzed-a-year-of-my-reporting-for-gender-bias-and-this-is-what-i-found-a16c31e1cdf; @https://www.theatlantic.com/technology/archive/2016/02/gender-diversity-journalism/463023; @https://www.theatlantic.com/science/archive/2018/02/i-spent-two-years-trying-to-fix-the-gender-imbalance-in-my-stories/552404], as well as by independent researchers [@doi:10.1177/0003122415596999; @doi:10.1080/1461670X.2013.834149; @doi:10.1177/0163443711418272; @doi:10.1371/journal.pone.0148434; @https://www.poynter.org/reporting-editing/2013/lack-of-female-sources-in-new-york-times-stories-spotlights-need-for-change; @https://whomakesthenews.org/gmmp-2015-reports; @doi:10.1371/journal.pbio.2004956]. -Researchers found a gap between male and female subjects or sources, with independent studies finding that between 17-40% of total subjects were female across multiple general-interest printed news outlets between 1985 and 2015 [@doi:10.1177/0003122415596999; @doi:10.1080/1461670X.2013.834149; @https://whomakesthenews.org/gmmp-2015-reports]. -One study found 27-35% of total subjects in international science and health related news were female between 1995 and 2015, and 46% in print, radio, and television in the United States in 2015 [@https://whomakesthenews.org/gmmp-2015-reports]. +Researchers found a gap between men and women subjects or sources, with independent studies finding that between 17-40% of total subjects were women across multiple general-interest printed news outlets between 1985 and 2015 [@doi:10.1177/0003122415596999; @doi:10.1080/1461670X.2013.834149; @https://whomakesthenews.org/gmmp-2015-reports]. +One study found 27-35% of total subjects in international science and health related news were women between 1995 and 2015, and 46% in print, radio, and television in the United States in 2015 [@https://whomakesthenews.org/gmmp-2015-reports]. While gender disparities in news coverage have been extensively researched, our research is different because it focuses on science journalism and comparing it against the demographics of actively publishing scientists. Additionally, our work focuses on research into disparities with respect to name origins, a focus which is currently lacking in the literature. @@ -19,7 +19,7 @@ This is similar to other studies that have quantified gender or racial dispariti In researching a story, a journalist will typically interview multiple sources for their opinion, potentially asking for additional sources, thus allowing individual unconscious biases at any point along the interview chain to skew scientific coverage broadly. In addition, the repeated selection of a small set of field experts or the approach a journalist takes in establishing a new source may intensify existing biases [@https://www.theopennotebook.com/2016/08/23/including-diverse-voices-in-science-stories; @https://www.theatlantic.com/technology/archive/2016/02/gender-diversity-journalism/463023; @https://www.theatlantic.com/science/archive/2018/02/i-spent-two-years-trying-to-fix-the-gender-imbalance-in-my-stories/552404]. While disparities in representation may go unnoticed in a single article, analyzing a large corpus of articles can identify and quantify these disparities and help guide institutional and individual self-reflection. -In the same vein as previous media studies [@doi:10.1177/0003122415596999; @doi:10.1080/1461670X.2013.834149; @doi:10.1177/0163443711418272; @doi:10.1371/journal.pone.0148434; @https://www.poynter.org/reporting-editing/2013/lack-of-female-sources-in-new-york-times-stories-spotlights-need-for-change; @https://whomakesthenews.org/gmmp-2015-reports], we sought to quantify gender and regional differences of journalism beyond the existing demographic differences in the scientific field. +In the same vein as previous media studies [@doi:10.1177/0003122415596999; @doi:10.1080/1461670X.2013.834149; @doi:10.1177/0163443711418272; @doi:10.1371/journal.pone.0148434; @https://www.poynter.org/reporting-editing/2013/lack-of-female-sources-in-new-york-times-stories-spotlights-need-for-change; @https://whomakesthenews.org/gmmp-2015-reports], we sought to quantify differences in representation across predicted of gender and name origin beyond the existing demographic differences in the scientific field. Our study focused solely on science journalism, specifically content published by _Nature_. Since _Nature_ also publishes primary research articles, we used these data to determine the demographics of the expected set of possible sources. This is not a perfect comparator since journalists will not cover every research article presented in the journal. @@ -30,10 +30,10 @@ In our analysis, we identified quoted and cited people by analyzing the content Through our analysis of 22,001 news articles, we were able to identify >88,000 quotes and >15,000 citations with sufficient speaker or author information. We also identified first and last authors of >10,000 _Nature_ papers. -We then identified possible gender or regional differences using the extracted names. +We then identified possible differences in predicted gender or name origin using the extracted names. The extracted names were used to generate three data-types: quoted, mentioned, and cited people. -We used computational methods to predict gender and identified a trend towards quotes from people predicted male in news articles when compared to both the general population and predicted male authorship in papers. -Within the period that we examined, the proportion of predicted male attributed quotes in news articles went from initially higher to currently lower than the proportion of male first and last authors in _Nature_ papers. +We used computational methods to predict gender and identified a trend towards quotes from people predicted to be men in news articles when compared to both the general population and authors predicted to be men in papers. +Within the period that we examined, the proportion of quotes predicted to be attributed to men in news articles went from initially higher to currently lower than the proportion first and last authors predicted to be men in _Nature_ papers. Furthermore, we found that the quote difference was dependent on article type; the “Career Feature” column achieved gender parity in quoted speakers. We also used computational methods to predict name origins of quoted, mentioned, and cited people. diff --git a/content/03.methods.md b/content/03.methods.md index efbc997..c522d4e 100644 --- a/content/03.methods.md +++ b/content/03.methods.md @@ -5,6 +5,8 @@ #### Text Scraping We scraped all text and metadata from _Nature_ using the web-crawling framework Scrapy [@https://scrapy.org] (version 2.4.1). +Scrapy is a tool that uses user-defined rules to follow hyperlinks on web pages and return the information contained on each webpage. +We used Scrapy to extract identify all the web pages containing news articles and extract the text. We created four independent scrapy web spiders to process the news text, news citations, journalist names, and paper metadata. News articles were defined as all articles from 2005 to 2020 that were designated as "News", "News Feature", "Career Feature", "Technology Feature", and "Toolbox". Using the spider “target_year_crawl.py”, we scraped the title and main text from all news articles. @@ -34,8 +36,10 @@ Since citations are hyperlinked in-line, we did not extract any citation informa After the news articles were scraped and processed, the text was processed using the coreNLP pipeline [@doi:10.3115/v1/P14-5010] (version 4.2.0). The main purpose for using coreNLP was to identify named entities related to mentioned and quoted speakers. -The full set of annotaters were: tokenize, ssplit, pos, lemma,ner, parse, coref, quote. -We used the "statistical" algorithm to perform coreference resolution. +We used the standard set of annotators: tokenize, ssplit, pos, lemma, ner, parse, coref, and additionally the quote annotator. +Each of which respectively performs text tokenization, sentence splitting, part of speech recognition, lemmatization, named entity recognition, division of sentences into constituent phrases, co-reference resolution, and identification of quoted entities. +We used the "statistical" algorithm to perform coreference resolution for speed. +Each of these aspects is required in order to identify the names of quoted or mentioned speakers and identify any of their associated pronouns. All results were output to json format for further downstream processing. @@ -44,7 +48,9 @@ All results were output to json format for further downstream processing. _Springer Nature_ was chosen over other publishers and search engines for multiple reasons: 1) ease of use; 2) it is a large publisher, second only to Elsevier; 3) it covers diverse subjects, in contrast to PubMed, which focuses on the biomedical and life sciences literature; 4) its API has a large daily query limit (5000/day); and 5) it provided more author affiliation information than found in Elsevier. We generated a comparative background set for supplemental analysis with the _Springer Nature_ API by obtaining author information for papers cited in news articles. We selected a subset of papers to generate the _Springer Nature_ background set. -These papers were the first 200 English language "Journal" papers returned by the _Springer Nature_ API for each month, resulting in 2400 papers per year for 2005 through 2020. +These papers were the first 200 English language "Journal" papers returned by the _Springer Nature_ API for each month, resulting in 2400 papers per year from 2005 through 2020. +These papers are the first 200 papers published each month by a _Springer Nature_ journal, which may not be completely random, but we believe to be a reasonably representative sample. +Furthermore, the _Springer Nature_ articles are only being used as an additional comparator to the complete set of all _Nature_ research papers used in our analyses. To obtain the author information for the cited papers, we queried the _Springer Nature_ API using the scraped DOI. For both API query types, the author names, positions, and affiliations for each publication were stored and are available in "all_author_country.tsv" and "all_author_fullname.tsv". @@ -54,7 +60,7 @@ For both API query types, the author names, positions, and affiliations for each We first pre-filter articles that have more than 25 quotes, which results in the removal of 2.69% (433/16,080) of the total articles. This was done to ensure no single article is over-represented and to avoid spuriously identified quotes due to unusual article formatting. To identify the gender of a quoted or mentioned person, we first attempt to identify the person’s full name. -Even though genderizeR only uses the first name to make the gender prediction, identifying the full name gives us greater confidence that we are using the first name. +Even though genderizeR, the computational method used to predict the name's gender, only uses the first name to make the gender prediction, identifying the full name gives us greater confidence that we correctly identified the first name. To identify the full name, we take the predicted speaker by coreNLP. Unfortunately, this is not always the full name and is only either the first or last name, with the full name occuring somewhere else in the article. In order to get the full name for all names that coreNLP is unable to identify, we match the coreNLP-identified partial name to the longest matching name within the same article. @@ -72,6 +78,7 @@ A summary of processed gender predictions of quotes at each point of processing #### Name Formatting for Gender Prediction of Authors Because we separate first and last authors, we only considered papers with more than one author. +Roughly 7% of all papers were estimated to be single authors and removed from this analysis: 1113/15013 for cited Springer articles, 2899/42155 for random Springer articles, 955/12459 for Nature research articles. As for quotes, we needed to extract the first name of the authors. We cast names to lowercase and processed them using the R package humaniformat [@https://cran.r-project.org/web/packages/humaniformat/index.html]. humaniformat is a rule based program that uses character markers to identify if names are reversed (Lastname, Firstname), find middle names and titles. @@ -101,7 +108,7 @@ The quote extraction and attribution annotator from the coreNLP pipeline was emp In some cases, coreNLP could not identify an associated speaker’s name but instead assigned a gendered pronoun. In these instances, we used the gender of the pronoun for the analysis. The R package genderizeR [@doi:10.32614/rj-2016-002], a wrapper for the genderize.io API [@https://genderize.io], predicted the gender of authors and speakers. -We predicted a name as male using the first name with a minimum cutoff of 50%. +We predicted a name as indicating a man if the first name was predicted by genderizeR to come from a man with at least a probability of 50%. To reduce the number of queries made to genderize.io, a previously cached gender prediction from [@doi:10.1101/2020.04.14.927251] was also used and can be found in the file "genderize.tsv". All first name predictions from this analysis are in the file "genderize_update.tsv". To estimate the gender gap for the quote gender analyses, we used the proportion of total quotes, not quoted speakers. @@ -109,9 +116,9 @@ We used the proportion of quotes to measure speaker participation instead of onl The specific formulas for a single year are shown in equations @eq:quote and @eq:first-author. We did not consider any names where no prediction could be made or quotes where neither speaker nor gendered pronoun was associated. -$$\textrm{Prop. Male Quotes} = \frac{|\textrm{Male Speaker Quotes}|} {|\textrm{Male or Female Speaker Quotes}|}$${#eq:quote} +$$\textrm{Prop. Quotes from Men} = \frac{|\textrm{Speaker Quotes from Men}|} {|\textrm{Speaker Quotes from Men or Women}|}$${#eq:quote} -$$\textrm{Prop. Male First Authors} = \frac{|\textrm{Male First Authors}|} {|\textrm{Male or Female First Authors}|}$${#eq:first-author} +$$\textrm{Prop. First Author Men} = \frac{|\textrm{First Authors Men}|} {|\textrm{First Author Men or Women}|}$${#eq:first-author} ### Name Origin Analysis @@ -126,16 +133,17 @@ NamePrism chose to exclude the United States, Australia, and Canada from their c This choice was justified by NamePrism in stating that these countries had a high level of immigration. The treemap of country groupings defined in the NamePrism manuscript are found in figure 5 of the publication [@doi:10.1145/3132847.3133008]. -After running the pre-trained Wiki-2019LSTM model, we select the highest probability origin for each name as the resultant assignment. -Similar to the gender analyses, quote proportions were again directly compared against publication rates. +After running the pre-trained Wiki-2019LSTM model, we used the probability origin for each name instead of a hard assignment to a single class. +Hard assignment was not used because it has been shown to reproduce biases due to the underreporting of Black and overprediction of White individuals [@doi:10.1371/journal.pone.0264270]. +Similar to the gender analyses, quote proportions were again directly compared against publication rates, except using the probability of assignment instead of the count of hard assignments. For citations, quotes, and mentions, we calculated the proportion for a given year for each name origin. -This is shown in @eq:cite-origin to, for example, calculate the citation rate for last authors with a Greek name origin for a single year. +This is shown in Eq. @eq:cite-origin to, for example, calculate the citation rate for last authors with a Greek name origin for a single year. -$$\textrm{Prop. Greek Last Author Cited} = \frac{| \textrm{Cited Last Authors w/Greek Name}|} {| \textrm{Cited Last Authors w/any Name}|}$${#eq:cite-origin} +$$\textrm{Prop. Greek Last Author Cited} = \frac{\Sigma \textrm{(Probability Greek Name for each Cited Last Author)}} {|\textrm{Cited Last Authors w/any Name}|}$${#eq:cite-origin} -$$\textrm{Prop. Greek Quotes} = \frac{| \textrm{Quotes w/Greek Named Speaker}|} {| \textrm{Quotes w/any Named Speaker}|}$${#eq:quote-origin} +$$\textrm{Prop. Greek Quotes} = \frac{\Sigma \textrm{(Probability Greek Name for each Quoted Speaker)}} {| \textrm{Quotes w/any Named Speaker}|}$${#eq:quote-origin} -$$\textrm{Prop. Greek Names Mentioned} = \frac{| \textrm{Unique Greek Names Mentioned}|} {| \textrm{Unique Names w/any Origin Mentioned}|}$${#eq:quote-origin} +$$\textrm{Prop. Greek Names Mentioned} = \frac{\Sigma \textrm{(Probability Greek Mentioned Name)}} {| \textrm{Unique Names w/any Origin Mentioned}|}$${#eq:quote-origin} When comparing _Nature_ articles against the _Springer Nature_ set of first or last authors, we again find the same patterns in quoted speakers with East Asian, Celtic/English, and Arabic/Turkish/Persian predicted name origins as we did in the previous citation analysis (Figure {@fig:suppfig4}d, green and purple lines). -In addition, we find an under-enrichment of predicted Hispanic, South Asian, and Hebrew name origins when comparing against the predicted name origin rate of first and last authors in our _Springer Nature_ set. +In addition, we find an under-representation of predicted Hispanic, South Asian, and Hebrew name origins when comparing against the predicted name origin rate of first and last authors in our _Springer Nature_ set. -#### News mention rates are over-enriched for predicted Celtic-English and under-enriched for East Asian name origins. +#### News mention rates are over-represented for predicted Celtic-English and under-represented for East Asian name origins. Since many journalists use additional sources that are not directly quoted, we also analyzed likely paraphrased speakers, e.g. a case in which the person was a source and mentioned in the story but not directly quoted. To do this, we identified all unique names that appeared in an article, which we term _mentions_. -We found the same pattern of over-enrichment for predicted Celtic/English name origins and under-enrichment for East Asian name origins when comparing against both _Nature_ and _Springer Nature_ first and last authorships (Figure {@fig:fig3}e, Figure {@fig:suppfig3}d,e, Figure {@fig:suppfig4}e,f, Table {@tbl:tableFCNature}, Table {@tbl:tableFCSpringer}). +We found the same pattern of over-representation for predicted Celtic/English name origins and under-representation for East Asian name origins when comparing against both _Nature_ and _Springer Nature_ first and last authorships (Figure {@fig:fig3}e, Figure {@fig:suppfig3}d,e, Figure {@fig:suppfig4}e,f, Table {@tbl:tableFCNature}, Table {@tbl:tableFCSpringer}). Similar to the quote analysis, we selected a subset of mentions from people that were also cited in the news article. We again found that the disparity was greatly reduced (Figure {@fig:suppfig_quote_cite}c,d). diff --git a/content/05.discussion.md b/content/05.discussion.md index a85c30f..dc9def6 100644 --- a/content/05.discussion.md +++ b/content/05.discussion.md @@ -1,29 +1,48 @@ ## Discussion -Science journalism is the critical conduit between the academic and public spheres, and consequently shapes the public's view of science and scientists. +Science journalism is the critical conduit between the academic and public spheres and consequently shapes the public's view of science and scientists. However, as observed in other forms of recognition in science, biases may shift coverage away from the known demographics within science [@doi:10.1101/2020.04.14.927251, @doi:10.1016/j.cell.2022.01.004]. Ideally, scientific journalism is representative of academic papers. -Though it would be best for news coverage to promote equitable representation, at a minimum quotes and citations would ideally match the regional and gender demographics of scientific academia. +Though it would be best for news coverage to promote equitable representation, at a minimum, quotes and citations would ideally match the predicted name origin and gender demographics of scientific academia. To examine this last point, we analyzed 22,001 news articles published in _Nature_, to identify quoted, mentioned, and cited people. We then compared this to the authorship statistics from _Nature_'s papers and a subset of _Springer Nature_'s English language papers. -We first looked at possible gender differences in quotes and found a large, but decreasing, gender gap when compared to the broader population in all but one article type. -Additionally, this result was consistent in articles written by both predicted female and male journalists. + +We first looked at possible gender differences in quotes and found a large, but decreasing, gender gap when compared to the general population in all but one article type. +Additionally, this result was consistent in articles written by journalists predicted to be women or men. We found that one column, "Career Feature", has an equal number of quotes from both genders, showing that gender parity is possible in science journalism. -This finding, coupled with the near equal number of article written by male and female predicted journalists, argues for more diversity in topical coverage. -Including more content that is not primarily focused on recent publications, but all topics surrounding the practice of science, may help to rapidly achieve gender parity in journalistic recognition. -However, we do recognize that different journalistic columns have different purposes or may represent different demographics and be inherently more difficult to reach parity. - - -To further our analysis of possible coverage disparities, we looked to differences in predicted name origins of quoted and cited authors across all the processed news articles. + + To further our analysis of possible coverage disparities, we looked at differences in predicted name origins of quoted and cited authors across all the processed news articles. +Our use of name origins is a proxy for how an individual's name might suggest ethnicity to a journalist or an individual's scientific peers. +We do not intend to assign an identity to an individual but to generate a broad metric to measure possible bias during primary source gathering. Our findings provide additional support for previous studies that identified under-citation [@doi:10.1101/2020.10.12.336230] and under-recognition [@doi:10.1101/2020.04.14.927251, @doi:10.1016/j.cell.2022.01.004] of East Asian people. Interestingly, we found under-citation of people with predicted East Asian name origins to be much less pronounced than under-quotation. We do not believe that the under-quotation is driven by paraphrasing sources, which may occur more frequently with non-native English speakers. -We also found that the disparity observed in quotes and mentions was almost eliminated when only considering people that were additionally cited within the same article. +We also found that the disparity observed in quotes and mentions was almost eliminated when only considering people who were additionally cited within the same article. This suggests that the source of the disparity may lie in the search for additional expert opinions. Either way, the clear disparity of predicted East Asian researcher quotes and mentions argues for including a broader set of voices when seeking opinions beyond the academic papers being covered in the article. @@ -32,7 +51,19 @@ While we were not directly able to examine the regions journalists lived in, thi When considering quotes from people with a predicted East Asian name origin, we found that journalists who themselves have a predicted East Asian name origin include a higher proportion of these quotes than journalists with European or Celtic/English predicted names. When considering only people who were both quoted and cited, the effect of the predicted name origins of journalists was substantially dampened. We are unable to identify if this is a geographic bias of the reporters in this analysis, since we do not know the location of the journalist at the time of writing the article. -However, having reporters explicitly focused on specific regional sources to better cover international opinions in science can help ameliorate this disparity. +As a proxy for measuring the possible geographical bias of a journalist, we attempted to identify if there was any geographical bias of cited authors. +To do this, we identified the affiliation of each cited author and identified their affiliated country. +Unfortunately, we were unable to robustly extract a large enough number of cited authors from different countries to make any conclusive statements. +Expanding our work to other science journalism outlets could help identify possible ways in which geographic region, gender, and perceived ethnicity interact and affect the scientific visibility of specific groups. +While we are unable to identify that journalists have a specific geographical bias, having reporters explicitly focused on specific regional sources will broaden coverage of international opinions in science. + + +In our analysis, we also find that there are more first authors with predicted East Asian name origin than last authors. +This is in contrast to predicted Celtic/English and European name origins. +Furthermore, we see that the number of first author people with predicted East Asian name origins is increasing at a much faster rate than quotes are increasing. +If this mismatched rate of representation continues, this could lead to an increasingly large erasure of early career scientists with East Asian name origins. +As noted before, focusing on increasing engagement with early career scientists can help to reduce the growing disparity of public visibility of scientists with East Asian name origins. + - + Through our comprehensive analysis, we were able to quantify how recognized persons in news journalism vary by name origin and gender, then compare it to scientific publishing background rates. -While we found a significant gender disparity, the rate of female representation in scientific news is increasing and outpacing first and last authorships on scientific papers. -Furthermore, we identified a significant depletion of quotes from scientists with a predicted East Asian name origin when compared to paper authorship, and a significant but smaller depletion of cited authors with a predicted East Asian name origin in news content. +While we found a significant gender disparity in comparison to the general population, the rate of female representation in scientific news is increasing and outpacing first and last authorships on scientific papers. +Furthermore, we identified a significant reduction of quotes from scientists with a predicted East Asian name origin when compared to paper authorship and a significant but smaller reduction of cited authors with a predicted East Asian name origin in news content. + +Computational tools enabled us to automatically analyze thousands of articles to identify existing disparities by gender and name origin, but these tools are not without limitations. +Our tools are unable to identify non-binary people and rely on gender predictors that are known to have region-specific biases, with the largest decrease in performance on names of Asian origin [@doi:10.7717/peerj-cs.156;@doi:10.5195/jmla.2021.1252]. +Furthermore, name origin is only a proxy for externally perceived racial or ethnic origins of a source or author and is not as accurate as self-identified race or ethnicity. +Self-identification better captures the lived experience of an individual that computational estimates from a name can not capture. +This is highlighted in our inability to distinguish between Black and White people from the US by their names. +As the collection of demographic data by publication outlets grows, we believe we can get a more fine-grained and accurate analysis of disparities in scientific journalism. + Previous anecdotal studies from journalists have shown that awareness of their bias can help them to reduce it [@https://medium.com/ladybits-on-medium/i-analyzed-a-year-of-my-reporting-for-gender-bias-and-this-is-what-i-found-a16c31e1cdf; @https://www.theatlantic.com/technology/archive/2016/02/gender-diversity-journalism/463023; @https://www.theatlantic.com/science/archive/2018/02/i-spent-two-years-trying-to-fix-the-gender-imbalance-in-my-stories/552404]. Once a bias is identified an individual can seek resources to help them find and retain diverse sources, such as utilizing international expert databases like gage [@https://gage.500womenscientists.org] and SheSource [@https://www.womensmediacenter.com/shesource]. Additional tips for journalists to achieve and maintain a diverse source pool is described by Christina Selby in the Open Notebook [@https://www.theopennotebook.com/2016/08/23/including-diverse-voices-in-science-stories]. @@ -77,6 +116,12 @@ For example, a person may not be mentioned or quoted in the article because of l A more accurate reflection of journalists' sources would be a self-maintained record of people they interview. Our work examines disparities with respect to recognition within articles, which can be measured by mentions, quotes, or citations of people. +Furthermore, the news articles presented on "www.nature.com" are intended for a very specific readership that may not be reflective of more broad scientific news outlets. +In a separate analysis, we took a cursory look into a comparison with _The Guardian_ and found very similar disparities in gender and name origin. +However, it is not clear which publications should be used as a comparator for science-related articles in _The Guardian_, and difficult to compare relative rates of representation. +While other science news outlets may not have a direct comparator, it would be useful to take a broad comparison across multiple science news outlets to compare against one another. +Our existing pipeline could be easily applied to other science news outlets and identify if there exists a consistent pattern of disparity regardless of the intended readership. + Another major limitation of our study, is that we only used articles published by _Nature_ or _Springer Nature_ as a comparator. Not all papers are interesting to the general public and likely to be covered by journalists. In this work, we assume that the demographics of scientists publishing work that is likely to be covered by journalists matches the demographics of all scientists publishing articles in _Nature_, _Springer Nature_ or other publishers. diff --git a/content/07.supplement.md b/content/07.supplement.md index 1020271..97fa49a 100644 --- a/content/07.supplement.md +++ b/content/07.supplement.md @@ -2,59 +2,57 @@ ![ **Benchmark Data ** -Panel A, depicts the performance of gender prediction for pipeline-identified quoted speakers. -Panel B is a histogram of the number of articles that were falsely identified to mention a country by our processing pipeline. -Panels C shows the estimated versus true frequency of country mentions within our benchmark dataset. The red line denotes the x = y line. +The performance of gender prediction for pipeline-identified quoted speakers. ](https://github.com/nrosed/nature_news_disparities/raw/main/figure_notebooks/manuscript_figs/supp_fig1_tmp/supp_fig1.png "Supplementary Figure 1"){#fig:suppfig1 tag="Supplemental 1" width=6in} ![ -**Predicted male speakers are overrepresented in news quotes regardless of predicted journalist gender** -Panel A depicts two trend lines: Yellow: Proportion of _Nature_ news articles written by a predicted female journalist; Blue: Proportion of _Nature_ news articles written by a predicted male journalist. -We observe a moderate gender difference in the number of articles written by male and female journalists. -Panel B depicts two trend lines: Yellow: Proportion of predicted male quotes in an article written by a predicted female journalist; Blue: Proportion of predicted male quotesin an article written by a predicted male journalist. +**Speakers predicted to be men are overrepresented in news quotes regardless of predicted journalist gender** +Panel a depicts two trend lines: Yellow: Proportion of _Nature_ news articles written by a predicted women journalist; Blue: Proportion of _Nature_ news articles written by a predicted men journalist. +We observe a moderate gender difference in the number of articles written by men and women journalists. +Panel b depicts two trend lines: Yellow: Proportion of quotes predicted to be from men in an article written by a journalist predicted to be a woman; Blue: Proportion of quotes predicted to be from men in an article written by a journalist predicted to be a man. In all plots, the colored bands represent the 5th and 95th bootstrap quantiles and the point is the mean calculated from 1,000 bootstrap samples. ](https://github.com/nrosed/nature_news_disparities/raw/main/figure_notebooks/manuscript_figs/supp_journalist_contingency_tab_tmp/supp_fig.png "Supplementary Figure 2"){#fig:suppfig_j_gender tag="Supplemental 2" width=6in} ![ -**Predicted male speakers are overrepresented in news quotes when compared against _Springer Nature_ authorship** -Panel A depicts three trend lines: Purple: Proportion of _Nature_ quotes for an estimated male speaker; Light Grey: Proportion of _The Guardian_ quotes for an estimated male speaker; Yellow: Proportion of first author articles from an estimated male author in _Springer Nature_; Dark Mustard: Proportion of last author articles from an estimated male author in _Springer Nature_. -We observe a larger gender difference between first and last authors in _Springer Nature_ articles, however the proportion of predicted male speakers is less than observed in _Nature_ research articles. -Panel B depicts the proportion of male quotes broken down by article type. +**Speakers predicted to be men are overrepresented in news quotes when compared against _Springer Nature_ authorship** +Panel a depicts three trend lines: Purple: Proportion of _Nature_ quotes for a speaker estimated to be a man; Light Grey: Proportion of _The Guardian_ quotes for a speaker estimated to be a man; Yellow: Proportion of first author articles from an author estimated to be a man in _Springer Nature_; Dark Mustard: Proportion of last author articles from an author estimated to be a man in _Springer Nature_. +We observe a larger gender difference between first and last authors in _Springer Nature_ articles, however the proportion of speakers estimated to be men is less than observed in _Nature_ research articles. +Panel b depicts the proportion of quotes from predicted men broken down by article type. In all plots the colored bands represent the 5th and 95th bootstrap quantiles and the point is the mean calculated from 1,000 bootstrap samples. ](https://github.com/nrosed/nature_news_disparities/raw/main/figure_notebooks/manuscript_figs/fig2_tmp/fig2_supp.png "Supplementary Figure 3"){#fig:suppfig2 tag="Supplemental 3" width=6in} ![ **Predicted Celtic/English, and European name origins are the highest cited, quoted, and mentioned** -Panel A, depicts the number of quotes, mentions, citations, or research articles considered in the name origin analysis. -Panels B-G depicts the proportion of a name origin in a given dataset, citations in articles written by journalists or writers, quoted speakers or mentions. +Panel a, depicts the number of quotes, mentions, citations, or research articles considered in the name origin analysis. +Panels b-g depicts the proportion of a name origin in a given dataset, citations in articles written by journalists or writers, quoted speakers or mentions. In all plots the colored bands represent the 5th and 95th bootstrap quantiles and the point is the mean calculated from 1,000 bootstrap samples. ](https://github.com/nrosed/nature_news_disparities/raw/main/figure_notebooks/manuscript_figs/fig3_tmp/fig3_supp.png "Supplementary Figure 4"){#fig:suppfig3 tag="Supplemental 4" width=6in} ![ **Distribution of name origins _Nature_ and _Springer Nature_ articles** -Panels A-D depicts the predicted name origins of first and last authors in our background sets. -Panel A and B show the predicted name origins of _Nature_ first and last authors, respectively. -Panel C and D show the predicted name origins of _Springer Nature_ first and last authors, respectively. +Panels a-d depicts the predicted name origins of first and last authors in our background sets. +Panel a and b show the predicted name origins of _Nature_ first and last authors, respectively. +Panel c and d show the predicted name origins of _Springer Nature_ first and last authors, respectively. ](https://github.com/nrosed/nature_news_disparities/raw/main/figure_notebooks/manuscript_figs/fig3_tmp/fig3_supp3.png "Supplementary Figure 5"){#fig:supfig_nameorigin_bg tag="Supplemental 5" width=6in} ![ **Over-representation of predicted Celtic/English and under-representation of East Asian name origins is also found in comparison to _Nature_ and _Springer Nature_ articles** -Panels A-F depicts ten plots, each for a possible name origin comparison against a background set. -Panel A, C, and E compare the citation (a), quote (c), or mention (e) rate against _Nature_ first and last author name origins. -Panel B, D, and F compare the citation (a), quote (c), or mention (e) rate against _Springer Nature_ first and last author name origins. -Panels A and B additionally partition the citation rates by journalist-written articles and scientist-written articles, each further divided into first or last author position. -For C-F, only journalist written articles are considered. +Panels a-f depicts ten plots, each for a possible name origin comparison against a background set. +Panel a, c, and e compare the citation (a), quote (c), or mention (e) rate against _Nature_ first and last author name origins. +Panel b, d, and f compare the citation (a), quote (c), or mention (e) rate against _Springer Nature_ first and last author name origins. +Panels a and b additionally partition the citation rates by journalist-written articles and scientist-written articles, each further divided into first or last author position. +For c-f, only journalist written articles are considered. ](https://github.com/nrosed/nature_news_disparities/raw/main/figure_notebooks/manuscript_figs/fig3_tmp/fig3_supp2.png "Supplementary Figure 6"){#fig:suppfig4 tag="Supplemental 6" width=6in} ![ **Over-representation of predicted Celtic/English and under-representation of East Asian quotes and mentions are reduced when additionally considering citation** -Panels A-D depicts twelve plots, each for a possible name origin comparison against a background set. -Panels A and B compare name origin proportions of quotes from people that were also cited in the same article. -Panels C and D compare name origin proportions from mentions of people that were also cited in the same article. +Panels a-d depicts twelve plots, each for a possible name origin comparison against a background set. +Panels a and b compare name origin proportions of quotes from people that were also cited in the same article. +Panels c and d compare name origin proportions from mentions of people that were also cited in the same article. In all plots the colored bands represent the 5th and 95th bootstrap quantiles and the point is the mean calculated from 1,000 bootstrap samples. ](https://github.com/nrosed/nature_news_disparities/raw/main/figure_notebooks/manuscript_figs/supp_country_specific_analysis_tmp/supp_fig.png "Supplementary Figure 7"){#fig:suppfig_quote_cite tag="Supplemental 7" width=6in} @@ -107,6 +105,21 @@ Table: Breakdown of all Springer Nature papers at major processing steps {#tbl: Table: Breakdown of all Nature papers at major processing steps {#tbl:table4} +| | Women| Men| Proportion Men| +|:-------------|------:|-----:|----------:| +|African | 270| 1554| 0.8519737| +|ArabTurkPers | 346| 1765| 0.8360966| +|CelticEnglish | 6399| 33329| 0.8389297| +|EastAsian | 1090| 4438| 0.8028220| +|European | 4788| 22844| 0.8267226| +|Greek | 73| 445| 0.8590734| +|Hebrew | 213| 1303| 0.8594987| +|Hispanic | 760| 2450| 0.7632399| +|Nordic | 593| 2397| 0.8016722| +|SouthAsian | 465| 2019| 0.8128019| +Table: Quoted speaker gender by name origin {#tbl:tableGenderNameOrigin} + + | |CelticEnglish |EastAsian |European | |:------------------------------------------|:-----------------|:-----------------|:-----------------| |citation_journalist_first vs. nature_first |1.37 (0.93, 1.82) |0.68 (0.44, 0.91) |1.01 (0.77, 1.28) |