Week 6: Data citations #431
Replies: 7 comments 4 replies
-
As a researcher, data has always been a huge topic of discussion. I cannot recall the number of times I've said things like 'I wish I had access to so-and-so's data' or 'I wish I knew if this data existed' or 'why can't I get my data out because this paper seems like it will never be accepted!' As suggested by the authors, data is the cornerstone to growth and discovery. Yet, having access to or publishing your own data is not so straight forward and comes with many inherit concerns to data owners as highlighted by the authors in their assertion that having citable possibly address these concerns:
While attribution would be more attractive to data owners, giving them credit for their efforts, until there is broad acceptance of the practice and an expectation or mandate by journals to cite datasets, the risks of being scooped still deters many authors from sharing data until they have had a chance to publish their findings. Further, it is still unclear if career evaluation would give merit to data publication vs paper publication as the act of interpretation and ingenuity of discovery in a paper is deemed the most meritis act. This is why many granting agencies that require data publication allow the author a certain number of years to publish said data, allowing them time to make the first discoveries from that data. I do acknowledge, though, that there is huge benefit to making FAIR data. With the development and growth of AI and machine learning, having more data means building better models. Better models mean accelerated science. To this end, the growing adoption of centralized data repositories for various data types is an attractive way of making data FAIR. For example, part of my doctoral thesis included the development of a low-temperature thermochronology dataset. Once I had published that work, I entered my dataset into a repository known as Geochron. This database uses a standardized template to capture the data and metadata for age dating datasets for wide use by the dating community. This standardized system facilitates FAIR principles. From a use prospective, I love that many of these data repositories make the data citation available right on the data page or when you download them so that you do not have to guess how to cite the data. Ease of access to the citation makes authors more receptive to attribution. |
Beta Was this translation helpful? Give feedback.
-
Firstly, reading this article makes me hopeful for the future of data accessibility. The government efforts to ensure public data access surely set the tone for all affiliated and government funded data creators to emulate FAIR data principles. That said, with the diversity of agencies, organizations, companies, individuals, governments, etc, who are creating, manipulating and storing data in diverse ways, it makes me wonder whether even a standard set of principles will suffice to create a truly transparent system of data sharing. I wonder if the dispersed state of earth data makes certain barriers impossible to overcome even with absolute buy-in to FAIR principles and practices from all stakeholder (which is unlikely). Thus, my mind goes to the potential for a central repository, agency, or coalition dedicated to the aggregation of publicly funded data. This would be an organization whose sole purpose would be to facilitate the organization and dissemination of data collected by scientists and researchers across agencies, universities and organizations. There are certainly barriers here too, but I think this approach, while initially costly, has the potential to supercharge/enforce FAIR principles and to make research more efficient and affordable in the long run. |
Beta Was this translation helpful? Give feedback.
-
It is important to cite data correctly for obvious reasons such as accelerating the growth of high-quality, reproducible, and transferable knowledge. More specifically what stood out to me is how this could illuminate connections researchers have not been aware of previously. Which can benefit individual researchers to ensure they are credited for research contributions. One thing I can do to improve data management practices is by following the FAIR principles. This is always very important and something that is good to get in the habit of from the start (which is probably why we learned about it day one). This also is beneficial not only to make data more discoverable and easier to reuse but also emphasizes that the metadata should be machine-readable, enabling automated attribution. These rules are right aligned with what we learned in week one, but also only briefly touched on them. I wish they went into more depth on this because it is a direct counter point to rule 5 so if people are actually going to remember this I think it needed a bit more emphasis, and maybe even some examples of data that would fall under this category. |
Beta Was this translation helpful? Give feedback.
-
Regarding the sections of the article about metadata: It's been my experience in the GIS world that although metadata is frequently discussed and encouraged, it is rarely actually created, updated, or otherwise maintained. The exception would be large public datasets maintained by state and federal agencies, and not universally in those cases either. I'm definitely guilty of this; at my firm, we very rarely create official metadata for the numerous datasets we generate for every project, unless there is a specific instruction to do so for a project. That being said, although my team does not create official metadata, we do have a detailed data management system and naming conventions in place, which can provide "meta metadata", indicating information like the theme/content of the data and the date it was produced. We typically find this is more than sufficient for our needs and the needs of our clients. A lot of our data generated by myself and my team is not the sort of data that would be published for research. It is typically on local project scale and is more or less disposable after the project is completed. That brings up the question: At what threshold of "disposability" or significance should we consider generating meaningful metadata and citation documentation for the data we are generating? I've been an outsider to the academic and research world for a long time and while I do recognize the benefits of the issues around citation pointed out in the article, I have a hard time coming to grips with when and how they should be applied. |
Beta Was this translation helpful? Give feedback.
-
At the most basic level, it is important to correctly cite data so that the source (author, entity, institution, etc.) get recognition and credit for their contribution to your work. Also, in providing a correct citation, you are giving others a map to the data repository or an avenue to access/request. This is important for reproducibility and the inclusion of datasets in other works. Often the biggest issue is finding the data … or knowing that it even exist. |
Beta Was this translation helpful? Give feedback.
-
Focusing on metadata, I am currently dealing with an extreme lack of it by the last few people that had my job. I am creating a new version of the crustal magnetic field which is comprised of thousands and thousands of magnetic surveys. Not only am I working with data back to the 40s without proper documentation, but I have no idea what previous people did to create this map, there is no nice list that says what data they used and what they didn't, the exact method they used when merging datasets. There are even published papers out describing what they did, which are pretty bare, and when it comes to reproducing their work, it's really not enough. I think a good idea when writing code or working with data, you need to construct it as though you are creating a guide for the next person. This will help you better understand what you did, why you did something, and lays the groundwork for future progress. And if you ever have to go back and look at your work, do you want to waste your own time trying to find out what you did. I think this is also a way of crediting your own work. Thankfully as I am working on this next map, hopefully I can follow the principles in the article and make sure that the next person that makes the map isn't as upset with me as I am about the previous people haha. |
Beta Was this translation helpful? Give feedback.
-
What can you do in your role to improve data management practices? You don't have to be a data manager to do so! I think the biggest thing I can do to improve my data management practices is time management, and prioritizing data organization and documentation from the beginning to the end (or the end of my part) of a project. As a project evolves, expands, and goes in new directions, I find it becomes overwhelming to adapt my file and data management to the new scope and needs of the project, and I often end up with files scattered throughout a confusing network of folders. I'm especially worried about this now that I'm switching from managing files entirely with a GUI file manager to using the terminal, and now that I have code that depends on files being in a certain place to run. If anyone has suggestions / resources related to flexible file organization, I would love to see them! |
Beta Was this translation helpful? Give feedback.
-
Reading: 10 simple rules for getting and giving credit for data [Wood-Charlson, E. M., Crockett, Z., Erdmann, C., Arkin, A. P., & Robinson, C. B. (2022). Ten simple rules for getting and giving credit for data. PLOS Computational Biology, 18(9), e1010476. https://doi.org/10.1371/journal.pcbi.1010476]
Some questions to get you started:
Beta Was this translation helpful? Give feedback.
All reactions