Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Challenges in data curation #127

Closed
aimeehan1 opened this issue Jul 13, 2022 · 5 comments
Closed

Challenges in data curation #127

aimeehan1 opened this issue Jul 13, 2022 · 5 comments

Comments

@aimeehan1
Copy link
Contributor

Comments from discussion 2022-07-13
Errors.

  • Observed increase in reporting errors. Examples: ECDC report (Argentina, Australia), Spain (computer reporting issue), Belgium (cases don't sum to total https://epidemio.wiv-isp.be/ID/Documents/Monkeypox/MPX_Update_12072022_FR.pdf), U.S. CDC (Illinois reporting errors), etc. Sometimes errors are acknowledged, other times data is changed without notice.

  • Pattern of inconsistencies in reporting among global/regional reports (e.g. WHO, PAHO, ECDC) and in comparison to country level MOH reporting. Currently, curators identify a change in cases from these global/regional report updates and then look for secondary .gov (national/local) sources of information as verification. But, if we cannot find secondary sources, then we default to the global/regional report numbers. Example, Mexico (PAHO reporting 27 cases, could not verify through MOH site, defaulted to PAHO report #), Malta (WHO reporting 9 cases, could not verify through MOH site, default to WHO #), etc. Reminder to curators that it's important to look for secondary sources of information.

Changes in reporting formats.

  • Change in cumulative case calculations. Some countries now include probable counts in totals. confirmed + probable = total. Examples: Belgium, Australia.

  • Standard reporting format no longer supports tracking of confirmed and/or suspected cases. Example, changes to Brazil’s heat map that displays suspected case counts changed to aggregate numbers – so, now we only track confirmed cases.

  • "Active" versus "recovered/inactive" case status (no longer have the clinical symptoms of monkeypox, they have recovered from acute illness). Example, Italy, Andalusia cases have been reported as active case totals, but we are tracking cumulative totals. Reminder to curators to check cumulative counts (active + inactive). Due to limited metadata, we are not currently able to update individual case status to "recovered/inactive." https://www.rtvsol.es/noticias/andalucia/salud-y-familias-informa-de-que-actualmente-en-andalucia-hay-193-casos-activos-de-viruela-del-mono

@aimeehan1
Copy link
Contributor Author

aimeehan1 commented Jul 26, 2022

On 2022-07-26, G.h observed discrepancy in U.S. confirmed case data between different webpages within CDC's website:

U.S. Map & Case Count page at 3,487
Global Map page at 3,846
Both pages reporting data as of 2022-07-25.

https://www.cdc.gov/poxvirus/monkeypox/response/2022/us-map.html
https://www.cdc.gov/poxvirus/monkeypox/response/2022/world-map.html

U S  Map
Global Map

@lisphilar
Copy link

Thank you for providing the dataset!

Sorry for jumping in, but I tried to create a pandas.DataFrame with cumulative number of confirmed/recovered/fatal cases using your linelist data.
https://gist.github.com/lisphilar/23d23f8692f70f2663a6c4890758a7ab

I assumed the followings. Is my understanding correct?

  • Status = confirmed or suspected: confirmed cases
  • Status = discarded or omit_error: lines which should be removed from subsequent analysis
  • Outcome = numpy.nan: active or "Status = omit_error" cases
  • Outcome = Recovered: recovered cases
    • recovery date is recorded in Date_last_modified column
    • only five recovery cases at this time, this may be due to "Due to limited metadata, we are not currently able to update individual case status to recovered/inactive." @aimeehan1 mentioned
  • Outcome = Deaths: fatal cases
  • "active" means "confirmed but not recovered/fatal" cases
  • "in-active" means "recovered or fatal cases"
  • minimum date of ["Date_onset", "Date_confirmation", "Date_hospitalisation", "Date_isolation", "Date_death", "Date_last_modified"] can be regarded as the infection date (This may require discussion for aggregation.)

Is it possible to provide recovered/fatal data as well as confirmed?
Total populaton and cumulative number of confirmed/recovered/fatal cases are very useful for data analysis. I developed a Python library (COVID-19 data, named CovsirPhy) and analysed them with math models.

@jim-sheldon
Copy link
Collaborator

jim-sheldon commented Sep 1, 2022

@lisphilar You're welcome!

Please do not apologize for jumping in; we made our work open source because we want your input!

Would you kindly open a new issue in this repository for the problem you described? This allows us to keep all "epics", features, and bugfixes discrete.

I would also refer you to our data dictionary, which might help answer some of your questions.

@lisphilar
Copy link

@jim-sheldon Thank you for your quick response!

I just have created four issues #177 #178 #179 #180 and I'm looking forward to having discussion with you and your team there.

@abhidg
Copy link
Contributor

abhidg commented Sep 26, 2022

Line list is discontinued as of 2022-09-22

@abhidg abhidg closed this as completed Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants