Predicting the incidence of COVID-19 using data mining

Affiliations.

  • 1 Department of Computer Engineering, School of Engineering, Behbahan Khatam Alanbia University of Technology, Behbahan, Iran.
  • 2 School of Medicine, Shahroud University of Medical Sciences, Shahroud, Iran. [email protected].
  • PMID: 34098928
  • PMCID: PMC8182740
  • DOI: 10.1186/s12889-021-11058-3

Background: The high prevalence of COVID-19 has made it a new pandemic. Predicting both its prevalence and incidence throughout the world is crucial to help health professionals make key decisions. In this study, we aim to predict the incidence of COVID-19 within a two-week period to better manage the disease.

Methods: The COVID-19 datasets provided by Johns Hopkins University, contain information on COVID-19 cases in different geographic regions since January 22, 2020 and are updated daily. Data from 252 such regions were analyzed as of March 29, 2020, with 17,136 records and 4 variables, namely latitude, longitude, date, and records. In order to design the incidence pattern for each geographic region, the information was utilized on the region and its neighboring areas gathered 2 weeks prior to the designing. Then, a model was developed to predict the incidence rate for the coming 2 weeks via a Least-Square Boosting Classification algorithm.

Results: The model was presented for three groups based on the incidence rate: less than 200, between 200 and 1000, and above 1000. The mean absolute error of model evaluation were 4.71, 8.54, and 6.13%, respectively. Also, comparing the forecast results with the actual values in the period in question showed that the proposed model predicted the number of globally confirmed cases of COVID-19 with a very high accuracy of 98.45%.

Conclusion: Using data from different geographical regions within a country and discovering the pattern of prevalence in a region and its neighboring areas, our boosting-based model was able to accurately predict the incidence of COVID-19 within a two-week period.

Keywords: COVID-19; Data mining; Predicting; Prevalence.

  • Data Mining

AIP Publishing Logo

  • Previous Article
  • Next Article

Data mining and machine learning techniques for coronavirus (COVID-19) pandemic: A review study

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

  • Article contents
  • Figures & tables
  • Supplementary Data
  • Peer Review
  • Reprints and Permissions
  • Cite Icon Cite
  • Search Site

Alaan Ghazi , Muthana Alisawi , Layth Hammood , Sirwan Saber Abdullah , Aras Al-Dawoodi , Abbas Hussein Ali , Ashraf Nabeel Almallah , Nidhal Mohsin Hazzaa , Yousif Mohammed Wahab , Asmaa Yaseen Nawaf; Data mining and machine learning techniques for coronavirus (COVID-19) pandemic: A review study. AIP Conf. Proc. 29 September 2023; 2839 (1): 040010. https://doi.org/10.1063/5.0167882

Download citation file:

  • Ris (Zotero)
  • Reference Manager

Data mining and machine learning (ML) methods will be examined in depth in this systematic study. Several medical issues have been addressed as a result of the growing interest in AI development. The danger presented by this virus to global public health means that these applications are inadequate. Data mining and machine learning (ML) methods may be used to automatically identify and diagnose COVID-19, according to this systematic study. Our goals are to get a comprehensive understanding of this dangerous virus, overcome the constraints of data mining and machine learning methods, and make this technology available to the medical community. Three databases, including IEEE Xplore, Web of Science, and Scientific Direct that was Scopus and Clarivate indexing were utilized in our research. Between 2020 and 2022, after obtaining around 1305 papers from these period, precise exclusion criteria and a selection technique were employed. MERS-Covid, SARS-CoV, SARS-CoV-2 including the recent Omicron variant are all CoV family members, and this research will examine the most recent state-of-the-art procedures for each.

Sign in via your Institution

Citing articles via, publish with us - request a quote.

data mining project on covid 19

Sign up for alerts

  • Online ISSN 1551-7616
  • Print ISSN 0094-243X
  • For Researchers
  • For Librarians
  • For Advertisers
  • Our Publishing Partners  
  • Physics Today
  • Conference Proceedings
  • Special Topics

pubs.aip.org

  • Privacy Policy
  • Terms of Use

Connect with AIP Publishing

This feature is available to subscribers only.

Sign In or Create an Account

Analyzing COVID-19 Dataset through Data Mining Tool “Orange”

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Hub Image

Process Excellence Network, a division of IQPC

Careers With IQPC | Contact Us | About Us | Cookie Policy

Become a Member today!

PLEASE ENTER YOUR EMAIL TO JOIN FOR FREE

Already an IQPC Community Member? Sign in Here or Forgot Password Sign up now and get FREE access to our extensive library of reports, infographics, whitepapers, webinars and online events from the world’s foremost thought leaders.

We respect your privacy, by clicking 'Subscribe' you will receive our e-newsletter, including information on Podcasts, Webinars, event discounts, online learning opportunities and agree to our User Agreement. You have the right to object. For further information on how we process and monitor your personal data click here . You can unsubscribe at any time.

websights

Mining Medical Data to Improve COVID-19 Treatment

Study Reveals Medications Associated With Lower Odds of Severe Infection

By Brandon Levy

Tuesday, June 28, 2022

doctor typing on laptop

By analyzing healthcare data on hundreds of thousands of people, IRP researchers have found clues that certain existing medications might be useful for combating COVID-19.

Many researchers studying COVID-19 have spent the past two years poring over test tubes and isolated cells. However, large troves of data about people’s interactions with the healthcare system can also be a rich source of useful insights. Using one such database, IRP researchers found that older adults taking certain medications were less likely to catch COVID or experience severe repercussions from the virus. 1

The vast majority of Americans ages 65 and older are enrolled in the federal government health insurance program known as Medicare. As a result, the U.S. government accrues mountains of health-related information on people in this age group, which is ‘de-identified’ so that it cannot be connected to specific individuals. What’s more, over 80 percent of the people in the U.S. who have died from COVID-19 were 65 or older, making Medicare data particularly useful for researchers who want to understand the factors that influence COVID-related health risks.

Those curious researchers include IRP senior investigator Clement McDonald, M.D. , and IRP staff scientist Kin Wah Fung, M.D. , who recently mined Medicare data to see whether older adults taking certain medications were more or less likely to be diagnosed with COVID, be hospitalized with the virus, or die from it. The dataset they analyzed included information on more than 374,000 older adults who received a COVID-19 diagnosis between April 1 and December 31, 2020, including which medications each person was taking during that period and when they were prescribed those drugs.

“COVID doesn’t happen in a vacuum,” says Dr. Fung, the new study’s first author. “Particularly in elderly patients, they may suffer from a host of chronic conditions and they may be taking various drugs because of those conditions. Either those conditions themselves or the drugs they’re taking may affect the susceptibility of a patient to COVID.”

person sorting medications in a pill organizer

Rather than give a drug to volunteers to determine its effects, as would be done in a clinical trial, the IRP study took advantage of the fact that many older adults already take certain medications on a regular basis.

Nearly three-quarters of the people included in the data who were diagnosed with COVID-19 during that timespan were taking at least one of the drugs Dr. Fung and Dr. McDonald were interested in. They specifically focused on medications that either influence biological pathways that are also affected by the novel coronavirus that causes COVID-19 or had been previously studied for their effects on the illness.

“In some of the prior studies, the findings conflicted with each other, so there was no definitive conclusion as to whether these drugs were beneficial or harmful,” Dr. Fung says. “Also, because we have a huge dataset, the number of patients in our study is at least several times more than in most of the published studies.”

“The special strength of the Medicare database, as opposed to some of the commercial healthcare databases and others, is you’ve got practically everybody,” adds Dr. McDonald, the study’s senior author.

After analyzing the Medicare data with the help of IRP biostatistician Seo Hyon Baik, Ph.D. , the researchers found that several types of drugs were associated with lower odds of being diagnosed with COVID-19 or experiencing a severe infection. These promising drugs included statins taken to reduce blood cholesterol levels; medications like warfarin that reduce the risk of blood clots to help prevent heart attacks and strokes; and angiotensin-converting enzyme (ACE) inhibitors and angiotensin receptor blocks (ARBs), which treat high blood pressure by reducing the effects of a chemical in the body called angiotensin II. The finding about ACE inhibitors and ARBs was particularly intriguing because, early in the pandemic, it was thought that those drugs might worsen COVID-19 infection by increasing the abundance of the ACE2 receptor, which angiotensin II binds to in order to exert its effects on blood vessels.

diagram of how the novel coronavirus uses the ACE2 receptor to infect cells

This diagram shows how the novel coronavirus that causes COVID-19 uses the ACE2 receptor to infect cells.

“The ACE2 receptor is a doorway through which COVID enters the cell,” Dr. McDonald explains. “There was some thought that if you had more of it, then more virus could get into cells, but that wasn’t the way it turned out.”

“Because of all the complex interactions between the drugs, the virus, and the cell, it’s very difficult to know what exactly will happen in a patient,” adds Dr. Fung. “Until you have a big study like ours or you have a really well-designed clinical trial, it’s really difficult to answer what the net outcome will be when you tinker with the angiotensin pathway. That’s how science is: often there are multiple pathways and there are many possible ways that the effects of a drug or virus can change the pathway, so you go back to the roots — you document the outcomes and that is the ground truth that will show whether your hypothesis is actually true or not.”

The IRP researchers also looked at the effects of hydroxychloroquine, which has long been used to treat malaria, rheumatoid arthritis, and chronic autoimmune conditions. Using hydroxychloroquine to treat COVID-19 has been controversial ever since the U.S. Food and Drug Administration authorized it for that purpose on an emergency basis in March 2020 and then revoked that authorization three months later. At first glance, the IRP study suggested that hydroxychloroquine actually increased the odds of being diagnosed with COVID-19 by more than 60 percent, but when the IRP team adjusted its analysis to exclude people who only began taking the drug in 2020 — possibly because they thought it would help ward off COVID-19 rather than due to medical need — it turned out that hydroxychloroquine had no effects on COVID-19 diagnosis, hospitalization, or death.

“We know now that hydroxychloroquine does not protect you, but if these people had a false sense of security because they thought they were taking a drug that would protect them from COVID, they might have been less likely to practice measures like masking, and so they may have been more likely to catch COVID,” Dr. Fung says. “If you’re not careful in designing your study, you can come to false conclusions like hydroxychloroquine being associated with increased risk of catching COVID.”

Dr. Kin Wah Fung (left) and Dr. Clement McDonald (right)

The study was led by Dr. Kin Wah Fung (left) and Dr. Clement McDonald (right).

Ultimately, the information gained from the study and similar endeavors will provide important leads for researchers looking to leverage existing medications as potential COVID-19 treatments. In addition, such studies demonstrate the value of the federal government’s efforts to collect a wide variety of anonymized medical information about American citizens.

“We are more and more talking about a ‘learning’ healthcare system,” Dr. Fung says. “Even during routine healthcare, we collect data, and that data can actually feed back into knowledge discovery and hypothesis generation. This is one mechanism through which the huge enclave of Medicare claims data can be used in research. It’s a very powerful and useful dataset.”

Subscribe to our weekly newsletter to stay up-to-date on the latest breakthroughs in the NIH Intramural Research Program.

References:

[1] Fung KW, Baik SH, Baye F, Zheng Z, Huser V, McDonald CJ. Effect of common maintenance drugs on the risk and severity of COVID-19 in elderly patients. PLoS One . 2022 Apr 18;17(4):e0266922. doi: 10.1371/journal.pone.0266922.

Related Blog Posts

  • IRP Pushes Forward in Fight Against Pandemic Virus
  • The Virus vs the Machine
  • IRP’s Gary Gibbons and Eliseo Pérez-Stable Receive Government Award for COVID-19 Response
  • A Multi-Front Effort to Combat Coronavirus
  • A Computational Approach to Curbing Chemotherapy’s Side Effects

This page was last updated on Wednesday, May 24, 2023

Google trends graph for anosmia from 2019 to Aug 2021

Data mining tools combat COVID-19 misinformation and identify symptoms

Computer scientists use Google Trends and a government dataset to track symptoms and sift through misinformation

data mining project on covid 19

UC Riverside computer scientists are developing tools to help track and monitor COVID-19 symptoms and to sift through misinformation about the disease on social media.

Evangelos Papalexakis

Using Google Trends data, a group led by Vagelis Papalexakis , an associate professor in the Marlan and Rosemary Bourns College of Engineering; and Jia Chen , an assistant professor of teaching, developed an algorithm that identified three symptoms unique to COVID-19 compared to the flu: ageusia — loss of the tongue’s taste function — shortness of breath, and anosmia, or loss of smell. The algorithm was developed in collaboration with two graduate students, Md Imrul Kaish and Md Jakir Hossain, at the University of Texas Rio Grande Valley.

“Much of the work using Google Trends for flu has focused on forecasting the flu season,” Papalexakis said. “We, on the other hand, used it to see if we could find a needle in a haystack: symptoms unique to COVID-19 among all the flu-like symptoms people search for.”

The researchers located symptoms on Google Trends for 2019 and 2020 and used a technique they called nonnegative discriminative analysis, or DNA, to extract terms that were unique to one dataset relative to the other. 

Jia Chen

“We assumed that symptom searches in 2019 would lead to influenza or other respiratory ailments, while searches for the same symptoms in 2020 could be either,” Chen said. “Using DNA, we were able to find the difference between the two datasets. This happened to be terms clinicians have already identified as unique to COVID-19, showing that our approach works.”

Papalexakis and Chen expect their work will help epidemiologists and other public health experts track and monitor COVID-19 using Google Trends as a proxy for hospital data.

“Google trends data is very noisy, but hospital data is not publicly available. People might search for symptoms because they are experiencing them or because they have heard of them and want to know more,” Papalexakis said. “Searches reflect interest in symptoms better than people actively experiencing them, but given the lack of other data, we think this tool could help researchers understand symptoms better.”

Chen said that the algorithm is simple and easy to implement as part of a potential tool that can help scientists researching other diseases learn about potential symptoms. 

The paper, “COVID-19 or Flu? Discriminative Knowledge Discovery of COVID-19 Symptoms from Google Trends Data,” was presented at epiDAMIK 2021, a workshop on data mining for advancing epidemiological knowledge. The workshop was organized as part of the largest annual data science conference, the Association for Computing Machinery’s, or ACM, Special Interest Group on Knowledge Discovery and Data Mining. The paper is available here .

Papalexakis and UC Riverside doctoral student William Shiao are also developing a tool that not only identifies COVID-19 misinformation but shows why the information is flagged as false in relation to a database of scientific articles about research on coronaviruses.

Papalexakis and Shiao used 90,000 articles from the COVID-19 Open Research Dataset Challenge (CORD-19) prepared by the White House and a coalition of research groups, and collected 20,000 articles “in the wild” with misinformation about the novel coronavirus. Using a similarity matrix-based embedding method they called KI 2 TE, the articles were linked to a set of reference documents and interpreted. The documents used for reference were a set of academic papers on coronavirus research included in the CORD-19 dataset.

When tested on articles that had been labeled by humans as false or identified by Google Fact Check as false, their method not only correctly identified the false stories but also pointed to the scientific sources that corroborated the system’s decision.

“We are not interested in censoring what people see. We want to go beyond hiding something altogether or simply showing a warning label,” Papalexakis said. “We want to also show them sources to educate them.”

Although the tool developed by Papalexakis and Shiao is a prototype under active research development, it could eventually be incorporated into a smartphone app or into social media platforms like Facebook. 

The results of this research were presented at the “Knowledge Graphs for Online Discourse Analysis” workshop organized as part of the ACM Web Conference, and the paper, “KI2TE: Knowledge-Infused InterpreTable Embeddings for COVID-19 Misinformation Detection,” is available here .

Media Contacts

Related articles.

Malaria parasite in blood film

Fight against malaria takes a step forward

Stock image of lungs

Scientists discover hidden army of lung flu fighters

microscope image of new nematode species

Surprise discovery of tiny insect-killing worm 

plane and leaves

Inexpensive, carbon-neutral biofuels are finally possible

data mining project on covid 19

Translation Disclaimer

COVID-19 Roundup: Dashboards, Datasets, Data Mining & More

data mining project on covid 19

As the COVID-19 pandemic sweeps the globe, big data and AI have emerged as crucial tools for everything from diagnosis and epidemiology to therapeutic and vaccine development. Here, we collect the latest news in how big data is fighting back against COVID-19.

data mining project on covid 19

Researchers at Sandia National Laboratories have assembled a combination of data mining, machine learning algorithms, and compression-based analytics to help the research community comb through the tens of thousands of articles published about COVID-19 since the pandemic began. In their initial test, they filtered 29,000 studies down to 87 relevant papers within ten minutes. To read more, click here .

Researchers call for better data collection to support COVID-19 behavioral research

Researchers writing in the American Journal of Preventive Medicine have issued an interdisciplinary call to action to strengthen the behavioral science response to COVID-19 through more funding and better data collection. To read more, click here .

PerkinElmer launches open-access COVID-19 data dashboards for therapeutics research

PerkinElmer has announced that it has launched two online, open-access COVID-19 data dashboards to help accelerate the development of COVID-19 antivirals and vaccines. The first dashboard, the PerkinElmer COVID-19 Drug Compound Dashboard, helps scientists narrow down drug compounds; the second dashboard, the PerkinElmer COVID-19 Clinical Trial Dashboard, helps them sort through clinical trials. To read more, click here .

data mining project on covid 19

Developers at the Georgia Tech Research Institute have helped bring COVID-19 data dashboards to the Centers for Disease Control and Prevention (CDC). The first dashboard, which is now publicly available , tracks the relationship between human mobility and COVID-19 transmission area-by-area. To read more, click here .

DARPA taps Duality Technologies to develop machine learning for COVID-19 research

Privacy technology provider Duality Technologies has been tapped by the Defense Advanced Research Projects Agency (DARPA) to develop a privacy-preserving machine learning tool that can be applied to researching genomic susceptibility to severe COVID-19 symptoms. “This contract award supports Duality’s commitment to promoting responsible, privacy-preserving use of big data for the public good and for national security,” said Dr. Kurt Rohloff, Co-founder and CTO, Duality Technologies. To read more, click here .

University of California Health creates centralized dataset to accelerate COVID-19 research

University of California Health has developed a unified, secure dataset for use in COVID-19 research using electronic health records from across its academic health system. Containing more than 460 million data points, the dataset is accessible to researchers across the California university system, helping researchers and doctors compare treatment options from previous cases to better treat future patients. To read more, click here .

Report finds that COVID-19 stay-at-home orders drove changes in smartphone behavior

Strategy Analytics research has found that the app usage patterns of smartphone users in the United States significantly changed when those users were faced with COVID-19-driven stay-at-home orders. App use increased by more than six hours per month, predominantly due to growth in use of social networking apps and mobile browsing. To learn more, click here .

Do you know about big data applications for COVID-19 that should be featured on this list? If so, send us an email at  [email protected] . We look forward to hearing from you.

Join the discussion Cancel reply

Your email address will not be published. Required fields are marked *

Notify me of follow-up comments by email.

Notify me of new posts by email.

Only registered users may comment. Register using the form below.

  • EnterpriseTech
  • Technology Conferences & Events
  • Advanced Computing Job Bank
  • Technology Product Showcase
  • Name * First Last
  • Organization *
  • Job Function * Technology: CIO/CTO/CSO Technology: Consultant Technology: Developer Technology: Data Center Management Technology: Data Center Operations Technology: Data Intelligence Management Technology: Data Scientist/Analyst Technology: Engineering Technology: HPC Managements Technology: HPC Operations Technology: IT Managements Technology: IT Operations Business: Business Development/Sales Business: CEO/President/Owner Business: EVP/SVP/VP Business: Management Business: Marketing Business: Operations Business: Product Management Business: Other Academia/Education Science Research & Development Other
  • Industry * Aerospace Automotive Education Financial Services Government Hardware Vendor Healthcare/Life Sciences Manufacturing Media/Entertainment Oil/Gas/Energy Research Center Retail Software Vendor/ISV Transportation/Utilities Telecom VAR/VAD/Integrator Other
  • Country * United States Canada Afghanistan Albania Algeria American Samoa Andorra Angola Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin Bermuda Bhutan Bolivia Bosnia and Herzegovina Botswana Brazil Brunei Bulgaria Burkina Faso Burundi Cambodia Cameroon Cape Verde Central African Republic Chad Chile China Colombia Comoros Congo, Democratic Republic of the Congo, Republic of the Costa Rica Côte d'Ivoire Croatia Cuba Cyprus Czech Republic Denmark Djibouti Dominica Dominican Republic East Timor Ecuador Egypt El Salvador Equatorial Guinea Eritrea Estonia Ethiopia Fiji Finland France Gabon Gambia Georgia Germany Ghana Greece Greenland Grenada Guam Guatemala Guinea Guinea-Bissau Guyana Haiti Honduras Hong Kong Hungary Iceland India Indonesia Iran Iraq Ireland Israel Italy Jamaica Japan Jordan Kazakhstan Kenya Kiribati North Korea South Korea Kuwait Kyrgyzstan Laos Latvia Lebanon Lesotho Liberia Libya Liechtenstein Lithuania Luxembourg Macedonia Madagascar Malawi Malaysia Maldives Mali Malta Marshall Islands Mauritania Mauritius Mexico Micronesia Moldova Monaco Mongolia Montenegro Morocco Mozambique Myanmar Namibia Nauru Nepal Netherlands New Zealand Nicaragua Niger Nigeria Norway Northern Mariana Islands Oman Pakistan Palau Palestine Panama Papua New Guinea Paraguay Peru Philippines Poland Portugal Puerto Rico Qatar Romania Russia Rwanda Saint Kitts and Nevis Saint Lucia Saint Vincent and the Grenadines Samoa San Marino Sao Tome and Principe Saudi Arabia Senegal Serbia and Montenegro Seychelles Sierra Leone Singapore Slovakia Slovenia Solomon Islands Somalia South Africa Spain Sri Lanka Sudan Sudan, South Suriname Swaziland Sweden Switzerland Syria Taiwan Tajikistan Tanzania Thailand Togo Tonga Trinidad and Tobago Tunisia Turkey Turkmenistan Tuvalu Uganda Ukraine United Arab Emirates United Kingdom Uruguay Uzbekistan Vanuatu Vatican City Venezuela Vietnam Virgin Islands, British Virgin Islands, U.S. Yemen Zambia Zimbabwe
  • State * Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware District of Columbia Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming Armed Forces Americas Armed Forces Europe Armed Forces Pacific
  • Province * Alberta British Columbia Manitoba New Brunswick Newfoundland & Labrador Northwest Territories Nova Scotia Nunavut Ontario Prince Edward Island Quebec Saskatchewan Yukon
  • Please check here to receive valuable email offers from Datanami on behalf of our select partners.
  • This Just In

February 16, 2024

  • Rice University Researchers Uncover Bias in Machine Learning Tools for Immunotherapy
  • Trellix to Host AI and Cybersecurity Virtual Summit
  • IEEE BigDataService 2024 Announces Call for Papers for 10th Edition in Shanghai
  • Gathr to Accelerate Gen AI Innovations for Enterprises with Its Upcoming Launch
  • UiPath Announces New Developer Features at DevCon 2024 to Bring Latest in AI-powered Productivity to Developer Community
  • Yellowbrick’s Cloud Data Warehouse Chosen by Ushur to Boost AI-Driven Customer Experience
  • Lambda Raises $320M in Series C Funding to Expand Its AI Cloud Business and Scale GPU Infrastructure

February 15, 2024

  • Alation Launches Data Culture Maturity Assessment to Measure the True Value of Data Initiatives
  • Vapor IO and Supermicro Unveil Zero Gap AI, Leveraging NVIDIA GH200 Grace Hopper Superchip for Enhanced AI Services
  • Torch.AI Awarded New Patent for Autonomous Graph Compute Invention
  • Lenovo and Anaconda Announce Agreement to Accelerate AI Development and Deployment
  • Quantum’s Myriad Platform Acquired by XENON to Propel Life Sciences and Engineering Research
  • BigID Introduces Access Controls to Mitigate Insider Risk, Enable Zero Trust Security, and Accelerate AI Compliance
  • Artie Raises $3.3M to Accelerate Decision-Making in Enterprises with Real-Time Data Processing

February 14, 2024

  • Oracle Advances Data Management with AI-Embedded Autonomous Database Enhancements
  • Akamai Expands Cloud Services with Gecko, Blending Edge Network for Enhanced Efficiency
  • Perforce to Acquire Delphix, Adding Enterprise Data Management Software to Its DevOps Portfolio
  • EDM Council Launches Data Excellence Program
  • Fujitsu AI Strategy Strengthens Data Integration, GenAI Capabilities with Dedicated Platform and New Fujitsu Uvance Offerings
  • Alteryx and Databricks Fast-Track AI for Enterprises with Deepened Integration
  • More This Just In…

Most Read Features

  • Data Mesh Vs. Data Fabric: Understanding the Differences
  • More Features…

Most Read News In Brief

  • Cloudera Hit with $240 Million Judgement Over Erasure Coding
  • Mathematica Helps Crack Zodiac Killer’s Code
  • More News In Brief…

Most Read This Just In

Sorry. No data so far.

Sponsored Partner Content

Gartner® hype cycle™ for analytics and business intelligence 2023, the art of mastering data quality for ai and analytics, navigating the ai era: how to empower data engineers for success, tiledb adds vector search capabilities, the uses and abuses of cloud data warehouses, 4 tips for migrating from proprietary to open source solutions, leading solution providers.

data mining project on covid 19

Tabor Network

HPCwire

Sponsored Whitepapers

Building an operational data warehouse for real-time analytics, can you use kafka as a database.

  • View the Whitepaper Library

Sponsored Multimedia

data mining project on covid 19

The Power of DataOps: Bring Automation to Life No Comments

data mining project on covid 19

Tactical Steps for Cloud Migration No Comments

data mining project on covid 19

Immuta Data Access Platform No Comments

data mining project on covid 19

Data Mesh: Fact or Fiction? No Comments

Contributors.

Tiffany Trader

Featured Events

Women in data science worldwide conference.

data mining project on covid 19

Memory Con 2024

data mining project on covid 19

Data Universe

data mining project on covid 19

Call & Contact Center Expo

data mining project on covid 19

AI & Big Data Expo North America 2024

data mining project on covid 19

AI Hardware & Edge AI Summit 2024

data mining project on covid 19

View More…

Privacy Overview

Copy short link.

  • Alzheimer's disease & dementia
  • Arthritis & Rheumatism
  • Attention deficit disorders
  • Autism spectrum disorders
  • Biomedical technology
  • Diseases, Conditions, Syndromes
  • Endocrinology & Metabolism
  • Gastroenterology
  • Gerontology & Geriatrics
  • Health informatics
  • Inflammatory disorders
  • Medical economics
  • Medical research
  • Medications
  • Neuroscience
  • Obstetrics & gynaecology
  • Oncology & Cancer
  • Ophthalmology
  • Overweight & Obesity
  • Parkinson's & Movement disorders
  • Psychology & Psychiatry
  • Radiology & Imaging
  • Sleep disorders
  • Sports medicine & Kinesiology
  • Vaccination
  • Breast cancer
  • Cardiovascular disease
  • Chronic obstructive pulmonary disease
  • Colon cancer
  • Coronary artery disease
  • Heart attack
  • Heart disease
  • High blood pressure
  • Kidney disease
  • Lung cancer
  • Multiple sclerosis
  • Myocardial infarction
  • Ovarian cancer
  • Post traumatic stress disorder
  • Rheumatoid arthritis
  • Schizophrenia
  • Skin cancer
  • Type 2 diabetes
  • Full List »

share this!

February 12, 2024

This article has been reviewed according to Science X's editorial process and policies . Editors have highlighted the following attributes while ensuring the content's credibility:

fact-checked

trusted source

Using citizens' data securely in research: COVID-19 data donation projects show how it can be done

by Viktoria Bosak, Dresden University of Technology

health app phone

Smartphones, smartwatches and associated apps are constantly improving their ability to record and store personal health data. The initial proposal for the EU law for a European Health Data Space in 2022 would allow depersonalized health and wellness data to be shared without explicit consent in the future. There has been understandable pushback against it—not just from data protection officers.

In an article published in npj Digital Medicine , Professor Stephen Gilbert, EKFZ for Digital Health, and Professor Dirk Brockmann, Center Synergy of Systems, discuss how medical data from citizens could be used for research in the future while respecting personal rights.

Their proposed solutions are based on experiences with data donation projects during the COVID-19 pandemic, which signpost a participative, standardized, scalable and consent-based approach to data sharing.

An increasing number of people are using wellness and health apps that measure, interpret and store a variety of parameters such as activity, metabolites, electrical signals, blood pressure and oxygenation. This data is not just for personal interest, it is also of great importance for medical research . The analysis of such citizen-gathered health data in conjunction with clinical data could help to improve understanding of diseases, their development and early diagnosis. It is also an important basis for research, especially for optimizing predictions based on deep learning and other artificial intelligence methods.

During the COVID-19 pandemic, several data donation projects were initiated in Germany, the U.K. and the U.S. These projects showed that citizens were willing to participate and share their data. The prerequisite was that they were able to decide for themselves when to share which data and were given the opportunity for consent withdrawal and to stop participation at any time.

"It is not ethically acceptable or politically sustainable to harvest more and more personal data from citizens by default with every new smart product—especially without asking for their consent first," says Gilbert, co-author of the article. As a solution the researchers propose the use of a trusted and secure externally provided consent platform.

This would allow users to understand with whom, where and for what purpose they are sharing their health data. Active engagement further increases the likelihood that data will be shared over a longer period of time. Researchers' positive and instructive experiences with voluntary data donation during the COVID-19 pandemic should now be used to find long-term solutions.

"In the future, the use of personal health data for research will only work if all participants are fully aware and consenting and can also withdraw their decision at any time. Our work has shown that citizens understand the benefits they can bring to society through volunteering their health and wellness data," says Brockmann, Director of the Center Synergy of Systems at TU Dresden.

Explore further

Feedback to editors

data mining project on covid 19

Researchers identify genes and cell types that may have causal role in primary open-angle glaucoma formation

16 hours ago

data mining project on covid 19

Promising target for CAR T-cell therapy leads to potent antitumor responses against cutaneous and rare melanomas

data mining project on covid 19

What can bulls tell us about men? Genetic discovery could translate to human fertility research

17 hours ago

data mining project on covid 19

Discovery provides new insight into severe liver disease

data mining project on covid 19

Researchers investigate long-term outcomes after severe childhood malnutrition

data mining project on covid 19

Review shares 'state-of-the-art' knowledge about fungal disease

data mining project on covid 19

Wildfires linked to surge in mental health-related emergency department visits, study shows

data mining project on covid 19

Heart attack found to significantly increase risk of other health conditions

18 hours ago

data mining project on covid 19

Study finds neurological symptoms are not a direct result of SARS-CoV-2 infection of the brain

data mining project on covid 19

Protective mechanism discovered in the formation of fear memories could be starting point for new therapies

Related stories.

data mining project on covid 19

Informed consent to the use of personal health data: A new standardized approach

Nov 16, 2023

data mining project on covid 19

Would people be willing to give their personal data for research?

Nov 20, 2019

data mining project on covid 19

EU squeezes Meta on personal data use for targeting ads

Nov 1, 2023

data mining project on covid 19

Meta to ask EU users' consent to share data for targeted ads

Aug 1, 2023

data mining project on covid 19

9 in 10 Americans want their health info kept private

Aug 2, 2022

data mining project on covid 19

EU wants patients to be able to access medical data

May 3, 2022

Recommended for you

data mining project on covid 19

Widely used AI tool for early sepsis detection may be cribbing doctors' suspicions

Feb 15, 2024

data mining project on covid 19

Environmental monitoring offers low-cost tool for typhoid fever surveillance

data mining project on covid 19

Common coronaviruses circulate twice yearly in Malawi, in contrast to annual winter peaks in more temperate climates

data mining project on covid 19

AI tools better at predicting heart transplant rejection than standard clinical method, finds study

Let us know if there is a problem with our content.

Use this form if you have come across a typo, inaccuracy or would like to send an edit request for the content on this page. For general inquiries, please use our contact form . For general feedback, use the public comments section below (please adhere to guidelines ).

Please select the most appropriate category to facilitate processing of your request

Thank you for taking time to provide your feedback to the editors.

Your feedback is important to us. However, we do not guarantee individual replies due to the high volume of messages.

E-mail the story

Your email address is used only to let the recipient know who sent the email. Neither your address nor the recipient's address will be used for any other purpose. The information you enter will appear in your e-mail message and is not retained by Medical Xpress in any form.

Newsletter sign up

Get weekly and/or daily updates delivered to your inbox. You can unsubscribe at any time and we'll never share your details to third parties.

More information Privacy policy

Donate and enjoy an ad-free experience

We keep our content available to everyone. Consider supporting Science X's mission by getting a premium account.

E-mail newsletter

Search code, repositories, users, issues, pull requests...

Provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

COVID-19 Data Mining Project

liuyal/Project_COVID19

Folders and files, repository files navigation, project coivd-19, a data mining study on covid-19 pandemic growth & related social media dynamics, introduction.

Coronavirus disease 2019 or COVID-19 is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [1]. First identified back in December 2019 in Wuhan province, China, the COVID-19 virus has resulted in nearly 15 million confirmed cases globally (as the writing of this report) [2]. Amidst the COVID-19 crisis, social media usage on platforms such as Facebook, WhatsApp, Twitter, and etc. has surged significantly [3]. As the general population rely heavily on social media platforms to gather the latest information in regards to the pandemic, resulting in an unprecedented amount of content and information.

An interesting data mining topic to focus on for the COVID-19 pandemic is to determine the relationship between COVID-19 related trending topics and sentiments on the social media platform Twitter, with the number of reported confirmed cases for a given country over a period of time. Topic modeling and Tweet sentiments classification could be a useful measurement for determining the general attitude expressed towards COVID-19 for a given population. Since the pandemic originated from Asia, and only after a three-month period where the virus quickly spread across North America totaling the number of confirmed cases to more than 10 million globally [2]; it would be very interesting to follow the change in daily trending topics of interest and tweet sentiments due to the influx of confirmed cases as COVID-19 begins to spread in a particular country or region.

The measurement of relationship between the growth of the pandemic and social media topics and semantics can help determine how a general population express their opinions, concerns, and general awareness throughout a global event. As such, semantics information and topic modeling can be used by government bodies around the world to determine the general population’s level of attitude towards the pandemic as it grows, and how to issue proper procedures and implement restriction in times of crisis for future global events or pandemics.

Architecture & Pipeline

data mining project on covid 19

The data mining architecture follows the basic Knowledge Discovery and Data Mining (KDDM) process, where the entire process is split into five phase. Staring with data collection for obtaining the raw datasets using the data collector, then data preprocessing using the data formatter to prepare the datasets for transformation such as Tokenization. Data mining methods such as sentiment classifiers and topic modeling are then performed on the transformed dataset and the results are visualized for knowledge comprehension.

data mining project on covid 19

The first spike of negative sentiment tweets are near the end of the month of February, where the COVID-19 growth starts to manifest rapidly in areas outside of China. The second spike from mid-May onwards, which can be correlated to the explosive growth within the United States. As the number of confirmed cases increased from around 1.5 million to 4.5 million in the span of two months, so did the frequency of negative sentiment tweets. The day to day increase in number of negative sentiment tweets can be observed as the severity and number of confirmed cases increased due to COVID-19 pandemic.

The full report can be found HERE

  • Input Twitter API Credentials to twitter.token ( GUIDE )
  • Run master script with python 3.6+ python 0_project_covid19.py
  • [1] M. Clinic, “Coronavirus disease 2019 (COVID-19),” Mayo Clinic, 16-Jun-2020. [Online]. Available: https://www.mayoclinic.org/diseases-conditions/coronavirus/symptoms-causes/syc-20479963 . [Accessed: 25-Jun-2020].
  • [2] L. Gardner, “Mapping COVID-19,” JHU CSSE. [Online]. Available: https://systems.jhu.edu/research/public-health/ncov/ . [Accessed: 25-Jun-2020].
  • [3] R. Holmes, “Is COVID-19 Social Media's Levelling Up Moment?” Forbes, 24-Apr-2020. [Online]. Available: https://www.forbes.com/sites/ryanholmes/2020/04/24/is-covid-19-social-medias-levelling-up-moment/#93725e96c606 . [Accessed: 25-Jun-2020].
  • [4] E. Chen, K. Lerman, E. Ferrara, I. S. Institute, C. A. C. C. A. E. Ferrara, C. Author, C. C. A. E. Ferrara, Close, and L. authors..., “Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set,” JMIR Public Health and Surveillance. [Online]. Available: https://publichealth.jmir.org/2020/2/e19273/ . [Accessed: 02-Jul-2020].

Dataset Links

  • Johns Hopkins University (CSSE) COVID-19 Data Repository
  • Daily COVID-19 Twitter ID Data Repository
  • T4SA Twitter Training Data
  • Python 100.0%

Our websites may use cookies to personalize and enhance your experience. By continuing without changing your cookie settings, you agree to this collection. For more information, please see our University Websites Privacy Notice .

UConn Today

  • School and College News
  • Arts & Culture
  • Community Impact
  • Entrepreneurship
  • Health & Well-Being
  • Research & Discovery
  • UConn Health
  • University Life
  • UConn Voices
  • University News

February 15, 2024 | Mac Murray

Pandemic Journaling Project Archive Opens for Research

A repository of data detailing the personal experiences of more than 1,800 people living during the COVID-19 pandemic is available to researchers for the first time

A globe in a museum display case, with red pins marking locations.

A globe signalling Pandemic Journaling Project participants' 55 countries of origin was on display at PJP's Picturing the Pandemic exhibition (https://picturingthepandemic.org/), which opened at the Hartford Public Library on Oct. 27, 2022. From Hartford, the exhibition continued on to Providence, Rhode Island; Heidelberg, Germany; Mexico City, Mexico; and Toronto, Canada. (Sydney Herdle/UConn Photo)

Today, data from the groundbreaking Pandemic Journaling Project (PJP) , headed by anthropologists Sarah S. Willen (UConn) and Katherine A. Mason (Brown University) , are being made available to researchers from all disciplines. Researchers can now apply to work with the data via their forever virtual home at Syracuse University’s Qualitative Data Repository (QDR) , thanks to the team’s long-term collaboration with QDR Associate Director Sebastian Karcher. 

The Pandemic Journaling Project logo

“Each journal, and each person’s story, matters on its own terms,” says Willen, who also co-directs the Research Program on Global Health and Human Rights at the Gladstein Family Human Rights Institute. “Taken together, this collection of materials offers a unique and extraordinary window onto how the pandemic challenged us and changed us, not just as individuals but also as members of broader communities.”     

Willen is excited about both the short-term and long-term possibilities for the data. Already, the PJP research team has begun using PJP data to explore a wide variety of questions, ranging from the impact of the pandemic on different groups’ mental health , to students’ experiences during COVID-19 around the globe , to the human rights dimensions of the project itself, which the team describes as a form of “archival activism.”  

But the project’s full significance as a pandemic time capsule may not become clear until twenty, forty, or even a hundred years from now, Willen says – especially after the COVID-19 pandemic has faded from living memory.  

“Our hope is that one day people with no personal memory of the pandemic will dig into the archive and find voices of real people – either by reading their journal entries , or literally hearing their voices – and that those encounters will illuminate the incredible fear, pain, and loss – and sometimes solidarity, joy, and relief – people around the world found themselves feeling,” Willen says.  

In addition to recording journal entries, PJP participants also reported on how their physical and mental health, as well as their level of trust in government and social institutions, changed throughout the pandemic. The data collection began extraordinarily early – on May 29, 2020, less than 80 days after the WHO declared the pandemic.  

It’s one thing to read newspaper accounts … but it’s something very different to hear people talk about the extraordinary loneliness that they’re experiencing because they haven’t used their voice in 36 hours.

“We knew from the very beginning that we wanted to share these data widely and archive them for the future,” says Mason. “We were so lucky to be able to connect with Sebastian and his [QDR] team early in the process so that we could make sure that we were able to do so as ethically and effectively as possible.”   

The resulting collection is unique in its breadth and potential for use and re-use by researchers in anthropology, sociology, history, and public health, among other fields.   

“There are several large quantitative surveys in the social sciences that are broadly used,” says Karcher, “but qualitative datasets that are large enough to be analyzed from so many different angles are very, very rare. The PJP data are going to be a treasure for social scientists for years to come, and we at QDR couldn’t be more excited to be the permanent home for them.”   

To protect participants, access to the full PJP data on QDR requires prior approval—requests can be submitted directly from the dataset’s page on QDR .  However, a significant subset of more than 2,000 entries are already available for searching and browsing on the Featured Entries page of the PJP website. There, anyone can spend time with the multiplicity of recorded stories.   

The team believes the firsthand, qualitative nature of the data can play a crucial role in conveying the true human costs of the pandemic to future generations.  

“It’s one thing to read newspaper accounts or policy documents from a particular moment in time, but it’s something very different to hear people talk about the extraordinary loneliness that they’re experiencing because they haven’t used their voice in 36 hours,” Willen says. “Or about the incredible sense of loss you feel when milestones arrive –birthdays, the birth of a new baby, a funeral – and you can’t be together with loved ones. Without first-person accounts about moments like these, it’s extremely difficult to understand or communicate the experience of living through a massive global crisis.”  

Learn more about the Pandemic Journaling Project in the Spring 2024 issue of UConn Magazine , out February 20.

Recent Articles

Travis Kelce #87 of the Kansas City Chiefs reacts at Head coach Andy Reid in the first half against the San Francisco 49ers during Super Bowl LVIII at Allegiant Stadium on February 11, 2024 in Las Vegas, Nevada.

February 16, 2024

The KC Chiefs and the NFL Should Not Be So Quick to “Shake It Off”

Read the article

Coast Guard Sector Long Island Sound Commander Captain Elisa M. Garrity pins a medal to John Genther’s lapel in honor of his heroic act on Nov. 6.

After First Aid Training Hosted by CT Sea Grant at UConn Avery Point, Fisherman Makes Dramatic Rescue

UConn Waterbury at night

UConn Waterbury Students to Get Help Adjusting to College, Thanks to Mayor’s Fundraiser

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 14 February 2024

Harmonizing government responses to the COVID-19 pandemic

  • Cindy Cheng   ORCID: orcid.org/0000-0002-7345-827X 1 ,
  • Luca Messerschmidt   ORCID: orcid.org/0000-0001-8069-381X 1 ,
  • Isaac Bravo 1 ,
  • Marco Waldbauer 1 ,
  • Rohan Bhavikatti   ORCID: orcid.org/0009-0004-9295-7551 2 ,
  • Caress Schenk   ORCID: orcid.org/0000-0001-5738-1906 3 ,
  • Vanja Grujic 4 ,
  • Tim Model   ORCID: orcid.org/0000-0003-1021-825X 5 ,
  • Robert Kubinec 6 &
  • Joan Barceló   ORCID: orcid.org/0000-0001-6965-4342 6  

Scientific Data volume  11 , Article number:  204 ( 2024 ) Cite this article

81 Accesses

3 Altmetric

Metrics details

  • Public health
  • Research data

Public health and safety measures (PHSM) made in response to the COVID-19 pandemic have been singular, rapid, and profuse compared to the content, speed, and volume of normal policy-making. Not only can they have a profound effect on the spread of the disease, but they may also have multitudinous secondary effects, in both the social and natural worlds. Unfortunately, despite the best efforts by numerous research groups, existing data on COVID-19 PHSM only partially captures their full geographical scale and policy scope for any significant duration of time. This paper introduces our effort to harmonize data from the eight largest such efforts for policies made before September 21, 2021 into the taxonomy developed by the CoronaNet Research Project in order to respond to the need for comprehensive, high quality COVID-19 data. In doing so, we present a comprehensive comparative analysis of existing data from different COVID-19 PHSM datasets, introduce our novel methodology for harmonizing COVID-19 PHSM data, and provide a clear-eyed assessment of the pros and cons of our efforts.

Introduction

From lockdowns to travel bans, government responses to the SARS CoV-2 virus radically affect virtually every dimension of society, from how governments govern to how businesses operate and how citizens live their lives. Comprehensive, high quality and timely access to COVID-19 public health and social measures (PHSM) data is thus crucial for understanding not only what these responses are, but the scale, scope and duration of their effect on various policy areas. These include e.g. the economy, the environment and society, to say nothing of their ostensible function of reducing the spread of the virus itself. Though dozens of research groups have documented COVID-19 PHSM, these individual data tracking efforts provide only an incomplete portrait of COVID-19 PHSM, with many having stopped their efforts entirely, often due to funding constraints 1 .

This article shows how data harmonization, the process of making comparable and compatible conceptually similar data, can create a more comprehensive dataset of COVID-19 PHSM relative to any single dataset alone. Specifically, this article introduces a novel, rigorous methodology for harmonizing COVID-19 PHSM data from December 31, 2019 until September 21, 2021 for the 8 largest existing PHSM tracking efforts:

ACAPS COVID-19 Government Measures (ACAPS) 2

COVID Analysis and Mapping of Policies (COVID AMP) 3 , 4

Canadian Dataset of COVID-19 Interventions (CIHI) 5

CoronaNet Research Project COVID-19 Government Response Event Dataset (CoronaNet) 6 , 7

Johns Hopkins Health Intervention Tracking for COVID-19 (HIT-COVID) 8 , 9

Oxford COVID-19 Government Response Tracker (OxCGRT) 10

World Health Organization EURO (WHO EURO) dataset on COVID-19 policies (retrieved from the WHO Public Health and Safety Measures (WHO PHSM) 11 )

US Center for Disease Control (WHO CDC) dataset on COVID-19 policies (retrieved from the WHO Public Health and Safety Measures (WHO PHSM) 11 )

With the help of 500+ research assistants around the world, these datasets are harmonized into the CoronaNet taxonomy to provide a fuller picture of government responses to the pandemic. This amounts to harmonizing around 150,000 observations external to the CoronaNet dataset, which at the time of writing itself has more than 180,000 observations. Not only can harmonizing these different datasets provide a more accurate and complete basis for understanding the drivers and effects of this pandemic, it can also ensure that the data collected by trackers that have stopped their work are not lost and that the original sources underlying this data are preserved to aid future research.

The value of this work is multifaceted. In terms of its value for aiding understanding of COVID-19 PHSM, our discussion of the various challenges faced in harmonizing these 8 datasets not only illuminates difficulties in the data harmonization process but also provides rich detail as to the relative strengths and weaknesses among different datasets. Such a discussion has thus far been missing in the literature and can help researchers adjudicate which dataset best meets their research needs. Initiating this discussion also makes transparent the difficulties in collecting PHSM data more generally, which, to our knowledge, has not been documented in such extensive and comparative detail before. Meanwhile, with respect to its value to COVID-19 research, our ongoing data harmonization efforts represents an enormous improvement over what was previously available from any one dataset alone. While any improvement to PHSM data coverage across time or government jurisdictions can provide a more robust and accurate basis for forwarding research on the drivers and effects of the pandemic, harmonizing data from the 8 largest such existing datasets ensures that our resulting dataset will be able to provide the most comprehensive and detailed information about COVID-19 PHSM possible. Finally, from a methodological perspective, our PHSM harmonization effort can serve as a guide for the harmonization of other datasets, especially in the social sciences. That is, in contrast to virtually all other harmonization efforts we could identify 12 , ours is largely implemented manually, providing us with an unusually intimate knowledge of data harmonization at the level of the individual observation.

In the results section, we provide readers with an overview of our data harmonization efforts from a variety of different angles. We begin from a forward-looking perspective by providing readers insight into the challenges we faced in harmonizing incomplete and dirty data which often suffered from missing sources in order to provide context for the methodological steps we employed to pursue data harmonization. Interested readers can find a fuller accounting of our methodology in the methodology section. The results section ends with a backward-looking perspective by providing an assessment of different aspects of our harmonization efforts, including e.g. the likely gains and limitations of our efforts. Overall, our data harmonization efforts significantly outperform the only other effort we are aware of to harmonize COVID-19 PHSM data, the WHO PHSM effort 11 (see  Supplementary Information , section 3). We conclude with a discussion of how data harmonization illuminates complexities in the data generation process.

Choosing which datasets to harmonize is one of the most significant decisions a researcher must make when harmonizing data because the particularities of any given dataset can have substantial downstream effects on how they can be made to fit together. Given this, we start this section by outlining our rationale for choosing to harmonize the particular datasets that we do. Having established this basis, we then dive into the the challenges of harmonizing complex, dirty, and incomplete COVID-19 PHSM data. We then briefly show how our methodology can address these challenges (see our methodology section for more information). We end this section by providing an assessment of our efforts to harmonize COVID-19 PHSM data, the criteria for which is based on separate guidance we developed on the topic of data harmonization more generally 12 .

Which datasets to harmonize?

In adjudicating which COVID-19 PHSM datasets to harmonize, we weighed the potential benefits of data harmonization among a number of different dimensions, including the:

Geographical coverage

Temporal coverage

Volume of data collected

Relative similarity of policy taxonomies to the CoronaNet taxonomy

Relative capacity of external dataset partners for collaboration

As can be seen in Supplementary Table  1 , we identified more than 20 datasets which could potentially be harmonized into the CoronaNet taxonomy. We ultimately chose datasets to harmonize that (i) aspired to world-wide geographic coverage with (ii) at least ten thousand observations in each dataset and were (iii) based on original coding of sources (as opposed to recoding of existing sources). Datasets that fit this criterion were ACAPS, COVID AMP, HIT-COVID, OxCGRT, and the Complexity Science Hub COVID-19 Control Strategies List (CCCSL) 13 , 14 (note, we started but did not finish harmonizing CCCSL data, see below for more information).

One clear exception to this criteria was the inclusion of the CIHI dataset, which focuses on Canadian policies and had fewer than ten thousand policies. We decided to include the CIHI dataset for consideration because i) it already formed a substantial part of subnational data collection for other data collection efforts, including the OxCGRT dataset and ii) of substantial cooperation and access to researchers with expertise in both the CoronaNet and CIHI taxonomies. Similarly, though the WHO EURO dataset aims for a regional, rather than a world-wide focus, given that (i) it is part of the WHO PHSM dataset, which we compare our efforts to in Section 3 of our  Supplementary Information (ii) CoronaNet is partially supported by the EU Commission for its EU data collection, we decided to include it for harmonization. Because the WHO CDC dataset follows the same taxonomy as the WHO EURO dataset and also contains a substantial number of policies, it was also included for harmonization.

September 2021 was chosen as the cutoff date given our available resources and because most data tracking efforts had stopped or significantly slowed their data collection by this date except for OxCGRT, CIHI and WHO EURO (OxCGRT stopped in 2023 while the latter two stopped in 2022). Should more resources become available we will expand our efforts to harmonize records for these datasets beyond this date.

Challenges of Data harmonization

Harmonizing data is rarely straightforward, but harmonizing COVID-19 PHSM data was particularly challenging because standards which researchers usually abide by before releasing data were not observed due to the emergency nature of the pandemic. For one, to the extent that researchers generate event-based datasets, they normally concern past events, not ongoing ones. Indeed a given event must run its course in order for researchers to both i) conceptualize the event being captured into a structured and logically organized taxonomy ii) estimate the amount of work needed in order to build a dataset based on that taxonomy. For another, because dirty data can significantly bias subsequent research findings, researchers often err on the side of caution by spending substantial additional time rigorously cleaning and validating the data before release. Researchers also have personal incentives to delay releasing data given that i) they generally wish to be the first to conduct analyses on data that they themselves have collected and ii) unclean datasets can significantly negatively affect professional reputations. Meanwhile, to promote replicability and transparency about the data generating process, documentation of original sources and coding decisions are often extensive. Due to the pandemic, however, PHSM data exceptionally were:

Collected based on taxonomies that were developed inferentially from research group to research group while the COVID-19 pandemic was still ongoing.

Released without extensive cleaning.

Inconsistently preserved with regards to original raw sources.

Absent regular documentation on changes to taxonomies or data collection methods.

There were a number of research-based reasons to prioritize speed over rigor. Not only did launching data collection during rather than after the pandemic help jump start early COVID-19 research, in many cases it was critical to document these policies in as close to real time as possible because primary sources on the pandemic can and have disappeared from the Internet over time. Meanwhile, though many COVID-19 trackers surely would have continued to improve their data quality, unfortunately many stopped their efforts because of lack of funding support. Our efforts to harmonize this external data into the CoronaNet dataset thus not only ensures that their substantial contributions can live on, but are also improved insofar as any errors in the data or discrepancies between datasets have a higher chance of being identified and resolved before being harmonized. This job is made more difficult however, because many trackers did not have rigorous guidelines for preserving raw sources. In what follows, we expand upon how each of these pandemic-related challenges have affected our data harmonization efforts and subsequent methodological decisions.

The challenge of harmonizing different taxonomies

Different conceptualizations of what ultimately ‘counts’ as PHSM data lie at the root of different taxonomic approaches to collecting such data. While one benefit of independently developing taxonomies is that it encourages greater flexibility and adaptability in conceptualizing COVID-19 PHSM while simultaneously validating common themes that independently appear across taxonomies, it also makes reconciling the differences among them more challenging. A particular challenge with our data harmonization efforts is that the CoronaNet taxonomy on the whole captures more policy dimensions than other datasets do. While this means that our data harmonization efforts will yield much more fine-grained information on a given policy than would be available in its original form, mapping a simpler taxonomy into a more complex taxonomy is also a much more challenging task compared to vice-versa. In what follows, we discuss what challenges we faced when mapping taxonomies for COVID-19 policy types in particular as well as for other important dimensions of COVID-19 policies.

There are at least four broad issues to consider when mapping the substance of different COVID-19 policies: (i) when taxonomies use the same or similar language to describe a policy but ultimately conceptualize them differently (ii) when taxonomies have the same or similar conceptual understandings of a given event but differ in how they record the data structurally (iii) when taxonomies have similar but ultimately different conceptual understandings of a given event and (iv) when taxonomies capture and conceptualize different events. We elaborate with examples for each of these issues in what follows:

Different datasets can often use similar language to describe conceptually different phenomena. An example of why it is important to be sensitive to these semantic details can be seen with regards to the term ‘restrictions on internal movement.’ While all datasets that use this terminology understand this to entail policies that restrict movement, some have different understandings of the term ‘internal.’ For instance, because OxCGRT generally codes policies from the perspective of the country (note OxCGRT does document subnational data for a select number of countries: the United States, Canada and China. Further note that although OxCGRT also collects subnational data for Brazil, in this case, it appears that their subnational Brazilian data is also coded at the level of the country), their ‘C7_Restrictions on internal movement’ indicator captures any restriction of movement within a country. Meanwhile, because CoronaNet codes policies from the perspective of the initiating government, its ‘Internal Border Restrictions’ policy type captures policies that restrict movement within the jurisdiction of a given initiating government while policies that restrict movement outside a given jurisdiction are coded as ‘External Border Restrictions.’ As such, if the state of California restricts its citizens from leaving the state, this would be captured in OxCGRT’s ‘C7_Restrictions on internal movement’ indicator but would be coded as an ‘External Border Restriction’, not an ‘Internal Border Restriction’ using the CoronaNet taxonomy. Parsing out these differences can only be automated to a limited extent, especially if the taxonomies being mapped simply do not make the same distinctions.

Meanwhile an example of how different datasets implemented different taxonomic structures to capture a similar conceptual understanding of a policy is how they captured policies related to older adults. OxCGRT organized its taxonomy by creating an ordinal variable with its “H8_Protection of elderly people” index. Specifically, this index records “policies for protecting elderly people (as defined locally) in Long Term Care Facilities (LTCFs) and/or the community and home setting” on an ordinal scale (it takes on a value of 0 if no measures are in place, 1 if restrictions are recommended, 2 if some restrictions are implemented and 3 if extensive restrictions are implemented; see the OxCGRT codebook 10 for further details). In contrast, the CoronaNet and COVID AMP taxonomies document policies toward older adults not in its policy type variable but in a separate variable which records the demographic targets of a given policy (in CoronaNet, this is the ‘target_who_gen’ variable while in COVID AMP this is the ‘policy_subtarget’ variable). Both datasets record whether a policy is targeted toward ‘People in nursing homes/long term care facilities.’ CoronaNet additionally makes it possible to document whether a policy is targeted toward ‘People of a certain age’ (where the ages are captured separately in a text entry) or ‘People with certain health conditions’ (where the health conditions are captured separately in a text entry) while COVID AMP additionally makes it possible to document whether a policy is targeted toward ‘Older adults/individuals with underlying medical conditions.’ When mapping different taxonomies, these differences in taxonomic structure must additionally be taken into account.

Furthermore, taxonomies may capture similar, yet conceptually still quite distinct events which makes one-to-one matching between datasets difficult, if not impossible. For instance, the CIHI taxonomy’s policy type of ‘Travel-restrictions’ does not make any distinctions between restrictions made within or outside of a given government’s borders. Meanwhile, to revisit the example of policies related to older adults, by developing a ‘nursing homes’ category, HIT-COVID taxonomy targets not older adults per se, but the institutional settings in which they are likely to be the most vulnerable. The WHO PHSM dataset meanwhile generalizes this idea in its policy category of ‘Measures taken to reduce spread of COVID-19 in settings where populations reside in groups or are restrained or limited in movement or autonomy (e.g., some longer-term health care settings, seniors’ residences, shelters, prisons).’ This taxonomy implicitly suggests that it may be prudent to investigate not only the effects of policies on older adults but for all those with limited mobility at the expense of easily extractable information on older adults in particular. Cases such as these are perhaps the most difficult to resolve as it is impossible to directly map distinctions that one taxonomy makes into other taxonomies where no such distinctions are made.

Finally, while all datasets generically sought to capture policies governments made in response to COVID-19, different datasets focused on different policy areas. For instance, virtually all external datasets have separate policy categories to capture economic or financial policies (e.g. government support of small businesses) while such policies are not systematically captured in the CoronaNet taxonomy. In these cases, such policies are thus simply not mappable.

The fact that different projects undertook such a variety of approaches in capturing PHSM policies also underscores the idea that there is no one correct taxonomy for capturing such policies; each has its own pros and cons. For instance, aggregating all policies towards older adults in one indicator as OxCGRT does facilitates research on how the pandemic affects older adults but makes it difficult to easily compare the effect of the pandemic on other vulnerable populations. Meanwhile though the CoronaNet and COVID AMP approach allows more flexibility in what kind of policies toward older adults can be captured, it also lacks the cohesiveness the OxCGRT indicator for older adults enjoys. With regards to data harmonization meanwhile, the sheer variety of approaches substantially increases the challenge of transforming PHSM data to adhere to one taxonomy.

Indeed, despite a strong partnership with CCCSL, we opted not to harmonize data from the CCCSL dataset because of these taxonomic challenges. We found CCCSL’s structure and semantics were too different from CoronaNet’s, such that we estimated we would ultimately only be able to use less than half of CCCSL’s observations. To illustrate by example, an observation with the CCCSL id of 4547 notes in its description that ‘Ski holiday returns should take special care.’ Such an observation would not be considered a policy in the CoronaNet taxonomy because it is does not provide specific enough information about what is meant by ‘special care’ and the link for the original source of this observation is dead. While many observations do contain high quality information and descriptions, a substantial number do not contain any or only very minimal descriptive information. Combined with the difficulty in accessing original sources, we decided the relative effort required to consistently map the remaining observations into the CoronaNet taxonomy would be too high, especially considering that we are also harmonizing similar data from 7 other datasets.

So far we have only discussed the challenge of mapping taxonomies specific to policy types. However all datasets also capture additional important contextual information for understanding, analyzing and comparing government COVID-19 policies. In Table  1 below, we show the variety of approaches different datasets undertook to capture some of the most important of these dimensions including: the data structure (Structure), whether a given dataset captures end dates (End Dates?), has a protocol for capturing and linking updates of a policy to its original policy (Updates?), has a standardized method for documenting policies occurring at the provincial ISO-2 level (Location standardized at ISO-2 level?), captures information about the geographic target of a policy (Geog. Target?) or captures information about the demographic target of a given policy (Demog. Target?).

As Table  1 shows, while most external datasets are formatted in event dataset format which facilitates comparability across these datasets, OxCGRT data is available only in panel format, which presents unique challenges. With regards to the data structure, in order to facilitate data harmonization, the OxCGRT data must be reformatted to an event format (see the  Supplementary Information , Section 2, to access the taxonomy map). However, the panel structure also has knock-on effects on how other policy dimensions are captured, which we discuss in greater detail later in this section.

Datasets also differ with regards to how they capture the timing of a policy. Although knowing the duration of a policy is crucial for understanding its subsequent impact, if any, neither ACAPS and nor HIT-COVID systematically captured information about policy end dates. Though CIHI did make this data available through its textual description, it was not available as an individual field and had to be separately extracted. When harmonizing data from these datasets then, additional work must be done to provide information on end dates.

Relatedly, datasets have also taken inconsistent approaches to capturing policy updates, if at all. Taxonomies that capture such updates are arguably better equipped to capture the messiness and uncertainty of the COVID-19 policy making process (e.g. policy makers often lengthen or shorten the timing of a given policy in response to changing COVID-19 conditions). ACAPS and CIHI however do not separately capture and link policy updates to the original policy. Meanwhile, OxCGRT’s inability to capture information on how policies may be linked is largely due to its panel dataset structure. In contrast, though both the CoronaNet and COVID AMP taxonomies have rules for linking policies together, these differ across datasets. Specifically, CoronaNet links policies together if there are any changes to the original policy’s time duration (e.g. extended or reduced over time), quantitative amount (e.g. number of quarantine days), directionality of the policy (e.g. whether a policy targets outbound or inbound travel), travel mechanisms (e.g. whether a policy targets air or land travel), compliance (e.g. whether a policy is recommended or mandatory), or enforcer (e.g. which ministry is responsible for implementation). COVID AMP meanwhile, has separate fields to document i) whether an original policy was extended over time (see the ‘prior row id’ in the COVID AMP dataset) or ii) whether a given policy implemented at the local level has a relationship with a higher level of government (see the ‘Parent policy number’ field in the COVID AMP dataset).

While all datasets use a standardized taxonomy for documenting country level information about where a policy originated from, some datasets did not use a standardized taxonomy for capturing this information at the subnational ISO-2 level, in particular ACAPS and the WHO. Even when the taxonomy was standardized within a given dataset, different datasets used slightly different taxonomies at both the country and subnational levels which also necessitates further reconciliation and standardization across datasets .

Of all the datasets processed for data harmonization, only the CoronaNet and COVID AMP datasets capture information on both the particular geographic (e.g. country, province, city) and demographic targets (e.g. general population, asylum seekers) of a given policy. To the extent that other datasets also capture this information, it is either very broad or not standardized enough. For instance, though the various indicators in the OxCGRT data capture whether a policy overall applies to the general population or a targeted population, no further information about the specific targets is provided. Meanwhile, the WHO PHSM dataset does have a separate field which documents demographic targets but these entries are not standardized resulting in more than 5900 unique categories, many of which have typos (see  Supplementary Information , Section 3, for more). It is thus impossible to use them for analysis without substantial additional cleaning and harmonization.

All in all, harmonizing different datasets can be challenging when considering only two taxonomies, much less eight. This is true not only with regards to taxonomies specific to the substance of COVID-19 policies themselves but also with respect to additional policy dimensions like policy timing and targets.

The challenge of harmonizing dirty data

Dirty data refers to data that is miscoded with reference to a given taxonomy. In our investigation of the cleanliness of different datasets, we distinguish between policies that are (i) inaccurately coded relative to a given taxonomy or (ii) incomplete or missing. Harmonizing dirty data would be challenging even if taxonomies across datasets were the same; these problems are only compounded when taxonomies are different. Unfortunately, because of the pandemic emergency, all datasets considered here suffer from problems with dirty data.

For instance, although within the ACAPS taxonomy, all policies related to curfews should theoretically be coded as ‘Movement Restrictions’ and ‘Curfew’ in their ‘category’ and ‘measure’ fields respectively, text analysis of the descriptions accompanying these observations suggests that curfew policies were mistakenly coded into at least 8 other policy categories. For example policies relevant to curfews which likely should have been coded as ‘Movement Restrictions – Curfew’ were also found as being coded under other categories like: ‘Lockdown – Partial Lockdown’, ‘Movement restrictions – Surveillance and monitoring’, and ‘Movement Restrictions – Domestic travel restrictions.’ Although admittedly, the aforementioned categories are conceptually quite close to the concept of curfews, curfew measures were also found to be coded under categories that are arguably quite father afield like: ‘State of Emergency’, ‘Movement Restrictions – Border closures’, ‘Public health measures – Isolation and quarantine policies’, ‘Governance and socio-economic measures – Emergency administrative structures activated or established’ and ‘Governance and socio-economic measures - Military deployment.’

Data can also be dirty for other important policy dimensions, e.g. the start dates of a given policies. Many governments maintain websites where they note the most current policies without detailed information as to when the policy started or will end. To draw an example of a government source for Latvian polices (see https://web.archive.org/web/20210621102402/https://covid19.gov.lv/en/support-society/how-behave-safely/covid-19-control-measures ), the date the information on policies was updated was June 21, 2021 but the policies themselves were not necessarily implemented on that day. Further research would be needed to triangulate the start date of a given policy listed on this website. Similarly, in some cases, coders will simply record the date that they accessed the policy as the start date as opposed to the true start date. That these types of issues were found across all datasets is no surprise given the unusual circumstances that such data are collected and released. Nevertheless, they can pose immense challenges; blindly harmonizing such data risks compounding the original errors in the data.

While it is difficult to quantify the relative cleanliness of different datasets (and thus, how much of an issue it poses to data harmonization), we provide some sense of their relative data quality with regards to the quality of their textual descriptions in Table  2 below. Good textual descriptions of a given policy are crucial for helping users understand what policies a given dataset is actually documenting and organizing. Table  2 shows how informative these descriptions are by counting, per dataset, the average number of characters each description has (Description Length (Average)), how many descriptions have less than 50 characters (Descriptions with less than 50 characters (Total)) and how many observations have no descriptions at all (Missing Descriptions (Total)).

Generally, descriptions with less than 50 characters contain only limited information about a given policy. The following examples from each dataset shows that often missing from these shorter descriptions are dates, places and enforcers of policies and even sometimes the nature of the policy itself: “Albania banned all flights to and from the UK.” (CoronaNet); “Blida extended until at least 19. april 2020” (ACAPS); “Lockdown extended. Lockdown extended” (WHO CDC); “The state of emergency in WA has been extended.” (COVID AMP); “Delay of international flights have been extended” (EURO); “Extends school closures until March 16” (HIT-COVID); “orders extended until April 30” (OxCGRT). The table shows that textual descriptions from the ACAPS dataset have on average the least number of characters compared to others, with more than two thousand having descriptions of less than 50 characters and more than 100 having no description at all. While OxCGRT has the third highest average description length, it also has the most number of descriptions with less than 50 characters. Meanwhile HIT-COVID has the most number of policies without any description, at more than 1600.

With regards to the content of the descriptions, only the CoronaNet and CIHI databases appear to standardize what should be included in this textual description (see ‘Description Standardized?’ column in Table  2 ). Coders for CoronaNet are instructed to include the following information in their textual descriptions: (i) the name of the country from which a policy originates (ii) the date the policy is supposed to take effect (iii) information about the ‘type’ of policy (iii) if applicable, the country or region that a policy is targeted toward (iv) if applicable, the type of people or resources a policy is targeted toward and (vi) if applicable, when a policy is slated to end. The CIHI descriptions take a regularized format in which the government initiating the policy is clearly specified, the policy type is described and the end date of a given policy is recorded if applicable. With regards to the other datasets, we were unable to find any documentation that suggested that text descriptions should follow a standardized format nor were we able to find evidence of any by reading through a sample of the text descriptions themselves.

For each dataset, we randomly selected one description that accorded to the average description length for that dataset to illustrate what kind of information could be gleaned from them in the ‘Example of Average Description’ column in Table  2 . These descriptions suggest that while the CoronaNet and CIHI descriptions include information about the date the policy is enacted and the policy initiator, this information is not always reliably made available for descriptions from other datasets. While this information is generally also subsequently captured in separate variable fields, having detailed textual descriptions are important for helping to adjudicate whether the subsequent coding of these separate policy dimensions is accurate or not.

While it would be useful to have a similar quality assessment for other variables of each dataset, as far as we know, only the CoronaNet dataset provides an empirical assessment of the quality of its data. CoronaNet implements a multiple validation scheme in which it samples 10% of its raw sources for three independent coders to separately code. If 2 out of 3 of the coders document a policy in the same way, then it is still considered valid. Though data validation is still ongoing, preliminary data from Table  3 suggests that inter-coder reliability is around 80% for how its policy type variable is coded, a level which is generally accepted to be indicative of high inter coder reliability 15 , 16 , 17 An exception to the generally high validity of the policy type variable is the relatively poor coder interreliability for the ‘Health Testing’ and ‘Health Monitoring’ categories. This is likely related to changes in the CoronaNet taxonomy, which while important to make to better adapt to the changing policy-making environment, also increases the dirtiness of the data. A full accounting of taxonomy changes can be found by accessing the CoronaNet Data Availability Sheet https://docs.google.com/spreadsheets/d/1FJqssZZqjQcA-jZhRnC_Av9rlii3abG8r7utBeuzTEQ/editgid=1284601862 .

Other external datasets have likely faced similar issues which subsequently affect their data quality although we were unable to locate public documentation of these changes. Note however, if there were any taxonomy changes for OxCGRT or HIT-COVID, they are likely recoverable from their Git repository histories. The closest similar information that other datasets provide on data quality are with regards to their cleaning procedures. More information on the steps other datasets took to ensure data quality can be found in their respective documentation (see: CoronaNet 6 ; ACAPS 2 , OxCGRT 10 , HIT-COVID 8 and the WHO PHSM 18 ; no documentation on data quality procedures was found for CIHI). Given that a number of external trackers stopped their data collection efforts as well as the relatively high level of data quality of the CoronaNet data for the dimensions that we have information on, we can cautiously infer that harmonizing external data to the CoronaNet dataset will help improve the quality of the subsequently harmonized data.

Data completeness is also an important factor in a dataset’s overall quality. The more complete a datatset is, the more accurate subsequent analyses based on this data can be. All datasets harmonized here are by definition incomplete given that they made their datasets publicly available while their data collection efforts are ongoing. This issue is compounded by the fact that many datasets have had to stop or substantially slow their data collection efforts, particularly ACAPS, HIT-COVID, WHO CDC and COVID AMP. Because policies often continue past the lifetime of the group collecting the data itself, issues of data incompleteness only grow over time for datasets that stop collecting data. While a full assessment of the completeness of each dataset is not possible (one would need a perfectly complete dataset in order to judge the completeness of other datasets) in Table  4 below, we provide some sense of each dataset’s relative completeness by assessing how many policies lack end dates in terms of the raw data and the harmonized data as well as the average start and end dates of policies and the last submission date of a given policy.

The first column, ‘Missing End Dates (Total Raw)’ shows that on an absolute basis, CoronaNet has by far the greatest number of missing end dates. However this large number is largely a function of the large size of its dataset (see Table  6 for information on size of different datasets). When we turn to the second column ‘Missing End Dates (% Raw)’, which assesses the percentage of end dates missing from the data before it is manually assessed for data harmonization, the table shows that following ACAPS and HIT-COVID which, as previously mentioned, do not collect information on any end dates at all, CIHI has the highest percentage of missing end dates while the OxCGRT data has the lowest percentage. However, this column should be contrasted with the subsequent one, ‘Missing End Dates (% Harmonized)’ (note that the ‘raw’ version of the data presented here corresponds to ‘Step 3’ of the data and the ‘harmonized’ version of the data presented here correspond to ‘Step 5’ of the data. See Table  6 for a precise breakdown of this data and the methodology section for more information as to how the data was processed during this different steps), which shows the prevalence of missing end dates for policies that have been assessed for harmonization. Here we can see that both ACAPS and HIT-COVID have improved in terms of the prevalence of missing data — this suggests that while they themselves did not systematically gather information on end dates, it was possible for research assistants to recover this information from the raw sources that they were based on during the harmonization process.

The difference between what was reported in the raw data and what was assessed during the harmonization process is particularly drastic in the case of OxCGRT data and given that it suggests that the OxCGRT data is of substantially poorer quality than it appears, it deserves some additional attention. In particular, though OxCGRT was originally assessed as having missing end dates for around 3% of the data when based on the raw data, this percentage explodes to nearly 40% during the harmonization process. A likely explanation for this discrepancy is that the OxCGRT data is collected as an ordinal index in a panel form. To take border policies as an example, the OxCGRT index for border closures takes a value of 3 if borders are closed to all countries and a 4 if it is closed to all countries. If country X (i) only bans travel from country A in March 2020 (ii) then only bans travel from country B (lifting the ban for country A) in April 2020, and finally (iii) bans travel all countries in May 2020, it will take on a value of 3 according to the OxCGRT in March and April of 2020 and a value of 4 in May of 2020. However, it is not necessarily the case that the OxCGRT data will accurately record the end date of the travel ban against country A since for the purposes of its index, the same value of 3 is maintained throughout March and April. Note that generally, all of OxCGRT ordinal indexes follow a similar logic insofar as they document whether a restriction applied to all, none or some but do not provide further specifics when the restriction only applies to some. Such lapses in documentation likely explain the comparatively high number of missing end dates for the OxCGRT data.

Overall, when considering the numerous different dimensions of the missing end dates problem, our analysis suggests that COVID AMP and CoronaNet have the highest data quality in terms of having the lowest percentage of data without missing end dates, though COVID AMP’s relatively better performance is likely a function of the smaller size of its dataset. All datasets however, are evaluated as having a more severe problem with missing end dates when they are being assessed for harmonization, rather than in their raw form, with the exception of ACAPS and HIT-COVID which did not systematically collect data on end dates. This discrepancy underscores the challenge in harmonizing dirty data. Note that there is no value given for CoronaNet in terms of the percentage of missing end dates for the harmonized data because the harmonized and raw versions of the dataset are identical for CoronaNet.

Meanwhile, based on the raw version of the data, though the average start date and end dates for all datasets center around the last half of 2020, the earliest average start dates are found in ACAPS, and HIT-COVID, with OxCGRT, CIHI and WHO EURO being relatively farther along and CoronaNet, CDC and COVID AMP in the middle of the pack. Meanwhile WHO CDC and WHO EURO have the earliest average end dates while OxCGRT, CIHI and CoronaNet have the latest average end dates. The last submission date (relative to September 2021 when the data was last retrieved for all datasets except CoronaNet) shows when datasets have stopped or slowed their data collection efforts. Overall then, this table suggests that data harmonization of these 7 other datasets into the CoronaNet dataset may substantially raise the data completeness of the CoronaNet dataset.

As outlined above, all datasets considered in this paper suffer in various degrees from problems of miscoded or missing or incomplete data. However, though dirty data substantially raises the complexity and challenge of accurate data harmonization, the data harmonization process can also improve the quality of such data, which we will discuss in more detail later on.

The challenge harmonizing data with missing information on original sources

Given both the challenges in harmonizing (i) data coded from multiple different taxonomies as well as (ii) dirty data, it is essential to have access to the original raw source of data of a given policy to harmonize the data accurately. Original sources are necessary to substantiate the content or nuances of a given policy or to resolve any confusion or disagreement about a given coding decision.

In Table  5 , we illustrate differences among each dataset in terms of how they make source data available (Source Data) and how many observations do not have any source data attached to it (Missing Links (Total)). The table also shows, relative to external data that has already been assessed for harmonization, the percentage of observations that have been found to be based on sources with dead links for which corroborating information was unable to be found after a good faith effort (Unrecoverable links (Percent of total harmonized)) as well as the percentage of observations which have been found to be based on dead links but for which corroborating information was subsequently recovered (Recovered Links (Percent of total harmonized)).

We find that while all datasets provide reference to the URL links used to code a given policy, only CoronaNet, COVID AMP and HIT-COVID also provide links to static PDFs of raw sources which ensure that this information can continue to be available in the future. Note however, COVID AMP has around 150+ observations which only have a URL link and no PDF link attached to it while early observations entered into the HIT-COVID dataset also only have URL links with no accompanying PDF links. With regards to the extent to which a given observation is missing a URL or PDF link to its raw source, the WHO EURO and OxCGRT datasets have the most number of missing links while this is not an issue for the CoronaNet and CIHI datasets. Meanwhile, based on the amount of external data that has been harmonized thus far, around 10.2% of the external data is based on links that were dead which were not possible to recover corroborating information for. This was a particular problem for the WHO EURO and WHO CDC datasets though not an issue for the CIHI or COVID AMP datasets. Meanwhile around 4.7% of the external dataset assessed for harmonization to date, were based on dead links but for which it was possible to recover corroborating information. Because these data points are recoded using the CoronaNet methodology, PDFs of these recovered links were also uploaded, ensuring that they will continue to be preserved for future records. Observations coded by the WHO EURO database were found to be particularly recoverable. Note that we do not make an assessment for unrecoverable or recovered links for CoronaNet because the CoronaNet methodology ensures that PDFs are always saved (the data is collected via a survey and uploading a PDF is mandatory for a policy response to be considered valid). All told, at least 17% of the external data (3% of the external data have no links, 10.2% of the data are based on links with unrecoverable information and 4.7% of the data are based on links with recoverable information) are based on data with some issues with regards to their original sources, which only increases the challenge of smoothly harmonizing information from different datasets.

COVID-19 PHSM Data Harmonization methodology overview

The challenges posed by harmonizing multiple complex taxonomies of dirty data based on inconsistently preserved original sources led us to the conclusion that ultimately, only manual harmonization would allow us to harmonize data from different PHSM trackers in a way that would ensure high data quality and validity. Given the sheer number of policies in the external dataset however, to the extent possible, we sought to support these manual efforts with automated tools to harmonize data across 8 different datasets into the taxonomy for capturing COVID-19 PHSM developed by the CoronaNet Research Project. To that end, we followed the methodology laid out in Fig.  1 . That is, after we evaluated the set of COVID-19 PHSM data to harmonize, we made taxonomy maps between the different external data and the CoronaNet taxonomy (Step 1), performed basic data cleaning as well as data trimming of policies from the external dataset irrelevant to the CoronaNet taxonomy (Step 2), and automatically deduplicated a portion of the external data (Step 3). After having piloted manual harmonization for a sample of the data (Step 4), we are currently manually harmonizing the remaining external data into the CoronaNet dataset (Step 5).

figure 1

PHSM Data Harmonization Process. This figure visualizes the volume of data processed across the different steps of our harmonization process for a given PHSM dataset. During Step 1, the taxonomy for each dataset is mapped into the CoronaNet taxonomy and represents the absolute amount of data that is possible to harmonize. During Step 2, basic cleaning and subsetting of each dataset is performed in order to remove observations that are clearly not mappable into the CoronaNet taxonomy. During Step 3, an algorithm is employed to remove as many duplicate observations as possible. Step 4, not pictured, refers to our pilot harmonization efforts for select countries and datasets. Step 5 refers to our ongoing efforts to manually harmonize the remaining data. Please see the methodology section for more details.

Table  6 provides a numerical breakdown of how different data have been processed along these steps and an overview of our harmonization efforts to date. It shows that after preprocessing and standardizing the raw data using automated methods (Step 1 through Step 3), there remain 150,052 policies from the 7 datasets external to CoronaNet to harmonize manually (Step 5). While Step 5 is still ongoing, more than 50% of the external policies have been assessed for whether they overlap with the CoronaNet data or not and around 36% has been assessed for harmonization into the CoronaNet dataset. Policies recoded into the CoronaNet dataset originally found from these external datasets can be identified in our publicly available dataset (see https://www.coronanet-project.org ) using the ‘collab’ and ‘collab_id’ fields which refer to the external dataset source and original unique id respectively. For a fuller accounting of our data harmonization methodology, please see the methodology section of this article.

Assessing the Value of Harmonizing COVID-19 PHSM Data

Given the apparent intricacy of harmonizing data, to say nothing of harmonizing complex, unclean and incomplete data, it is easy to miss the forest for the trees. As such, in this section, we take a step back in order to provide an assessment of the overall value of harmonizing COVID-19 PHSM data. To do so, in the following we draw from guidance we developed in a separate commentary 12 to explore the benefits, costs, limitations, necessary resources and alternatives to data harmonization of PHSM data.

What can be gained from data harmonization?

To our knowledge, no individual effort to document PHSM has been able to do so for all countries. Indeed, though at the time of writing, there are 145k+ observations unique to the CoronaNet dataset (of the 180k+ observations available in total, which includes harmonized data), we identified 150,052 observations for the 7 datasets external to CoronaNet combined for data available through September 2021. According to our efforts so far, around 83% of external data do not overlap with the CoronaNet dataset, and of these around 45% can be recoded, suggesting there are potentially 55k additional observations to recode.

Data harmonization would thus lead to a dataset that is more complete and consistently coded across time and space then is currently available. Indeed, Fig.  2 shows that while most datasets have fair coverage of PHSM until the summer of 2020, with data from CoronaNet being especially rich, data after this time is limited especially for trackers that stopped data collection (e.g. HIT-COVID, ACAPS). OxCGRT in comparison to other datasets, has been able to document more policies for later months.

figure 2

Number of policies per date recorded by 8 different COVID-19 PHSM tracking efforts.

Meanwhile Fig.  3 illustrates differences in the number of policies captured across continents. Clearly, all trackers have asymmetrically focused on countries in Europe and North America.

figure 3

Number of policies per date, grouped by region, recorded by 8 different COVID-19 PHSM tracking efforts.

While data harmonization cannot compensate for this relative unevenness in data coverage, it can significantly improve coverage of non-European and non-North American countries on an absolute level.

Moreover, as Fig.  4 shows, most external datasets either focus on gathering national-level data for countries around the world or subnational data for a more limited number of countries, but rarely both.

figure 4

Number of policies per date, grouped by the initiating level of government, recorded by 8 different COVID-19 PHSM tracking efforts.

As such, data harmonization efforts will substantially improve the availability of PHSM data initiated at the national level and to some degree, the provincial level as well.

Overall, data harmonization greatly advances the completeness of PHSM data on a number of dimensions, including time, space, and administrative levels. Moreover, our data harmonization methodology also allows each policy in the external dataset to be evaluated independently, which can improve the quality of the PHSM data overall. This is all the more valuable given that while PHSM data has generally been made publicly accessible in close to real time because of the emergency nature of the pandemic, research groups have not been able to guarantee data cleanliness (see the ‘Challenges of Data Harmonization’ section). Progress on these dimensions greatly improve the research community’s ability to conduct analyses on the COVID-19 pandemic which can yield results with both greater external validity and generalizability (in e.g. cross national analyses) as well as analyses that can yield results with greater internal validity and with fewer potential confounders (in e.g. subnational analyses).

What can be lost from data harmonization?

The main loss when harmonizing PHSM data into the CoronaNet taxonomy is with regards to measures that CoronaNet does not capture and for which, the benefit of its relative fine-grained taxonomy are moot. The most prominent of these measures are the economic ones, such as business subsidies or rental support. For measures for which there is conceptual overlap between the CoronaNet taxonomy and other taxonomies, the fact that the data were harmonized to the CoronaNet taxonomy, which by far has the most detailed taxonomy of the 8 datasets, minimizes the extent to which information was lost from the harmonization process.

Meanwhile, the benefits of data harmonization aside, there can be real scientific value when different researchers approach similar research topics with different research designs 19 . In support of this, we further make taxonomy maps between the CoronaNet taxonomy and the taxonomy of each respective dataset publicly available through our  Supplementary Information , Section 2. These maps can not only help users better understand how to use different datasets, but can also provide robustness checks of COVID-19 related research and bolster the transparency and replicability of our data harmonization efforts.

What are the limits of data harmonization?

While we believe that our efforts to harmonize data across 8 different datasets will provide the most complete picture possible of COVID-19 PHSM, they will still fall short of a dataset that will reflect all COVID-19 PHSM ever implemented. Though it is inherently impossible to assess how much data will still be missing after data harmonization is finalized — a complete dataset needs exist to make this assessment and it does not — we offer some insights as to where and why data may be incomplete. Specifically, our complete, harmonized dataset will still (i) lack information on subnational policy making for a number of countries as well as low state capacity governments and (ii) be unable to ensure complete data cleanliness.

Our review of projects gathering COVID-19 policies suggests that most projects focus on national level policies, limiting what data harmonization can achieve. Table  S1 in the Supplementary Information shows the coverage of data on subnational policy making for all datasets that we know to be in existence, using data available at the time of writing. Most datasets aside from CoronaNet do not collect subnational data and to the extent that they do, they overwhelmingly focus on the United States. Meanwhile, though the CoronaNet data does capture subnational data for some countries, given the volume of policies generated and limited resources, we are only able to capture this data for reduced time periods. However, available evidence suggests that subnational policy-making has taken place in many other countries beyond the ones listed in Table  S1 in the Supplementary Information. Data from both the Varieties of Democracy Pandemic Backsliding Project (PanDem) 20 as well as CoronaNet’s internal surveys suggest that there was subnational policy making in anywhere from 30 to 90 countries at any given point in time, as visualized in Figs.  5 and 6 . Note that the CoronaNet internal surveys followed the same coding scheme as PanDem’s [subvar] variable; at the time of writing, CoronaNet’s internal assessment covers 115 countries for 6 quarters while PanDem’s data covers 144 countries for 5 quarters, with 91 countries covered in common across both.

figure 5

Extent of policies made at the subnational level by quarter, from CoronaNet Research Project internal assessment data.

figure 6

Extent of policies made at the subnational level by quarter, from the Varieties of Democracy Pandemic Backsliding Project (PanDem).

Meanwhile, we also identify how issues of low state capacity can make it difficult to document COVID-19 policies at all. Some problems that CoronaNet researchers have reported include:

No announcement of policies in any official government sources: In the absence of any official government sources about a policy, research assistants must rely on media reports which can often have conflicting information about the nature or timing of a given policy. It is also not uncommon for governments to announce policies on social media without providing further information in the form of official government sources.

Policies being communicated in mediums other than the Internet: In places with low internet connectivity, governments have been known to make policy announcements in non-digital forms used most prevalently by the local population e.g. radio, local news bulletins, town criers.

NGOs and/or IOs implementing policies that are normally under the purview of governments: When governments lack the capacity to respond the COVID-19 pandemic, NGOs or IOs have been known to step in. While it is possible to capture these policies, policy trackers to date have largely focused on documenting government initiated policies.

In short, large scale data collection efforts of PHSM data have been predicated on: (i) the capacity to capture PHSM polices made at all different administrative levels (ii) the availability, access and durability of web-based documentation on PHSM policies and (iii) the assumption that governments are the primary policy responders to the COVID-19 pandemic. However, these conditions are not always present in low state capacity states. While the enormous undertaking described here will greatly advance our collective knowledge of COVID-10 PHSM policies, much more funding and support is needed to document all PHSM.

Finally, as we elaborate more fully in the ‘Challenges of Data Harmonization’ section, PHSM data is unusually challenging to harmonize because the emergency nature of the pandemic gave rise to multiple complex taxonomies and corresponding datasets that have had varying levels of quality, completeness, and underlying source material.

While we employ some automated processes to harmonize taxonomies and deduplicate data, our methodology is overwhelming reliant on the analog process of recoding external data based on the original sources found in the external data rather than relying directly on the observations available in the external data itself. In doing so, we can ensure that whatever errors might have been made in the automated taxonomy harmonization processes, which itself was adjusted to account for systemic errors in the external data (see the ‘Step 1. Making Automated Taxonomy Maps’ section below), can be rectified manually later. Meanwhile we have also additionally vigorously tested our automated deduplication strategies to ensure that we are biased towards keeping duplicates to be removed later manually rather than mistakenly removing observations that are not duplicates (see the ‘Step 3. Automated Deduplication’ section below). However, despite out best efforts, we can nevertheless not guarantee that the likely 55k+ observations that we will eventually recode from the external datasets into the CoronaNet taxonomy will be completely free of error.

What cooperative resources are available for harmonizing data?

External data partners were either co-hosts or participants in the two conferences hosted by CoronaNet: the PHSM Data Coverage Conference (February and March 2021) and the PHSM Research Outcomes Conference (September 21, 2021). During both conferences though especially the first, trackers discussed common challenges and solutions to their data collection efforts, especially with regards to taxonomy and organization. Both the planning of the conferences and conferences themselves helped increase mutual understanding and collegiality among trackers 1 . For more information, please see https://covid19-conference.org or the shared statement written by conference participants outlining a framework for cooperation and collaboration (PHSM 2021).

Meanwhile, bilateral exchanges also played an important role in identifying and overcoming specific challenges with regards to mapping and harmonizing data for a given dataset. For instance, our ability to harmonize the CIHI dataset, was contingent on close cooperation with the CIHI team. Aside from explicit coordination on COVID-19 vaccines taxonomy, three volunteer researchers for CoronaNet were contracted to work on the CIHI database. This shared expertise greatly facilitated our ability to build a taxonomy map between CIHI and CoronaNet and to pilot our harmonization efforts.

Similarly, researchers from both CoronaNet and HIT-COVID were involved in building the HIT-COVID taxonomy map, which greatly facilitated the mapping process. They were also involved in piloting the data harmonization process, which also increased the speed at which it could be done. The fact that HIT-COVID and CoronaNet built their taxonomies for COVID-19 vaccine policies with mutual feedback from the other also facilitated the mapping of this particular policy type.

Meanwhile, ACAPS, COVID AMP, and OxCGRT generously made themselves available for clarifying confusion or misunderstandings about their respective taxonomies which helped make the mappings more accurate. However, despite repeat inquires to the WHO PHSM dataset to initiate such cooperation, we found them to be unresponsive which made the taxonomy mapping exercise with the WHO PHSM dataset comparatively difficult. Overall, we found that greater communication and cooperation between leaders of different datasets was an important intangible in facilitating the data harmonization process.

What are alternatives to data harmonization?

While in this paper we concentrate on presenting our rationale and methodology for qualitatively harmonizing PHSM data, in Kubinec et al. (2021) we introduce a Bayesian item response model to create policy intensity scores of 6 different policy areas (general social distancing, business restrictions, school restrictions, mask usage, health monitoring and health resources) which combines data from both CoronaNet and OxCGRT 21 . As this latter paper shows, researchers should be cognizant that while statistical harmonization can be an effective form of data harmonization, the resulting indices or measures may sometimes need to be interpreted or used differently than the underlying raw data. For example, our policy intensity scores for mask wearing can be interpreted as the amount of time, resources and effort that a given policy-maker has devoted to the issue of mask restrictions in a given country compared to that of other countries. This is different from what the underlying raw data measures: whether a given mask restriction is in place or not. Researchers choosing to engage in statistical harmonization should thus provide a thorough accounting of the underlying concept that they seek to measure and a corresponding justification of why their statistical method provides a good operationalization of it.

This article presents our efforts to harmonize COVID-19 PHSM data for the 8 largest existing datasets into a coherent, unified one, based on the taxonomy developed by the CoronaNet Research Project. To do so, we provide a thorough accounting of the various challenges we faced in harmonizing such data as well as the methodology we used to address these challenges. Along the way, we open a window into understanding the strengths and weaknesses of different COVID-19 PHSM datasets and create a new path for future researchers interested in harmonizing data more generally to follow.

We also show that there are substantial gains to harmonizing PHSM data across 8 different datasets, particularly in terms of the time, spatial and administrative coverage of PHSM data. While some conceptual diversity is always lost when harmonizing data, we argue that by harmonizing PHSM data to the CoronaNet taxonomy, this issue is minimized due to the CoronaNet taxonomy’s comparative richness. Data harmonization of these 8 datasets will still fall short of a complete PHSM dataset, especially for countries for which there is a great deal of subnational policy making or low state capacity but this effort nevertheless will provide the fullest picture yet of COVID-19 government policy making. Moreover, it substantially improves upon the existing WHO PHSM effort to harmonize data both in terms of scale and quality (see  Supplementary Information , Section 3). More resources would allow us to complete data harmonization more quickly, which given the ongoing nature of the COVID-19 pandemic, would be welcome. However, even if data harmonization is completed only after the pandemic is overcome, it will still present a tremendous historical resource for generations of researchers.

Our experience in data harmonization has underscored for us that the production of data may be understood not only as a mere reflection of reality, but a framing or even creation of reality. That is, by producing certain measures and not others, data can frame certain aspects of the world as more or less deserving of attention. Meanwhile creating a measure in the first place can bring forth concepts that previously did not exist in the public consciousness 22 . Harmonizing data cannot escape these dynamics and in fact invites greater scrutiny of them as it adds another layer of negotiation and complexity in terms of determining what is worthy of being measured and how to measure it. Undergirding all of this are social processes that produce data in the first place and which can have important influence on what data ultimately is or is not harmonized 23 . Though in a number of fields, researchers have developed novel platforms that aim to help facilitate data harmonization 24 , 25 , ultimately effective data harmonization requires researchers to identify clear goals for their harmonization process, a high level of attention to detail in designing a rigorous plan to carry out, and a robust working culture to ultimately successfully implement it. We hope our experience with PHSM data harmonization can provide a roadmap for researchers embarking on similar journeys for their own research.

Methodology

Covid-19 phsm harmonization methodology.

In this section, we provide greater detail for the 5 step methodology we employed to semi-manually harmonize data from 7 PHSM datasets into the CoronaNet taxonomy for policies implemented by governments before September 21, 2021. We start with outlining each of these different steps before expanding on each step in separate subsections later on. A note to the reader: unless explicitly noted, any subsequent analysis or description of the external data refers to data recorded by September 21, 2021.

Step 1: Create taxonomy maps for each external dataset and CoronaNet, which we make publicly available (see  Supplementary Information , Section 2). Based on these maps, we then mapped data available for each external dataset into the CoronaNet taxonomy

Step 2: Perform basic cleaning and subsetting of external data to only observations clearly relevant to existing CoronaNet data collection efforts.

Step 3: Remove a portion of duplicated policies using customized automated algorithms with respect to:

Duplication within each respective external dataset

Duplication across the different external datasets

Step 4: Pilot our data harmonization efforts for a select few countries (over the summer of 2021)

Step 5: Release the resulting curated external data to our community of volunteer research assistants to:

Manually assess the overlap between PHSM data found in CoronaNet with that found in the ACAPS, COVID AMP, CIHI, HIT-COVID, OxCGRT, the WHO EURO and WHO CDC datasets respectively and;

Manually recode data found in the external datasets that were not already in the CoronaNet dataset into the CoronaNet taxonomy.

Step 1. Making Automated Taxonomy Maps

Given the variety and complexity of approaches that different groups have taken to document PHSM policies, asking research assistants to not only become experts in one taxonomy but multiple taxonomies would have been unfeasible. Instead, we created maps between the CoronaNet taxonomy and other datasets so that all datasets could be understood in the CoronaNet taxonomy for the following principal fields:

Policy timing

The start date of the policy

When available, the end date of policy

Policy initiator

The country from which a policy is initiated from

When available, the ISO-2 level region from which a policy is initiated from

Policy Type

Broad policy type

When possible, the policy sub type

Sources/URLs

When available, links of original pdfs

Textual description

When possible, other fields, such as the geographic and demographic targets, are also matched. As outlined in the ‘Challenges of Data Harmonization’ section, because of conceptual and organizational differences across different taxonomies, one to one mappings were not always possible especially with regards to the substance of COVID-19 policies. In such cases, one to two or one to three mappings were suggested. For the COVID AMP and WHO PHSM mappings (relevant for the WHO EURO and WHO CDC datasets), we also employed machine learning models to predict the most likely policy type an observation was likely to be in the CoronaNet taxonomy based on the textual description of the policy. Both because one to one mappings based on the taxonomies themselves were often not possible and because of issues with dirty data, in some cases, the mappings were often adjusted to so that they were based not only on the formal taxonomy but also on when certain key words were used in the dataset. For example, though policies originally coded in the WHO taxonomy of ‘Social and physical distancing measures (Category) - Domestic Travel (Sub-Category) - Closing internal land borders (Measure)’ might reasonably map onto CoronaNet’s ‘Internal Border Restriction’ policy type, when the word ‘quarantine’ appears in the text description of such policies, we reclassify them in the taxonomy map as a ‘Quarantine’ policy instead. As such, these taxonomy mappings are not always based strictly on how different policy types theoretically should map onto each other, but attempt to account for mistakes and miscodings in the external data to create the best mapping possible between the existing data and the CoronaNet dataset. In this first automated step, our aim was to ensure that most mappings were correctly mapped but did not take pains to make sure that every mapping was correctly mapped, because, as we explain later on, each observation was ultimately assessed and evaluated for harmonization by human coders who are better equipped to make these more fine-grained and nuanced judgements.

As part of this mapping exercise, in order to keep track of the original dataset that each observation came from, we also ensured that each record was associated with its own unique identifier (unique_id). In some cases, reformatting the data also impacted how the unique_id assigned by the original dataset was formatted, though we ensured that our transformation method nevertheless still allows others to trace a given observation back to the original dataset. For example, in HIT-COVID, border restrictions for people leaving or entering a country are coded in separate observations. However, in CoronaNet, if a policy for restricting both entry and exit to or from the same countr(ies) on the same date, they are coded as one observation. In this case, the HIT-COVID data is collapsed to fit into one observation and the unique identifier is also collapsed such that two or more of the original unique identifiers are collapsed into one when they are mapped to the CoronaNet taxonomy. In the case of OxCGRT, no unique identifiers are provided in the original dataset and in this case we create them using a combination of the policy indicator, date, country and where applicable, province.

Please see the  Supplementary Information , Section 2, for more information about how to access the specific taxonomy mappings we created between CoronaNet and other datasets.

Step 2. Basic cleaning and subsetting of external data

With the help of the taxonomy maps, we were able to roughly transform the external datasets into the CoronaNet taxonomy. Before moving forward with manual data harmonization, we first implemented some basic cleaning and subsetting of the data. Because, as discussed in the ‘Challenges of harmonizing different taxonomies’ subsection above, most datasets do not use a consistent reference for identifying policies originating from the ISO-2 provincial level, we created code to clean these text strings up as much as possible. Given the sheer number of observations that needed such cleaning, we could not ensure full standardization for these text strings. However, we took pains to ensure that the 430+ provinces for which CoronaNet systematically seeks to collect subnational data for were consistently documented in the external data. Specifically, these are subnational provinces for the following countries: Brazil, China, Canada, France, Germany, India, Italy, Japan, Nigeria, Russia, Spain, Switzerland, and the United States.

Next we subset the external data to exclude regions that CoronaNet is currently not collecting data for. In particular, we excluded observations from the COVID AMP dataset documented at the county or tribal level in the United States as well as observations for Greenland, the United States Virgin Islands and Guam from our harmonization efforts. In addition, we also subset the external dataset to exclude policy types that CoronaNet is currently not collecting data for, in particular economic or financial measures taken in response to the pandemic.

Step 3. Automated Deduplication

After making taxonomy maps for each external dataset to the CoronaNet taxonomy and conducting some basic cleaning of the data, we also took steps to deduplicate the data using automatic methods to the extent possible. Deduplication was assessed along three criteria: i) duplicates within each external dataset ii) duplicates across the external datasets and iii) duplicates between the CoronaNet and external datasets. We outline the steps we took to assess the level of duplication along each of these criteria and when possible, to remove duplicates accordingly. All in all, we took a conservative approach in our automated deduplication efforts insofar as we rather left many potential duplicates in the dataset rather than removed too many policies which may have not been duplicates.

Step 3a. Deduplication within External Datasets

Given the sheer amount of data collected and coordination needed to collect such data, it is not surprising that there is some duplication within datasets. Duplicates can occur for a number of reasons including (i) structural differences between taxonomies (ii) the lack of one to one matching between taxonomies (e.g. a policy that may be coded as several policies in one taxonomy may only be coded as one policy in the CoronaNet taxonomy) (iii) coder error.

We first needed to deal specifically with duplication that occurs as a result OxCGRT’s method of collecting data to fit a panel data. In particular, OxCGRT coders are generally instructed to provide an assessment of whether a policy was in place or not for each given day that they are either recording the policy or for which they have evidence for a policy being in place or not. For instance, if a coder finds that the same policy has been in place over several weeks, the same textual description may be copied and pasted into the notes section of the dataset for each day that the coder happened to review the status of policy-making for that indicator, even if the ordinal indicator itself does not change. When initially extracting and reshaping the OxCGRT data into an event dataset format, each textual description is initially retained, even though it may not contain new information. To deal with this, we built a custom function to identify policies that repeated the exact same description, keeping the ‘latest’ instance of the policy description and removing earlier ones (see the OxCGRT-CoronaNet taxonomy map available through the  Supplementary Information , Section 2, for more detail).

We also needed to implement a custom procedure to deal with a related practice of documenting ‘no change’ in a policy indicator which was unique to OxCGRT’s methodology for documenting policies. Specifically, when an OxCGRT coder does not identify any change in a policy indicator, it is customary for the coder to note something to the effect of ‘no change’ in the textual description for that particular day. This information can be extremely valuable if one desires to know the status of a given indicator in the ‘present’ as it allows researchers to distinguish whether there was truly no change in government policy makers or whether there was simply no one actively documenting government policy making for a given region and indicator. As the present becomes the past however, this information becomes less useful. For instance, while the value of knowing that there was ‘no change’ in a given indicator ‘today’ is quite high, knowing that there was ‘no change’ for a given indicator in e.g. March 2020 is not very informative especially if there was subsequently a flurry of policy making activity for that indicator. Given that we initially retained each textual description from the OxCGRT data when transforming it from a panel to event dataset format, our initial efforts created an OxCGRT event dataset format filled with observations that documented variations of the sentiment ‘no change.’ Because the CoronaNet taxonomy does not document when there are no policy changes, to the extent possible then, we sought to remove such observations from the OxCGRT dataset. The difficulty in doing so was compounded by the fact that (i) there appears to be no standard language that OxCGRT coders follow in communicating that a policy had no change (ii) not infrequently, a textual description will start by noting that there has been no change to a policy, but will then subsequently provide a long and detailed description of the policy. In these cases, it is unclear whether there actually was no change to a policy and the coder is simply noting what the policy was or if there was no change to the policy that could be captured by the OxCGRT taxonomy, but there were actually some changes made by the government and the coder is documenting them qualitatively in the text. To deal with the former issue, we looked through hundreds of OxCGRT policies to try to identify as many phrases that conveyed the sentiment ‘no change’ as possible. To deal with the latter issue, we did not remove observations over a certain character limit even when they noted that there was ‘no change’ in case there actually was a substantive change that could be captured in the CoronaNet taxonomy. These choices were consistent with our general conservative approach towards automated deduplication.

Following this specialized deduplication for the OxCGRT dataset, we then sought to identify duplicates within each dataset more generally. We experimented with identifying policies that had identical values for a variety of different policies and ultimately found the following set of variables as being able to accurately identify a large number of duplicates:

description: records the textual description used to describe each observation. Note, for the purposes of deduplication, the descriptions were stripped of punctuation and special characters and transformed to all lower cased letters in order to decrease the likelihood that stray superfluous symbols would prevent the identification of duplicates. During the manual harmonization stage, the original descriptions are used.

country: records the country that a policy originates from, where the list of countries are standardized.

province: records the province that a policy originates from, where the list of policies are semi-standardized (see the ‘Step 2. Basic cleaning and subestting of external data’ subsection for more information).

link: records the URL link used as the raw source of information for a given policy.

Theoretically, we believed that the likelihood of identifying true duplicates with the above variable fields are quite high given that all descriptions are all written in free form and that URL links can act as fairly robust unique identifiers. With this set of variables, we identified 6955 policies that were duplicated (note that we excluded from this procedure, policies that had the textual description ‘Extension’ or ‘extend’ in their descriptions. As part of our investigation, we found that it was common for coders to copy and paste the same description with this word every time a policy was extended in time and as such we would have inaccurately removed many policies had we not excluded such observations from our deduplication efforts). To assess our efforts, we sampled 100 groups of policies that were found to be duplicates, (which was equivalent to 393 total observations), and through manual investigation, found that 99 of these groupings were indeed duplicates, for an accuracy of 99%. We further manually checked groups of policies that were identified as having particularly high number of duplicates (7 or more, the maximum being 19) and found that our criterion accurately identified these groups of policies as having duplicates. Because this automated deduplication method proved to be quite accurate, we subsequently used this criterion to remove likely duplicates within each dataset. We show the distribution of policies we found to be duplicates according to this criterion in Table  7 .

As can be seen, we identified a particularly high number of duplicates within the OxCGRT dataset. This is consistent with our knowledge that duplication is a particular problem with OxCGRT data because of their methodology for data collection as well as what we know to be a conservative approach in our custom method of deduplicating OxCGRT data.

Step 3b. Deduplication across External Datasets

The data was also evaluated for duplicates across datasets. Data duplication across datasets happens because different policy trackers have only coordinated their work in collecting PHSM data to a limited extent. As such, the same policy may be independently documented by coders in different datasets. While this is desirable from the point of view of data validation, it is a hindrance from the point of view of data harmonization.

As a first step in deduplicating data across datasets, we were able to remove a number of observations that were by definition duplicates. Specifically, since the OxCGRT subnational data for Canada is based in large part on the data collected by CIHI, we removed OxCGRT data for Canada from our dataset and instead chose to prioritize the more fine-grained version of the data documented by the CIHI dataset. Note that the full WHO PHSM dataset actually includes data from ACAPS, HIT-COVID and OxCGRT. These observations were removed from the dataset as well following similar a logic. That is, it seemed likely that a direct translation from e.g. the ACAPS to CoronaNet taxonomy would lead to fewer errors than using the version of the data that first translates ACAPS to the WHO PHSM taxonomy and then to the CoronaNet taxonomy. Second, it further allows us to maintain evaluate the full ACAPS, HIT-COVID and OxCGRT datasets; whereas in the WHO PHSM dataset, the ACAPS, HIT-COVID and OxCGRT data has already been deduplicated according to the WHO PHSM methodology.

Following this, we then experimented with identifying duplicates across datasets more generally. In addition to exploring which set of variables most reliably identified groups of true duplicates (as we did for identifying duplicates within datasets), when duplicating across datasets, we further had to decide from which dataset observations should be retained when duplicates were found. With regards the former, we found that identifying duplicates based on the following variables to yield the most accurate results:

type: records the broad policy area of a given COVID-19 policy. E.g. a policy related to schools will be coded as ‘Closure and Regulation of Schools’ type.

type_sub_cat: The specific policy area of a given COVID-19 policy. This is hierarchically determined such that only certain type and type_sub_cat combinations can go together. E.g. a policy related to primary schools will have a type_sub_cat of ‘Primary Schools’ and will by definition have a policy type of ‘Closure and Regulation of Schools.’

province: records the province that a policy originates from, where the list of policies are semi-standardized (see the ‘Step 2. Basic cleaning and subsetting of external data’ subsection for more information).

target_who_what: if applicable, records the citizenship (citizen or non-citizen) or travel status (traveller or resident) which a given policy is targeted toward.

date_start: records the start date of a given policy.

Note, we considered other variables but found that they were not adequate because they were not broadly collected across different external datasets. E.g. enforcer is only collected by CIHI; target_country, target_province is only collected by COVID AMP; target_direction and institution_status are only collected by HIT-COVID; type_mass_gathering is only collected by WHO EURO and WHO CDC; date_announced is only collected by COVID AMP and CIHI.

Meanwhile, with regards to the issue of what observations we should ultimately retain when duplicates were identified, we developed a protocol for prioritizing datasets based on both our qualitative experience working and transforming each dataset during the taxonomy mapping exercise in Step 1 as well as the quantitative assessment of the data quality of each dataset which we outlined in the ‘Challenges of Data Harmonization’ section. When there was only one duplicate identified for a given observation, we chose to retain information from the dataset that had the most number of characters in its textual description of that observation. When more than one duplicate was identified per grouping however, we developed the following protocol for prioritizing which observation to retain:

Priority 1: For Canadian data, CIHI is prioritized first because this dataset specializes in collecting Canadian data.

Priority 2: COVID AMP is prioritized second for all data except for Canadian data based on both our qualitative and quantitative assessment of COVID AMP data quality. Based on our experience creating the taxonomy map between COVID AMP and CoronaNet, we found that COVID AMP’s taxonomy was very similar to the CoronaNet taxonomy, mitigating the challenge of taxonomy mapping and potential attendant errors. In terms of our quantitative assessment of COVID AMP data quality, we found it to be relatively high quality insofar as there are very few missing links and relatively high quality textual descriptions. Note however, that the version of the COVID AMP data we harmonized available in 2021 only collects data for 64 sovereign countries (while 95 are available in its dataset, these include policies for United States Native American tribes).

Priority 3: WHO CDC and WHO EURO is prioritized third for all data except for Canadian data. These data were prioritized together because they have already been mutually assessed for deduplication within the WHO PHSM dataset. In terms of data quality, the WHO CDC data appears to have higher quality descriptions compared to OxCGRT, ACAPS and HIT-COVID based on the average length of the description, the number of descriptions with less than 50 characters, while the WHO EURO data appears to have higher quality descriptions than ACAPS and HIT-COVID based on the average description length and higher quality descriptions than ACAPS, HIT-COVID and OxCGRT based on the number of descriptions with less than 50 characters. Meanwhile, both datasets furthermore have fewer missing end dates than ACAPS, HIT-COVID and OxCGRT.

Priority 4: OxCGRT is prioritized fourth for all data except for Canadian data because OxCGRT has some information on end dates and based on our qualitative assessment, has more informative descriptions of policies than HIT-COVID and ACAPS. This is supported quantitatively as well given that OxCGRT descriptions are on average longer and have less missingness than HIT-COVID and ACAPS descriptions.

Priority 5: HIT-COVID is prioritized fifth for all data except for Canadian data because compared to the ACAPS taxonomy, the HIT-COVID taxonomy is relatively similar to the CoronaNet taxonomy and it is relatively rich in subnational data. It was prioritized after the other datasets in part because it has no information on end dates.

Priority 6: ACAPS is prioritized sixth for all data except for Canadian data this is because the compared to the other datasets, its textual descriptions are of poorer quality and because it has no information on end dates.

Using the above methodology, we identified 5989 duplicate observations. The distribution of policies identified as duplicates is shown in Table  8 . Here we can see that observations from OxCGRT and ACAPS were discarded most often given these criteria. We then sampled 100 groups of observations identified to be duplicates, for a total of 425 observations, using this algorithm and found that 74.5% to be true duplicates, meaning that likely, around 1500 observations were discarded as being duplicates in this process that likely were unique observations. Given that we identified around 180k observations to harmonize to begin with and that most policies discarded were from datasets that we had previously found to have a higher likelihood of duplication (OxCGRT) or to be comparatively of lower quality (ACAPS), we made the judgement call that it was acceptable to discard this small percentage of observations without threatening the rigor of the data harmonization exercise writ large. Moreover, discarding these policies for consideration for manual harmonization at this point does not preclude doing so at a later state should resources allow for reassessing the value of harmonizing these policies.

Step 3c. Deduplication between CoronaNet and External Datasets

Lastly, we also evaluated the extent to which there were duplicates between the CoronaNet dataset and the external datasets. Such duplication can occur for the same reason that there is duplication across datasets: there has not been coordination between CoronaNet and these other datasets in terms of collecting policies and as such it is quite possible that there are duplicates across these datasets.

Like our attempts to identify duplicates both within and across the external datasets, we also experimented with different sets of variables that could accurately identify true duplicates across the CoronaNet and external datasets. However, ultimately we were not able to find a combination that yielded sufficiently high accuracy. Our best attempt used the following variables to identify true duplicates:

init_country_level

Based on this criteria, we sampled 100 groups of policies found to be duplicates (equivalent to 764 observations) but found that only 14 were true duplicates, for an accuracy of 14%. Subsequent efforts with other sets of variables did not improve on this percentage. As such, we were unable to automate deduplication of the external dataset across the external and CoronaNet datasets and limited our automated deduplication efforts to deduplication within and across external datasets.

As a last step, we adjusted the dataset at this stage for the sample of policies that we manually inspected for duplication in Steps 3a and 3b. In other words, we recovered the policies that the algorithm falsely identified as being duplicates and added them back to the dataset to be evaluated for manual harmonization. In so doing, we additionally identified observations that would not be considered policies in CoronaNet from this sample (around 50) and removed them for consideration from manual harmonization.

Step 4. Piloting of Manual harmonization Efforts

Steps 1 through 3 yielded an external dataset for which automated taxonomy mappings provided a rough first translation of the external data to the CoronaNet taxonomy and automated deduplication was able to remove the most obvious instances of duplicates within the external dataset.

As the challenges of harmonizing data from different, unclean data with inconsistently preserved raw sources revealed themselves, it became clear that the bulk of the work in data harmonization would need to be manual. While automated methods were able to reduce the size of the external dataset from around 180k to around 150k records, this still represents a tremendous number of policies to harmonize. To meet this challenge, the CoronaNet Research Project has been fortunate to be able to recruit hundreds of volunteers from around the world to help us complete this task.

Before rolling out these efforts to the entire project however, we first piloted data harmomization for a subset of each external dataset in order to i) validate the accuracy of the automated taxonomy mappings in Step 1 and ii) learn about potential difficulties and pitfalls as well as useful strategies to data harmonization so as to provide better guidance to future volunteers.

Table  9 describes the scope of our pilot harmonization efforts. The ‘assessment time frame’ refers to the actual time frame spent on piloting the data harmonization efforts (as opposed to the when the policies themselves were implemented). Part of the reason for these staggered time frames is that each taxonomy map itself took around 3-4 weeks to create; once a taxonomy map was created, it was immediately piloted for a given geographical scope. The choice to pilot certain countries and regions depended both on the availability of data for a given region for a given dataset and CoronaNet’s own prioritization of harmonizing European countries first given its partial funding from an EU Horizon 2020 grant. While relatively more assessments were done for taxonomies that were piloted earlier, fewer policies were assessed later on in part because i) taxonomy maps became better given the experience building the earlier ones and ii) assessment capabilities became higher given the experience of assessing earlier taxonomies. The order of mapping taxonomies from certain datasets as opposed to others was largely a function of how much capacity for cooperation the partner dataset was able to provide in building a given taxonomy map.

As can be seen in Table  9 , we initially sought to also include CCCSL in our pilot harmonization efforts. Unlike other taxonomy maps, the taxonomy map in this case was spearheaded by CCCSL partners. However, as part of the pilot assessment exercise, we found that both the CoronaNet and CCCSL were too complex to create high-accuracy maps. As previously discussed, given that CCCSL also had only around 11k observations, relatively few observations compared to other trackers with aspirations to track policies world-wide, inconsistently preserved sources, and unstandardized descriptions, we decided to depriortize harmonizing CCCSL data.

In piloting this data harmonization process more generally, research assistants reported that vague or incomplete descriptions and missing or dead links increased the difficulty of the work. It was not uncommon to encounter duplicate policies or external policies that needed to be broken down into smaller pieces in order to translate properly into the CoronaNet taxonomy. The pilot harmonization process also produced a pool of strategies and tips that future research assistants could draw on in their own efforts. Some strategies include (i) reading through the descriptions of all observations for a given country or region first in order to catch potential errors in the dataset (ii) using the Way Back Machine to recover dead links (iii) being aware that national level data from the OxCGRT dataset may include information about subnational policies because of the particulars of their methodology. Ultimately, these experiences helped us finalize the procedure we developed to manually harmonize the data, which we describe more in the following section.

Step 5: Manual harmonization of data

After having piloted our manual data harmonization efforts for each external dataset separately, we then finalized our plans for manual harmonization of the full combined external dataset into two main steps. First, each observation is assessed for whether it is already documented within the CoronaNet dataset or not. This information is saved internally under the column name ‘overlap_assessment.’ Second, observations that are currently not in CoronaNet are recoded using the CoronaNet taxonomy and harmonized into the CoronaNet dataset. This information is saved internally under the column name ‘harmonize_assessment.’ We elaborate on each of these steps in the below.

In order to allow coders to manually assess the external data according to this criterion, we wrote the external data into Google Sheets, which we refer to here as the ‘Data Harmonization Sheets’ (referred to internally as ‘Data Integration Sheets’) We grouped each sheet by country or subnational region and added conditional formatting to help facilitate their assessments. A note here on language: at the beginning of our harmonization process, we inaccurately referred to our efforts as ‘data integration’ instead of ‘data harmonization.’ To reduce confusion for the reader, we use the term ‘harmonization’ instead of ‘integration’ in what follows. We report this discrepancy here for the sake of transparency.

By using Google Sheets, we were able to provide an editable, centralized place for numerous different people to assess the external data. In addition to the ‘overlap_assessment’ and ‘harmonize_assessment’ columns as well as columns to record which human coder made a given assessment, these sheets also provide information about the:

Unique identifier for a given external observation (unique_id)

Dataset that it belongs to (dataset)

Textual description of the observation (description)

Timing of the policy (date_start; date_end)

Likely policy type. The type and type_sub_cat: provides the direct mapping while type_alt and type_alt_2 provides the machine learning prediction of the policy type, when available

Demographic targets of a policy when available (target_who_what, target_who_gen)

Geographic information about the policy initiator (country, province, city, init_other)

Geographic target of the policy (target_country, target_province, target_city, target_other)

Compliance of the policy (compliance)

Types of travel the policy affected if applicable (travel_mechanism), and

Raw source of the policy either in terms of the original URL (link) or a PDF of the source (pdf_link)

We summarize each of the steps below before then providing an example of how the Data Harmonization Sheets are used following this methodology. Though manual harmonization of the data is still ongoing, we close the section by providing an assessment of our progress to date and a discussion of tools and resources we have developed to support this process.

Step 5a. Manual assessment of overlap between external and CoronaNet data

For each observation in the external dataset, a human coder evaluates whether this observation has previously captured in the CoronaNet dataset or not. This evaluation is stored in the column ‘overlap_assess’ in the Data Harmonization Sheets and can take on the values of ‘Yes’, ‘No’, or ‘NA.’ The meaning of each of these values is as follows:

‘ Yes ’: this means that the external observation had already been independently captured in the CoronaNet dataset. In this case, the research assistant copies and pastes the corresponding CoronaNet unique identifier, which is stored in its record_id variable, into the matched_record_id column in the Data Harmonization Sheet.

‘ No ’: this means that the external observation has not been previously captured in the CoronaNet dataset. In this case, the human coder should move onto the second step of manually harmonizing the data.

‘ NA ’: this means that no one has yet been able to make an assessment of whether a given observation is or is not already in the CoronaNet dataset.

Step 5b. Manual harmonization of data

If a given observation is found to be in the external dataset but not in the CoronaNet dataset, the human coder moves onto to the second step of harmonizing external data into the CoronaNet taxonomy. To do so, they are instructed to treat the external observation just as they would any other potential source of information about a COVID-19 policy. In particular, they are asked to first go to the raw source of information using either the URL or PDF links (if available) provided for a given policy. Put differently, they are asked to recode the data based on the raw source of information provided in the Data Harmonization Sheets rather than from the textual description of the observation provided by the external data.

Once they have read through the raw information source, they can then either recode the information into the CoronaNet taxonomy using the normal procedure for documenting policies at CoronaNet (that is, they can recode this information into a Qualtrics survey customized for this purpose. See the methodology section in 6 for more information) or they can provide another assessment of the external data. In the ‘harmonize_assess’ column, they can make one of the following 6 assessments:

‘Harmonized’; this means that the coder has recoded it into the CoronaNet taxonomy.

‘Harmonized with additional original research’: this means that the coder had to do some additional research before coding the observation into the CoronaNet taxonomy. This could be for any number of reasons. E.g. the information from the URL or PDF links in the external dataset may be unclear or require additional context/knowledge to code well.

‘Harmonized with additional work to find a new link’ means that the original link for the policy is dead but that the research assistant was able to find a new link that corroborates the information described in the ‘description’ column.

‘Harmonized with additional original research AND with additional work to find a new link: means the research assistant fulfilled both the criterion under: ‘Harmonized with additional original research’ and ‘Harmonized with additional work to find a new link.’ See above for more information.

‘Duplicated policy’: means that there were multiple external policies that were duplicates of each other. In this case, the research assistant is asked to only harmonize one of them and to mark the other ones as being duplicates.

‘Not a relevant policy’: means that after having taken a closer look at the link, the observation is not one that would be coded in the CoronaNet taxonomy.

‘Link dead, no other link found’: means that the original link for the policy as noted in the CoronaNet Data harmonization sheet is dead and the research assistant was unable to i) use the Way Back Machine to find the original data ii) find another link to corroborate this information. In this case, the research assistant is instructed to not recode this policy.

Table  10 provides a visual example of this data harmonization exercise for three policies in Hungary. The first policy was found to not have been in the CoronaNet dataset. As such, the coder marked the overlap_assessment as ‘No’ (unique_id: OXCGRT_Hungary_20210728_mask). After looking through the URL or PDF link (not shown in Table  10 but available for each observation in the harmonization sheet), the coder then subsequently assessed the policy as being an irrelevant policy to the CoronaNet dataset and thus ‘Not a relevant policy’ was chosen in the harmonize_assessment column.

Meanwhile, the second observation (unique_id: EURO_730824_1) was found to have already been coded in the CoronaNet dataset; as such the coder marked the overlap_assessment as being ‘Yes’ and copied and pasted the corresponding record in the CoronaNet dataset, R_3NXmQbf9TrzN3XU into the matched_record_id column.

Finally, at the time of writing, the third policy has not been assessed for harmonization yet (unique_id: OXCGRT_Hungary_20200311_school). As such, both the overlap_assessment and harmonize_assessment columns take the value of NA.

Step 5 of manually harmonizing the data is still ongoing. However, based on the close to 80k observations that we have assessed so far, we have found that on average 83% of policies in the external dataset were not previously in the CoronaNet dataset. Table  11 provides a breakdown of the overlap assessment by dataset. Overall, HIT-COVID and CoronaNet have the most amount of overlap at 34% while OxCGRT and CoronaNet have the least amount of overlap at 11%.

Meanwhile Table  12 shows the breakdown to date of the harmonization assessment. Recall, that these assessments are only done for policies that are found to not currently be in the CoronaNet dataset, or in other words, for the 83% of the external data assessed to have an overlap_assessment of ‘No’ to date. The harmonization assessments show that around 45% of the data not currently in the CoronaNet dataset is subsequently recoded into the CoronaNet taxonomy and dataset, with around 9% of requiring either additional research or work to find a new link before this is possible. Meanwhile, 25% of the observations are assessed to be duplicates, 21% are not relevant policies and 10% do not have a recoverable link and thus cannot be substantiated and subsequently recoded.

There is however substantial variation for each assessment across the different datasets. Overall, it appears that data from CIHI is often harmonized without the need for substantial extra work and with relatively low issues with duplicated policies, dead links or irrelevant policies. This may in part be due to the fact that CIHI data focuses on subnational Canadian data and is relatively high quality given that it is collected not by volunteers, but by paid contractors. With regards to duplication, HIT-COVID, and OxCGRT appear to have about the similarly high amounts of duplication while there is comparatively little duplication for WHO EURO and as previously mentioned CIHI. The fact that the rate of duplicate data for OxCGRT data is relatively in line with those found for other datasets also suggests that we did not go overboard with our custom OxCGRT deduplication efforts in Step 3a. Dead links appear to be a particular problem for WHO EURO sources while irrelevant policies appear to be particularly high with regards to OxCGRT data. This is likely largely due to previously mentioned differences in OxCGRT and CoronaNet methodology; while OxCGRT documents policies that have ‘no change,’ CoronaNet does not (see the ‘Step 3a. Deduplication within External Datasets’ section for more details).

We conclude by noting that since the last step in the harmonization of the different taxonomies into CoronaNet taxonomy is manual and requires the enlistment of a substantial labor force, we have made significant investments in training research assistants and providing supportive resources for them to minimize the possibility of systematic coding errors. These include:

Regular workshops for managers and research assistants about data harmonization. These are mandatory for new research assistants and they receive this training along with the original training that we developed to onboard them into the project 6 .

The design and diffusion of reference material to the research assistants, such as: manuals, spreadsheets, presentations, info-graphics and videos.

Monitoring and rectification of inconsistencies identified in both the overlap assessment and harmonization assessment stages of the harmonization process by both managers and automated code. If there is an error in the data harmonization process, it is noted and communicated as feedback to research assistants to rectify before it is accepted as a valid harmonized entry.

Open communication channels for research assistants to receive asynchronous feedback on questions they may have on the data harmonization process through Slack.

While our harmonization efforts are still ongoing, we hope that the methodology we have outlined here can prove useful to others seeking to harmonize similar data or to evaluate the work of others.

Data availability

Users interested in which observations in the CoronaNet dataset were harmonized from external datasets can reference either the ‘collab’ or ‘collab_id’ columns in the raw event dataset made publicly available on the CoronaNet github repository here: https://github.com/CoronaNetDataScience/corona_tscs/tree/master/data/CoronaNet . The ‘collab’ variable notes which external dataset, if any, an observations was harmonized from and the ‘collab_id’ variable documents the original unique ID which matches the corresponding observation in the original data.

We further provide documentation of our methodology and data in the earlier steps of the data harmonization process in an OpenICPSR COVID-19 Data Repository entitled “CoronaNet COVID-19 Policy Responses: Taxonomy Maps and Data for Data Harmonization.” 26 Interested users can access the taxonomy maps and the input and output data of these taxonomy maps there. Information on the taxonomy maps is also made available on our project website ( https://www.coronanet-project.org/external_data_harmonization.html ) and information on the data inputs and outputs can similarly also be found in our project git repository ( https://github.com/CoronaNetDataScience/corona_tscs/tree/master/data/collaboration and https://github.com/CoronaNetDataScience/corona_tscs/tree/master/RCode/collaboration )

Code availability

Please refer to the  Supplementary Information , Section 2, in order to access the taxonomy maps made for Step 1 of the data harmonization process. Access to these taxonomy maps is also provided through our website here: https://www.coronanet-project.org/external_data_harmonization.html . Meanwhile, we make the code and data for replicating Steps 2 to 3 available in the following folders in the CoronaNet public git repository: https://github.com/CoronaNetDataScience/corona_tscs/tree/master/RCode/collaboration https://github.com/CoronaNetDataScience/corona_tscs/tree/master/data/collaboration .

Cheng, C. et al . Capturing the covid-19 crisis through public health and social measures data science. Scientific Data 9 , 1–9 (2022).

Article   Google Scholar  

ACAPS. Acaps government measures dataset readme version 1.1. Available at: https://www.acaps.org/en/thematics/all-topics/covid-19 (2020).

Katz, R. et al . Open data for covid-19 policy analysis and mapping. Scientific Data 10 , 491 (2023).

Article   PubMed   PubMed Central   Google Scholar  

Katz, R. et al . Covid analysis and mapping of policies dataset. Zenodo, https://doi.org/10.5281/zenodo.8087600 (2023).

for Health Information, C. I. Canadian data set of covid-19 interventions. Available at: https://www.cihi.ca/en/canadian-data-set-of-covid-19-interventions (2021).

Cheng, C., Barceló, J., Hartnett, A. S., Kubinec, R. & Messerschmidt, L. Covid-19 government response event dataset (coronanet v. 1.0). Nature human behaviour 4 , 756–768 (2020).

Article   PubMed   Google Scholar  

Cheng, C. et al . Covid-19 government response event dataset (coronanet (v1.1)). Zenodo . https://doi.org/10.5281/zenodo.5201766 (2023).

Zheng, Q. et al . Hit-covid, a global database tracking public health interventions to covid-19. Scientific data 7 , 1–8 (2020).

Zheng, Q. et al . Health intevention tracking for covid-19 (hit-covid) data. Figshare https://doi.org/10.6084/m9.figshare.12724058.v1 (2021).

Hale, T. et al . A global panel database of pandemic policies (oxford covid-19 government response tracker). Nature Human Behaviour 5 , 529–538 (2021).

Organization, W. H., of Hygiene, L. S. & Medicine, T. Public health and safety measures. Available at: https://www.who.int/emergencies/diseases/novel-coronavirus-2019/phsm (2022).

Cheng, C. et al . A general primer for harmonizing data. Preprint available at OSF: https://osf.io/baf2j (2023).

Desvars-Larrive, A. et al . A structured open dataset of government interventions in response to covid-19. Scientific data 7 , 285 (2020).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Desvars-Larrive, A. et al . Complexity science hub covid-19 control strategies list (cccsl). Zenodo https://doi.org/10.5281/zenodo.4573102 (2020).

O’Connor, C. & Joffe, H. Intercoder reliability in qualitative research: debates and practical guidelines. International journal of qualitative methods 19 , 1609406919899220 (2020).

Miles, M. B. & Huberman, A. M. Qualitative data analysis: An expanded sourcebook (sage, 1994).

Landis, J. R. & Koch, G. G. The measurement of observer agreement for categorical data. Biometrics 33 , 159–174 (1977).

Article   CAS   PubMed   Google Scholar  

WHO. Global dataset of public health and social measures data harmonization, processing flow, and data dictionaries for stage 1 and stage 2 databases). https://cdn.who.int/media/docs/default-source/documents/phsm/phsm—taxonomy_95529eca-9133-42e5-8549-daff3b208e97.zip?sfvrsn=7b98572e_16 (2020).

Cohen, J. A. et al . Leveraging real-world data to investigate multiple sclerosis disease behavior, prognosis, and treatment. Multiple sclerosis journal 26 , 23–37 (2020).

Edgell, A. et al . Pandemic backsliding: Democracy during covid-19 (pandem), version 6, https://www.v-dem.net/pandem.html (2020).

Kubinec, R. et al . Statistically validated indices for covid-19 public health policies. Preprint at SocArXiv: https://osf.io/preprints/socarxiv/rn9xk/ (2021).

Desrosières, A. Measurement and its uses: Harmonization and quality in social statistics. International Statistical Review 68 , 173–187 (2000).

Owino, B. Harmonising data systems for cash transfer programming in emergencies in Somalia. Journal of International Humanitarian Action 5 , 1–16 (2020).

Parmesan, S., Scaiella, U., Barbera, M. & Tarasova, T. Dandelion: from raw data to datagems for developers. In ISWC (Developers Workshop) , 1–6 (2014).

Chen, T., Abadi, A. J., Lê Cao, K.-A. & Tyagi, S. multiomics: A user-friendly multi-omics data harmonisation r pipeline. F1000Research 10 , 538 (2021).

Cheng, C. et al . Coronanet covid-19 policy responses: Taxonomy maps and data for data harmonization. ICPSR https://doi.org/10.3886/E195081V2 (2023).

Download references

Acknowledgements

We deeply thank the very large number of research assistants who have been working on harmonizing PHSM data. Of the more than 500 individuals who have helped harmonize PHSM data to date, the following deserve special acknowledgement for their work in onboarding and training research assistants in the process of data harmonization: Vanessa Zwisele, Augusto Teixeira, Constanza Schönfeld, Muneeba Rizvi, and Clara Fochler. Augusto Teixeira also deserves special recognition for his role in monitoring and providing feedback on data harmonization errors to research assistants and Muneeba Rizvi for consistently providing weekly progress updates on our manual harmonization efforts. We also recognize Audrey Firrone and Natalia Filkina-Spreizer for their help in creating the taxonomy map between CIHI and CoronaNet. We are moreover grateful to the following individuals for their work in helping us pilot our data harmonization efforts in the summer of 2021: Joseph Shim, Natalia Filkina-Spreizer, Audrey Firrone, Silvia Biagioli, Mayuiri A., Maanya Cheekati, Laura Eckoff, Fiona Valad, Katelyn Thomas, Humza Q, Amy Nguyen, Rawaf al Rawaf, Sella Devita, Paula Ganga, Tim Bishop, Jaimi Plater, Rose Rasty, Natalie Ellis, Maryam Al Hammadi, Shreeya Mhade, Shrajit Jain, Kyle Oliver, Shaila Sarathy, Alisher Shariyazdanov, Emma Baker, Jurgen Kadriaj, Celine Heng and Augusto Teixeira. We especially would like to thank the research assistants who have tirelessly worked to harmonize this data. We are deeply appreciative for their contribution to this work. The following research assistants have done at least 250 overlap assessments or 200 harmonization assessments: Aden Littlewood, Aidana Kadyrbayeva, Aina Kabuldinova, Aisulu Kossanova, Aleksiina Kallunki, Alexandr Lopukhov, Aliya Akbayeva, Altynay Askerova, Alua Akhmolda, Amina Abylkassymova, Amina Yensegenova, Aneliya Kassymova, Anke Horn, Anuar Zeken, Arailym Kanatkyzy, Ares Caufape, Assel Aimurzinova, Assemay Toganbay, Assyl Kenbayeva, Audrey Firrone, Audrey Tey, Augusto Teixeira, Ayaulym Saduakas, Ayaulym Tokenova, Ayazhan Abekova, Balaussa Amir, Bekzat Kabenov, Bibinur Salykova, Clara Fochler, Damelya Amanova, Dana Sultanbekova, Daniyarbek Bakhytzhan, Diana Karabekova, Dilnaz Medenova, Ella Goeckner-Wald, Emmanuel Nkanda Mbonke, Ethan Fulton, Fiona Valade, Giovanna Monteiro de Sá, Gulnaz Korganbay, Gulzhan Tulbayeva, Hritik Arora, Ida Steineck Nilsson, Ingkar Bekzhan, Kamila Bissenkulova, Kamila Isataeva, Kanat Kenzhetayev, Komila Nassyrova, Konstanze Schönfeld, Larissa Yugay, Madina Kinat, Madina Tanbayeva, Maksat Akimzhanov, Matthew Cottrell, Melody Bechler, Meruyert Rakhimova, Milana Kapezova, Minar Mekekyzy, Mukhtabar Sabyrova, Muslimbek Temirlan, Nargiz Turtayeva, Natalia Filkina-Spreizer, Nazerke Baimukan, Nazerke Kanatbekova, Nazym Malikova, Nidia Michinge, Nurbibi Altayeva, Nursaya Alpysbes, Rauan Alguatov, Sagynysh Zhukina, Sai Teha Muramalla, Sandugash Orynbay, Sanzhar Abdulin, Silvia Biagioli, Silvia Paolucci, Sukhrab Turdiyev, Talitha Gower, Tomiris Kuandyk, Waldemar Hartmann, Wissam Gaith, Yeligay Segizbay, Yelnaz Ramazanova, Youyang Zhang, Zhansaya Nurzhaubayeva, Zhibek Orazbayeva, Aigerim Aubakirova, Aruzhan Balgynbayeva, Assem Bazarbek, Axcel Jasso, Kamila Baitleuova, Nurizat Azhibekova, Symbat Kusherova, Yerdaulet Kumisbek. Meanwhile the following research assistants have done at least 100 overlap assessments or 50 harmonization assessments on the dataset (and were not acknowledged in the above): Aigerim Kenes, Aina Kabuldinova, Ainur Adeshova, Aisulu Kossanova, Aizada Omen, Aizhana Zhumagul, Akbota Omirkhanova, Akerke Akhmetova, Alexei Roudnitski, Ali Kahraman, Aliya Ibrayeva, Arailym Kalkeshova, Arailym Myrzagaliyeva, Aray Bolatova, Ares Caufape, Aruzhan Abdygali, Aruzhan Kaparova, Aruzhan Mardissadyk, Aruzhan Zeinilova, Assem Sansyzbay, Assemgul Alpysbayeva, Avirat Desai, Ayazhan Abekova, Balaussa Amir, Bekmukhanbet Abibulla, Bekzat Kabenov, Bekzat Sabyrov, Botagoz Amangeldiyeva, Celine Heng, Damira Nygmetova, Danagul Bakhytova, Darya Gorbacheva, Diana Namazbayeva, Diana Staroverova, Dilnaz Medenova, Dilyara Begalykyzy, Dinara Aben, Dulat Minas, Ezgi Caki, Fariza Shermakhanova, Fiona Zhong, Galiya Galymzhankyzy, Guang Shi, Gulnaz Korganbay, Gulzhaukhar Kenzhe, Jellen Olivares-Blanco, Jia Wei Liu, Jianing Li, Jose Muriel, Jurgen Kadriaj, Kader Saygili, Kamila Isataeva, Kelly Dang, Khadiya Nessipbay, Kudaibergen Aldakayev, Lailim Azetova, Laura Duisebayeva, Liliya Mukhamejanova, Madina Altenova, Mari Pulido, Mario Hernandez, Meruyert Ospan, Moza Alhuraiz, Nargiza Kozhanova, Nicole Lawrence, Nicole Mattson, Nur Hazwani Binti Shariff, Nurbibi Altayeva, Paloma Laye, Rawaf al Rawaf, Sabina Mavletova, Sabit Kyrykbay, Samat Sayakov, Samet Berk Oksuz, Shugyla Akhmetova, Shynggyskhan Yeshev, Sonia Mago, Stian Jenssen, Suyuan Wang, Symbat Maksatkyzy, Symbat Mombayeva, Tafadzwa Wandayi, Tasia Wagner, Tom Wiederkehr, Tomiris Nabiyeva, Vicky Liu, Weicong Hu, Xiatian Ye, Yaqi Ren, Zarina Irgaliyeva, Zarina Mukusheva, Zhadyra Yeshentay, Zhanel Jamenkova, Zhaniya Abdrakhman, Zhanna Bakytbek, Zhuldyz Ramazanova, Zi Yu Zhang. We would also like to thank Svanhildur Thorvaldsdottir for her helpful guidance and comments during the initial development of this paper and Hannah Löffler for her comments and edits of the paper. We are further grateful for Hans-Peter Nowak’s help in making our harmonization efforts accessible on our website. Finally we gratefully acknowledge for the support that we have received from the Chair of International Relations (Tim Büthe), Hochschule für Politik at the Technical University of Munich (TUM) for this work. Moreover, our work would not be possible without the support and generosity that other COVID-19 tracking projects have provided. We especially would like to recognize Amélie Desvars-Larrive and Michael Gruber (CCCSL), Sophia Zweig, Alex John Zapf, Kyle Oliver, and Hanmeng Xu (HIT-COVID), Erin Pichora and Christina Cately (CIHI), Alex Howe, Steve Penson, Angeliki Niki and Claudia Manili (ACAPS), Alaina Case and Ellie Graeden (COVID AMP) and Yuxi Zhang, Anna Petherick, Thomas Hale and Toby Phillips (OxCGRT). We are grateful for the generous funding support that we have received from the Chair of International Relations (Tim Büthe), Hochschule für Politik at the Technical University of Munich (TUM), New York University Abu Dhabi (Covid-19 Facilitator Research Grant), Leibniz Research Alliance Group “Crises in a Globalised World”, “CoronaNet in Eurasia: Leveraging the Comparative Moment in COVID-19 Research” from the National Council for Eurasian and East European Research, Grant number 832-06g. This project has also received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101016233 for EU data collection. Further, LM received support from the German Academic Scholarship Foundation.

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and affiliations.

Hochschule für Politik, Technical University of Munich, Richard-Wagner Str. 1, Munich, 80333, Bavaria, Germany

Cindy Cheng, Luca Messerschmidt, Isaac Bravo & Marco Waldbauer

Independent Researcher, Sydney, Australia

Rohan Bhavikatti

School of Humanities and Social Sciences, Nazarbayev University, Kabanbay Batyr Ave., 53, Astana, 010000, Kazakhstan

Caress Schenk

Faculty of Law, University of Pernambuco, Praça Adolfo Cirne, Recife, 50050-060, Brazil

Vanja Grujic

iSpot, 15831 NE 8th Str #100, Bellevue, 98008, Washington, USA

Division of Social Science, New York University Abu Dhabi, Social Science Building (A5), Abu Dhabi, 129188, United Arab Emirates

Robert Kubinec & Joan Barceló

You can also search for this author in PubMed   Google Scholar

Contributions

Author’s contributions to the paper are as follows: Conceptualization (C.C., L.M.), Methodology (C.C., I.B., M.W., R.B.), Software (C.C.), Validation (C.C.), Investigation (I.B., M.W., R.B.), Resource (C.C., L.M. C.S., V.G, R.K, J.B.), Data Curation (C.C.), Writing - Original Draft (C.C.), Writing - Review & Editing (C.C., L.M., I.B., M.W., R.B., T.M.), Visualisation (C.C., L.M., I.B., R.B.), Supervision (C.C., L.M., C.S., V.G., T.M., R.K., J.B.), Project administration (C.C., L.M., I.B., M.W., R.B. V.G., R.K., J.B.), Funding (C.C., L.M., C.S., R.K., J.B.).

Corresponding author

Correspondence to Cindy Cheng .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, supplementary table 1, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Cheng, C., Messerschmidt, L., Bravo, I. et al. Harmonizing government responses to the COVID-19 pandemic. Sci Data 11 , 204 (2024). https://doi.org/10.1038/s41597-023-02881-x

Download citation

Received : 31 May 2023

Accepted : 27 December 2023

Published : 14 February 2024

DOI : https://doi.org/10.1038/s41597-023-02881-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

data mining project on covid 19

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • Philos Trans A Math Phys Eng Sci

Data science approaches to confronting the COVID-19 pandemic: a narrative review

Qingpeng zhang.

1 School of Data Science, City University of Hong Kong, Hong Kong

2 Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USA

Joseph T. Wu

3 WHO Collaborating Centre for Infectious Disease Epidemiology and Control, School of Public Health, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong

Zhidong Cao

4 The State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, People’s Republic of China

5 School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100190, People’s Republic of China

Daniel Dajun Zeng

Associated data.

This article has no additional data.

During the COVID-19 pandemic, more than ever, data science has become a powerful weapon in combating an infectious disease epidemic and arguably any future infectious disease epidemic. Computer scientists, data scientists, physicists and mathematicians have joined public health professionals and virologists to confront the largest pandemic in the century by capitalizing on the large-scale ‘big data’ generated and harnessed for combating the COVID-19 pandemic. In this paper, we review the newly born data science approaches to confronting COVID-19, including the estimation of epidemiological parameters, digital contact tracing, diagnosis, policy-making, resource allocation, risk assessment, mental health surveillance, social media analytics, drug repurposing and drug development. We compare the new approaches with conventional epidemiological studies, discuss lessons we learned from the COVID-19 pandemic, and highlight opportunities and challenges of data science approaches to confronting future infectious disease epidemics.

This article is part of the theme issue ‘Data science approaches to infectious disease surveillance’.

1. Introduction

The use of data science methodologies in medicine and public health has been enabled by the wide availability of big data of human mobility, contact tracing, medical imaging, virology, drug screening, bioinformatics, electronic health records and scientific literature along with the ever-growing computing power [ 1 – 4 ]. With these advances, the huge passion of researchers and practitioners, and the urgent need for data-driven insights, during the ongoing coronavirus disease 2019 (COVID-19) pandemic [ 5 ], data science has played a key role in understanding and combating the pandemic more than ever.

COVID-19, caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [ 6 ], has swept the globe and claimed over 3.4 million lives as of 19 May 2021. Because of its enormous impact on global health and economies, the COVID-19 pandemic highlights a critical need for timely and accurate data sources that are both individualized and population-wide to inform data-driven insights into disease surveillance and control. Compared with responses to previous epidemics such as SARS, Ebola, HIV and MERS, the COVID-19 pandemic has attracted overwhelming attention from not only medicine and public health professionals but also experts in other data and computational sciences fields that in previous epidemics were more peripheral [ 7 , 8 ].

The COVID-19 pandemic presents a platform as well as a rich data source for mathematicians, physicists and engineers to contribute to disease understanding from data-driven and computational perspectives. Some of these data were unavailable in previous epidemics, while other data were available, but their potential had not been fully unleashed. The public health systems established by many countries’ Centres for Disease Control (CDCs), including those proven to be effective in the past, were easily outflanked by the SARS-CoV-2 virus due to its very high transmissibility and the ever-increasing global human mobility. Within only a few weeks of the virus being reported it was apparent that conventional public health practices had failed in containing it. Looking back, there were notable deficiencies in the public health systems [ 7 , 8 ], including (a) the slow response to highly contagious viruses, particularly if the symptoms resembled those of seasonal influenza and other mild infectious diseases; (b) the lack of reliable data at critical points (such as early outbreak and mutant strains); (c) slow and disorganized data collection; (d) policy decision-making based on political expediency but not scientific evidence; (e) slow and incomplete manual contact tracing; (f) the conflict between the effectiveness of contact tracing and the invasion of privacy; and (g) difficulty in identifying effective drugs to treat COVID-19 patients.

Many of these deficiencies can be addressed by creatively mining big data related to people’s behaviours and opinions, the biological structure of drugs, human interactomes and the constantly mutating virus. The threat of the pandemic has resulted in the whole scientific community being mobilized to combat COVID-19, resulting in many successful and innovative applications. These applications required the capabilities of not only experts in one field but collaborations between people with diverse professional backgrounds. A difficult year has passed, yet it was also a remarkable year of the rise of interdisciplinary data-driven research on emerging infectious diseases. It is therefore important to summarize the progress that has been made so far, and to lay out a blueprint of an emerging field of using data science and advanced computational models to confront future infectious diseases.

In this article, we briefly summarize the important progress made during the COVID-19 pandemic. There have been over 400 000 coronavirus-related publications in 2020 alone [ 9 ]. The list of papers we reviewed here (see table 1 ) is by no means complete, nor is it meant to be. Instead, we selected a set of typical and representative publications and discuss how these approaches shed light on how data science will be an indispensable tool in the ongoing war against the COVID-19 and future epidemics. The selection process is as follows. First, we used the keyword combination (‘COVID-19’ *OR ‘2019-nCov’) *AND (‘data science’ *OR ‘artificial intelligence’) to retrieve all related papers during 1 January 2020 to 31 May 2021 from Web of Science by Clarivate Analytics. Second, we used the same keyword combination to further retrieve additional conference papers from DBLP (a computer sciences bibliographic database). Third, we ranked the retrieved papers in terms of the number of citations and the impact factor of the journals. Fourth, we manually added a small number of papers that we agreed to be representative but not in the highly cited list. Fifth, the authors and five PhD students manually selected the papers to review. We prioritized the representative papers published in top-tier journals.

Data-driven COVID-19 publications that we reviewed.

In this article, we first reviewed the publications that used novel data sources/modalities and methods to address a broad spectrum of problems in disease control. Then, we performed bibliographic analysis to highlight the knowledge flow between these publications and the publications cited by/citing them. We conclude the paper with discussions of lessons we have learned so far in leveraging novel data and data science approaches to confront COVID-19 and other emerging infectious diseases.

2.  Modelling human mobility

SARS-CoV-2 is contagious in humans who are in close contact [ 6 ]. There is overwhelming evidence that SARS-Cov-2, similar to other SARS-like coronaviruses, found its way into a human host through an intermediate host in nature. Human contact has then become the main transmission medium [ 81 , 82 ]. As a result, the progression of the epidemic is heavily dependent on human mobility both locally and internationally. This makes the analysis of human mobility data essential to disease surveillance and policy evaluation. Luckily, we now have access to rich human mobility data including population-based census and survey data representing the general travel tendencies of people, as well as individualized mobility data derived from mobile phones, digital transactions and social media.

Reflecting on the early days of the epidemic in Wuhan City, China, the quick outbreak led to severe under-reporting of the problem [ 83 ]: on the one hand, many asymptomatic but infected people and people with mild symptoms did not realize that they were infected until they had recovered; on the other hand, many symptomatic people could not be admitted to hospital due to limited healthcare resources. As a result, the early epidemiological data did not fully represent all patients as early reports usually assumed a short serial interval period because they were based on data of severely ill patients who were admitted to hospital, while it missed those who were not hospitalized. It seems that similar situations occurred in other places around the world. As a result, a number of studies used human movement data to estimate the epidemiological parameters, such as the basic reproduction number R 0 , because people travelling out of Wuhan were closely monitored and well described in January and February 2020. [ 10 – 12 ]. Similar migration data were also used to reconstruct the full transmission dynamics of COVID-19 in Wuhan [ 13 ].

The success of using human mobility data to estimate the epidemiological parameters of the disease translates to other tasks. Travel restriction has been a popular control measure around the world in response to restricting the spread of SARS-CoV-2. Similarly, Gatto et al. used nationwide census mobility fluxes to quantify the effect of local non-pharmaceutical interventions (NPIs) and support the spatio-temporal planning of emergency measures in Italy [ 14 ]. However, a number of studies concluded that travel restriction might not be the most effective approach to containing the virus. Lai et al. and Kraemer et al. used open-source anonymized human movement data (Baidu migration data, https://qianxi.baidu.com/ , derived from Baidu users) to evaluate the effect of NPIs in containing the COVID-19 epidemic in China. It found that early detection and timely isolation of infected patients was more effective than travel restrictions and contact reductions [ 15 , 16 ].

A number of companies provide individual or aggregated mobile phone-derived mobility data. In a representative study using aggregated mobile phone users data (provided by SafeGraph , https://www.safegraph.com/ ), Chang et al. developed dynamic mobility networks to simulate the COVID-19 outbreak in 10 major metropolitan areas in the USA [ 17 ]. Not only did the model predict the superspreader points of interest would account for a majority of the infections but this work also revealed risk inequities that disadvantaged groups suffered, for instance they had a higher risk of infection because they could not reduce their mobility as sharply. Liu et al. reported similar findings from a retrospective analysis of the anonymized daily mobile phone location data in China [ 19 ]. Two studies using commercial data ( SafeGraph , Pei et al. [ 18 ], Teralytics https://www.safegraph.com/ , Badr et al. [ 20 ]) reported that social distancing played a central role in mitigating COVID-19 transmission in the USA.

In examining the effect of NPIs in a city or smaller country, agent-based models are useful because of their flexibility and high granularity in modelling travel patterns. To better model the travel tendencies in a city, census and demographic data are required, especially when individualized mobility data are absent. For example, Koo et al. used national census data to build an agent-based model of the COVID-19 transmission in Singapore [ 21 ]. Similarly, Aleta et al. used mobile phone, census and demographic data to build an agent-based model of the COVID-19 transmission in Boston [ 22 ]. A recent study took a more aggressive approach, where Zhou et al. constructed an agent-based model with 7.55 million agents representing each citizen in Hong Kong [ 23 ]. The authors collected open government data including demographics, public facilities and functional buildings, transportation systems and travel patterns (based on census), and also incorporated the real-time human mobility patterns provided by Google’s Community Mobility Report ( https://www.google.com/covid19/mobility/ ). The entire city of Hong Kong was split into 4905 500   m × 500   m grids (refer to figure 1 for an illustration). This very detailed model was used to identify the high-value grids for targeted interventions with low disruption of the whole city.

An external file that holds a picture, illustration, etc.
Object name is rsta20210127f01.jpg

Geographical distribution of the 7.55 million agents and facilities in Hong Kong. Layer 1 represents the distribution of schools. Layer 2 represents the population distribution. Layer 3 represents the locations of entertainment sites. Credit: Zhou et al. [ 23 ]. (Online version in colour.)

Human mobility data are useful in informing responsive and adjustable NPIs, which can maintain economic productivity. Leung et al. used digital transactions for transport to enable real-time and accurate nowcast and forecast of COVID-19 epidemics in Hong Kong [ 24 ]. Successful application of such real-time predictions has the potential to maximize economic productivity. Yang et al. proposed a simple optimization scheme that considers both the reduction in infections and the social disruption in New York City, and concluded that tight social distancing measures in public places was the key to protect the elderly who are most vulnerable to experiencing severe disease, or death [ 25 ]. In a study in Italy, Bonaccorsi et al. modelled mobility restrictions as a shock to the economy by harnessing a near-real-time Italian mobility dataset provided by Facebook. These researchers found that mobility contraction was stronger in municipalities with greater inequality and lower income per capita , and they subsequently called for fiscal measures that targeted poverty and inequal mitigation [ 26 ].

On a global scale, Chinazzi et al. proposed a metapopulation disease transmission model that considered both air transportation and ground mobility across 3200 sub-populations in 200 countries and regions. They suggested that early detection, hand washing, self-isolation and household quarantine were more effective than travel restrictions at containing the virus [ 27 ]. Gilbert et al. used global air travel data to estimate the risk of COVID-19 importation per African country, as well as the preparedness of each country [ 28 ].

Facing a global pandemic, coordination between countries/regions is apparently a key in reducing cross-border transmissions. Ruktanonchai et al. examined the coordinated relaxation of NPIs across Europe by estimating human movements among European countries by using mobile phone data. They found that coordination of on–off NPIs is indeed important to containing the outbreak across Europe [ 29 ].

3.  Manual and digital contact tracing

Contact tracing is an indispensable method to identify and isolate at-risk people, in an attempt to reduce infections in the community. During the COVID-19 pandemic, most public health practice has still relied on conventional manual contact tracing. Although such data are rarely made publicly available for research due to privacy concerns, there have been good empirical and modelling studies using it. Bi et al. analysed a complete dataset of 391 cases and 1286 of their close contacts in Shenzhen City (provided by Shenzhen CDC), China, during 14 January 2020–12 February 2020, and demonstrated that contact tracing significantly reduced the reproduction number and thus prevented a localized outbreak [ 30 ]. Zhang et al. analysed survey data for Wuhan City and Shanghai City, as well as detailed contact tracing data in Hunan Province (provided by Hunan CDC), and constructed a transmission model to evaluate the impact of NPIs on transmission [ 31 ]. They concluded that the NPIs implemented in these places had successfully controlled the COVID-19 outbreak.

Conventional manual contact tracing has major challenges, such as recall bias and time delay. The wide adoption of smartphones makes the novel digital contact tracing techniques a promising supplement to, if not replacement of, manual contact tracing [ 32 , 33 ]. This is particularly relevant to SARS-Cov-2, which is highly infectious. Ferretti et al. used a mathematical model to explore the feasibility of controlling the epidemic using conventional manual contact tracing by questionnaires versus digital contact tracing, and concluded that manual contact tracing is not feasible. Thus, the use of digital contact tracing is potentially more effective in stopping the epidemic given the high proportion of people using smartphones [ 34 ].

In developed countries/regions, there appear to be no technical obstacles for effective digital contact tracing because current smartphones are mostly equipped with GPS and Bluetooth [ 84 ]. Both Google and Apple have implemented frameworks in smartphones to assist in contact tracing and exposure notifications ( figure 2 ). Since COVID-19 is likely to become endemic, digital contact tracing may eventually become a common public health practice. However, the wide implementation of digital contact tracing has not been particularly successful except for a few countries in East Asia [ 85 ]. There are many controversial issues including privacy concerns, accuracy, connection to health authorities, and other cultural and political factors [ 85 , 86 ]. In many lower- and middle-income countries/regions, where citizens are less technologically savvy, manual contact tracing is still playing the dominant role in containing the epidemic.

An external file that holds a picture, illustration, etc.
Object name is rsta20210127f02.jpg

Three typical digital contact tracing apps: ( a ) Apple’s Exposure Notification function (Bluetooth-based). ( b ) TraceTogether system in Singapore (Bluetooth-based). ( c ) Health Code system in Mainland China (Mandatory manual input), ( d ) LeaveHomeSafe system in Hong Kong (voluntary manual input). (Online version in colour.)

Since late 2020, Singapore has mandated the use of a digital contact tracing app, TraceTogether . In mainland China, different cities/provinces have produced their own Health Code systems and these isolated systems are now merging into a nationwide Health Code system. In Hong Kong, a conservative contact tracing app, LeaveHomeSafe , has been made available by the government. LeaveHomeSafe does not have access to users’ private data. There is no registration requirement, and it only sends users (not public health authorities) exposure notifications. Its use is voluntary and people can always choose to manually leave their contact information (usually nobody verifies the information) when entering premises (such as a restaurant) that requires it ( figure 2 ). Given Ferretti et al. ’s simulation research [ 34 ], the efficacy of such a voluntary-based digital contact tracing system in reducing transmission is limited by the low proportion of trustworthy data.

How to motivate people to use digital contact tracing is an important public health challenge. Munzert et al. combined an online panel survey and mobile tracking data to measure usage of the official contact tracing app in Germany, and found that people with different demographic backgrounds exhibited different usage of the app [ 35 ]. These researchers also showed that video messages were not effective in motivating updates, while small monetary incentives may strongly increase updates.

Even if vaccines become widely available, their development may not keep pace with virus mutations. Thus, contact tracing remains a critical tool in stopping the epidemic. To unleash the potential of digital technology to improve contact tracing accuracy, advances are required in both technology and public health research. On the one hand, more advanced technologies are needed to dispel people’s doubts about data privacy, while on the other hand, how to motivate and incentivize people to adopt new technologies (including other interventions and vaccinations) might be the most important question.

4.  Empirical evaluation of government responses

Governments and authorities around the world responded to the COVID-19 pandemic with a range of NPIs. Compliance with policy measures provide a rich dataset of lessons and experiences that are in valuable for future decision-making. A number of studies have quantified the extent of the action, as well as the compliance with policy measures. A typical example is Oxford Covid-19 Government Response Tracker ( OxCGRT , https://www.bsg.ox.ac.uk/research/research-projects/covid-19-government-response-tracker ), which collects systematic information on more than 180 countries’ policy measures since 1 January 2020. More specifically, OxCGRT records these policies on a scale to reflect the extent of government action, and policy indices are created based on the scores [ 38 ]. Similarly, Porcher published Response2covid19 ( https://response2covid19.org/ ), a dataset of governments’ response to the COVID-19 pandemic [ 36 ]. Another global dataset, the Citizenship, Migration and Mobility in a Pandemic ( CMMP , https://www.cmm-pandemic.com/ ) was introduced by Piccoli et al. [ 37 ]. Quantifying the effect of various NPIs is another important problem. Hsiang et al. compiled data on 1700 local/regional/national NPIs deployed in six countries, and applied reduced-form econometric methods to empirically measure the effect of these NPIs on flattening the epidemic curve [ 39 ]. Dehning et al. analysed the data in Germany using a Bayesian inference model and emphasized that relaxation of NPIs should be undertaken warily, because the currently deployed NPIs had barely contained the outbreak [ 40 ]. However, there is little research that compared the implementation and uptake of NPIs across different countries. Objective and data-driven evaluation of the actual NPIs deployed around the world is crucial for decision-makers to confront future infectious disease epidemics. Moreover, with the growing accessibility to vaccines, another important question arises: how to effectively and efficiently allocate vaccines locally and globally. This question has not been well addressed by the time of this review, and the authors would like to call for data-driven research on this crucial topic.

5.  Assessing the economic, trade and supply chain impact

Travel restrictions and NPIs have dramatically affected the global supply chains and trades. Guan et al. adopted the latest economic disaster modelling to examine the supply chain effects of a set of NPIs scenarios. They found that the supply chain losses were dependent on the number of countries imposing travel restrictions, while a longer containment that might control the epidemic could impose smaller losses [ 41 ]. This study built the global supply chain network using the Global Trade Analysis Project (GTAP) database [ 42 ], which is subject to a subscription fee. Maliszewska et al. also used GTAP data and previous episodes of global epidemics to simulate the impact of the COVID-19 pandemic on gross domestic product and trade, and drew similar conclusions [ 43 ]. More recently, Ye et al. developed an integrated network model to investigate the personal protective equipment (PPE) shortage contagion patterns on a global trade network harvested from the World Customs Organization report, and found that PPE export restrictions exacerbated shortages, and caused shortage contagion travelling faster than disease contagion [ 44 ]. Malliet et al. used a computable general equilibrium model to assess the impacts of French NPIs on environmental and energy policies at macroeconomic and sectoral levels, and found that lockdown measure decreased economic output but generated positive environmental impact by reducing CO 2 emissions [ 45 ]. In other two studies, Çakmaklı et al. and Andersen et al. quantified the macroeconomic effects of COVID-19 on consumers and economies by harnessing the data provided by the Central Bank of the Republic of Turkey [ 46 ] and a major bank in Denmark [ 47 ], respectively.

6.  Mining patient data and drug repurposing

Mining patient data can generate enormous amounts of valuable information, ranging from aggregated statistics on a daily or weekly basis to detailed electronic health records (EHRs). Analysing the time series of case counts has always been the focus of epidemic modelling. Xu et al. collected and curated individual-level patient data from official reports in China, and published it for public use [ 48 ]. This dataset has successfully enabled a dozen of downstream epidemiological studies. In another study, Bednarski et al. explored how to use reinforcement learning and deep learning models to derive the near-optimal redistribution of medical equipment to support public health emergencies [ 49 ].

How to prioritize testing for COVID-19 is important because testing resources are usually limited. To this end, Zoabi et al. developed a machine learning model to predict the COVID-19 diagnosis based on the testing data provided by the Israeli Ministry of Health [ 50 ]. In another study, Callahan et al. used screening data to address the same problem by developing a machine learning model [ 51 ]. In dealing with the patients admitted to the hospital, the major challenge is to prioritize the patients with severe disease and a high risk of death. The ability to derive an accurate individual-level risk score on the EHR is crucial for effective resource allocation and distribution, and prioritizing vaccination programs. Estiri et al. trained age-stratified generalized linear models with component-wise gradient boosting to predict the death of patients before getting infected [ 52 ]. In a population-based study from Hong Kong, Zhou et al. developed a simple risk score for predicting severe COVID-19 disease using clinical and laboratory variables [ 53 ].

Machine learning has been recognized as effective in predicting the risk of a range of patient outcomes. It is particularly useful for COVID-19 because the diagnosis usually involves both structured data and medical imaging data. Shamout et al. developed deep neural network models to predict deterioration risk by learning from chest X-ray images and routine clinical variables [ 54 ]. Wang et al. proposed a deep learning-based AI system for COVID-19 diagnostic and prognostic analysis by analysing computed tomography images, and validated the model on a Chinese dataset of 5372 patients [ 55 ]. Oh et al. proposed a patch-based convolutional neural network method for COVID-19 diagnosis by analysing the potential imaging biomarkers of the CXR radiographs [ 56 ]. The success of using deep learning and more general machine learning techniques in COVID-19 diagnosis and prognosis, and patient stratification continues. Please refer to the latest review of these techniques [ 87 ].

Owing to people’s isolation during the COVID-19 pandemic, mental health has emerged as another focal issue [ 88 – 90 ]. Surveys and suicide records could provide a good data source if they were collected during the time period of the pandemic. For example, Holman et al. examined mental health issues during the COVID-19 pandemic by sampling US citizens across three 10-day periods, and identified a number of factors associated with acute stress and depressive symptoms [ 57 ]. However, due to the difficulty in obtaining reliable data, data science and machine learning approaches that accurately detect mental health issues during the ongoing COVID-19 pandemic remain under-researched. There are a few successful studies, which are mostly based on Internet and social media data, rather than individual patients’ records.

Because of the speed of onset, and size of impact of COVID-19, repurposing currently is an efficient way of ensuring that effective treatment is available . Early in the pandemic, Gordon et al. showed that a protein interaction map of SARS-CoV-2 could identify targets for drug repurposing [ 58 ]. In the search for drug candidates in the sea of biological data, with a focus on protein–protein interactions (PPIs), network science and machine learning have the advantage of being able to model the high-dimensional biological and pharmaceutical data associated with different drugs. Sadegh et al. developed an online interactive platform named CoVex ( https://exbio.wzw.tum.de/covex/ ) for COVID-19 drug or target identification by integrating virus–human protein interactions, human PPI, and drug-target interactions [ 59 ].

In a representative study, Gysi et al. adopted a set of machine learning, network diffusion, and network proximity models to prioritize 6340 drugs that might treat COVID-19 [ 60 ]. These authors constructed the human interactome with 18 505 proteins and 327 924 protein interactions by harvesting 21 public databases that compile experimentally derived PPI data. The authors found that no single model consistently outperformed others across all datasets, and thus a multimodal approach was used to perform model fusion for the best prediction performance. A similar study was carried out by Zhou et al. [ 61 ], where high-value proteins and drug combinations were derived by a network-based algorithm. Yan et al. proposed a knowledge graph approach to prioritise drug candidates against SARS-Cov-2 [ 62 ]. This study integrated 14 biological databases of drugs, genes, proteins, viruses, diseases, symptoms and their linkages, and developed a network-based algorithm to extract hidden linkages connecting drugs and COVID-19 from the constructed knowledge graph. See figure 3 for the description of the knowledge graph and the identified motifs-of-interest. Pham et al. proposed a deep learning method, namely DeepCE , to model substructure–gene and gene–gene associations for predicting the differential gene expression profile perturbed by de novo chemicals, and demonstrated that DeepCE outperformed state-of-the-art, and could be applied to COVID-19 drug repurposing of COVID-19 with clinical evidence [ 63 ]. Zhou et al. provided a useful review and helpful illustrations of these machine learning, and AI techniques for COVID-19 drug repurposing [ 91 ] The knowledge graph does not have to be manually constructed, except for the existing biological datasets, as machine learning and natural language processing (NLP) techniques are appropriate tools to automatically construct knowledge graphs from scientific literature [ 65 ].

An external file that holds a picture, illustration, etc.
Object name is rsta20210127f03.jpg

Motifs-of-interest for drug repurposing in a knowledge graph: a knowledge graph is a multi-relational graph composed of entities and relations. Each entity represents a specific protein, gene, drug, virus, disease or symptom and each relation represents a known existing linkage between any two entities. A motif is a connected subgraph representing fundamental building block of the knowledge graphs. Motifs-of-interest are defined based on their importance to the drug repurposing task. Motif-clique discovery algorithms are used to extract these defined motifs-of-interest. Credit: Yan et al. /Wiley [ 62 ]. (Online version in colour.)

7.  Mining scientific literature

The COVID-19 pandemic has led to a huge corpus of coronavirus-related publications across disciplines. There were over 400 000 publications about COVID-19 and SARS-Cov-2 in 2020, and the number is ever-growing. Mining this huge set of scientific articles can facilitate knowledge discovery, enable novel expert systems, identify research trends and guide research policy.

There are a number of open-source datasets of COVID-19 scientific literature. TREC-COVID ( https://ir.nist.gov/covidSubmit/ ) is a set of information retrieval test collections jointly organized by the Allen Institute for Artificial Intelligence (AI2), the National Institute of Standards and Technology, the National Library of Medicine (NLM), Oregon Health & Science University, and the University of Texas Health Science Center at Houston [ 71 ]. TREC-COVID provides a list of papers contributed by the challengers ( https://ir.nist.gov/covidSubmit/bib.html ), but the list seems incomplete. AI2, in collaboration with Chan Zuckerberg Initiative, Georgetown University, Microsoft, IBM, NLM, and the White House of the USA, also created the COVID-19 Open Research Dataset Challenge ( CORD-19 , https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge ) through Kaggle [ 64 ]. Note that there are over 30 000 COVID-19-related data challenges in Kaggle as of 15 May 2021 ( https://www.kaggle.com/search?q=covid-19 ). MIT Operations Research Center is also maintaining a service, namely the COVID Analytics ( https://www.covidanalytics.io ), which provides a dataset of COVID-19-related papers, with a visualization tool for users to derive their own insights from the data. COVID Analytics has great impact on not only disease surveillance, but also the vaccine development. Developers of the Johnson & Johnson COVID-19 vaccine and the MIT researchers applied machine learning to help guide the company’s research efforts into a potential vaccine by analysing COVID Analytics data and other real-world data. For example, they worked together to identify key locations to set up trial sites for the company ( https://news.mit.edu/2021/behind-covid-19-vaccine-development-0518 ).

Esteva et al. created a semantic search engine, CO-Search ( http://einstein.ai/covid ), which is able to handle complex queries over the COVID-19-related literature [ 9 ]. CO-Search has a multi-stage framework, with a hybrid semantic–keyword retriever based on the popular BERT language model, and a re-ranker that further sort the order of retrieved documents by relevance. The authors demonstrated the strong performance of CO-Search on the TREC-COVID dataset. Su et al. developed a real-time question answering (QA) and document summarization system, namely CAiRE-COVID ( https://demo.caire.ust.hk/covid/ ) [ 72 ], which is able to answer high-priority questions with question-related information (see figure 4 for an example). Similar to CAiRE-COVID, there are a number of COVID-19 specific QA systems [ 66 – 68 ], and search engines [ 70 ]. Machine learning and NLP methods to construct knowledge graphs by analysing the coronavirus-related literature. More specifically, Chen et al. combined the CORD-19 dataset [ 64 ] and the PubMed dataset [ 73 ] to identify COVID-19-related experts and bio-entities [ 69 ]. Another example is the COVID-KG framework, which could extract fine-grained multimedia knowledge elements from scientific literature [ 65 ]. The resulted knowledge is available at http://blender.cs.illinois.edu/covid19/ .

An external file that holds a picture, illustration, etc.
Object name is rsta20210127f04.jpg

An example of the answers and summary provided by CAiRE-COVID. Screenshot taken by searching ‘What do we know about asymptomatic transmission of COVID-19?’ on CAiRE-COVID [ 72 ]. (Online version in colour.)

8.  Social media analytics and Web mining

The World Wide Web and social media have become important channels for laymen to retrieve health-related information. There is strong evidence that users’ online behaviours are associated with their health conditions and thus could be used to estimate the epidemic of infectious diseases [ 92 , 93 ]. It is possible that the Web and social media data could inform more timely responses since traditional manual reporting systems have significant lag times. In an empirical study, Bento et al. examined people’s information-seeking behaviours in response to the first confirmed COVID-19 case in each state of USA, and found that searches for certain terms were strongly influenced by the timing of the first confirmed case in a state [ 74 ]. In a correlation analysis, Effenberger et al. found that Internet searches (Google Trends) are correlated with the number of COVID-19 cases across European countries [ 75 ]. There was usually a time lag of 11.5 days, indicating that the Internet searches were possibly predictive of actual cases within that time period in Europe. Li et al. performed a comprehensive study using both Internet searches and social media data to predict the COVID-19 incidence in China [ 76 ]. The authors used both Google Trends and Baidu Index to characterize the popularity of COVID-19-related terms in Internet searches, and the Sina Weibo Index to characterize that in social media interest. The results showed that all three sets of data were correlated with the actual COVID-19 cases in China. Of note however was that the Baidu Index and Sina Weibo Index could predict the outbreak over a week earlier, possibly because Google is not a mainstream search engine in China.

In addition to disease surveillance, the Web and social media have also become a battlefield of truth, rumours, misinformation and even disinformation [ 80 ]. Li et al. analysed the social media discussions on Sina Weibo and found that specific linguistic and social network features could predict the reposted amount of different types of information [ 77 ]. However, the ever-present question was whether the online information was of good quality? To answer this question, early on in the outbreak (as of 6 February 2020), Cuan-Baltazar et al. manually screened the COVID-19-related websites by searching relevant terms on Google, and found that the quality and readability of retrieved information was mostly poor, highlighting the risk of the Internet as a public source of information on health [ 78 ]. Roozenbeek et al. examined predictors of misinformed belief about COVID-19 and SARS-Cov-2, using a dataset of samples from the United Kingdom, Ireland, the USA, Spain and Mexico, identifying a consistently high proportion of misinformed public belief views in all five countries [ 79 ]. Such susceptibility to misinformation was found to make people less likely to comply with NPIs or to seek COVID-19 vaccines, suggesting interventions are required to help the public gain trust in science.

Ye et al. built a mathematical model, which indicates that the media and opinion leaders should provide true and quality information to the public so that people are willing to comply with public health guidance to protect themselves and the whole population [ 94 ]. To achieve this, more rigorous research on mis- and disinformation about COVID-19 is much-needed, especially while facing the rise of populism and anti-scientism worldwide [ 95 , 96 ].

9.  Discussion

We performed a bibliographic analysis of the papers reviewed above. Figure 5 visualizes the knowledge transfer from the disciplines of the papers cited by the papers we reviewed ( cited-papers ) to the disciplines of papers citing the papers we reviewed ( citing-papers ). The disciplines were determined by the Web of Science (WoS) and one paper may have multiple disciplines. The cited- and citing-papers were also retrieved from WoS. It is obvious that Multidisciplinary Sciences is the dominating discipline for both groups of papers. To have a better understanding, we further present the bar charts of these papers’ disciplines excluding Multidisciplinary Sciences in figure 6 . We found that 6 out of 20 most frequent disciplines of the cited papers were not in medicine, biology or public health. For citing papers, half were not in medicine, biology or public health. Most of these fields are computational sciences. These bibliographic analysis results suggest that COVID-19 research is highly multidisciplinary and there is strong evidence of knowledge transfer between different disciplines.

An external file that holds a picture, illustration, etc.
Object name is rsta20210127f05.jpg

Knowledge transfer from the disciplines of the papers cited by the papers we reviewed (down) to the disciplines of papers citing the papers we reviewed (up). The size of arrows represents the frequency. (Online version in colour.)

An external file that holds a picture, illustration, etc.
Object name is rsta20210127f06.jpg

The count of top 20 disciplines (excluding Multidisciplinary Sciences ) of ( a ) the papers cited by the papers we reviewed, and ( b ) the papers citing the papers we reviewed. The orange bars represent disciplines other than medicine, biology and public health disciplines. (Online version in colour.)

The impact of the COVID-19 pandemic on human society and scientific community is unprecedented. To win the war against the COVID-19 pandemic requires innovative collaborations between scientists from many disciplines. Data scientists have already shown that by joining with medicine and public health scholars they can identify, analyse and model traditional and novel data generated by, or associated with, the pandemic to produce rich understandings. The innovative use of these data has led to many important applications, that cannot be adequately covered by a single article. In this paper, we selected a set of publications that represent the data science studies in modelling human mobility, developing digital contact tracing techniques, evaluating government responses, assessing the economic impact, mining patient data, drug repurposing, mining scientific literature, social media analytics and Web mining. There are a number of topics that are not covered in detail because of insufficient publications, such as vaccine prioritization [ 97 , 98 ] and vaccine hesitancy [ 99 ], screening chatbot [ 100 ], crowdsourcing and the emerging folk science. As the pandemic, and research into it, progresses, more knowledge will become available in these topics.

This rich literature of data science approaches to combating the COVID-19 pandemic has provided valuable knowledge, experience and more importantly toolkits that we may use to improve disease surveillance and refine NPIs for COVID-19. The excitement that lies ahead for scientists in all disciplines is the use of these approaches to prevent the outbreak of future infectious diseases. The capability will not only depend on the methodological advances in AI and machine learning, but also on the identification of more data, the linkage across datasets, and the balance between individual’s privacy and the population’s well-being. Research policy-makers should recognize the urgent need for multidisciplinary COVID-19 research and foster novel collaborative research by thematic prioritization of funding and organizing work groups and conferences of researchers from different domains. It is important that the public’s trust in science is secured, so that when the world faces another emerging infectious disease in the future, reactions will be timely, effective and underpinned by believable data-driven NPIs, with which people comply because of their credibility.

Data accessibility

Authors' contributions.

Q.Z. wrote the first draft of the paper. J.G., J.T.W., Z.C. and D.D.Z. provided critical feedback and helped shape the paper. All authors revised the paper.

Competing interests

We declare we have no competing interests.

This work was supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. 11218221, C7154-20GF, C7151-20GF and C1143-20GF).

We've detected unusual activity from your computer network

To continue, please click the box below to let us know you're not a robot.

Why did this happen?

Please make sure your browser supports JavaScript and cookies and that you are not blocking them from loading. For more information you can review our Terms of Service and Cookie Policy .

For inquiries related to this message please contact our support team and provide the reference ID below.

IMAGES

  1. Artificial Intelligence Equipped Supercomputer Mining for COVID-19

    data mining project on covid 19

  2. Role of Data Mining During the COVID-19 Outbreak

    data mining project on covid 19

  3. The role of data in a world reshaped by COVID-19

    data mining project on covid 19

  4. Business Impact of COVID-19 Wave 6

    data mining project on covid 19

  5. COVID-19 (Coronavirus) Data Hub

    data mining project on covid 19

  6. How has the mining industry responded to COVID-19?

    data mining project on covid 19

VIDEO

  1. Early Disease Detection through Machine Learning

  2. Introduction class of Data Mining (Covid time)

  3. Advanced Data Mining Project Milestone 1

  4. #coronavirus #datascience Italy recovering, UK and USA in trouble!

  5. CSC-126 : COVID-19 QUARANTINE SYSTEM GROUP PROJECT

  6. COVID-19-Predicting Outbreak in India with Machine Learning

COMMENTS

  1. Predicting the incidence of COVID-19 using data mining

    Using data from different geographical regions within a country and discovering the pattern of prevalence in a region and its neighboring areas, our boosting-based model was able to accurately predict the incidence of COVID-19 within a two-week period. Peer Review reports Background

  2. Predicting the incidence of COVID-19 using data mining

    Using data from different geographical regions within a country and discovering the pattern of prevalence in a region and its neighboring areas, our boosting-based model was able to accurately predict the incidence of COVID-19 within a two-week period. Supplementary Information

  3. Predicting Mortality of COVID-19 Patients based on Data Mining

    The present study aimed at predicting mortality in patients with COVID-19 based on data mining techniques. To do this study, the mortality factors of COVID-19 patients were first identified based on different studies. These factors were confirmed by specialist physicians.

  4. Forecast and prediction of COVID-19 using machine learning

    2.1. Incubation period of COVID-19. The incubation period is the time between when someone catches the virus and when symptoms start to appear. As reported by the WHO, this virus has an incubation period of 2-14 days in the human body [4,6].According to the Centers for Disease Control and Prevention (CDC), mild symptoms of the virus start appearing within 5 days and become worse afterward [].

  5. Data based model for predicting COVID-19 morbidity and ...

    There is an ongoing need for scientific analysis to help governments and public health authorities make decisions regarding the COVID-19 pandemic. This article presents a methodology based on data ...

  6. Machine learning-based prediction of COVID-19 diagnosis based on

    43 Altmetric Metrics Abstract Effective screening of SARS-CoV-2 enables quick and efficient diagnosis of COVID-19 and can mitigate the burden on healthcare systems. Prediction models that combine...

  7. Association mining based approach to analyze COVID-19 response ...

    This paper introduces a novel data mining-based approach to understand the effects of different non-pharmaceutical interventions in containing the COVID-19 infection rate.

  8. PDF Predicting the incidence of COVID-19 using data mining

    Predicting the incidence of COVID-19 using data mining Fatemeh Ahouz1and Amin Golabpour2* Abstract Background: The high prevalence of COVID-19 has made it a new pandemic. Predicting both its prevalence and incidence throughout the world is crucial to help health professionals make key decisions.

  9. Predictive Data Mining Models for Novel Coronavirus (COVID-19 ...

    The results of the present study have shown that the model developed with decision tree data mining algorithm is more efficient to predict the possibility of recovery of the infected patients from COVID-19 pandemic with the overall accuracy of 99.85% which stands to be the best model developed among the models developed with other algorithms inc...

  10. Predicting the incidence of COVID-19 using data mining

    In this study, we aim to predict the incidence of COVID-19 within a two-week period to better manage the disease. The COVID-19 datasets provided by Johns Hopkins University, contain information on COVID-19 cases in different geographic regions since January 22, 2020 and are updated daily. Data from 252 such regions were analyzed as of March 29 ...

  11. Data mining and machine learning techniques for coronavirus (COVID-19

    1. A. S. Albahri et al., " Role of biological data mining and machine learning techniques in detecting and diagnosing the novel coronavirus (COVID-19): a systematic review ," vol. 44 , no. 7 , pp. 1

  12. Prediction of Covid-19 patients states using Data mining techniques

    This paper discusses supervised learning on the COVID-19 Corona Virus India dataset in particular, which contains 3,799 patients, which used to classify the patient data of COVID-19 into two types, recovered and deceased.

  13. Analyzing COVID-19 Dataset through Data Mining Tool "Orange"

    Abstract: Data mining is comprehended as a procedure used to get important data from any more extensive cluster of raw data. It infers assessment of data designs in huge data sets utilizing at least one or more applications. Data mining has applications in various fields, for instance science and research.

  14. Using data mining as a weapon in the fight against Covid-19

    As confirmed coronavirus cases reach more than 3.2 million globally, those fighting the battle against the pandemic have been inspired to implement innovative methods to help predict the spread of the outbreak.. The past few months have seen a number of governments and organizations involved in the Covid-19 response across the globe adopting data mining techniques and spatial analysis mapping ...

  15. Mining Medical Data to Improve COVID-19 Treatment

    Tuesday, June 28, 2022. By analyzing healthcare data on hundreds of thousands of people, IRP researchers have found clues that certain existing medications might be useful for combating COVID-19. Many researchers studying COVID-19 have spent the past two years poring over test tubes and isolated cells. However, large troves of data about people ...

  16. Predictive Data Mining Models for Novel Coronavirus (COVID-19) Infected

    The results of the present study have shown that the model developed with decision tree data mining algorithm is more efficient to predict the possibility of recovery of the infected patients from COVID-19 pandemic with the overall accuracy of 99.85% which stands to be the best model developed among the models developed with other algorithms inc...

  17. Data mining tools combat COVID-19 misinformation and identify symptoms

    Computer scientists use Google Trends and a government dataset to track symptoms and sift through misinformation. UC Riverside computer scientists are developing tools to help track and monitor COVID-19 symptoms and to sift through misinformation about the disease on social media. Using Google Trends data, a group led by Vagelis Papalexakis, an ...

  18. COVID-19 Roundup: Dashboards, Datasets, Data Mining & More

    Sandia National Laboratories enables desktop data mining for COVID-19 research. Researchers at Sandia National Laboratories have assembled a combination of data mining, machine learning algorithms, and compression-based analytics to help the research community comb through the tens of thousands of articles published about COVID-19 since the pandemic began.

  19. Using citizens' data securely in research: COVID-19 data donation

    During the COVID-19 pandemic, several data donation projects were initiated in Germany, the U.K. and the U.S. These projects showed that citizens were willing to participate and share their data.

  20. GitHub

    GitHub - johnmatzakos/predict-covid-19: A Data Mining project that aims to forecast future Covid-19 cases in Greece using time series data. master 1 branch 0 tags Code 12 commits Failed to load latest commit information. Data Logs Utilities DataMining.py DataPreprocessing.py DataVisualization.py FeatureEngineering.py InitialScript.py Main.py

  21. PDF MI COVID response

    COVID-19 Cases Among Staff and Residents in Long Term Care Facilities. Note: The data are from weekly reporting by facilities with bed occupancy of at least 13 beds. Source: Data is now provided through NHSN, data prior to May 19 was from Michigan EM Resource ... • Nowcast estimates project that JN.1 (93.1%, 95% P.I. 91.5-94.4%) is the most ...

  22. GitHub

    An interesting data mining topic to focus on for the COVID-19 pandemic is to determine the relationship between COVID-19 related trending topics and sentiments on the social media platform Twitter, with the number of reported confirmed cases for a given country over a period of time.

  23. Data mining can play a critical role in COVID-19 linked mental health

    Data mining can exert greater influence on COVID-19 linked mental health studies if we continue to underline its importance. For example, sentiment analysis can be implemented to evaluate Internet users' overall sentiment status and thereby to monitor public health concerns ( Singh et al., 2020 ).

  24. Pandemic Journaling Project Archive Opens for Research

    A repository of data detailing the personal experiences of more than 1,800 people living during the COVID-19 pandemic is available to researchers for the first time A globe signalling Pandemic Journaling Project participants' 55 countries of origin was on display at PJP's Picturing the Pandemic exhibition (https://picturingthepandemic.org ...

  25. US Airports Get Nearly $1 Billion in Federal Funds for Makeovers

    Market Data. Stocks; ... the coronavirus pandemic. The latest round is on top of the nearly $2 billion granted to airports over the past two years for capital improvement projects that include ...

  26. Harmonizing government responses to the COVID-19 pandemic

    Public health and safety measures (PHSM) made in response to the COVID-19 pandemic have been singular, rapid, and profuse compared to the content, speed, and volume of normal policy-making. Not ...

  27. Annual Educator Preparation Report Card Shows Ten Educator Preparation

    Today, the State Board of Education released its eighth annual Educator Preparation Report Card, a tool that evaluates educator preparation providers (EPPs) in Tennessee. For the first time since prior to the COVID-19 pandemic, this year's Report Card rates EPPs on the effectiveness of their teachers in the classroom. These classroom outcomes, along with performance in other key areas such ...

  28. Data science approaches to confronting the COVID-19 pandemic: a

    In this paper, we review the newly born data science approaches to confronting COVID-19, including the estimation of epidemiological parameters, digital contact tracing, diagnosis, policy-making, resource allocation, risk assessment, mental health surveillance, social media analytics, drug repurposing and drug development.

  29. New Uranium Mining Projects Are Wooing Consumers After Nuclear Fuel

    Uranium project developers see a stronger outlook for prices and supply pacts after the world's top producer of the nuclear fuel jolted the market with output cuts.

  30. Guinea Talks Carbon Tax on Mines as It Seeks $4.3 Billion Loans

    Guinea is planning to introduce a carbon tax on mining companies as it started talks with United Arab Emirates investors to raise $4.3 billion to fund development projects.