Estimating active cases of COVID-19: The unknown matters

Image for post
Image for post
(Free to use image from Pixabay)

This post is not about physics but, in a way, it is related to the elusive dark matter:

Dark matter is material that cannot be seen directly. We know that dark matter exists because of the effect it has on objects that we can observe directly.”

Likewise, this text is about the inference of the larger number of active COVID-19 cases by using the information on official numbers of detected cases and mortality data. For that, we will rely on information available from ECDC (European Centre for Disease Prevention and Control), that provides an open time series for most of the world.

Depending on the phase of the epidemics, and each countries’ resources and policies, we can observe a wide variation on the coverage of COVID-19 detection. Since late March 2020, the Centre for Mathematical Modelling of Infectious Diseases has been modelling the rate of under-reporting for several countries. The idea is that, even when we cannot observe directly 100% of the cases, we can observe their effect on the mortality rate. Their modelling took as reference the expected mortality, as documented in an early paper in The Lancet, by Verity et al, to see how countries deviate from that baseline. The idea is simple, if the baseline COVID-19 mortality is 1%, and a country is reporting 2%, this probably means that they are only catching 50% of the cases (there are confounding factors like variations in demographics and health systems, but for simplicity consider those factors hs having a smaller effect).

In this case, with 50% coverage, there is a correction of times 2 that needs to be applied to the official case count. In our work in Coronasurveys, we used this approach to infer the cumulative number of cases in each country. We provided such estimates since April, and in June we were able to confirm that at least for Spain those predictions matched very precisely a large scale serology survey (now also published in The Lancet).

The cumulative number of cases is a relevant proxy for the number of persons that had contact with the disease, and possibly for thresholds for herd immunity. However on a day to day basis, what is more relevant is the risk of transmission. This risk will depend on the region, time and prevalence of active cases.

In order to count active cases, we need to know how long can transmission occur. There are many reports of cases that stayed PCR positive for weeks and months, but any virologist will tell you that “viral RNA detection by PCR does not equate to infectiousness or viable virus.” This is stated in a study from the National Center for Infectious Diseases in Singapore, that determined, following infectivity studies, that “virus could not be isolated or cultured after day 11 of illness”. With this result in mind, we consider an infectivity window of 12 days.

Our aim is to estimate the true number of active cases in a given country. The approach is to use the correction factor (estimated from the rate of under-reporting for each given country and time) to correct the number of cases detected and reported each day. This number is added to the cumulative tally of cases and subtracted 12 days later. The result is an estimate for the number of active cases.

However, the cases that are officially detected each day are likely also to be quarantined and hence removed from the transmission chains. Thus, a more interesting indicator is the number of unknown active cases that results from subtracting the detected cases from the estimated total cases.

Let us look at some examples, starting with Spain were our method is more closely matching reality. We observe a strong peak of activity, with more than 2.5% active cases, that quickly decayed to a baseline. Today, July 14th, we estimate 0.07% of undetected active cases among the total population. Meaning that among 10000 Spaniards one is now likely to find 7 persons actively transmitting the disease.

In Portugal we see a much smaller fraction at the peak of transmission, followed by a steady number of actives cases, most in the Lisbon area. However, the estimate of undetected active cases, 0.08% is not much different from Spain, and so the risks appear to be similar.

For France, we also observe a small peak followed by a stable activity. Here we estimate 0.11% of undetected active cases.

Sweden does not show a clear peak of activity and is now decreasing in activity. We estimate a current 0.35% of undetected active cases.

The US has a large population and activity has been increasing after an initial peak, mainly in the New York area. We now estimate 0.58% of undetected active cases, meaning that about 1 in 200 persons might be transmitting.

Brasil, also a large country is apparently at a plateau with the steady activity of 0.65% of undetected active cases.

More countries are available in the Coronasurveys site, as we automate the tracking from countries with available data. There are important caveats since regions that poorly track COVID-19 mortality will give poor estimates, as well as countries that have much younger demographics than those supporting the Wuhan and Spanish baselines on death to cases ratios. Countries, like the US, have recently been identifying more cases in younger populations, this will equate with lower mortality reporting and might lead to under-estimation of undetected active cases. The proportion of asymptomatic cases, while probably relevant for transmission, does not change the relation among countries. A multiplier of 1.5 (derived from ratio of asymptomatic cases in the Spanish serology) can be used to correct the percentages above to account for active asymptomatic cases.

Even under the limitations stated above, we hope to provide valuable estimates into what would otherwise be only dark matter. The cases that are transmitting without being known are probably those that matter most. To get an overall perspective we maintain a worldwide risk map based on the percentage of undetected active cases.

Short bio of authors

This article was written by INESC TEC researcher Carlos Baquero in July 2020 and results from an international collaboration lead by the IMDEA Networks Institute in Madrid.

Carlos Baquero is an Associate Professor with Habilitation in the Informatics Department, Universidade do Minho, and an Area Manager at the High Assurance Software Laboratory (HASLab) within INESC TEC. His research interests are in data management, causality tracking and distributed data aggregation. During the COVID-19 pandemic, he teamed with other researchers in network theory, complex systems and statistics to study new approaches to data aggregation and estimation in order to track the pandemic evolution.

INESC TEC is a private non-profit research institution, dedicated to scientific research and technological development.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store