Validation of a clinical and genetic test for COVID-19 severity
The current COVID-19 pandemic is a continuing threat to public health and the global economy. While COVID-19 can be a mild disease in many individuals, with cough and fever the most commonly reported symptoms, 10–15% will develop severe COVID-19 requiring hospitalisation and 5% will require intensive care (1).
Globally, public health responses have been aimed at limiting new cases by preventing community transmission through mask wearing, social distancing, curtailing non-essential services and broad travel restrictions. The economic and social impacts of these interventions have been devastating, with foundational damage to local economies (2) and unprecedented increases in mental health diagnoses being reported (3). As the protracted strain of the pandemic increases pressure to re-open economies, there is an urgent need for tests to predict an individual’s risk of severe COVID-19. In the community, a risk prediction test could enable workplaces to confidently manage employees who are at increased risk of severe disease and should work from home or avoid client-facing roles. In the healthcare setting, a risk prediction test could inform patient triage when hospital resources are limited and be useful in prioritising pathology tests and vaccination. On a personal level, knowledge of individual risk can empower individuals to make informed choices about day-to-day activities.
Early in the pandemic, epidemiological analyses recognized that sex and increasing age are risk factors for severe COVID-19 and that common medical comorbidities contribute to individual risk (4) but they are frequently considered independently without accurate knowledge of the magnitude of their effects on risk. The effect of human genetic variation on COVID-19 severity has been examined by the COVID-19 Host Genetics Initiative, which has now released several meta-analyses of the available genome-wide association studies of COVID-19 severity (5,6).
We have previously developed a combined clinical and genetic risk model (7) based upon early data from the UK Biobank (8) and the COVID-19 Host Genetics Initiative meta-analysis of hospitalized vs non-hospitalized COVID-19 cases (which was at that time almost exclusively UK Biobank samples) (6,9). Our prototype model appeared to perform well but was based on a small sample size (1,018 cases and 564 controls) from the first wave of the pandemic. We decided not to attempt validation in this dataset because of our concern about the representativeness of the data (the SARS-CoV-2 testing data was ascertained early in the pandemic when the limitations on testing availability in the United Kingdom meant that mild and asymptomatic cases were not identified). In the interim, the UK Biobank has released further data from participants confirmed to be infected with SARS-CoV-2. This latest data release (2,205 cases and 5,416 controls) has a larger proportion of non-hospitalized people, providing more confidence that they are a more representative non-hospitalized control population. We used all of the available data and randomly divided it into a 70% training dataset and a 30% validation dataset (ensuring that the datasets were balanced for case and control status) to build and validate a new clinical and genetic risk model. Given the uncertainty around the published summary statistics, we incorporated SNPs in the new model without relying on published summary statistics and without assumptions as to the identity of the risk allele. We included the SNPs as individual risk factors and estimated the per allele OR for each. By doing so, we were able to identify the subset of SNPs (and clinical risk factors) that were informative for predicting risk. The clinical variables included in the model are consistent with large-scale epidemiological studies (12) and include the following phenotypes: gender, ethnicity, body mass index, cancer – haematological, cancer – non-haematological, cerebrovascular disease, diabetes, hypertension, kidney disease, respiratory disease (excluding asthma). Our new model retained seven SNPs (rs112641600, rs10755709, rs118072448, rs7027911, rs71481792, rs112317747, and rs2034831). Interestingly, none of the SNPs were in the 3p21.31 locus identified by others (10,11) because their associations were explained by the respiratory disease variable.
The discrimination of the model as determined by the area under the receiver operating characteristic curve (AUC) was 0.752. The model was well calibrated with no evidence of overall overestimation or underestimation of risk (α=−0.08; 95% CI=−0.21, 0.05; P=0.3). There was also no evidence of under or over dispersion (β=0.90, 95% CI=0.80, 1.00, P=0.06).
While we were able to build the model using separate test and validation datasets, it is important that we further validate the model in an independent dataset. This is the purpose for the present application to Lifelines.