Approximating missing epidemiological data for cervical cancer through Footprinting: A case study in India

Irene Man
Damien Georges
Maxime Bonjour
Iacopo Baussano

Curated by eLife

eLife assessment

This work presents a framework for estimating missing data on cervical cancer epidemiology. If properly validated, it could help determine missing data in regions where data are scarce. The work will be of broad interest to researchers and policymakers evaluating cervical cancer prevention measures.

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (eLife)

Abstract

Local cervical cancer epidemiological data essential to project the context-specific impact of cervical cancer preventive measures are often missing. We developed a framework, hereafter named Footprinting, to approximate missing data on sexual behaviour, human papillomavirus (HPV) prevalence, or cervical cancer incidence, and applied it to an Indian case study. With our framework, we (1) identified clusters of Indian states with similar cervical cancer incidence patterns, (2) classified states without incidence data to the identified clusters based on similarity in sexual behaviour, (3) approximated missing cervical cancer incidence and HPV prevalence data based on available data within each cluster. Two main patterns of cervical cancer incidence, characterized by high and low incidence, were identified. Based on the patterns in the sexual behaviour data, all Indian states with missing data on cervical cancer incidence were classified to the low-incidence cluster. Finally, missing data on cervical cancer incidence and HPV prevalence were approximated based on the mean of the available data within each cluster. With the Footprinting framework, we approximated missing cervical cancer epidemiological data and made context-specific impact projections for cervical cancer preventive measures, to assist public health decisions on cervical cancer prevention in India and other countries.

Version published to 10.7554/elife.81752 on eLife
May 25, 2023
eLife
Feb 15, 2023
Author Response

Reviewer #1 (Public Review):

This work provides a new general framework for estimating missing data on cervical cancer epidemiology, including sexual behavior, HPV prevalence, and cervical cancer incidence. These data are useful to determine impact projections of cervical cancer prevention. The authors suggest a three-step approach: 1) a clustering method applied on registries with an intermediate level of data availability to cluster cervical cancer incidence based on a Poisson-regression-based CEM algorithm, 2) a classification method applied on registries with a low level of data availability to classify cervical cancer incidence based on a Random Forest, 3) a projection method applied on missing data based on the mean of available data. The authors use India as a case study to implement this new methodology. …
Author Response

Reviewer #1 (Public Review):

This work provides a new general framework for estimating missing data on cervical cancer epidemiology, including sexual behavior, HPV prevalence, and cervical cancer incidence. These data are useful to determine impact projections of cervical cancer prevention. The authors suggest a three-step approach: 1) a clustering method applied on registries with an intermediate level of data availability to cluster cervical cancer incidence based on a Poisson-regression-based CEM algorithm, 2) a classification method applied on registries with a low level of data availability to classify cervical cancer incidence based on a Random Forest, 3) a projection method applied on missing data based on the mean of available data. The authors use India as a case study to implement this new methodology. Results indicate that two patterns of cervical cancer incidence are identified in India (high and low incidence), classifying all Indian states with missing data to a low incidence. From this classification, missing data is approximated using the mean of the available data within each cluster.

A strength of this approach is that this methodology can be applied to regions with missing data, although a minimum set of information is needed. This makes it possible to have individual data for each unit in the region.

One of the weaknesses of this methodology is the need for a minimum set of epidemiological data to enable impact projections. It is true that when epidemiological cervical cancer data is not available, authors mentioned that general indicators (e.g., human development index, geography) can be used but projections will be probably less realistic. As observed with other techniques, countries with fewer resources have less data available and cannot benefit from these types of techniques to have more adequate guidelines.

Imputation of missing data is always a challenging issue. The technique proposed in this manuscript is an interesting new approach to missing data imputation that could be applied with a minimum set of available data. However, we must focus on obtaining reliable data from each region of the world to help local health authorities implement better preventive measures for the local population.

We thank the reviewer for the considerate comments and suggestions and have tried to incorporate them as much as possible in the revised manuscript.

As the reviewer has pointed out, the applicability of the proposed methodology depends on the available data. In our opinion, it is a general challenge for approximating missing data, rather than a weakness particular to our methodology. In fact, we believe that our framework is flexible to address missing data in many situations. To clarify this point, we have included the following sentences in the Discussion (lines 363-376, page 18): “It is important to note that, in general, the applicability the proposed framework depend on the actual amount of data available. However, in our opinion, it is a general challenge for approximating missing data, rather than a weakness particular to our methodology. By allowing possible adaptations, we believe that our framework is sufficient flexible to address missing data in many situations.”

Finally, we fully agree with the reviewer that we should continue our effort to collect more data for countries where these are not available. The proposed framework should be considered as a solution to the situation in which collection of additional data is not or not yet possible.

Reviewer #2 (Public Review):

The burden of cervical cancer worldwide is well recognized. While prevention strategies, including vaccination against human papillomavirus (HPV), cervical cancer screening, and pre-cancer treatment, can reduce the burden of cervical cancer, access to these measures is still limited, especially in low- and middle-income countries. Since the impact of prevention strategies is heavily dependent on the disease's burden on a particular population, we need to know the latter to assess the impact of these context-specific prevention strategies.

However, epidemiological data on cervical cancer are not always available for all geographical areas. This paper uses India as a case study to propose a framework called "Footprinting" to comprehensively evaluate the burden of cervical cancer. The authors applied a three-step analytical strategy to impute cervical cancer epidemiological data in states where this information was unavailable using data from cervical cancer incidence, HPV prevalence, and sexual behaviour from other regions. The findings suggest a high and low incidence of cervical cancer incidence in different parts of India; all Indian states with missing data were classified as low incidence.

The proposed analytical strategy presents an important solution for imputing data from geographic areas of a country where data are missing.

We thank the reviewer for the considerate comments and suggestions and have tried to incorporate them as much as possible in the revised manuscript.

One conceptual limitation of this work is the lack of explanation or evidence that sexual behaviour can be used to approximate cervical cancer and/or HPV rates.

A similar comment was raised by Reviewer #1. It is well established that sexual contact is the only transmission route of carcinogenic HPV infection, and hence necessary for the occurrence of cervical cancer [ref #26 Vaccerella 2006, Muñoz 1992 Int J Cancer 52, 743-749].

We have included sexual behaviour variables that have previously been shown to be risk factors of HPV infection and cervical cancer risk, e.g., age of sexual debut and number of sexual partners [ref #26 Vaccerella 2006, ref #27 Schulte-Frohlinde 2021]. Furthermore, we used variables that are commonly available so that the analyses can be easily applied to other settings.

As far as we know, there is no established set of sexual behaviour variables for predicting the patterns of HPV prevalence and cervical cancer incidence. The good prediction performance in the India case study shows that using the selected set is sufficient. As sexual behaviour variables are highly correlated, including more variables might even risk overfitting.

To clarify these points we have included the following paragraph in the Discussion (lines 319-325, page 16): “In our analysis of classifying clusters of cervical cancer incidence, we only included some of the sexual behaviour variables available in the NACO report [15]. We selected variables that were previously shown to be risk factors of HPV infection and cervical cancer risk and that are commonly available so that the analyses can be easily applied to other settings, e.g., age of sexual debut and number of sexual partners [26, 27]. As far as we know, there is no established set of sexual behaviour variables for predicting the patterns of HPV prevalence and cervical cancer incidence. The good prediction performance shows that using the selected set is sufficient. As sexual behaviour variables are highly correlated, including more variables might even risk overfitting.”

Also, full information on the three main indicators is only available in two states. This is used to impute the values for the other states.

Indeed, HPV prevalence data were only available for two states. While we acknowledge that this affects the certainty in the imputed HPV prevalence, we considered the imputed results to be satisfactory based on the good accordance with the cervical cancer incidence data we found in the validation step (lines 286-23, page 14). We verified that the ratio of HPV prevalence between the high-and low-incidence cluster (1.7-fold) was very similar to the ratio of age-standardized cervical cancer incidence (1.9-fold).

Furthermore, we note that previous modelling works on India relied on even less data, namely one source of HPV prevalence and cervical cancer incidence data [ref #29 Brisson 2020, Diaz 2008 Br J Cancer].

Moreover, the available data used in this study also present some limitations; for example, cervical cancer incidence data were from 2012 to 2016, while sex behaviour data were from 2006. This large gap is likely to have a significant cohort effect, especially given changes in sexual norms in Western countries over the last few decades, which may have gradually influenced other countries, especially in this age of the internet and social media.

In our opinion, for the purpose of modelling the natural history of cervical cancer, it is not necessarily more adequate to use the most recent data of sexual behaviour data. Arguably, as sexual behaviour is the “exposure” for the “outcome” cervical cancer, calibration of HPV transmission and cervical cancer model is best done with data of sexual behaviour and cervical from the same cohorts, hence, sexual behaviour data from an earlier period than the cervical cancer data.

In addition, if changes of sexual behaviour occur across the country, it should not affect the clustering much.

Finally, due to delay in reporting, cervical cancer incidence from the period 2012-2016 is the most recent edition at the moment of writing. Regarding sexual behaviour data, there is at the moment no later edition of the NACO report published after that of year 2006.

Finally, it would be interesting to validate this methodology to confirm its utility.

We agree that it would be very interesting to validate this proposed methodology in other regions. Unfortunately, it was beyond the scope of this work. Currently, we are working on a project in which we try to apply footprinting to a collection of low- and middle-income countries.

The proposed framework's strength is difficult to evaluate because the steps and justification for the model variables were not clearly presented, nor were the models validated.

We acknowledge that the framework could be more clearly presented and have added additional explanation in the following places to do so:

Concerning the framework steps, in Method (144-163, pages 7-8): “For convenience of explanation, we assumed earlier that data availability occurs hierarchically. However, the framework can also be applied with less stringent data requirements. First, the source of Footprint data needs not necessarily cover all geographical units. It is still possible to train a classifier in the classification step with Footprint data available for only a part of clustered geographical units. Second, if none of the key cervical cancer epidemiological data (sexual behavior, HPV prevalence, and cervical cancer incidence data) have large enough coverage to serve as Footprint data, alternatives indicators of similarity, such as human development index and geographical distance, could also be used as substitute. However, the resulting classification performance might be suboptimal, as we expect these indicators to correlate less well with cervical cancer risk. Third, for the projection step, data of cervical cancer incidence, sexual behavior, and HPV prevalence needed for calibration of projection models need not necessarily belong to the same geographical unit. Calibration can be performed as long as the three types of data are available within each cluster.

With these less stringent data requirements, the proposed framework should sufficient flexible to be applied to many situations. However, one should still be cautious in applying the framework when there are little data. This means that, in some cases, we might need to exclude from the analysis some geographical units with too little data or redefine bigger geographical units if the data are not granular enough. Furthermore, we should assess the goodness-of-fit of the obtained clustering, performance of classification, correlation of data within different clusters, and calibration fits to ensure the validity of the final impact projections.”

Concerning selection of model variables (lines 319-325, page 16): “In our analysis of classifying clusters of cervical cancer incidence, we only included some of the sexual behaviour variables available in the NACO report [15]. We selected variables that were previously shown to be risk factors of HPV infection and cervical cancer risk and that are commonly available (e.g., age of sexual debut and number of sexual partners) so that the analyses can be easily applied to other settings [26, 27]. In the India case study, the good classification performance shows that using the selected set is sufficient. As sexual behaviour variables are highly correlated, including more variables might even risk overfitting.”

Based on the authors' interpretation of the framework findings, this framework may help extrapolate data from one country to another. I'm curious as to whether this framework could be applied across states and countries.

We thank the reviewer for this comment. Currently, we are working on a multi-year projects in which we try to apply the framework to all low- and middle-income countries.
Read the original source
eLife
Dec 4, 2022

eLife assessment

This work presents a framework for estimating missing data on cervical cancer epidemiology. If properly validated, it could help determine missing data in regions where data are scarce. The work will be of broad interest to researchers and policymakers evaluating cervical cancer prevention measures.

Read the original source
eLife
Dec 4, 2022

Reviewer #1 (Public Review):

This work provides a new general framework for estimating missing data on cervical cancer epidemiology, including sexual behavior, HPV prevalence, and cervical cancer incidence. These data are useful to determine impact projections of cervical cancer prevention. The authors suggest a three-step approach: 1) a clustering method applied on registries with an intermediate level of data availability to cluster cervical cancer incidence based on a Poisson-regression-based CEM algorithm, 2) a classification method applied on registries with a low level of data availability to classify cervical cancer incidence based on a Random Forest, 3) a projection method applied on missing data based on the mean of available data. The authors use India as a case study to implement this new methodology. Results indicate that …

Reviewer #1 (Public Review):

This work provides a new general framework for estimating missing data on cervical cancer epidemiology, including sexual behavior, HPV prevalence, and cervical cancer incidence. These data are useful to determine impact projections of cervical cancer prevention. The authors suggest a three-step approach: 1) a clustering method applied on registries with an intermediate level of data availability to cluster cervical cancer incidence based on a Poisson-regression-based CEM algorithm, 2) a classification method applied on registries with a low level of data availability to classify cervical cancer incidence based on a Random Forest, 3) a projection method applied on missing data based on the mean of available data. The authors use India as a case study to implement this new methodology. Results indicate that two patterns of cervical cancer incidence are identified in India (high and low incidence), classifying all Indian states with missing data to a low incidence. From this classification, missing data is approximated using the mean of the available data within each cluster.

A strength of this approach is that this methodology can be applied to regions with missing data, although a minimum set of information is needed. This makes it possible to have individual data for each unit in the region.

One of the weaknesses of this methodology is the need for a minimum set of epidemiological data to enable impact projections. It is true that when epidemiological cervical cancer data is not available, authors mentioned that general indicators (e.g., human development index, geography) can be used but projections will be probably less realistic. As observed with other techniques, countries with fewer resources have less data available and cannot benefit from these types of techniques to have more adequate guidelines.

Imputation of missing data is always a challenging issue. The technique proposed in this manuscript is an interesting new approach to missing data imputation that could be applied with a minimum set of available data. However, we must focus on obtaining reliable data from each region of the world to help local health authorities implement better preventive measures for the local population.

Read the original source
eLife
Dec 4, 2022

Reviewer #2 (Public Review):

The burden of cervical cancer worldwide is well recognized. While prevention strategies, including vaccination against human papillomavirus (HPV), cervical cancer screening, and pre-cancer treatment, can reduce the burden of cervical cancer, access to these measures is still limited, especially in low- and middle-income countries. Since the impact of prevention strategies is heavily dependent on the disease's burden on a particular population, we need to know the latter to assess the impact of these context-specific prevention strategies.

However, epidemiological data on cervical cancer are not always available for all geographical areas. This paper uses India as a case study to propose a framework called "Footprinting" to comprehensively evaluate the burden of cervical cancer. The authors applied a …

Reviewer #2 (Public Review):

The burden of cervical cancer worldwide is well recognized. While prevention strategies, including vaccination against human papillomavirus (HPV), cervical cancer screening, and pre-cancer treatment, can reduce the burden of cervical cancer, access to these measures is still limited, especially in low- and middle-income countries. Since the impact of prevention strategies is heavily dependent on the disease's burden on a particular population, we need to know the latter to assess the impact of these context-specific prevention strategies.

However, epidemiological data on cervical cancer are not always available for all geographical areas. This paper uses India as a case study to propose a framework called "Footprinting" to comprehensively evaluate the burden of cervical cancer. The authors applied a three-step analytical strategy to impute cervical cancer epidemiological data in states where this information was unavailable using data from cervical cancer incidence, HPV prevalence, and sexual behaviour from other regions. The findings suggest a high and low incidence of cervical cancer incidence in different parts of India; all Indian states with missing data were classified as low incidence.

The proposed analytical strategy presents an important solution for imputing data from geographic areas of a country where data are missing.

One conceptual limitation of this work is the lack of explanation or evidence that sexual behaviour can be used to approximate cervical cancer and/or HPV rates. Also, full information on the three main indicators is only available in two states. This is used to impute the values for the other states. Moreover, the available data used in this study also present some limitations; for example, cervical cancer incidence data were from 2012 to 2016, while sex behaviour data were from 2006. This large gap is likely to have a significant cohort effect, especially given changes in sexual norms in Western countries over the last few decades, which may have gradually influenced other countries, especially in this age of the internet and social media. Finally, it would be interesting to validate this methodology to confirm its utility.

The proposed framework's strength is difficult to evaluate because the steps and justification for the model variables were not clearly presented, nor were the models validated. Based on the authors' interpretation of the framework findings, this framework may help extrapolate data from one country to another. I'm curious as to whether this framework could be applied across states and countries.

Read the original source
Version published to 10.1101/2022.06.28.22276994 on medRxiv
Jun 28, 2022

Assessing Factors Affecting Uptake of Cervical Cancer Screening Services in Dodoma, Tanzania: A Cross-sectional Study

This article has 4 authors:
1. Abdon Mrosso Ibrahim
2. M Yustus Isaack
3. Alois Kilumile Richard
4. Shaloom Kajuna Elice
This article has no evaluationsLatest version Sep 8, 2025
Structural Determinants of HPV Vaccination Inequalities: A Multiregional Analysis across Six WHO Regions

This article has 3 authors:
1. Siyan Liu
2. Xiaowei Man
3. Xingli Cao
This article has no evaluationsLatest version Oct 16, 2025
Epidemiology of Cancer in Tanzania Based on GLOBOCAN 2022 Estimates of Burden and Trends

This article has 2 authors:
1. Fabian P. Mghanga
2. Jotham A. Seth
This article has no evaluationsLatest version Oct 16, 2025

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Assessing Factors Affecting Uptake of Cervical Cancer Screening Services in Dodoma, Tanzania: A Cross-sectional Study

Structural Determinants of HPV Vaccination Inequalities: A Multiregional Analysis across Six WHO Regions

Epidemiology of Cancer in Tanzania Based on GLOBOCAN 2022 Estimates of Burden and Trends