Recommendations: Training Data#

Collecting robust and representative data is critical to any successful deployment of Cider. Here are several recommendations:

  • Number of observations that should be collected in a ground truth poverty survey: Returns to additional training data for poverty prediction have been documented in past research (Blumenstock et al., 2015). In general, accuracy increases with additional training data, but with diminishing marginal returns to data collection.

  • Survey collection - phone vs. in-person: While it is generally accepted that field surveys collect higher quality data than phone surveys, the specific structural and economic constraints of each targeting project will dictate whether a phone or in-person survey is more appropriate. It is worth noting that more complex poverty outcomes like income and consumption can typically only be collected in a high quality field survey; phone surveys typically focus on asset indices or other proxies for consumption or income, such as a proxy-means test.

  • Poverty outcomes collection: When possible, multiple poverty measures should be collected for validation purposes. Past studies using wealth from mobile phone metadata have used a variety of poverty outcomes to train machine learning models, including asset indices, consumption, proxy-means tests (PMTs) and poverty probability indices (PPIs). Consumption is accepted as the gold-standard of poverty in development economics, so consumption should be collected where possible (however, eliciting consumption typically requires a multi-hour consumption module in a field survey). Recent work has shown that well-calibrated PMTs and PPIs are preferable to asset indices, if there is an up-to-date PMT or PPI for the country in question, or if it is possible to custom-calibrate a PMT based on recently-collected consumption data.

  • Number of components in a PMT or asset index: Past work on wealth prediction and targeting has used indices or PMTs with between 8 and 30 components. Index accuracy typically increases with the number of underlying components, but there are diminishing marginal returns to increasing the number of components after a point. Moreover, the more components collected in a survey the more fatigue or boredom on the part of the respondent, which may decrease survey quality. Some contexts have standardized and tested asset indices or PMTs. If there is no such standardized index available, but past survey data from the country in question is available, one option is to use forward selection of features for a PMT (implemented in CIDER’s survey module) and look for an “elbow” in accuracy curves where returns to additional features start to decrease. In choosing the number of features in the PMT, the total survey duration (including, for example, demographic information for bias checks or other poverty measures for validation) should also be considered to prevent respondent fatigue.

  • Timeline and frequency for survey and phone metadata collection (wealth prediction recalibration): Recent research from Aiken et al. (2021) suggests that the quality of phone-based poverty prediction degrades when phone data or survey data are out-of-date. In that application of phone-based targeting in rural Togo, the authors find that targeting accuracy decreases by 4-5% when phone data or survey data are out-of-date, and precision decreases by 10-14%. Note that there are three particular concerns when it comes to data collection and model recalibration: (1) Turnover of SIM cards on the mobile phone network, (2) the transitory nature of poverty, and (3) applying a machine learning model calibrated on “old” survey data to contemporary CDR (risking “model drift”).

  1. One concern with using “old” CDR for a contemporary aid program is the turnover of SIMs on the mobile phone network. Suppose the CDR data used for targeting is even six months out of date. In that case, a sizable number of subscribers will be prevented from registering because they are not associated with underlying CDR from which to produce a wealth prediction. For example, in Togo, Aiken et al. (2021) estimate that around 3% of mobile network subscribers leave the network in a given month, so around 18% of subscribers would have switched SIM cards after six months. CDR should be taken from as close to the program start date as possible to maximize program coverage.

  2. People move in and out of poverty with some frequency, though the timescale for this movement is typically years. To capture the state of poverty that is contemporary with the program time period, survey data should be gathered as close to the program start date as possible. Initial results suggest that it is not essential that survey data overlap with the time period of CDR, but both should be taken from as close to the program as possible.

  3. The two above concerns suggest that it is necessary to “refresh” CDR data frequently and to “refresh” survey data somewhat less frequently. It is worth noting that there is a specific concern with training a model on “old” CDR and survey data and using the model to project onto CDR gathered contemporarily, called “model drift.” Specifically, the relationship between phone behavior and poverty shifts over time: for example, the relationship over the holidays may not be the same as the relationship in other months. To combat this effect, even if survey data is not “refreshed,” the model should be consistently refreshed (that is, retrained).

  • Representativity checks of training data: Bias audits should be grounded in the cultural context in which the program is conducted, either with qualitative interviews with stakeholders (especially potential beneficiaries) or based on existing legal frameworks. If possible, qualitative or legal work should be conducted to assess which subgroups are vulnerable and should be audited for in fairness checks. Qualitative work should consist of interviews or focus groups with program stakeholders, including local government, NGOs, and most importantly potential program beneficiaries. Alternatively, fairness classes could be based on the legal context in question, checking fairness for subgroups that are protected by the law. For example, in the US, race, religion, nationality, sex, age, disability status, and veteran status are all protected classes.