As noted earlier, previous studies have demonstrated that proximity to PAs has a net positive impact on wellbeing defined in socioeconomic terms based on narrow conceptual models and the imposition of strict assumptions that are likely not applicable across diverse cultural and ecological contexts. Recent studies have emphasized the need to understand social benefits and wellbeing more holistically to include some measure of objective wellbeing (socioeconomic indicators), subjective wellbeing (self-defined and comparative indicators), environmental wellbeing (ecological indicators), and governance (governing, management, and equity and justice indicators)19. Other research demonstrates how these various components can be operationalized empirically at the country level, but to date, we are not aware of any studies that systematically incorporate these components into empirical studies of PA effectiveness in systematic and scalable ways, particularly at sub-national or spatially explicit levels24.
We begin to address that gap by constructing a multidimensional, composite index that includes both objective and subjective measures of household wellbeing as a response variable. We then construct empirical models to examine the impact of predictor variables on the composite index, including local and national governance factors, environmental changes and shocks, and a variety of physical or geographic factors. Importantly, because we understand that many social-ecological relationships are highly context-dependent, we do not impose strict linear assumptions on the relationships between predictor and response variables. We employ random forest regression machine learning techniques25 to assess the importance of variables in influencing movement and distribution of observed wellbeing outcomes across our sample. The use of these methods is gaining traction in the conservation community given their flexibility for potentially correlated variables and nonlinear relationships. Recent studies have employed random forest regression and classification to problems such as deforestation42, soil quality26, tourism and recreation27, and erosion prediction28. To our knowledge, these techniques have not yet been applied to protected area effectiveness studies nor the social impacts of protected areas. As such, the use of random forest regression in this study is novel compared to the classical linear models and more recent Bayesian models that have become standard in the conservation literature.
Constructing the response variable
To construct an expanded indicator for wellbeing, we build off the foundation of an earlier study24 to construct a composite index of objective and subjective wellbeing. Previous studies in the conservation literature use Demographic and Health Survey (DHS) data to measure objective wellbeing using individual indicators of economic, health, and education as proxies for wellbeing18. Each of those could be considered a key component of objective wellbeing, but when considered in concert they present a more well-rounded or comprehensive picture of objective wellbeing. As discussed earlier, subjective wellbeing refers to individually defined indicators of wellbeing and comparative assessments of actual performance against expected outcomes. These are typically culturally or locally nuanced, making it difficult to construct indicators that are effective across contexts. Previous work describes these nuanced facets of wellbeing14,17. In previous research, we circumvented that difficulty by constructing indices of individuals’ self-assessment of their life satisfaction which asks respondents to rank their overall satisfaction with their life and living situation, as well as their life evaluation which asks participants to compare their life now to the past, the future, and others24. Such historical, present, and comparative self-assessments have elsewhere been shown to be important factors affecting wellbeing, albeit in the context of sustainably peaceful societies46. Using a similar rationale, we understand subjective wellbeing to be nuanced and multidimensional and to include aspects of comparative wellbeing regarding historical, future, and distributional components. We also understand that, like objective wellbeing, no single indicator is an effective proxy for this construct, but rather a mosaic of related constructs provides a richer and more comprehensive view of subjective wellbeing. We, therefore, opted to construct composite indices of objective and subjective wellbeing using variables like those in previous studies, and then combine them into an overall wellbeing composite. The input parameters for our composite indices of objective, subjective, and overall wellbeing are described in Table 4.
Unfortunately, few surveys or existing datasets include questions relevant to both objective and subjective indicators that are also readily converted into the spatially explicit formats needed to explore the impact of proximity to PAs. This prevented us from using the same data that other studies have employed. However, the Afrobarometer survey is an exception in that it includes questions relevant to both objective and subjective wellbeing and is administered using similar protocols to each other across a wide range of countries in their respective geographic regions on similar periods. It is a public opinion survey that is conducted annually and covers topics ranging from personal security, education, infrastructure, and living conditions. It collects information on a variety of factors relevant to objective and subjective wellbeing albeit using different enumeration protocols and variations in questions and responses. The Afrobarometer data are available as cleaned and geocoded data29. In this study, we utilize data from Round 6, which was implemented in 36 countries in 2014–2015. These data were provided under academic license for this study.
In selecting variables to include in the objective and subjective wellbeing composites, we could not assume that variables of interest were missing at random and therefore, the analysis was restricted to indicators with less than 5% missing values. To capture objective wellbeing, we utilized the following: the respondent’s employment status, food security in the household; access to health care; and a composite measure of asset-based wealth. To capture subjective wellbeing measures, we utilized the following: household wellbeing in comparison to other community members; household wellbeing compared to previous wellbeing; and how households viewed their current living situation. Given the likely multidimensionality in any measure of wellbeing, a composite index is useful to capture a range of indicators30. Our variables of interest were not correlated and could not be assumed to be substitutes. Previous literature has documented the potential constraints of assuming the components of poverty indices are compensatory31. Therefore, rather than factor analysis, we ensured our composite measures were reliable using Chronbach’s alpha measurement and created indices for objective wellbeing and subjective wellbeing utilizing a generalized non-compensatory method, the Mazziotta-Pareto Index32.
Compiling predictor variables
As noted above, much of the contemporary work assessing the effectiveness of PAs has employed traditional development indicators that have been shown to highly correlate wellbeing outcomes. These have typically included elements of income, health, and education. Other studies and frameworks such as the Protected Area Management Effectiveness toolkit33 focus on exploring the relationship between local or national-level governance and PA outcomes, while others have explored the impact of geophysical changes and climate-induced shocks. The evidence base and rationale for the inclusion of a variety of predictor variables have been thoroughly discussed in these and other studies and are by now commonplace in the conservation literature18,20,21. However, the empirical studies that operationalize those variables work on somewhat different and typically overly simple conceptual models that are based on narrow sets of generalized assumptions around causality, influence, and importance of some variables over others. For instance, the model that underpins18 is based entirely on a model that assumes PA benefit to surrounding communities depends on tourism revenue from the PA, and thereby omits other mechanisms through which societies might benefit. However, other mechanisms could logically include the safe and reliable production or delivery of a range of ecosystem services that are not necessarily monetary in nature or monetizable. The reliance on such simplistic causal models presents challenges for designing a conceptual model that works across contexts and selecting a suite of indicators that work across geographic, political, and cultural boundaries. Moreover, the pragmatic constraints of data that measure such indicators present real barriers to the development of precise and refined indicators. We, therefore, balance conceptual clarity, empirical justification, and pragmatism in selecting variables.
For the present study, we are primarily interested in understanding the importance of proximity to PAs and size of the nearest PA relative to other variables, and as such include a measure of distance to the nearest IUCN PA as well as the geographic size of the nearest PA using data from the World Database on PAs34. In addition, we understand PAs as social-ecological systems that are nested in wider social, political, economic, and environmental systems, each with its own sets of dynamic feedback processes tied to local conditions and context35. We know from previous empirical studies cited above that various categories of indicators influence wellbeing outcomes (socio-economic, political/governance, environmental geographic, and stochastic shocks) for households in the study geographies. We therefore include measures for each category to assess their influence on wellbeing outcomes. The data used to measure predictor variables is described in Table 5. To assign values of each variable to a specific household observation, we employed the following order of operations. Socio-economic data and governance indicators are sourced primarily from the Afrobarometer data, and thus already associated with a particular household. We recoded each variable such that answers like ‘not reported’ and ‘unanswered’ were changed to ‘missing’. As with construction of the response variable, we only included predictor variables in our study that had no more than 5% of ‘missing’ observations. We assume that the relationship between wellbeing outcomes and environmental factors is nuanced and complex, and that environmental and social factors operate on different timescales. We also assume from a large body of theoretical literature, much of which is described in8, that social impacts of protected areas and other conservation strategies result from changes in the natural world. As such, we observed environmental factors over extended time periods that include the 2015 reference year but extend before and/or after. For the study, variables expected to influence movement in wellbeing outcomes included control variables for the income group of the country, the occurrence environmental shocks including drought, floods, and extreme temperature, the micro-economic situation of the household (either according to the respondent or observed by the enumerator), the presence of village facilities (water, electricity, sewage, and cell service), the distance to the nearest PA, the size of the nearest PA, and relevant indicators at the household level including the respondent’s perspective on the direction of the country, household facility access (water, sewage, electricity), the household head’s educational attainment, perceived security, physical security, freedom of speech, voting freedom, and representative government. While management and governance of the PA are expected to influence social outcomes13, we were not able to include information on the governance of the PA due to lack of granular data in the World Database of Protected Areas52 and large numbers of missing values for IUCN PA type in the dataset.
We assume that some aspects of landcover and landcover change should influence movement in observed wellbeing outcomes, so we constructed metrics that measure change over time. Due to data limitations preceding the survey year, we utilized a period of 2015–2019, assuming that land cover changes in this period are representative of longer-term trends (aside from stochastic shocks). We controlled for factors expected to influence the heterogeneity in these groups including the standard deviation of net primary productivity and NDVI from 2015 to 2019, and the change in the following spatial variables between 2015 and 2019: crop cover, urbanization, nighttime illumination, and tree cover. We also assume that connectivity is important and include distances to the nearest road and buildings. Additionally, following37, we assume exposure to anthropogenic ecological threats could be an important factor affecting wellbeing and include a measure of this. Environmental measures were sourced from a variety of remote sensing data and preprocessed geospatial data described in Table 5 and processed using Google Earth Engine36. For variables that involved averaging or taking standard deviation over time, the period was 2015–2019. To assign values of each variable to a household observation, we assigned the value of the pixel at which the household is located, given its spatial coordinates. In the event that multiple pixels overlapped a household’s coordinates, we averaged across those values. Given the various ranges of the ordinal variables included, we utilized min–max techniques to normalize the variables included in our models. Two exceptions to that normalization are noted in Table 5.
Constructing random forest regression models under a quasi-experimental design
As discussed earlier, there is reason to believe that random forest machine learning models39 may be useful for examining the importance of factors in driving variation in observed wellbeing outcomes compared to the more traditional linear approaches commonly employed in the conservation literature45. This estimation technique relies on an ensemble learning approach that uses multiple decision trees to classify the outcome variable according to the influence of the variables. Each decision tree is used to predict the outcome in a separate model and the results of the ensemble are trained using a subset of the original data. The results of the ensemble of trees are then averaged to create the regression or prediction algorithm and are then applied to the entire dataset. This approach overcomes some of the limitations of classical linear models by relaxing the imposition of directionality and instead learning from the extant patterns in the dataset to identify the relative importance of each variable in driving movement in the response variable. Using a high number of simulation runs, the approach also minimizes the potential for decision trees to split based on unimportant regressors, thereby providing added confidence in variable importance scores40. This approach is limited in that its outcomes are not generalizable, as the model outputs cannot be extrapolated outside the existing data. However, given the contextual specificity of social-ecological relationships of households to PAs and natural resources around the world, we view this model as appropriate for illuminating the relative importance of variables, thereby enabling future site-based studies to unpack those relationships in detail.
To run the random forest models for the dataset, we utilized the ‘ranger’ package in R41, conducting predictions to measure accuracy and then running regression models for the full sample on three outcome variables described in Table 4: Overall wellbeing, Objective wellbeing, and Subjective wellbeing. We first constructed a training dataset using a subset of the data, and then ran a model using the full dataset. For the model, we set the number of simulations to 1000. We include all variables included in Table 5 to determine the importance scores of each. We analyzed the outputs by first comparing importance scores from model outputs, and later by constructing and examining ALE plots for each variable. The basic code for the random forest machine learning models is included in the supplemental material file.
Recent studies have found a relationship between household wellbeing using experimental designs with the treatment condition being located within 10 km from a PA and the control condition being located outside the 10 km buffer in multi-country studies across the developing world18. Others have found similar impacts using larger and smaller buffers in more constrained geographies, for instance, 5 km buffers9. We assume that variables including distance of a household to PA and size of nearest PA should have distinct relationships within such buffer zones compared to outside the buffer. Based on that assumption, we split the sample into those households that were within 10 km of a PA and those that were more than 10 km from a PA based on the design of18. To test whether we could utilize a quasi-experimental design to extend the study beyond the Random Forest approach alone, we matched households inside the 10 k buffer with households outside on a variety of the factors above. Rather than attempting propensity score matching given the need to discard unmatched observations, we reweighted the samples to balance the covariates using entropy balancing per previous studies38. In addition to the balancing procedure, country-level fixed effects were included in a linear model. However, the predictive power of the model was very low, and the assumption of linearity was unlikely. The inclusion of categorical variables and the likely non-linear relationship between those variables and wellbeing required a more flexible model therefore, we utilized these covariates in the machine-learning random forest regression model presented in this study, run on the two subsamples. While this is not the same experimental approach previous studies use, segregating the data according to the buffer and repeating our random forest model provides insight into the importance of absolute distance as a driver of movement in observed wellbeing scores within and outside of the buffer.