Data
Our survey data of PV adopters and non-adopters at the household level were collected by the National Renewable Energy Laboratory (NREL) between June 2014 and April 2015 as part of its Solar Energy Evolution and Diffusion Studies (SEEDS) project. The Institutional Review Board (IRB) at Portland State University, a key collaborator with the NREL SEEDS project, provided approval to collect our data. All research was performed in accordance with relevant guidelines and regulations, and informed consent was obtained from all survey participants. The data sample comprised single-family United States households in California, Arizona, New Jersey, and New York. We chose these four states for two reasons: (1) they were the top four residential PV markets in 2014 in the U.S.65, and (2) national installers played a big role in these states. In each state, survey participants were primarily voluntary respondents identified by installers and lead-generator companies that collaborated on the research project, supplemented by paid respondents (i.e., panelists recruited through a web-panel company). A minimum of 100 responses per state per customer type were collected. Multi-family households and renter-occupied houses were excluded because of their low probability of adopting solar PV.
Survey items asked about respondents’ monthly winter and summer electricity bills, various socio-demographics (e.g., income, education, age, household size, etc.), motivations for considering PV, and experiences surrounding their decision of PV installation. We further complemented the survey data with zip-code level solar irradiations in PV capacity factor that we sourced from NREL’s PVWatts. Although there were both objective and subjective survey items, this study focuses on a subset of the objective items to build our predictive models, i.e., nine highly visible and easy-to-measure household attributes that can be easily obtained by PV companies. The purpose of doing so is to build a parsimonious empirical model with high out-of-sample prediction performance and low data collection costs.
There were 3,592 responses in our dataset, which represents the largest sample in the literature on PV adopters and non-adopters at the household level (Supplementary Table S1). 48%, 19%, 17%, and 16% of the respondents are from California, New York, New Jersey, and Arizona, respectively. There are about 30% of the data with missing value for key variables that were needed for our data analysis. The missing data rate was roughly the same across the four states: 30% for California, 29% for New Jersey, 32% for New York, and 23% for Arizona. The relatively low missing data rate for Arizona resulted from using more paid respondents to guarantee the required sample size by sub-category. After deleting 22 cases due to invalid or missing regional information, we used multiple imputation to obtain 3,570 responses for the final data analysis; 46% of them are adopters and 54% are non-adopters (see Supplementary Table S2 for sample breakdown by state). The summary statistics for the data are shown in Supplementary Table S3.
Methods
The overall methodology is summarized in the flow chart in Supplementary Fig. S9. To better predict which households were likely to adopt solar PV, we first split the total sample into training, validation, and test datasets. In all model runs, we pre-reserve 20% of the data for out-of-sample testing purposes. For machine learning algorithms, five-fold cross-validation for hyperparameter tuning was also used to ensure that the validation dataset has a similar sample size to the testing dataset.
We choose logistic regression and XGBoost as our two baseline prediction methods, since the former is the typical significance-based method used in the literature and the latter is an award-winning machine learning algorithm with great prediction performances. We have used three criteria for feature selection: (1) household attributes must be objective, highly visible and easy-to-measure, so that data collection costs for PV companies are low; (2) household attributes must be commonly seen in the PV adoption literature; and (3) the selection of household attributes cannot bias toward neither of the two methods being compared. As to feature transformation, we took the logarithm of electricity bills and household income to reduce their distributional skewness (Supplementary Table S4) and to obtain better predictive performances (mainly for logistic regression); applying the Box-Cox transformation to remove skewness produces very similar prediction results to what we obtain with the log transformation (Supplementary Table S4). Categorical variables such as education levels and house size are dummy coded beforehand.
Prediction rate
Other than the above two classification methods we use to model households’ PV adoption statuses, below we also employ the method developed by Lo et al.35 to calculate the prediction rate of a single household attribute to determine whether it was a strong predictor of the binary outcome variable. For example, given the binary variable statuses \(A\) and \(N\), where \(A\) is adopter and \(N\) is non-adopter, the observed value of the single household attribute \(x\) comes from the distribution either \({f}_{A}\) or \({f}_{N}\). Assuming the costs of false positives and false negatives are equal, an appropriate Bayes rule to decide the outcome status for \(x\) is in favor of the larger value of \({f}_{A}(x)\) and \({f}_{N}(x)\). When \(x\) is discrete and takes a finite number of values, the corresponding error rates for the above Bayes rule are \({\sum }_{x:{f}_{A}\left(x\right)<{f}_{N}(x)}{f}_{A}(x)\) and \({\sum }_{x:{f}_{N}\left(x\right)<{f}_{A}(x)}{f}_{N}(x)\), respectively. The average error rate is then 0.5*\({\sum }_{x}min\left({f}_{A}(x) ,{f}_{N}(x)\right)\). As a result, the prediction rate is 1 minus the average error rate:
$$\mathrm{Prediction rate}=0.5*{\sum }_{x}max\left({f}_{A}(x) ,{f}_{N}(x)\right)$$
(1)
For continuous variables, Eq. (1) can be rewritten with integrals instead of summations.
Logistic regression
Logistic regression (or logit regression) is a statistical model that uses a logistic function to model binary dependent variables. In our case, such a binary dependent variable is whether or not a household had adopted solar PV. We chose logistic regression as a comparison model to the machine learning algorithms because logistic regression is the most commonly used method in the previous PV adoption research.
Denote the probability of a PV adopter as \(p\), and the original dependent variable as \(Y\), then we have \(p=Prob\left(Y=1\right)\). Logistic regression further uses the log odds of the event when \(Y=1\) as the new dependent variable, and assumes a linear relationship with a vector of household attributes \({X}_{i}\) on the right side of the regression model:
$$\mathrm{log}\left(\frac{{p}_{i}}{1-{p}_{i}}\right)=\alpha +{\beta }^{^{\prime}}{X}_{i}+{\varepsilon }_{i}$$
(2)
where \(\alpha\) is the intercept, \(\beta\) is the coefficient vector, and \({\varepsilon }_{i}\) is the error term. Since \(0\le p\le 1\), the logistic function of \(\mathrm{log}\left(\frac{{p}_{i}}{1-{p}_{i}}\right)\) thus converts \(p\) to a real value in the range of \(\left[-\infty ,+\infty \right]\). The goal of logistic regression is to identify which of \(X\) has a significant impact on \(Y=1\) (i.e., becoming a PV adopter in this case), while keeping all other factors constant. The magnitude of these marginal impacts is captured by the coefficient vector \(\beta\). However, the assumed linear relationship between \(X\) and the log odds of the event may be constraining with respect to the ultimate predictive performance.
XGBoost
XGBoost (extreme gradient boosting), is a typical gradient tree boosting method widely used in machine learning challenges such as the Kaggle competition, which outperforms other classic methods such as naïve Bayes, random forest, and support vector machine. More than half of the challenge winning solutions used XGBoost, which is even more popular than deep neural nets. The following description of this method is mainly drawn from66, with certain re-organizations.
One of the defining features of XGBoost is to use an ensemble of decision trees to boost performance. Assuming in the end that there are \(K\) trees in the ensemble, each with a predicting score \({f}_{k}\) for subject \(i\) with features \({X}_{i}\), then the overall predicting score for this subject is the sum of all \(K\) predicting scores: \({\widehat{y}}_{i}={\sum }_{k=1}^{K}{f}_{k}\left({X}_{i}\right)\).
The goodness of fit for an ensemble of decision trees is defined as:
$$\mathcal{L}={\sum }_{i=1}^{n}l\left({y}_{i},{\widehat{y}}_{i}\right)+{\sum }_{k=1}^{K}\Omega \left({f}_{k}\right)$$
(3)
where \(l\) is the training loss between the observed outcome \({y}_{i}\) and its predicted version \({\widehat{y}}_{i}\), and \(\Omega\) is the regularization term that captures the complexities of all \(K\) trees. By default, a quadratic loss function and L2 norm are used in XGBoost, which we used in this study. Specifically, \(l\left({y}_{i},{\widehat{y}}_{i}\right)={\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}\) and \(\Omega =\upgamma T+\frac{1}{2}\lambda {\sum }_{j=1}^{T}{\omega }_{j}^{2}\), where \(T\) is the number of leaves, \({\omega }_{j}\) are leaf scores, and \(\upgamma\) and \(\lambda\) are parameters to be calibrated. As such, \(\Omega\) penalizes the complexity of the tree model. Furthermore, prediction function \({f}_{k}\) and leaf scores are related in that: \({f}_{k}\left({X}_{i}\right)={\omega }_{q({X}_{i})}\), where \(q\) is the tree structure that maps subject \(i\) to one of the \(T\) leaves. All subjects in the same leaf have the same score.
The predictive performance of a tree proceeds in two major steps: first it finds the best prediction function and leaf scores given the tree structure; second it finds the best tree structure. For a for a fixed tree structure \(q\left({X}_{i}\right)\), one can compute the optimal weights or leaf scores \({\omega }_{j}^{*}=-\frac{\sum_{i\in {I}_{j}}{g}_{i}}{\sum_{i\in {I}_{j}}{h}_{i}+\lambda }\), and then the optimal objective value becomes:
$${\widetilde{\mathcal{L}}}^{\left(t\right)}\left(q\right)=-\frac{1}{2}\sum_{j=1}^{T}\frac{{\left(\sum_{i\in {I}_{j}}{g}_{i}\right)}^{2}}{\sum_{i\in {I}_{j}}{h}_{i}+\lambda }+\upgamma T$$
(4)
where \({g}_{i}\) and \({h}_{i}\) are the first and second order gradient statistics on the loss function \(l\). Equation (3) is similar to the impurity score for a decision tree.
Next the best tree structure is identified \(q\left({X}_{i}\right)\). XGBoost uses a greedy method to grow the tree. It starts with a tree with one node and zero depth. Then, for each node of the tree, it enumerates the overall features; for each feature, it sorts by feature values and then decides the best split along that feature; and takes the best split across all features. The best split will maximize the increase in the objective value of the tree. For example, if one splits a sorted \({x}_{i}\) into a left half (\(L\)) and a right half (\(R\)), then the increase in the objective is:
$$\Delta \widetilde{\mathcal{L}}=\frac{1}{2}\left[\frac{{G}_{L}^{2}}{{H}_{L}+\lambda }+\frac{{G}_{R}^{2}}{{H}_{R}+\lambda }-\frac{{\left({G}_{L}+{G}_{R}\right)}^{2}}{{H}_{L}+{H}_{R}+\lambda }\right]-\upgamma$$
(5)
where \({G}_{L}=\sum_{i\in L}{g}_{i}\) and \({G}_{R}=\sum_{i\in R}{g}_{i}\), and similarly for \({H}_{L}\) and \({H}_{R}\). Obviously, this greedy method to grow trees can be computationally expensive; however, XGBoost uses an approximate algorithm, caching-aware prefetching algorithm, and parallel learning to enhance speed.
To achieve strong out-of-sample prediction performance without penalizing the in-sample model fitting performance, we used Bayesian optimization (R package ‘ParBayesianOptimization’ v1.2.467 in its original code) to tune hyperparameters in XGBoost, namely the maximum depth of a tree, eta (learning rate), gamma (minimum loss reduction to make a further partition), minimum child weight (minimum sum of weight in a leaf node), column sampling ratio (ratio of columns when constructing each tree), and row sampling ratio (ratio of rows when constructing each tree), as well as the number of rounds for boosting. We then used the best set of hyperparameters to re-train the XGBoost model and predict out-of-sample PV adoption outcomes. We also use an R package ‘xgboostExplainer’ v0.168 to probe into XGBoost results and make them more interpretable by calculating the impact of each feature on the prediction at the leaf level. We further make waterfall graphs out of these marginal impacts and examine the nonlinearity of these impacts.
Decomposing XGBoost. In order to decompose the enhanced prediction performance of XGBoost over logistic regression, we carefully designed 10 scenarios (Supplementary Table S12) to re-run XGBoost. The goal is to control for variable interaction first, and then for nonlinearity, since these are the two major factors that determine the better performance of XGBoost.
The definition of specific scenario and parameter setting are as follows: First, we restrained the degree of variable interactions used in the decision trees of XGBoost via two parameters called “colsample” (column sampling ratio) and “round” (number of rounds for boosting). For example, using 0.07 instead of 1 for “colsample” means that every decision tree will randomly draw at maximum 7% of all columns (i.e., household attributes) from the dataset; since there are 15 columns in our baseline model, it means only one variable will be used in growing each decision tree. This will ensure that no interaction is allowed within one decision tree. However, because XGBoost uses an ensemble of decision trees to fit the model and make a prediction, adding up prediction scores from many decision trees implies variable interaction across trees. That is why we also need to restrain the number of decision trees used in XGBoost via the parameter ‘round’ to further remove variable interactions. By limiting the maximize number of the allowed decisions trees from 1000 to 20 (almost one decision tree for one variable; results from using 15 are very similar), we can make sure that only very limited variable interaction remains in XGBoost. We cannot remove all the interactions by limiting the number of decisions trees to be one since in that case with colsample equal to 0.07, we only used one variable to fit the whole model.
Second, we restrained the degree of nonlinearity through the parameter “depth” (maximum depth of a tree) in XGBoost. Since every additional tree depth means that we add more branches to the tree as a new layer, it translates directly to more nonlinearity even with a single variable being used to grow the tree. Meanwhile, with more variables in use, higher tree depth would also leverage those other variables to growth the tree, resulting in interactions among variables. That is why we first need to restrain “colsample + round” and then “depth” in the process, and not vice versa. After limiting the tree depth to be four, the prediction performance of XGBoost becomes very similar to that of logistic regression.
Thus, by restraining the above three parameters in XGBoost, we are able to control the degree of nonlinearity and interactions included in this method, peeling off its complexity step by step, and in the end making XGBoost similar to classic methods like logistic regression.
Cost savings in customer acquisition
The current high customer acquisition costs are the results of two major components: (1) leads cost, and (2) sales cost. The leads cost is for an installer company to purchase solar leads from a lead generation company, and the sales cost is to pay the sales staff for contacting and visiting potential customers and their commissions69. After receiving the leads information from the lead generation company, the installer company will contact all those leads and makes an appointment with some of the leads. Only a small proportion (e.g. 7%) of leads will become final sales (see Supplementary Note for detailed numbers). Machine learning predictions can be used by the installer company to enhance its sales closing rate (or the leads conversion rate). Based the prediction results, installers are able to tag and group potential customers into predicted PV adopters and non-adopters, and then send more people to contact and visit all the predicted PV adopters and fewer people to those predicted non-adopters.
The calculation of cost savings in customer acquisition are based on the following three rules: (1) its sales staff should contact and visit (visit for short thereafter) all tagged PV adopters, while only visiting some tagged non-adopters in proportion to the adopter share in this group:
$$\begin{aligned} & Visit_{A} = Tag_{A} \\ & Visit_{N} = { }Visit_{A} \times \frac{{Share_{N} }}{{Share_{A} }} \\ \end{aligned}$$
(6)
where \({Visit}_{A}\) is the number of household visits to tagged adopters—\({Tag}_{A}\), \({Visit}_{N}\) is the number of household visits to tagged non-adopters, \({Share}_{N}\) is the adopter share in the tagged non-adopter group, and \({Share}_{A}\) is the adopter share in the tagged adopter group.
The second rule is that: (2) if the final payoff from visiting the tagged non-adopter group is much less than one sale or installation (say 0.3 or 0.4 in the statistic sense), the installer company would rather not visit any tagged non-adopters.
$${Visit}_{N}^{^{\prime}}=0 if {Visit}_{N}*{Share}_{N}<0.5$$
(7)
The third rule is that: (3) the final payoff or number of sales should be the same for different methods in comparison:
$${Visit}_{A}*{Share}_{A}+{Visit}_{N}^{^{\prime}}*{Share}_{N}=B$$
(8)
where \(B\) is the required sales to be fulfilled whether the installer company uses tagging and selective visiting or not.
The above rules assume that the installer company is risk-neutral and rational, and its allocation of human resource is based on the expected payoffs. With these rules, the total cost of household visits is: \({c}_{1}*({Visit}_{A}+{Visit}_{N}^{^{\prime}})\), where \({c}_{1}\) is the unit visiting cost. The total leads cost is: \({c}_{2}*\left({Tag}_{A}+{Tag}_{N}\right)\), where \({c}_{2}\) is the unit leads cost, and \({Tag}_{N}=\frac{{Visit}_{N}}{{Share}_{N}}\). These two costs add up to the total customer acquisition costs (see Supplementary Note for calculation details).