Architecture of Imagery2Flow
As shown in Fig. 1a, the top 10 largest MSAs of the United States are selected as the study areas. Each region of interest (ROI) is partitioned into N smaller geographic units, whose boundaries can be defined by administrative divisions (e.g., census tracts) or square grids. In this study, an ROI is a metropolitan statistical area (MSA), and the census tracts are the smaller geographic units for predicting OD flows. That is, there are N census tracts in each ROI. The flow yij between area i and area j denotes the number of people moving from area i to area j for commuting, traveling, or other reasons. Generally, the human mobility OD flow prediction task is defined as follows: given the visual characteristics of the urban environment in satellite imagery and the observed spatial structure of flows between any two geographic areas, the model is trained to predict the unknown OD flows within unseen regions.

a The top 10 largest Metropolitan Statistical Area (MSA) of the United States (U.S.), denoted as M1 to M10, corresponding to the following: New York-Newark-Jersey City, NY-NJ-PA (M1); Los Angeles-Long Beach-Anaheim, CA (M2); Chicago-Naperville-Elgin, IL-IN-WI (M3); Dallas-Fort Worth-Arlington, TX (M4); Houston-The Woodlands-Sugar Land, TX (M5); Washington-Arlington-Alexandria, DC-VA-MD-WV (M6); Philadelphia-Camden-Wilmington, PA-NJ-DE-MD (M7); Miami-Fort Lauderdale-Pompano Beach, FL (M8); Atlanta-Sandy Springs-Alpharetta, GA (M9); and Phoenix-Mesa-Chandler, AZ (M10). b The distribution of the number of jobs versus rank for the ten MSAs, which follows the Zipf’s law. The legend labels of M1 to M10 are the same as Fig. 1a. c. Architecture of the Imagery2Flow deep learning model.
To predict the human mobility flows within each urban area, Imagery2Flow consists of three modules: spatial context embedding, spatial interaction learner, and OD flow predictor (as shown in Fig. 1c). First, it extracts visual features from satellite images as the characteristics of the study areas. For an MSA, the fully covered remote sensing imagery is split into N sub-areas according to the boundaries of each spatial unit. Given a satellite image vi of geographic area i, the image embedding ri is obtained to represent the visual features of terrestrial objects within the area in a self-supervised contrastive learning framework. The data augmentations, including random crop, resize, and color distort, are applied to each satellite image vi twice in an iteration to get two similar satellite views for the same geographic unit, which helps the self-supervised contrastive representation learning for spatial contexts59. Then, a deep neural network (e.g., Convolutional Neural Networks, CNN or Vision Transformers) is adopted as the backbone model to encode the satellite views into embedding ri, then a two-layer nonlinear transformation is employed as a projector to project all the satellite views to a latent space, in which the contrastive loss is calculated to all views for maximizing the embeddings’ agreement from the same area. After the contrastive learning process, each geographic area has gained a satellite image neural embedding as a latent representation of its complex visual features.
Second, in order to consider the geographical location, distance, and adjacency influences when predicting the OD flows, we represent the N sub-areas within each ROI into a graph structure to learn their spatial interactions. This graph is constructed based on geographical adjacency, with distance as edge weight. We employ a Graph Attention Network (GAT) as the spatial interaction learner to consider the importance of neighboring nodes with weighting factors, allowing the model to automatically capture geographic areas that have a large or small impact on human mobility flows in or out of the current geographic unit. The satellite image embedding ri is used as initial features of each node (census tract) i, which offers important information about the spatial contexts of each geographic unit. Following the GAT encoding, the node representations are updated by incorporating structural knowledge and neighborhood information through graph representation learning39,60, facilitated by message passing and aggregation mechanisms. The updated node embedding is used as the feature vector of each census tract and fed into the decoder for OD flow prediction in the last module40,41,42,43. The satellite image embeddings encode comprehensive urban contexts of each geographic unit, while the spatial interaction learner helps the model understand the spatial interaction patterns between nodes. Besides, satellite imagery embeddings are transformed and shared among neighboring nodes. In this process, the graph structure is assimilated, which helps to iteratively improve the node embeddings and the relationship with other nodes, while aiding in capturing both local and global information.
Moreover, we utilize the Batch-based Monte-Carlo Loss (BMC Loss)61 to address the data imbalance issue to achieve better performance than the mean-square-error (MSE) loss. In addition, the mini-batch sampling strategy is utilized in the training process to make the graph represent learning inductive, instead of transductive, by sampling a certain number of nodes as a subgraph in each iteration while training. The nodes in the test set do not appear in the training set as the origins of the OD flows. This ensures the transferability of the model and allows its application to unseen nodes to infer their corresponding OD flows.
Lastly, a bilinear transformation decoder62 as the OD flow predictor is employed in Imagery2Flow to learn the spatial interaction patterns from the GAT encoder during training and finally estimate the OD flows between nodes (i.g., census tracts). In addition, we also consider a model variant named Imagery2Flow-LGBM by replacing the bilinear decoder with a Light Gradient Boosting Machine (LGBM) to demonstrate the flexibility of our modeling framework, as recent OD flow prediction works using graph neural networks also utilize machine learning decoders40,42,43. We will compare the performance with different baseline models with and without the machine learning decoder.
Experiments and model performance
Using the Imagery2Flow framework, we have conducted intensive experiments on human mobility OD flow prediction in the top-10 largest MSAs of the U.S., to provide a comprehensive evaluation of our proposed models. The Census Tracts (CTs) in the MSAs are used as the spatial units for OD flow prediction. The commuting flows among census tracts from the 2020 Origin-Destination Employment Statistics (LODES) dataset in each MSA are utilized as the ground-truth data. The data contains the home and employment locations of workers and their commuting flows. We removed the flows with fewer than 10 commuting workers between census tracts due to their insignificance in accurately capturing stable movement patterns between locations. We use Sentinel-2 and Landsat-8, the two widely used, openly accessible, and stable data sources of satellite imagery with global coverage, with spatial resolutions of 10-meter and 30-mete, respectively, to train deep learning models separately to compare performance. To evaluate the effectiveness of Imagery2Flow in predicting mobility OD flows, we employed a wide range of models as baselines, including traditional physical/mechanistic models, machine learning models, and deep learning models. The details of baselines and evaluation metrics are discussed in the ‘Methods’ section. Each model is independently applied to 10 MSAs. In each MSA, stratified sampling is conducted based on the population in each area, with census tracts divided into a 60% training set, 20% validation set, and 20% test set. The human mobility flows whose origins belong to the training set are involved in the loss computation and the gradient update in the training process but not in the testing process. The 5-fold cross-validation is utilized to get the prediction covering the entire city. For example, in the MSA of New York-Newark-Jersey City, NY-NJ-PA (M1), there are 120,559 flows across 4953 census tracts, with a mean flow intensity of 22.3 and a median of 16. In order to reduce the skewness of the data sampling and improve the performance of the model, we performed a log transformation on the flow intensity during the training of Imagery2Flow43.
To illustrate the generalizability of Imagery2Flow, we conducted experiments on different backbones for visual feature extraction from satellite imagery in the M1 MSA. The results are shown in the Supplementary Table 5, suggesting that vision transformers perform slightly better than the CNN backbones. Therefore, we utilized the ViT-L/16 backbone for subsequent experiments. Additionally, to assess the effectiveness of each module in the Imagery2Flow, we added an ablation experiment. The results are shown in Supplementary Table 4. Specifically, we replaced the visual features extracted from satellite imagery with numerical socioeconomic features (Supplementary Table 3) extracted from OpenStreetMap points of interest (POI) data. The results show that the visual features extracted from Sentinel-2 10-meter satellite imagery have a better performance than the POI data, no matter whether the decoder is a bilinear layer or LGBM. Additionally, compared to the OpenStreetMap data, remote sensing images have a better data quality regarding the completeness and global spatial coverage63 and clear timestamp information, which can ensure that the model has the potential to infer up-to-date human mobility OD flows globally when statistical or survey data is not available. In addition, there is a small proportion of pairs of OD flows in the dataset accounting for a greater flow intensity; such an observed scaling phenomenon is consistent with other studies using various human mobility datasets in different cities3,64,65. The sparsity of large mobility flow pairs and the presence of data skewness pose significant challenges to the model’s performance. The BMC Loss is employed to address this issue by learning the distribution of flow intensity through a cost function analogous to contrastive loss, which outperforms the conventional Mean Square Error (MSE) loss.
Table 1 compares the performance of the models in the top-10 largest U.S. MSAs. Compared with machine learning based and deep neural network-based methods, traditional mechanistic methods, such as the Radiation Model and the Gravity Model, provide worse performance in terms of the Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Common Part of Commuters (CPC) metrics. As a parameter-free model, the Radiation Model is more suitable for long-distance migration modeling for spatial units with coarser granularity32. The error indicators of Gravity Model exhibit fluctuations across different cities, suggesting that the performance is influenced by the metropolitan size and spatial structure. The XGBoost and LGBM achieve impressive performance, demonstrating the applicability of tree-based gradient boosting methods to OD Flow prediction, which is essentially a regression task, and the superiority of visual features extracted from satellite imagery for characterizing spatial contexts in predicting mobility flows. As an improved version of the Gravity Model, the Deep Gravity model35 has non-linearity and multiple hidden neural layers in a fully connected network structure, so using population (Deep Gravity-P) or visual features (Deep Gravity-V) achieves better results than mechanistic models. However, Deep Gravity-V is obviously better than Deep Gravity-P, by incorporating the visual features of complex urban environments from remote sensing imagery. Additionally, the results indicate that Deep Gravity-V-10m using higher resolution sensing imagery (i.e., 10-meter) outperforms Deep Gravity-V-30m (i.e., 30-meter spatial resolution) by capturing richer spatial contextual information. Also as deep learning models, SIGCN36 and GMEL42 use graph neural networks for feature encoding, which are distinguished from the fully connected layers of Deep Gravity-V. By replacing the bilinear layer with a gradient boosting regressor after the graph embedding training and fine-tuning process, GMEL narrows the RMSE gap with our Imagery2Flow-LGBM to as low as 2.8% when using 10-m resolution imagery, whereas Deep Gravity-V and SIGCN have a gap of 23.5% and 21.8% respectively. Our proposed Imagery2Flow-LGBM model demonstrates superior performance on most evaluation metrics, utilizing 10-meter imagery in nine other MSAs and 30-meter imagery in Miami-Fort Lauderdale-Pompano Beach, Florida (M8). The widely used CPC metric represents the degree of correct prediction in OD flow matrices and is more representative of the predicted mobility flow patterns than other metrics35. Compared with traditional machine learning models that directly use visual features, our Imagery2Flow-LGBM model uses a graph neural network to further encode the relationships between geographic units, aggregate their neighborhood information, and learn spatial interaction patterns, thus achieving better performance. Specifically, using 10-meter resolution of Sentinel-2 imagery, the RMSE has reduced by a range of 4.5% to 13.9%, MAE by 1.3% to 10.8%, and CPC has increased by 1.3% to 2.6%. The Deep Gravity model is based on a singly constrained gravity model that uses the total outflow volume of each origin area as a known restriction during the OD flow estimation process. Throughout the evaluation of model performance and transferability, such high performance cannot be achieved when the outflow volume is unknown. Similarly, SIGCN builds graphs based on known flows, which restricts its generalizability. When transferring the model to an unseen city or MSA, the Imagery2Flow model and its variant Imagery2Flow-LGBM can be directly applied to the new area with a graph structure between geographic units, requiring only open-source satellite images as input, thus offering greater flexibility.
We further analyze the OD flows predicted by Imagery2Flow and Imagery2Flow-LGBM, group them according to the population of the origin areas, the intensity and distance of flows, and calculate the average CPC of different groups. As shown in Fig. 2d, we split all geographic units (census tracts) in each MSA into ten equal-sized groups (i.e., deciles), sort them based on their population, as decile 1 accounts for the top 10% of areas by population, and decile 10 accounts for the areas with the smallest population. The model performs well and is stable in every decile. The performance of Imagery2Flow-LGBM remains superior in the first nine deciles until slightly declining compared to Imagery2Flow in the least populated areas. Recently, points of interest and geo-referenced social media data that can characterize the geographical environment are often used in mobility flow prediction tasks35,42,66. But these data have some limitations of being unevenly distributed across different regions and more abundant in densely populated areas, such as Manhattan, compared to the suburbs of New Jersey in the M1-New York Region, leading to spatial biases in model performance. In contrast, the spectral information capturing terrestrial objects in remote sensing imagery fully covers the entire MSA, exhibiting relatively uniform information richness. Therefore, the difficulty of the model predicting flow does not change much with the population variation across the region.

There are 120,559 flows among 4953 census tracts. a. Real commuting flows. b Predicted flows of Imagery2Flow-LGBM with 10-meter satellite images. c Predicted flows of Imagery2Flow-LGBM with 30-meter satellite images. d Comparison of the performance in terms of Common Part of Commuters (CPC), varying the decile of the area’s population. e The CPC of predicted flows in various distances. f The CPC of predicted flows in various intensities. Error bars represent the standard deviation of CPC of flows across origins, centered at the mean. A 5-fold cross-validation is employed.
Based on the distance and flow intensity, we divided the OD flows into six levels, as shown in Fig. 2e and f. Both Imagery2Flow and Imagery2Flow-LGBM perform well across short and long distances (Fig. 2a–c). However, for high-intensity flows with more travelers, more errors are observed due to the data imbalance, with a large number of low-intensity flows present in the training set. As the intensity increases, the performance of Imagery2Flow declines more substantially than that of Imagery2Flow-LGBM, which can still achieve the CPC of 0.6 at high flow intensity levels. In addition, there are no significant differences on the performance of visual features provided by remote sensing images of different spatial resolutions (10-meter or 30-meter).
Correlation to urban morphology
Prior studies of human mobility have investigated the effect of geographical environment on human mobility patterns, such as the distance decay phenomenon, i.e., the reduction in collective human movements or individual displacements as the distance increases between origin and destination. This relationship is commonly modeled using a power law24,67 or an exponential form68. In addition to terrain morphology and transportation infrastructure, urban form and development also have an impact on human mobility, which can be reflected in satellite imagery. We therefore perform an empirical analysis based on the commuting flows, illustrating the relationship between the predicted mobility and urban morphology (e.g., compactness and centrality).
To quantify the distance decay phenomenon and understand population mobility patterns, we measure the distance distribution of commuting flows. The Akaike weight69,70 is utilized to determine which type of distribution is most suitable among exponential, power law, and truncated power law. The results are detailed in Supplementary Note 2 (Supplementary Table 2), which validates that the exponential form approximates the distance decay better than other alternatives in our study. Additionally, the distributions of the predicted flow using the Imagery2Flow and Imagery2Flow-LGBM models closely match the observed flow, with both the Kullback-Leibler (KL) divergence and Jenson’s Shannon (JS) divergence being very small (see Supplementary Table 1). Therefore, the human mobility OD flows predicted by satellite imagery can also capture the information of the urban form and spatial structure that is comparable to the commuting flow. Our proposed method can help urban planners connect satellite imagery to urban morphology from the perspective of human mobility, which has been less explored in prior work, as most existing studies focused on land cover, land use, or population mapping55,58,71.
We fit the mobility flow distance-probability distribution based on Maximum Likelihood Estimation (MLE)70 and find that the probability density function of the observed and predicted flows by distance universally exhibit the exponential law P(d) ~ e−λd, as shown in Fig. 3. The exponent λ calculated based on the actual OD flow indicates the rate at which commuting distance decays for residents in various cities. A larger λ value signifies that fewer residents undertake long-distance commutes. We further use urban morphology indices that represent compactness and centrality to investigate their correlation with the exponents68. The exponent λ shows strong correlation with compactness and centrality of the MSAs, with the Pearson correlation coefficient at 0.6-0.8 (p-value < 0.05) in these ten MSAs (Supplementary Fig. 1). The Boyce-Clark Shape Index (BCSI)72 is adopted to quantify the shape irregularity of cities, by projecting n rays from the shape’s centroid at equal angular intervals to its boundary and measuring the variation in radial distances. The formula for BCSI is computed by \(BCSI=\sum | \frac{{r}_{i}}{{\sum }_{i=1}^{n}{r}_{i}}\times 100-\frac{100}{n}|\). As the radial distances ri become more consistent, the BCSI value decreases, indicating a more compact shape. A BCSI value of 0 represents a perfect circle, which has equal radial lengths from the centroid to the boundary, while higher BCSI values represent more irregular shape. For this analysis, we used 36 rays at equal angular intervals from the centroid of each MSA. Supplementary Fig. 1 illustrates a strong positive correlation between the BCSI and λ. In a more compact city, the cost of long-distance commuting is lower due to the more averaged radials, resulting in a higher proportion of long-distance commuters. This is contrary to the correlation obtained in previous research68, which showed that in more compact cities, the radius of gyration (ROG) decays more rapidly, that is, fewer residents have a large ROG. There are three possible explanations for this opposing finding. First, the previous research68 relies on individual-level mobile phone data, which encompasses a range of individual activities, including daily routines such as home-to-work and entertainment. In compact cities with denser facilities, residents’ demand for long-distance movement decreases. However, our research is based on collective commuting survey data, which has fixed activity types and frequencies and is insensitive to urban facilities. This indicates that various types and frequencies of human activities are influenced by urban morphology in distinct ways73. Second, the distance in collective flow data is measured between the centroids of two census tracts, which introduces an aggregation effect on the distance itself. This may differ from the distance decay observed at the individual-level movements. Finally, differences in travel mode preference in different cities and changes over time may also have influence on individuals’ long-distance travel decisions.

The red curves represent an exponential fit to the distance distribution of the empirical observations. a 120,559 flows in New York-Newark-Jersey City, NY-NJ-PA (M1); b 68,712 flows in Los Angeles-Long Beach-Anaheim, CA (M2); c 67,090 flows in Chicago-Naperville-Elgin, IL-IN-WI (M3); d 57,311 flows in Dallas-Fort Worth-Arlington, TX (M4); e 40,628 flows in Houston-The Woodlands-Sugar Land, TX (M5); f 46,261 flows in Washington-Arlington-Alexandria, DC-VA-MD-WV (M6); g 45,025 flows in Philadelphia-Camden-Wilmington, PA-NJ-DE-MD (M7); h 39,593 flows in Miami-Fort Lauderdale-Pompano Beach, FL (M8); i 43,147 flows in Atlanta-Sandy Springs-Alpharetta, GA (M9); j 40,443 flows in Phoenix-Mesa-Chandler, AZ (M10).
Regarding urban centrality factors, this study focuses on population and employment distributions. As population density increases from low to high in different sub-regions, cities shift from a monocentric to a polycentric structure74. This transition can be viewed through the lens of urban growth: while higher population densities initially support single central cores, issues such as traffic congestion, resource accessibility challenges, and the demand for diverse economic activities drive the shift towards a polycentric urban spatial structure. For employment, we use the job count of each census tract at the commuting flow destination to create a log-log plot of job distribution versus its rank (Fig. 1b). The slope for each MSA reflects the rate at which the number of employment opportunities declines. A larger negative slope indicates a slower decline in employment, suggesting a polycentric structure with multiple employment centers. A strong correlation is detected between λ and both indices (population density at 0.74 (p-value < 0.05) and employment at 0.70 (p-value < 0.05), Supplementary Fig. 1). The cost of long-distance commuting is lower in a more polycentric city with multiple centers than that in a more monocentric city.
These results prove that the predicted human mobility from remotely sensed data in cities is closely related to urban morphology and spatial structure, by closely resembling the real commuting flow distribution and offering valuable insights. Satellite imagery can help predict the human mobility flows in unseen cities and obtain corresponding distance-decay human mobility patterns under different urban spatial structures.
Spatial Heterogeneity
After analyzing the modeling results, we observed that the prediction accuracy of OD flows varies across different locations, yet these spatial differences are not adequately represented by the metrics in Section ‘Experiments and Model Performance’. To gain additional insights from the Imagery2Flow model, we explore the spatial heterogeneity of the prediction results across the MSAs. Figure 3 shows the spatial distribution of OD flow prediction errors in New York-Newark-Jersey City (M1), and Supplementary Figs. 2–10 in the Supplementary Note 1 show the results of the remaining nine MSAs. Overall, the Imagery2Flow and its variant Imagery2Flow-LGBM have higher prediction accuracy around suburban areas in each MSA, while the relatively lower accuracy is found near the urban center. This indicates that the model also has saturation effects from areas with high intensity flows or large populations, where larger errors are more prevalent. This phenomenon is similar to the tasks of estimating population75,76,77, vegetation indices78, carbon emissions79 or other variables from remote sensing data sources (including from multi-spectral images and nighttime light data), which is more likely to have larger errors in areas with high values.
Factors that are prone to errors in the OD flow prediction model include differences in the socioeconomics of residents in different neighborhoods80,81, land use and vegetation. We first collect the land use82 and land cover83 data to explore predictive accuracy in different groups of census tracts to reveal the patterns between human mobility and urban context. Figure 4c and d present the OD flow prediction accuracy (average CPC) of different groups, which is divided by the dominant land type of the flows’ origin areas. As shown in the areas circled in Fig. 4a and b (and Supplementary Figs. 11–19a–b), different factors contributing to the spatial heterogeneity of prediction errors are represented by circles of different colors. The circles labeled ‘A’ signify areas with vegetation cover, including greenbelts, wildlife management areas, and regional parks. The appearance of vegetation in remote sensing images is very different and the features are complex and diverse, which makes the model perform poorly in areas dominated by vegetation, such as forests and wetlands (indicated in Fig. 4d). The circles labeled ‘B’ represent land use types and urban functions, such as recreational areas including encampment, golf and bowling clubs, educational facilities like schools, industrial lands, special types of land such as airfield, commercial zones and landmarks. Buildings on these areas may have similar appearances to residential or workspaces (as indicated in Fig. 4c), leading to overestimation of mobility flows by the model. This is the most common factor in places close to the downtown area in cities. In addition, the circles labeled ‘C’ signify socioeconomic heterogeneity, such as neighborhood income levels, age of residents, and ethnic demographics. For instance, it is reasonable that the model significantly overestimates the commuter flow in areas with households of median age greater than 55 years old.

Spatial distribution of flow prediction errors by Imagery2flow-LGBM according to the origins using a Sentinel-2 imagery (10-meter spatial resolution) and b Landsat-8 imagery (30-meter spatial resolution), respectively. Note that the labels A, B, and C on the map show the locations with different land covers, land uses, and urban socioeconomic factors. The label A represents areas with vegetation cover, including greenbelt, wildlife management areas, and regional parks; The label B represents areas with diverse land use types and urban functions, such as recreational areas including encampment, golf and bowling clubs, educational facilities like schools, industrial lands, special types of land such as airfield, commercial zones and landmarks; Label C shows areas with diverse socioeconomic heterogeneity, such as neighborhood income levels, age of residents, and ethnic demographics. c Prediction accuracy of different land use groups in terms of Common Part of Commuters (CPC). d Prediction accuracy (Average CPC) of different land cover groups. Total 4953 origin census tracts are included in New York.
Model spatiotemporal transferability
In order to examine the spatiotemporal transferability or generalizability of the model, which is the ability to estimate human mobility flows in other space or time when trained only with data from one specific MSA (we use the largest city name in each MSA as the label of each MSA), we collected the data of top-10 largest MSAs, with different geographies, development histories and lifestyles. Demonstrating the ability of a time- and space-specific model to generalize across diverse regions and timeframes is of considerable importance77. Spatiotemporal contexts with limited data availability may benefit from leveraging models trained on contexts with abundant data resources. In addition, it helps to explore and explain different spatial interaction patterns of human dynamics across cities.
The mini-batch sampling strategy facilitates the transferability of our model. To simplify the explanation, each pair “City A-City B” indicates that the model is initially trained on data from city A, and then the learned weights are directly employed on city B’s data to predict OD flows in B. In the spatial transfer experiments across all MSAs summarized in Fig. 5, the overall performance of models trained within this MSA (diagonal) is better than that of models trained across MSAs, for training on both 10-meter and 30-meter resolution images. Imagery2Flow’s transfer matrix displays notable patterns (Fig. 5a and b), while Imagery2Flow-LGBM significantly enhanced the overall transferability (Fig. 5c and d). Imagery2Flow employs an end-to-end deep learning framework for inferring OD flows directly from satellite visual features, making its transfer performance inherently dependent on the relevance of these visual features. In contrast, Imagery2Flow-LGBM enhances its approach by retraining the LGBM decoder using features that integrate aggregated neighborhood information and spatial distance. We found two pairs “Dallas-Houston”(M4-M5) and “Washington D.C.-Philadelphia”(M6-M7), that performed particularly well while some other metropolitan pairs showed poor spatial transferability results by Imagery2Flow. Our experiment results reflect the weak replicability in geospatial AI models under which the model architecture or specification might be generalizable but the model’s weights or parameters might be allowed to vary spatially due to the influence of spatial heterogeneity of geographic environments84.

The Common Part of Commuters (CPC) values of the Imagery2Flow model and the Imagery2Flow-LGBM model trained in one Metropolitan Statistical Area (MSA) (on the x axis) and applied to other MSA (on the y axis). a Using Sentinel-2 (10-meter) images with bilinear decoder. b Using Landsat-8 (30-meter) images with bilinear decoder. c Using Sentinel-2 (10-meter) images with LGBM decoder. d Using Landsat-8 (30-meter) images with LGBM decoder. M1-M10 represent the ten United States metropolitan statistical areas: New York-Newark-Jersey City, NY-NJ-PA (M1); Los Angeles-Long Beach-Anaheim, CA (M2); Chicago-Naperville-Elgin, IL-IN-WI (M3); Dallas-Fort Worth-Arlington, TX (M4); Houston-The Woodlands-Sugar Land, TX (M5); Washington-Arlington-Alexandria, DC-VA-MD-WV (M6); Philadelphia-Camden-Wilmington, PA-NJ-DE-MD (M7); Miami-Fort Lauderdale-Pompano Beach, FL (M8); Atlanta-Sandy Springs-Alpharetta, GA (M9); and Phoenix-Mesa-Chandler, AZ (M10).
Imagery2Flow’s transfer performance or generalization capability across regions is affected by land cover type and land use pattern in different areas; the performance correlates with the geographic context similarity. Existing studies have found that green space exposure may help with active transportation (e.g., biking, walking) and the use of public transportation (e.g., subways)85,86, and a lack of vegetation may indicate highways, thoroughfares, etc., which might increase automotive travel and thus link to distinctive human mobility patterns. As shown in Fig. 6, Washington D.C.-Philadelphia (M6-M7) shows the similarity of their geographic contexts, which explains the effective spatial transferability between them to some extent. Most MSAs with extremely low CPC values in flow prediction have different main land cover types, such as Los Angeles-Atlanta (M2-M9), and Miami-Atlanta (M8-M9), etc., on both 10-meter and 30-meter imagery (in Fig. 5a and b).

New York-Newark-Jersey City, NY-NJ-PA (M1); Los Angeles-Long Beach-Anaheim, CA (M2); Chicago-Naperville-Elgin, IL-IN-WI (M3); Dallas-Fort Worth-Arlington, TX (M4); Houston-The Woodlands-Sugar Land, TX (M5); Washington-Arlington-Alexandria, DC-VA-MD-WV (M6); Philadelphia-Camden-Wilmington, PA-NJ-DE-MD (M7); Miami-Fort Lauderdale-Pompano Beach, FL (M8); Atlanta-Sandy Springs-Alpharetta, GA (M9); and Phoenix-Mesa-Chandler, AZ (M10).
But the model’s spatial transferability still cannot be explained by land cover alone. For “Dallas-Houston”(M4-M5), it works well in places where the main land use types are different. But for “Philadelphia-New York”(M7-M1) and “Washington D.C.-New York”(M6-M1), it doesn’t work well with the same main land cover type. This mismatch can be further explained by the spatial forms of population distribution and urban functions in metropolitan areas, which is what urban geographers have long been concerned about and committed to studying. Four distinctive types of residential and nonresidential land use patterns (i.e., typologies of urban sprawl) were identified for all metropolitan areas in the U.S., of which the ten MSAs in this study fell into three71: (1) Less-Intensive, Most-Compact, Less-Mixed, Less Monocentric Development, including Los Angeles (M2), Chicago (M3), Miami (M8), Atlanta (M9), and Phoenix (M10), with little mixing of jobs and housing at a fine-grained scale; (2) More-Intense, More-Compact, More-Mixed, Polycentric Development, including New York (M1), Dallas (M4), and Houston (M5), with more fine-grained mixing of jobs and housing; (3)Least-Intensive, Less-Compact, Most-Mixed, Most-Monocentric Development, including Washington DC (M6) and Philadelphia (M7), with high degree of fine-grained mixing of housing and jobs. The typologies help explain the model transferability: two metropolitan areas with similar land use patterns tend to transfer well, as their density of buildings, the spatial arrangement patterns of terrestrial objects, and the jobs and housing matching exhibit high similarity, which is linked to the neighborhood information aggregation and message passing mechanisms in the graph neural networks. Conversely, when urban elements and their spatial patterns diverge, the model transfer effectiveness tends to diminish. The transferability performance is influenced by multiple factors including land cover and land use patterns, population and socioeconomic activity distributions, and spatial structure. The interaction between these factors, combined with data noise, results in variations in model transferability at “A-B” and “B-A”, and at different spatial resolutions. The complex quantitative relationship among these spatial contextual characteristics deserves further exploration87.
To further verify that the model has truly learned the relationship between urban environments by satellite imagery and human movements, we also applied the models trained on the 2020 imagery to previous years’ data spanning approximately 10 years. Since there were no Sentinel-2 images available before 2015, we collected Landsat-5 images from 2010 and 2002 (2004 for M10). As shown in Table 2, the Imagery2Flow-LGBM model performance in 2010 and 2002 is good (with CPC metric above 0.67 across all ten MSAs), which demonstrates very good temporal transferability. Over time, such as years, cities grow as reflected in satellite imagery by the proliferation and densification of buildings and roads, as well as the expansion of urban boundaries, etc. Imagery2Flow-LGBM has the temporal transferability to be applied to remote sensing imagery from 10 or even 20 years ago to infer the human mobility flows with good accuracy, proving that this model has learned the underlying relationship between complex geographic contexts in satellite images and human mobility flows.
All the above experiment results indicate that satellite imagery has the ability to reveal human mobility patterns, which can be modeled by the Imagery2Flow or Imagery2Flow-LGBM models trained at timestamps and geographic regions with rich data, to predict mobility flows in another time period or other areas with similar spatial contexts where real mobility flows from mobile phone tracking or survey data are unavailable.
