Crop statistics
Firstly, we collected different sources of crop statistics in different study regions (Table 1), which provide the total amount of crop-specific areas. FAOSTAT, the biggest dataset in the field of food and agriculture, provides free access to food and agriculture data for over 245 countries and territories from 1961 to the most recent year available (https://www.fao.org/faostat/en/#data). Here, Crops and livestock products datasets in FAOSTAT’s Production domain were collected, which contained information on harvested area of more than 143 crop types. According to the categories of SPAM products, it is integrated into 42 crop categories through FAO code (Table S2–1).
In countries with large land area such as USA and China, national statistics are insufficient to capture the details of spatial changes. Therefore, we collected statistics at a sub-national level from national official reports in these two counties. As for USA, National Agricultural Statistics Service (NASS) from United States Department of Agriculture (USDA) provides access to census or survey of crop harvested area at the state level (https://quickstats.nass.usda.gov/). In China, statistics of crop-specific areas were acquired from China’s economic and social big data research platform operated by China National Knowledge Infrastructure. Due to the differences in crop categories from multiple sources, we matched 15 and 14 crop types respectively in USA and China according to the categories of SPAM products (Table S2–2, Table S2–3).
In addition, it is also a vital step to relate statistics to georeferenced locations. The Global Administrative Unit Layers (GAUL), one of the most standard spatial datasets of administrative units, compiles and disseminates the best available information on different levels of administrative units for countries in the world. Here, we used GAUL at national levels (ADM0) and sub-national levels (ADM1) to form geo-referenced crop harvest datasets respectively in Africa and USA/China (Table 1).
Base maps
Compared to statistics which represent total amount, base maps provide a detailed portrayal of crop-specific areas, which will be essential for spatiotemporal dynamic modeling. There are multiple maps of crop-specific areas with global coverage and detailed categories. And among which SPAM2010 product were chosen in this study for its sophisticated approaches and latest updates, providing key information about the distribution of crop-specific areas in regions in which remote sensing techniques have not been used widely in crop mapping such as Africa (Table 2).
However, more accurate and timely crop distribution information in countries like USA can be acquired by higher spatial resolution maps produced by remote sensing and field survey. Therefore, we adopted Crop Data Layer (CDL) as the base map in USA (Table 2). The correspondence between CDL’s crop categories and SPAM is presented in the Supplement (Table S2–2). As for China, there is still not an integrated multi-type crop map with medium-high spatial resolutions (10–30 m). Therefore, we collected multi-source base maps of 8 crop types to produce more accurate and timely references (Table S3–1), as for the rest crop types and the regions that are not covered, we still used SPAM2010 as the base map. To ensure uniform resolution, we calculate the proportion of the crop area within a 10 km (~5 arcmin) grid for the base map with higher resolution (10m-30m).
Cropland extent
Cropland extent determines where statistics of crop-specific areas can be allocated. In order to provide an annual-updated basis to track spatiotemporal dynamics, we integrated two datasets to produce annual cropland extent from history to the most recent year. It is an efficient and accurate way to identify cropland from classified land cover products. FROM-GLC Plus provides a framework for near real-time land cover mapping at multi-temporal (annual to daily) and multi-resolution (30 m to sub-meter) levels43. Here, we used FROM-GLC Plus Global Land Cover Products (1982–2021, 1 km subpixel) to extract cropland extent from 1982 to 2021. In periods where there were rarely remotely sensed images (before the 1980s), GCD (Global Cropland Dataset), a collection of historical cropland maps generated by spatially allocating cropland statistics, was used as extent44 (Table S3-5). Given that crop production may take place over several seasons within a year, we also multiplied the cropland extent by the cropping intensity to get the annual maximum harvested area. Annual cropping intensity datasets cover periods from 2001 to 201945. For the remaining years, data from the nearest year is used as a substitute (Table S3-6). To ensure uniform resolution, we resampled the cropland extent and maximum harvested area of all year periods to 10 km (~5 arcmins).
Spatial indicators
To better model spatiotemporal dynamic of crop-specific areas, a total of 26 related spatial indicators were collected from 7 aspects, including climate, agro-system, suitability, potential yield, soil, terrain, and location (Table S3-10). These indicators are proved to be related to the distribution of crop-specific areas in previous studies. To ensure uniform resolution, we resampled the spatial resolution of all spatial indicators to 10 km (~5 arcmin).
Climate determines the areas in which crops are suitable for cultivation, and climate change affects the potential yields and revenue, thus affecting farmer decisions. It is reported that climatic variables explain considerable portions of the variance in crop planted area (22–30%), harvestable fraction (15–28%) and yield (32–50%)46. In this study, five climate variables were calculated from ERA5-Land including annual mean temperature (temp), total annual precipitation (prec), downwards surface solar radiation (radi_down), evaporation from vegetation transpiration (evap_veg) and growing degree day (gdd) (Table S3-10). We present a detailed description of these variables in the Supplement (Sect. S3-2).
In addition, terrain and soil properties are also essential variables of crop suitability. However, these variables are only available at single time period unlike climate variables that updated annually. More specifically, slope and elevation in the year 2010 were calculated from GMTED2010 as the terrain variables while aggregation of soil water content, PH, texture class, organic carbon content, sand content and clay content at different depths were calculated from OpenlandMap datasets as soil variables. We present a detailed description of terrain and soil variables in the Supplement (respectively Sect. S3-3 and Sect. S3-4).
We also selected the suitability assessment result of each crop as input for this prediction model. GAEZv4.0 produces a gridded suitability assessment for 48 major crops in two input levels (i.e., high, low), and two water supply regimes (i.e., irrigated or rainfed) at 5 arcmin resolution. The correspondence between GAEZ v4.0’s crop categories and SPAM is presented in the Supplement (Table S2-4). Most of the SPAM2010 crops are included in GAEZ’s crop categories, those not included are assigned values from similar crops. Potential yield has a greater impact on farmer decisions when multiple crops are suitable to cultivate, this variable could also be accessed by GAEZv4.0 product in two input levels and two water supply regimes. The suitability index and potential yield in three regimes (irrigated, rainfed and high input, rainfed and low input) were selected as input indicators. We present a detailed description of suitability and potential yield variables’ processing progress in the Supplement (respectively Sect. S3-5 and Sect. S3-6).
Agro-system variables are directly related to crop distribution. Firstly, cropland maps determine where and to which extent crop could be cultivated (also mentioned above). Furthermore, crop cultivation suitability and potential yield vary greatly in irrigation or rainfed regimes. Therefore, we integrated two datasets (HID and SPAM) to extract irrigation area proportion in a long time series. HID (historical irrigation data set) provides estimates of the temporal development of the area equipped for irrigation since 1900 at 5 arcmin resolution47, while SPAM contains irrigation area proportion information in year 2000, 2005 and 2010. What’s more, it is said that the proportions of crops being distributed on farms of different sizes vary greatly48. So we select a global field size map as an indicator which classifies farm size into 5 classes49. Last but not least, rural population is closely related to agricultural production, which can be considered as a measure of agricultural labor and also market accessibility. Estimates of rural population from HYDE (History Database of the Global Environment) were selected as the last indicator of the agro-system group50. We present a detailed description of agro-system variables’ processing progress in the Supplement (respectively Sect. S3-7, S3-8, S3-9, S3-10).
Workflow design
There are mainly two steps in annually updating spatiotemporal dynamics of crop-specific areas. The first step is to generate probabilistic spatiotemporal dynamics layers through machine learning based on spatial indicators and base maps, while the latter step is to allocate crop statistics based on the probabilistic layer and harmonize the result based on multiple constraints (Fig. 1).

Generating probabilistic spatiotemporal dynamics
In this part, we trained machine learning models for each crop in each region based on spatial indicators and base maps. To begin with, we collected samples on the base map using a stratified random strategy. The sample set of each crop is set to collect a maximum of 10000 points at the resolution of 10 km. We only collected samples at the pixel where crop-specific area exists. Combined with spatial indicators mentioned above, we chose random forest (RF) regression model to learn the rules by which crop distribution is affected by 26 spatial indicators from sample sets. RF has been proven to be effective and efficient in dealing with multi-dimensional features51 and RF regression has been widely used in tasks of predicting the distribution of geospatial targets such as population, cropland, etc.44,52. More specifically, we use the random forest algorithm in Google Earth Engine to build the regression model. According to the results of sensitivity analysis, the number of decision trees is set to 100 (more details can be found in Supplement Information), and the number of features required for each node for splitting is set to the square root of the number of input features set by default.
Statistic allocation and harmonization
In this part, crop statistics were allocated based on probabilistic layers predicted by the RF regression model and harmonized according to multiple constraints. More specifically, we first calculate the sum of probabilistic values at the administrative unit level (excluding areas with no cropland or suitability index of 0). The sum value was used to compare with crop statistics at the same administrative unit and the rate was used to adjust probabilistic layers at this administrative unit (Eq. (1)). The descriptions of variables in the equations are listed in Table 3 (the same below).
$${{AdjProb}}_{{ijy}}=\frac{{{Stat}}_{{jky}}}{{\sum }_{i\in k}{{Prob}}_{{ijy}}\times {{PixelArea}}_{i}}\times {{Prob}}_{{ijy}}$$
(1)
Then, we defined the maximum crop harvested area in each grid as the limit that no more crop statistics could be allocated (Eq. (2)). More specifically, the maximum crop harvested area is calculated by multiplying cropland and crop intensity (more data sources descriptions in Data Preparation Section).
$${{MaxHarvArea}}_{{iy}}={{Cropland}}_{{iy}}\times {{Crop}{Intensity}}_{{iy}}\times {{PixelArea}}_{i}$$
(2)
Further, we used \({{\rm{HarvAreaL}}{\rm{eft}}}_{{iy}}\) to represent the conflict between the maximum crop harvested area and statistics allocation of all crops in total (Eq. (3)). In order to compare in the same unit, we transformed the adjusted proportion of crop-specific areas from proportion (0-1) into area (ha) by multiplying \({{PixelArea}}_{i}\).
$${{\rm{HarvAreaL}}{\rm{eft}}}_{{iy}}={{MaxHarvArea}}_{{iy}}-\sum _{j}{{AdjProb}}_{{ijy}}\times {{PixelArea}}_{i}$$
(3)
Statistic allocation was performed in a crop-by-crop order which is specified in each administrative unit. We classified the crop type list into several types: annual or perennial, specific name (e.g. wheat) or general name (e.g. other cereals). The priority of perennial crops is higher than annual ones, and then crops with specific name will be processed first compared with that of general names. We further ranked the crop type list in each sub-group according to the crop area statistics in the administrative unit in descending order. To reduce errors due to inconsistent definitions of cropland, permanent crops were excluded from the cropland extent constraints (See more details in Supplement Information S2 Crop categories). It’s worth noting that when maximum crop harvested area is achieved in a certain grid \({\rm{i}}\), the statistic of the crop type \({\rm{j}}\) being processed would not be allocated. The crop type \({\rm{j}}\) and crop types behind \({\rm{j}}\) in the ranked crop list would be saved in a grid-specific list \({{LeftCrop}}_{{iy}}\). We further summed the statistics still not be allocated yet (\({{\rm{StatLeft}}}_{{jky}}\)) in a grid where maximum of crop harvested area is achieved based on \({{LeftCrop}}_{{iy}}\) (Eq. (4)).
$${{\rm{StatLeft}}}_{{jky}}=\sum _{i\in k}{{AdjProb}}_{{ijy}}\times {{PixelArea}}_{i}\;{if}{{\rm{HarvAreaL}}{\rm{eft}}}_{{iy}} < 0{and}j\in {{LeftCrop}}_{{iy}}$$
(4)
We adopted two strategies to deal with statistics still not being allocated yet (\({{\rm{StatLeft}}}_{{jky}}\)). The first one is to allocate crop statistics to the grids where the maximum crop harvested area is not achieved according to the probability value predicted by the RF model (Eq. (5)). If there are statistics still not allocated yet, the latter process will allocate crop statistics to the grids where the maximum crop harvested area is not achieved according to the rest space of crop harvested area (Eq. (6)).
$${{AdjProb}}_{{ijy}}+=\frac{{{StatLeft}}_{{jky}}}{{\sum }_{i\in k,{{\rm{HarvAreaL}}{\rm{eft}}}_{{iy}} > 0}{{AdjProb}}_{{ijy}}\times {{PixelArea}}_{i}}\times {{AdjProb}}_{{ijy}}{if}{{\rm{HarvAreaL}}{\rm{eft}}}_{{iy}} > 0$$
(5)
$${{AdjProb}}_{{ijy}}+=\frac{{{StatLeft}}_{{jky}}}{{\sum }_{i\in k,{{\rm{HarvAreaL}}{\rm{eft}}}_{{iy}} > 0}{{\rm{HarvAreaL}}{\rm{eft}}}_{{iy}}}\times \frac{{{\rm{HarvAreaL}}{\rm{eft}}}_{{iy}}}{{{PixelArea}}_{i}}{if}{{\rm{HarvAreaL}}{\rm{eft}}}_{{iy}} > 0$$
(6)
At last, we got the adjusted proportions of crop-specific areas (after Eqs. (5) or (6)), and the final result \({{\rm{CropArea}}}_{{ijy}}\) represents the harvested area (ha) of crop type \({\rm{j}}\) in grid \({\rm{i}}\) in year \({\rm{y}}\) (Eq. (7)).
$${{\rm{CropArea}}}_{{ijy}}={{AdjProb}}_{{ijy}}\times {{PixelArea}}_{i}$$
(7)
