Is it possible to detect extreme rainfall events areas by clustering spatio-temporal data? The intensification of weather extremes, which is dramatically changing the climate scenario worldwide, is currently thought to be as one of the most important factors related to green-house effect and climate change1,2,3,4,5,6,7,8. The increase in the frequency and intensity of daily temperatures has contributed to a widespread escalation of daily precipitation9,10. Moreover, severe weather and climate events, interacting with exposed and vulnerable human and natural systems, can lead to disasters which require an extraordinary adaptation ability2. It is therefore mainly for this reason that, nowadays, the study of climate change is not only about temperature increase, but it also focuses on catastrophic rainfall extreme events and drought7,11. The concept of extreme precipitation and its changes in response to warming are well described in12. For this reason, the scientific community faces an increasing demand for regularly updated estimations of evolving climate conditions and extreme weather events1,11,13. Moreover, a correlation between changes in heavy precipitation and landslides in several regions has been found in2. More specifically, it is possible to identify 3 examples of extreme weather events, that have raised the question of a potential link to climate change: more intense precipitation events, increased summer drying over most mid-latitude areas and increase in tropical cyclone peak winds intensities14. These results show that rainfall extreme events are related to climate change15 and represent the triggers of a chain of reactions involving several human activities. The change in temperatures will, in fact, have serious long-term effects16,17,18, although extreme rainfall events will also cause a short-term danger to the environment and the population19. In more recent years several extreme events all over the world caused large losses of lives, as well as a tremendous increase in economic losses from weather hazards20. Such disasters have forced public opinion to consider climate change as the main cause of these events21 and to deeply analyse the economic consequences of climate change in terms of investments and productivity22,23. A relevant example of this regards wine industry. For instance, in the past two decades, Sicilian winemakers have enhanced the biological production of wine all around the island, especially on the slopes of Mount Etna. Although wine is not essential to human survival, it is an important product of human ingenuity and its economy is rapidly growing24. Agricultural activities depend on climate and are interconnected to weather changes. Any shift in climate and weather patterns may potentially affect the entire local wine industry25 and the stability of many crops, thus undermining the related economies23. Any shift in climate and weather patterns may potentially affect the entire local wine industry25. Abnormal climate changes might also undermine the stability of crops and might be critical for the related economy23. Considering all of these aspects, in Mediterranean areas, rainfall is probably the most important climatic variable due to its manifestation as a deficient resource (dryness) or a catastrophic agent, such as water bombs26. Therefore, many challenges arise during the measurement of the precipitation. For instance, in situ measurements are especially affected by wind effects on the gauge catch, particularly for snow but also for light rain16. Moreover, to reduce this uncertainty, it is crucial to analyze spatio-temporal data in the most efficient way27,28.
In this regard, over the last decades scientists conducted several studies on rainfall time series. These studies investigated potential trends in different rainfall indicators, such as total and maximum annual precipitation and mean daily intensity29,30,31. A tendency toward higher frequencies of heavy and extreme rainfalls emerged for some areas32. In most of these areas, an increase in total precipitation has also been observed, for instance in26, thanks to the analysis of 247 stations over the 1921–2000 period. However, the correlation between the increase of total precipitation and extreme events is not always clear, as in other areas (i.e. Italy) several authors have observed an increase in heavy precipitation, together with a tendency towards a decrease in the total amount of precipitations33. Among the studies mentioned, a few of them were specifically focused on the Mediterranean areas, given their peculiar climate, which is affected by interactions between mid-latitude and tropical processes, lying between the arid climate of North Africa and the temperate and rainy climate of central Europe. For these reasons, even relatively minor modifications of the general circulation can lead to substantial changes in the Mediterranean climate29,30,31,32,33,34,35, including rainfall frequency36, thus making these areas vulnerable to climatic changes and in particular to catastrophic precipitations.
In this setting, scientists analysed the region of Sicily to identify climate change signals, as for instance in37. In most of those studies, the authors analysed annual, seasonal and monthly rainfall data in the entire Sicilian region, showing a global reduction of total amount of annual rainfall37. For example, in29 the annual maximum rainfall for fixed time duration of 1, 3, 6, 12 and 24 h, and the daily rainfall series recorded from 1956 to 2005 in approximately 60 stations were analyzed using the non-parametric Mann–Kendall test38,39.
Results of this study, confirmed an increasing trend for rainfall of short duration, in particular for the 1 hour rainfall length. On the other hand, time-persistent rainfalls exhibited a decreasing course38,39. In particular, heavy-torrential precipitation have been reported to be more frequent at a regional scale, while light rainfall have shown negative trends at some sites. In40 the presence of linear and non-linear trends in 16 series from rain gauge stations, mostly placed in the eastern Sicily, was studied. The results indicated a different behaviour according to the time scale: for short duration, historical series generally presented increasing trends, that switched to decreasing for longer time courses.
A total of 67 sites of daily precipitation records over the 1951–1996 period in Italy were also analyzed in33 considering seasonal and yearly total precipitation, number of wet days and precipitation intensity with the aim of evaluating the trends both from the single-station records, and for larger areas by using averaged series. Results showed that the trend for the number of wet days in the year was significantly negative throughout Italy, particularly stronger in the north than in the south, especially in winter. A tendency towards an increase in precipitation intensity, which was globally less strong and significant than the decrease in the number of wet days was also found.
In41 the authors identified the presence of homogeneous areas over Sicily using the Regional Frequency Analysis (RFA), which is a procedure estimating the frequency of rare events at one site by using data from several sites42, used frequently in the analysis of environmental data43. They also developed Principal Component Analysis (PCA) followed by a clustering analysis, performed by applying the K-Means method, to identify regional groups, starting from annual maximum series for rainfall duration of 1, 3, 6, 12 and 24 h over about 130 rain gauges.
One of the most interesting papers studying different rainfall time series in Sicily is32, where the authors investigated temporal changes in extreme rainfall by performing a regional study. In particular, a regional frequency analysis based on L-moments approach44 was applied to 1, 3, 6, 12 and 24 h annual maxima rainfall (AMR) series grouped per homogeneous regions, identified through a hierarchical cluster analysis45. Changes were investigated in a long-term dynamic (from 1928 to 2009) with special reference to the last forty years. The study32 detected an increasing trend on rainfall extreme events between 2003 and 2009 with several heavy localized storms all over Sicily and a remarkable tendency towards more intense storm events during the 2000’s affecting mainly the outer western part of the region. On the contrary, the increasing trend in extreme rainfall detected in eastern Sicily, has been considered only apparent, as related to a few severe local storms.
In our work we present for the first time a multi-modal spacial and temporal clustering analysis on rainfall data over Sicily, performed using the Affinity Propagation clustering algorithm46. The novelties are manifold. First of all, we collected a new dataset, which we named RSE (Rainfall Sicily Extreme), offering an original perspective on extreme events happening from 2009 up to 2021, witnessed by the alarming violent rainfall events that occurred in East Sicily at the end of 202147,48. Moreover, the analysis was performed directly on the whole time series, without defining any specific statistic indicator or feature extracted from the data. In this way we avoid the risk of introducing any bias or a priori assumptions, such as homogeneity of the whole Sicily or its sub-regions, and the need of performing data dimensionality reduction. Additionally, the data preprocessing phase allowed us to remove data inconsistencies. Finally, the Affinity Propagation algorithm, successfully used in other contexts49,49,50,52, is here applied to climate data for the first time.
Differently from32 or35, in our study clustering is not only used for identifying homogeneous sub-regions, but also to detect critical rainfall sites. Moreover, while in32 the authors focused on finding long-term trends, we concentrated our attention on short-term changes between 2009 and 2021, analyzing high-frequency data, so as to obtain clusters specifically related to extreme events.
Based on the RSE dataset, we faced several steps:
-
We clustered regions and detect extreme sites according to rainfall data observations.
-
We used a multi-modal approach to merge both geographical and temporal information.
-
We defined rainfall indicators to further validate the clusters and their meaning.
-
We detected an increasing trend on extreme events in East Sicily, in agreement with the results of the state of the art in32.
Figure 1 shows the corresponding methodology flowchart.

Flowchart of the methodology and the timeline used in this study: data collection, clustering and statistical validation, comparison with other algorithms, conclusions and policy implications.
The paper is structured as follows: in “RSE: the rainfall Sicily extreme dataset” section the regional dataset used in the analysis, including the data pre-processing, is presented. In “Methods” section the methods applied in the study, in particular the adopted clustering algorithm, and the statistical validation methods, are introduced. In “Results and discussion” section we report the discussion of the results, concerning each analyzed variable, and the most relevant conclusions drawn. Furthermore, we report in the supplementary material the analysis concerning the annual histograms of specific rain gauges and local data plots at different levels, as well as the complete annual clustering results.
RSE: the rainfall Sicily extreme dataset
The dataset used in this analysis consists in geographical rainfall records with a 10 minutes periodicity from 2009 to 2021, provided by SIAS, the Servizio Informativo Agrometeorologico Siciliano53. The dataset together with the code is available at the following GitHub Repository54.
The most common rainfall measurement gathered from the database is the number of millimeters (mm) of rain in a given period. Accordingly, six collections were considered, as described in Table 1. C.A and C.B contain 13 datasets per station—one per year—with the original data and the weekly mean data, respectively. C.C and C.D include one full dataset per station – involving all the records from 2009 to 2021—with the original data and the weekly mean data, respectively. C.\(A_{s}\) and C.\(B_{s}\) are subsets of C.A and C.B, respectively, since one station per time is considered, so that each of them includes 13 datasets.
Data preprocessing
We will now describe the initial data selection process, obtained through the analysis of annual data. On the basis of an initial graphical analysis reported in the SI document, we decided to select the most extreme stations. A station is considered extreme if it is possible to observe a high amount of rain in a relatively short time interval. We implemented this concept of “extremeness” using the following strategy.
First, we considered the following data for all the 96 available stations in Sicily and for all the years:
-
The total annual precipitation in mm (tot).
-
The percentage of rainy days over the year (rd), measured as number of days with more than 1 mm of rain.
-
The mm of rain during the rainiest day in the year (dmax).
Afterwards, a selection strategy has been applied. Extreme rainfall events are generally characterized by the increasing of either drought and/or excessive wetness26. The logical rule below highlights precisely such characteristics:
-
(1)
Fix a station.
-
(2)
Compute \(\mu _{1}\): the mean over years of the rd annual indicator.
-
(3)
Compute \(\mu _{2}\): the mean over years of the dmax annual indicator.
-
(4)
Fix a year y.
-
(5)
If the rd value in the year y is less than \(\mu _{1}\) and the dmax value in the year y is grater than \(\mu _{2}\), then the year y is considered as extreme. Otherwise no.

Location of rainfall gauging stations in Sicily.
Since the procedure works year by year, we selected the stations satisfying the extreme events detection rule for at least 3 years (the stations respecting this condition for at least one year were 85 out of 96, almost all). In this way, we obtained 32 stations out of the 96 rain gauges. Furthermore, we decided to include all of the provincial capitals in the region, thus obtaining the 34 stations shown in Fig. 2.
After the selection, we observed rainfall data time series, by fixing a station and using full, annual, and monthly data plots, as well as mean data graphics (all details regarding these initial observations are reported in the SI document). This preliminary analysis lead to different reasoning. The full plots proved the necessity of quantifying and understanding variation in the stations time series behavior. In contrast, the annual plots showed a typical seasonality pattern. Moreover, the graphics observation led to the idea of comparing annual time series. Finally, a similar reasoning has been done with regard to the monthly view.
All of the above considerations suggested us to highlight the differences and the similarities both among stations and years, in order to identify multi-modal (geographical and historical) rainfall changes. Instead of performing classical time series analysis, we proceeded by applying the suitable clustering algorithms described in the following sections.
