A machine learning estimator trained on synthetic data for real-time earthquake ground-shaking predictions in Southern California

Synthetic physics-based earthquake ground motions

In this work we leveraged a large dataset generated via physics-based wave propagation simulations to generate ground motions from hundreds of thousands of hypothetical earthquakes to train ML algorithms. The data was generated using the CyberShake platform and, in particular, the CSS-15.4 study for the Southern California region.

CyberShake, developed by the Southern California Earthquake Center (SCEC), is an integrated collection of scientific software that performs PSHA by using 3D physics-based modeling. It simulates ground motions for a large suite of earthquakes derived from an earthquake rupture forecast (ERF) and has been used to assess seismic hazards in California in multiple studies^6,7. Simulations are based on seismic reciprocity, so two unit impulses at a given ground site, one in each horizontal direction, are propagated to fault surfaces to calculate their Strain Green Tensor (SGT) response. Then, the SGTs are convolved with slip time histories for each event to produce a seismogram at the site of interest. Thus, CyberShake simulations computationally scale with the number of sites (generally on the order of hundreds), and not with the total number of potential earthquakes (usually in the order of tens to hundreds of thousands), allowing to model an arbitrary number of earthquakes for a given area. Although this platform was developed to perform PSHA, the rich suite of its data products makes it an ideal source of data to feed the MLESmap method.

CSS-15.4 used in this paper is a well-curated computational study to calculate a physics-based PSHA for Southern California at 1 Hz, using the tomographically-derived Community Velocity Model CVM-S4.26-M01, the GPU implementation of AWP-ODC-SGT³⁶, the GP-14 kinematic rupture method with uniform hypocenters³⁷, and the UCERF2 ERF⁷. The computation of the CSS-15.4 study required a total of 37.6M hours in tier-0 supercomputing facilities. In particular, the database contains IMs (derived from simulations) for 153628 scenarios (i.e., hypothetical earthquakes) from the UCERF2 earthquake rupture forecast. Those scenarios were recorded at a collection of sites, i.e., discrete points in space on the free surface, where seismic IMs were extracted from each scenario. Such IMs can be further analyzed to obtain discrete spectral intensity values.

Some relevant characteristics of the CSS-15.4 dataset are shown in Fig. 1. The faults are marked with red lines, the Los Angeles city center with a blue dot, and the network of ground sites where RotD50 is computed with magenta stars in Fig. 1a. Figure 1b shows the RotD50 distribution for the events at different periods. As expected, lower periods attain higher RotD50 values. The magnitude distribution of all synthetic scenarios is shown in Fig. 1c, where a predominant 7.6 magnitude is observed. CyberShake uses a magnitude cutoff of 6.5, so only events with a median magnitude of at least 6.5 are considered, though aleatory magnitude variability implies that some events with lower magnitudes are included. Since CyberShake considers multiple realizations with varying hypocenter locations and slip distributions to sample variability, the distribution of earthquakes in the event set is not directly related to actual earthquake magnitude distribution in the region of study.

Finally, it should be noted that the MLESmap methodology is not limited specifically to training on CyberShake-like databases. In CyberShake, although multiple kinematic rupture scenarios are derived for each rupture in the ERF by varying slip distributions and hypocenter locations across an input fault system, the rupture speeds and slip distributions in each simulation are pre-set. Having a richer set of rupture conditions by considering a synthetic database of dynamic rupture simulations could further improve the applicability of the method.

MLESmap: methodology

The goal of the MLESmap methodology is to rapidly predict IMs with high spatial resolution given, as inputs, the earthquake’s magnitude, hypocentral location, and the relationship between the earthquake hypocentre and the site of ground motions recordings (see Supplementary Fig. 9 for a graphical summary of the MLESmap methodology). In particular, the MLESmap methodology provides a framework for training region-specific ML models of ground motion on databases of synthetic IMs with two algorithms: Random Forest (RF) and Deep Neural Networks (DNN). The resulting ML models, the RF-based GMM and DNN-based GMM predict the distribution of a selected IM at a given period for given source characteristics of a new event in the region.

Given our objective and the associated time constraints, we choose parameters that can be quickly inferred from early data records and are readily provided by international agencies immediately after an event. It should be noted that despite the simple inputs, the methodology takes into account the complex relationship between the finite fault models used in the ground motion simulations that generate the synthetics, the associated hypocentral location, and the resulting spatial distribution of the chosen IM.

In this paper, the MLESmap predictions are made by means of ML models for Southern California trained with CSS-15.4 events only³⁸. The CSS-15.4 database is one of the largest and best-calibrated datasets of synthetics worldwide, thus serving as an ideal first implementation of MLESmap. For both RF and DNN algorithms, four independent ML models considering the data for a single maximum period (T = 2, 3, 5, and 10 s) were built. The logarithm of the RotD50—which represents the median pseudo-spectral acceleration (PSA) value selected from all azimuth directions at each recording station³⁹—was used as a target, as the logarithmic scale was shown to increase the score performance for the four periods considered (see Supplementary Fig. 10). It should be noted that other IMs could serve targets without significant impact on the performance of the MLESmap models.

We refer to the collection of all scenarios recorded at all sites for a discrete spectral intensity, or in general for all spectral intensities, as events. The dataset of 153,628 scenarios was split into 90% for training and 10% for testing and validation. Therefore, considering 253 stations in our dataset (Fig. 1a), a total of 155M events were used for training the models generated with the MLESmap methodology, involving 4 discrete values of RotD50 (at different periods), which were used independently from each other. A total of 3.8M events belonging to the validation subset were used to compute the score metrics and evaluate the MLESmap models’ accuracy. The training dataset constitutes 3.8 GB on disk, while the validation subset is 0.5 GB.

MLESmap: RF regression algorithm

The training and model generation for the RF regressor has been performed using the Distributed Computing Library (dislib)⁴⁰, a Python software built on top of PyCOMPSs⁴¹. dislib is inspired by NumPy and scikit-learn, providing various supervised and unsupervised learning algorithms through an easy-to-use API. dislib has been used due to its efficiency to handle models with a large number of events.

To find the best hyperparameters for a model with the selected input characteristics (magnitude, hypocentral latitude, longitude, and depth, latitude and longitude coordinates of the site, and the Euclidean distance and azimuth that define the spatial relationship between the hypocentre and the site) we use a grid search on the training set, and the k-folds cross-validation functions over the three available parameters in dislib for the algorithm:

maximum tree (d_max): number of levels in each decision tree,
number of estimators (n_est): number of trees in the forest, and
try-features (t_f): maximum number of features considered for splitting a node.

The metric to measure the performance of the models given a specific set of hyper-parameters is the coefficient of determination (R²) score. Supplementary Fig. 10 shows the R²-score values for different d_max values using two different target scales: (a) logarithmic and (b) non-logarithmic. The results show that the treatment of the target on a logarithmic scale increases the score performance for the four periods considered. Moreover, the best parameters are d_max = 30, n_est = 30, and t_f = ‘third’ for all periods.

MLESmap: Deep Neural Network topology

MLESmap uses a fully connected neural network. Many factors were taken into account in determining the DNN architecture, including the nature of the application. Specifically, it is a regression problem with only 8 inputs and a single output. Tackling it with a “sophisticated” neural network model, such as a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), or even a Transformer, was considered unnecessary.

In order to select an appropriate network topology for the target problem, we started from the most basic multilayer perceptron (MLP) with eight neurons in the input layer, corresponding to the eight features of the earthquake that are taken into account (magnitude, hypocentral latitude, longitude, and depth, latitude and longitude coordinates of the site, and the Euclidean distance and azimuth that define the spatial relationship between the hypocentre and the site), and one neuron in the output layer which determines the RotD50 component of the earthquake. The exploration of the topology space was carried out with different numbers of hidden layers and units (neurons) on those layers.

The classical “non-generalization” problem in neural networks was tackled by applying regularization, data normalization, and batch normalization. In addition, distinct learning rate schedulers were evaluated to deal with the local minimum deadlock optimization problem. Finally, to avoid problems due to the vanishing and explosion of gradients, we tested different dropout and activation functions.

After concluding this phase of experimentation with the different periods, the best results were obtained with MLPs consisting of either seven or nine hidden layers, respectively, with 32, 64, 128, 256, 128, 64, and 32 or 16, 32, 64, 128, 256, 128, 64, 32 and 16 units per layer. The MLPs, in addition, integrate pre-batch normalization (before the activation function) and a warm anneal learning rate scheduler. Softplus was adopted as the activation function for the hidden layers and sigmoid for the output layer since the IM values were previously normalized between 0 and 1.

The different experiments with neural networks were carried out in Python, making use of the open-source TensorFlow library⁴², and the high-level framework Keras?. These libraries rely on other auxiliary Python libraries such as os, NumPy, matplotlib, etc.

Validation score metrics

In this work, we used different metrics to validate the accuracy of the MLESmap inference on synthetics and real events (see Results section).

To quantify the accuracy of the ML predictions on the synthetic data in Fig. 2, we use common regression metrics such as mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE), mean absolute percentage error (MAPE), coefficient of determination R² and Pearson’s coefficient⁴³. The MAE is calculated as the mean or average of the absolute differences between predicted and expected target values, so the units of the error score correspond to the units of the predictions. The MSE is the mean of the squared differences—the units of the error and of the prediction do not match, but the metric is useful to emphasize and penalize large errors, as the errors are squared before they are averaged. RMSE is the square root of the MSE, so it is measured in the same units as the target variable, yet it still gives a relatively high weight to large errors. MAPE is sensitive to relative errors and so remains insensitive by the scaling of the target variable. R² represents the proportion of variance explained by the independent variables in the model and indicates how well-unseen samples are likely to be predicted by the model. Finally, Pearson’s correlation coefficient measures the strength of the linear association between two variables. Note that all the metrics summarize performance in ways that disregard the direction of over- or under-prediction.

In particular, to evaluate the ML predictions on the validation subset against empirical GMMs, we focus on the RMSE as score metrics, as it is widely used in the community to compare and validate empirical GMMs with observations^13,44,45,46. RMSE ranges from 0 to infinity, is in the same units as the target variable and is most useful when large errors are particularly undesirable, as in the case of the prediction of ground motion IMs for assessing associated risks. However, since RMSE does not provide any information on prediction bias, we also consider the geometric mean ratio of Aida’s number K²¹, where underestimation and overestimation correspond to K > 1 and K < 1, respectively. We compute the logarithm of K as

$$\log (K)=\frac{1}{N} {\sum }_{i=1}^{N}\log \left(\frac{{O}_{i}}{{S}_{i}}\right)$$

(1)

where O_i is the observed (target) value, and S_i is the simulated (predicted) value.

It should be noted that multiple other empirical GMMs were proposed for Southern California (e.g., BSSA⁴⁷, CB⁴⁸, and CY⁴⁹) that generate similar results and thus result in comparable RMSE metrics. We deem ASK-14²⁰ the most suitable for our validation, as it was constructed using the NGA-West2 database⁵⁰ that contains worldwide ground motion data recorded from shallow crustal earthquakes in active tectonic regimes post-2000, as well as a set of small to moderate magnitude earthquakes in California between 1998 and 2011. The ASK-14 values were computed using OpenSHA software⁵¹ with the V_S30 values from the Thompson model⁵².

Source link