
L-shaped MA design representations47; (a) Three-dimensional perspective view and (b) Two-dimensional cross section.
To validate the effectiveness of the proposed optimization algorithm, the structure previously fabricated and reported in Ref. 47 is selected for initial evaluation. The reported MA is composed of L-shaped copper resonators and a continuous copper ground plane that ensures zero transmission, isolated by a dielectric spacer. The copper layers have a thickness of \(0.035~\textrm{mm}\) with a constant conductivity \(\sigma = 5.96 \times 10^{7}~\mathrm {S/m}\). The dielectric spacer is made of low-cost FR-4, characterized by a relative dielectric constant \(\varepsilon _r = 4.3\), loss tangent \(\tan \delta = 0.025\), and a thickness \(h=1.6~\textrm{mm}\). The unit cell of the referenced design is arranged periodically with a lattice constant of \(a = 3.7~\textrm{mm}\). To broaden the absorption bandwidth, a structural cut of length \(d\) is introduced in the resonator layer, which, along with the L-shaped resonators, facilitates charge redistribution and generates multiple electric and magnetic dipoles. The geometry and 3D configuration of the structure are illustrated in Fig. 2.
All simulations are conducted using CST Microwave Studio with the frequency domain solver. Open boundary conditions are applied along the propagation direction (z-axis), while periodic boundary conditions are enforced in the x- and y-directions around the unit cell. A tetrahedral mesh with a precision of \(10^{-4}\) is used, comprising a total of 184,690 elements.
The absorption coefficient \(A\) of the metamaterial absorber at a given frequency \(f\) can be calculated using reflection (\(S_{11}\)) and transmission (\(S_{21}\)) coefficients by:
$$\begin{aligned} A = 1 – S_{11}^2 – S_{21}^2 \end{aligned}$$
(7)
Since the bottom metal layer is 0.035 mm thick –significantly greater than the skin depth– it effectively blocks transmission, making \(S_{21} \approx 0\). Thus, the absorption simplifies to:
$$\begin{aligned} A = 1 – S_{11}^2 \end{aligned}$$
(8)
The optimization aims to realize a broadband MA. To achieve this aim, the reward function used in equation 6 is applied. By increasing the reward of the TD3-RL model, the total absorptivity of the MA across the spectrum from 10 to 25 GHz increases. The variables d, g, w, and p are selected as design parameters for the optimization. The parameters are optimized within the following ranges: d \(\in [0.1, 1.5]\), p \(\in [0.1, 0.31]\), g \(\in [0.1, 0.31]\), and w \(\in [0.1, 0.31]\). These ranges are chosen to facilitate fabrication and maintain the desired shape of the device. Additionally, 1000 frequency points are sampled across the studied range with an interval of 0.015 GHz. The TD3-RL model was running for 45 iterations, which was sufficient for the proof-of-concept stage. Given the robustness of TD3 to initial conditions, the optimization parameters were initialized randomly to ensure diverse exploration during training. The model achieved the best results after only 23 iterations. A total of 2 hours was required to complete all iterations on a Dell Precision 7820 Tower workstation with an Intel® Xeon® Gold 6211 CPU @ 3.60 GHz and 384 GB RAM. It should be noted that the training time is dominated by the electromagnetic simulations at each iteration rather than by the computational cost of the RL algorithm itself.
The model training diagram is shown in Fig. 3a which demonstrates the immediate reward value obtained at each iteration for the corresponding generated design and shows that the model achieved the maximum reward value at the 23\(^{\text {rd}}\) iteration. It is worth noting that the high-performance design obtained within 23 iterations is not a random outcome but reflects the ability of reinforcement learning algorithms to efficiently identify promising solutions even during the early stages of training. This observation is consistent with previous work, where an A2C-RL model surpassed the particle swarm optimization (PSO) method in the optimization of the grating coupler in only 14 iterations 50. To further examine the effect of the reward formulation, an additional trial was conducted using a smaller penalty factor (-10 instead of -1000). The corresponding training curves are shown in Fig. 3b, where the y-axis scales differ due to the change in penalty magnitude. In this case, the model was unable to improve its performance with iterations and achieve a high reward value, confirming that the larger penalty scaling used in this work is necessary to guide the learning process effectively. In this study, the effectiveness of the designs is compared using the mean absorption of the MA across the frequency range through the following equation.
To further assess the stability of the optimization process, two additional runs are performed using different initializations: one initialized at the lower bounds of all geometric parameters and another at their upper bounds. As shown in Fig. 3c, all runs converged toward nearly the same reward value within a similar number of of iterations at the 19\(^{\text {th}}\) and 24\(^{\text {th}}\) iteration of the lower and upper bounds, respectively. This consistency across distinct starting points demonstrates the robustness of the TD3-based optimization framework to variations in the initial design state.
$$\begin{aligned} f(x) = \frac{\displaystyle \sum _{f=10\,\textrm{GHz}}^{f=25\,\textrm{GHz}} A(f)}{\text {number of frequencies}} \end{aligned}$$
(9)

Optimization process evaluation: Training curves for reward formula versus the number of iterations with (a) penalty factor –1000 (b) penalty factor –10, (c) Training curves for three runs with different initializations. (d) Performance comparison between the optimized MA using TD3-RL model and that reported with parametric sweep in Ref. 47, TRA 51, and A2C-RL 41.
The design parameters obtained in the 23\(^{\text {rd}}\) iteration are applied and its performance is comprehensively compared against three relevant works. In this context, the optimized design using the TD3 is compared to that studied by parametric sweep in Ref. 47 and that optimized using A2C reinforcement learning technique (A2C-RL) 41. Further, Etman et al. 51 have used the Trust Region Algorithm (TRA) based on Co-kriging model for the optimization of the same absorber. It is worth noting that these prior studies were applied to the exact same L-shaped absorber geometry considered in this work, ensuring a consistent and fair basis for comparison. The summarized results of these comparisons are provided in Table 3. Furthermore, Fig. 3d shows the absorption response of the optimized MA for both TE and TM polarizations compared to the results reported in the other studies. It is evident that the proposed MA demonstrates absorptivity exceeding 90% over a broader frequency range–from 12.2 to 22.4 GHz–effectively covering the majority of the \(Ku\) and \(K\) bands. Therefore, the TD3-RL model successfully identified a design with higher absorptivity and a wider bandwidth while maintaining the same structural configuration, thickness, and periodicity when compared to the A2C-RL and TRA models. Consequently, the TD3-RL provides a reliable and effective alternative for the inverse design of metamaterial absorbers, particularly in problems involving continuous design spaces.
To confirm the versatility of the proposed TD3-based optimization framework and evaluate its applicability to other classes of metamaterial absorbers, a second case study was conducted on an all-dielectric metasurface which was introduced by Cai et al. 52. Fig. 4a presents the schematic of the structure, consisting of a periodic array of Gallium Arsenide (GaAs) cylindrical resonators placed on a thin GaAs spacer layer and backed by a thick tungsten (W) reflector. GaAs serves as a high-index dielectric resonator capable of supporting multiple optical modes across the visible and near-infrared ranges, while the 200 nm W layer, substantially exceeding the skin depth, ensures negligible transmission. Periodic boundary conditions along the x– and y-directions and an open boundary along z were applied to emulate an infinite array, and the absorbance was computed using equation 8 in a frequency range from 80 to 800 THz.
The same TD3 model and reward function defined in equation 6 were applied to this dielectric metasurface, with all parameters initialized randomly to promote broad exploration of the design space. In this example, four geometric variables were selected for optimization: the cylinder radius r, cylinder height h, GaAs spacer thickness t, and lattice constant a. These parameters were optimized within the following ranges:\(r \in [135, 155]~\text {nm}\), \(h \in [100, 140]~\text {nm}\), \(t \in [20, 40]~\text {nm}\), and \(a \in [380, 420]~\text {nm}\). Using these bounds, the TD3 agent was executed for 50 iterations. Despite the material and geometric differences compared to the metallic absorber, the model rapidly adapted, achieving its highest reward at the 15\(^{\text {th}}\) iteration, as illustrated in Fig. 4b. After applying the geometric parameters obtained at the 15\(^{\text {th}}\) iteration, the absorption spectrum of the optimized metasurface was evaluated and compared with that of the reference design. As shown in Fig. 4a, the optimized structure exhibits a clear enhancement in the overall absorptivity along with a noticeably broader absorption bandwidth, demonstrating the ability of the TD3 model to improve both the magnitude and spectral extent of the device’s absorption performance. The optimized parameters and their corresponding absorption results compared to the reference design are summarized in Table 4. This confirms that the TD3-RL framework can efficiently optimize heterogeneous metasurface designs when guided by a consistent physical objective and reward structure. The results confirm the generalizability of the proposed method and its suitability for diverse classes of dielectric metamaterial absorbers.

Optimization results of dielectric metasurface absorber; (a) absorption curves for the optimized and reference designs, where the inset figure shows 3D structural overview of the dielectric metasurface absorber, and (b) the training diagram of the TD3-RL model with the number of iterations.
Based on the successful results obtained from the previous metamaterial absorber optimization, the TD3-RL model is next employed to optimize a new broadband cross-polarization conversion metasurface (CPCM) structure. Figure 5 presents the structural layout of the proposed CPCM, where an additional triangular shape is added inside the L-shapes. The top resonator layer is composed of copper with constant conductivity \(\sigma = 5.96 \times 10^{7}~\mathrm {S/m}\) and a thickness of 0.035 mm. The ground plane, which forms the bottom layer, is also made of copper with the same thickness, ensuring negligible transmission. A dielectric FR-4 substrate is placed between the top and bottom copper layers with a thickness of \(h\) of 1.6 mm. The unit cell periodicity \(a\) is set to 3.7 mm.

Design representations: (a) Three-dimensional structural overview and (b) Two-dimensional planar layout of the proposed metamaterial cross-polarizer.
Similar to the previous study, d, g, w, and p are selected as design variables. The parameters are optimized within the following ranges: d \(\in [0.1, 1]\), p \(\in [0.1, 0.31]\), g \(\in [0.1, 0.31]\), and w \(\in [0.1, 0.31]\). The upper bound of d is reduced to prevent the introduced triangle from being removed during optimization. Periodic boundary conditions are used around the unit cell in the x- and y-directions. However, an open boundary condition is used along the propagation direction of the incident wave z-direction. Furthermore, the tetrahedral mesh of the structure is employed with an accuracy of \(10^{-4}\) with a total number of elements of 143,938. The suggested CPCM is optimized using the TD3-RL model to obtain a wide-band cross polarizer.
In a reflective metamaterial cross-polarizer, the polarization conversion performance is commonly characterized using the co- and cross-polarized reflection coefficients. Under normal incidence, an x-polarized plane wave is excited toward the metasurface, and the reflected field can be decomposed into x– and y-polarized components. Accordingly, the co-polarized reflection coefficient is defined as \(S_{xx}\), representing the reflected component maintaining the same polarization, whereas the cross-polarized reflection coefficient \(S_{xy}\) represents the reflected component converted into the orthogonal polarization. Assuming negligible transmission due to the metallic ground plane, the polarization conversion ability is quantified by the polarization conversion ratio (PCR), defined as the fraction of the reflected power that is converted into the orthogonal polarization:
$$\begin{aligned} \textrm{PCR}=\frac{|S_{xy}|^2}{|S_{xx}|^2+|S_{xy}|^2}. \end{aligned}$$
(10)
A PCR approaching unity indicates near-perfect cross-polarization conversion, while lower values correspond to dominant co-polarized reflection.
The same reward function in equation 6 is used but in this case y represents the fraction of the frequency spectrum with PCR exceeding 90% for 150 iterations. The number of iterations was raised to account for the greater structural complexity of the novel cross polarizer design, thereby providing the RL agent with sufficient opportunity to achieve improved optimization results. The initial parameters were chosen randomly, and the model got the best results after only 81 iterations. The whole training took 7 hours on the same workstation used in the first example. Figure 6a presents the reward values achieved by the proposed TD3-RL model across successive iterations, demonstrating the model’s ability to converge toward higher reward values. After applying the parameters obtained in the 81\(^{\text {st}}\) iteration, Figures 6b and 6c present the PCR of the proposed unit cell and the corresponding reflection coefficients \(S_{xx}\) and \(S_{xy}\), respectively. The studied CPCM exhibits PCR greater than 90% over a broad frequency range from 11.8 to 24.2 GHz, including the full Ku-band and much of the K-band as shown in Fig. 6b. These results indicate that the proposed design holds strong potential for Ku-band applications. The device performance is summarized in Table 5.

(a) The training chart for the TD3-RL model with Reward Vs No. of iterations (b) Polarization conversion ratio. (c) co- and cross-polarization components \(S_{xx}\) and \(S_{yx}\).
The stability and performance of the polarization-conversion metasurface can be evaluated by examining its effectiveness over a broad range of incidence angles. Figure 7 illustrates the response across different incident angles from 0°to 60°with PCR. The figure shows that till 20°incident angle the structure is fairly stable for the operating bandwidth with PCR more than 0.8. However, from 30°, 40°, and 50°the higher frequency band slightly decreases to 20.5 GHz, 20.1 GHz, and 19.3 GHz respectively while PCR more that 0.8. these results shows that the CPCM performs well under oblique incidence.

The impact of the incidence angle on the reflector response.
The physical origin of polarization conversion in the proposed metasurface can be understood through the impedance response and the corresponding electromagnetic field distributions at resonance. As shown in Fig. 8, the retrieved normalized input impedance exhibits a near-perfect matching condition within the operating band, where the real part approaches unity and imaginary part remains close to zero. This impedance matching significantly suppresses the co-polarized reflection component and enhances the excitation of resonant modes, which is essential for achieving efficient polarization conversion. To further elucidate the conversion mechanism, the electric field distributions at the three resonance frequency peaks–12.5 GHz, 18 GHz, and 23.7 GHz–are depicted in Fig. 9. Strong electric-field localization is observed along different arms and corners of the resonator, indicating the excitation of multiple anisotropic resonant modes and the formation of orthogonal field components. In addition, the corresponding surface current density distributions illustrated in Fig. 10 show pronounced non-collinear and rotational current paths along the metallic segments. These induced currents generate scattered fields with significant perpendicular polarization components, thereby producing strong cross-polarized reflected waves. Overall, the combined impedance matching behavior (Fig. 8) and the resonance-driven electric field/current distributions (Figs. 9-10) confirm that polarization conversion is primarily governed by anisotropic resonant coupling and current reorientation at the identified resonance frequencies.

Relative impedance of the reported CPCM.

Simulated electric field distributions of the proposed MA at frequencies of (a) 12.5 GHz, (b) 18 GHz, and (c) 23.7 GHz.

Simulated surface current density of the proposed design at frequencies of (a) 12.5 GHz, (b) 18 GHz, and (c) 23.7 GHz.
To experimentally validate the performance of the proposed metamaterial cross-polarizer, a prototype is fabricated based on the optimized design parameters. It comprises \(40 \times 40\) unit cells, forming a cuboid with dimensions of \(148~\text {mm} \times 148~\text {mm} \times 1.6~\text {mm}\). Due to fabrication constraints, the resonator width w is adjusted to 0.2 mm. The final set of parameters used for fabrication is detailed in Table 6. The prototype is fabricated using standard photolithography techniques, and its performance is experimentally characterized using two linearly polarized standard-gain horn antennas connected to a vector network analyzer (Agilent FieldFox® model N9918A) in a bi-static reflection configuration. The x- and y-polarized horns are connected to ports 1 and 2 of the VNA, respectively, enabling simultaneous measurement of co- and cross-polarized reflection components. The co-polarized reflectance is obtained from \(S_{xx}\) and \(S_{yy}\), while the cross-polarized reflection is extracted from \(S_{yx}\). This measurement arrangement enables direct evaluation of polarization conversion by comparing the relative magnitudes of co- and cross-polarized reflected signals under identical experimental conditions.
The complete measurement setup is illustrated in Fig. 11a 11b, while the fabricated prototype is shown in Fig. 11c. A comparison between the simulated and measured reflectance coefficients are introduced in Fig. 11d for the co-polarization reflection and Fig. 11e for the cross-polarization reflection. As shown, the measured reflectance curve exhibits slight difference compared to its simulated counterpart. This discrepancy can be attributed to several factors. The finite size of the fabricated sample introduces edge diffraction and scattering effects that are absent in the idealized simulation, which assumes an infinitely periodic structure. Further, minor deviations in geometry or material properties may have occurred during the fabrication process, contributing to performance variation.
In Table 7, the performance of the proposed design is compared with various previously reported wide-band metamaterial cross-polarizers. This comparison shows that the reported design offers several advantages, including compact thickness, near-perfect cross-polarization reflection, and a wide bandwidth, all achieved without the use of multilayer structures or lumped elements.

(a,b) Measurement setup, (c) Fabricated prototype, (d) simulation and measurement results of co-polarization reflection coefficient, and (e) simulation and measurement results of cross-polarization reflection coefficient.
