Recently, there has been a rise in interest in airborne communication networks, which encourages the development of novel wireless infrastructure deployment techniques1. Aerial communication systems may provide better system capacity and coverage, which is why this occurred. Unmanned aerial vehicles (UAVs), also known as remotely piloted aircraft systems (RPAS) or drones, are small unmanned aircraft that may be deployed fast2. These are yet another kind of Third Generation Partnership Project-built LTE-A (long-term evolution – advanced) system (3GPP). In contrast to ground communication, the channel characteristics for communication between UAV and the ground are more likely to be line-of-sight (LoS) links2, which makes wireless communication easier. With regard to deployment, navigation, and control, UAVs built on a variety of airborne platforms have drawn a significant amount of academic and industry effort3. To increase UAV communication systems’ coverage and energy efficiency, resource allocation which includes transmit power, service users, and sub-channel is also required4. This is because crucial communication issues are involved. UAVs can typically be deployed in less time than terrestrial base stations and offer greater configuration flexibility5. The distance between various UAV deployments and the altitude of UAV-enabled small base stations are studied by the author4. A cyclic packing-based three-dimensional (3D) deployment algorithm is developed in reference6 to maximize the performance of the downlink coverage. Additionally develops a 3D deployment method for a single UAV to maximize the number of coverage users6. Additionally, proposes a continuous UAV placement method by maintaining the same altitude7. This plan intends to reduce the overall number of UAVs needed while making sure that each genuine ground user is protected by at least one UAV8. Even though the UAV deployment has been optimized, the design of UAV trajectories to optimize communication performance has received considerable attention, as evidenced by9,10,11. The authors investigate the problem of throughput maximization and view UAVs as mobile relays9. To achieve optimal results, they optimize the power distribution and UAV trajectory. Then, in reference9, successive convex approximation (SCA) is proposed as a method for the design of UAV trajectories. The authors of9 examine the UAV trajectory design that reduces the amount of time needed to finish a task using UAV multicast systems. To accomplish this, they changed an uninterrupted trajectory into a set of distinct way-points. Furthermore10, consider wireless communication systems capable of supporting multiple UAV systems. This paper analyses a collaborative design for the best trajectory and resource distribution by increasing the minimum throughput for all users to maintain fairness. To mitigate the delay of the sensing task while maintaining the overall rate of a multi-UAV aided uplink single-cell network, the authors of12 suggest a joint sub-channel allocation and trajectory design technique. This can be accomplished by designing a trajectory that takes both the total rate and latency of the sensing task into account. Due to their adaptability and maneuverability, the control design of UAVs is constrained by the need for human intervention. The performance of UAV communication systems necessitates intelligent UAV control based on machine learning as a result13. The design of neural network-based trajectories for UAVs is examined from the standpoint of manufacturing architecture in14,15. The paper16 proposes a weighted expectation-based UAV on-demand predictive deployment method to minimize transmit power in UAV-enabled communication systems. This method uses a Gaussian mixture model to construct the data distribution.
In the related work16, the authors investigate autonomous path planning for UAVs by jointly considering energy efficiency, transmission delay, and interference management. To address this complex optimization problem, they propose a deep reinforcement learning framework based on Echo State Networks (ESNs), enabling adaptive decision-making in dynamic environments. Furthermore, the same study presents a resource allocation strategy leveraging Liquid State Machines (LSMs) for efficient spectrum utilization across both licensed and unlicensed LTE bands in cache-enabled UAV networks. In a related work17, a joint channel and time-slot selection mechanism for multi-UAV systems is introduced. The proposed approach employs log-linear learning to optimize spectrum sharing and mitigate collisions in a distributed manner, thereby enhancing the overall communication performance of UAV-enabled networks17.
Machine and deep learning are two types of artificial intelligence model that learns directly from the data with explicitly programming a computer system to detect and recognition, both are promising and potent tools that can provide autonomous and effective solutions to intelligently improve communication systems that support UAVs18. However, the majority of research contributions have been on how UAVs are deployed and how their trajectories are designed in communication systems16. Prior research has primarily focused on time-independent scenarios, despite11,12 discussing resource allocation schemes for UAV-supported communication systems, including transmit power and sub-channels. In other words, the optimal design is independent of the time being taken into account. Additionally19,20, investigated the possibility of resource allocation techniques based on machine learning for time-dependent scenarios. However, the majority of proposed machine learning algorithms focus on scenarios involving a single UAV or multiple UAVs, assuming that each UAV possesses comprehensive network information. Due to the rapid movement of UAVs21,22, it is not simple to acquire a comprehensive understanding of the dynamic environment in practice. This creates a difficult environment for the design of reliable UAV wireless communication, which poses a significant challenge. Additionally, the majority of earlier research contributions were on centralized techniques, making modeling and computing tasks challenging as the network’s scale continues to grow. For communication systems that allow UAVs, reward-based multi-agent learning (RMAL) can offer a distributed view of intelligent resource management. This is especially useful in situations where each UAV only has access to its local data23.
In dynamic UAV-enabled communication networks, centralized control or full network state awareness is often impractical due to high mobility, limited energy, and real-time operational constraints. Most existing solutions either assume complete inter-UAV information sharing or rely on static deployment strategies. In contrast, the proposed RMAL (Reward-Based Multi-Agent Learning) framework enables each UAV to make decentralized resource allocation decisions using only local observations, eliminating the need for inter-agent communication. This reduces overhead while retaining adaptability in highly dynamic environments. The motivation for using RMAL lies in its ability to capture environmental uncertainty through a stochastic game formulation, enabling each UAV to maximize long-term rewards independently via Q-learning. This makes the method scalable, practical, and well-suited for real-time UAV applications24.
Based on the proposed framework, the following summarizes our primary contributions:
-
To enhance multi-UAV downlink systems’ long-term effectiveness, our work focuses on concurrently constructing user, power level, and sub-channel selection algorithms. To ensure reliable communication, we specifically created a limited energy efficiency function based on the quality-of-services (QoS) as a reward mechanism. The exceptional nature of the formulation of the optimization problem can be attributed to its time-dependent and uncertain nature. To tackle this challenging issue, we describe a method for dynamic resource allocation based on reward learning.
-
Our method of analyzing the dynamic resource allocation problem of a multi-UAV system is based on a novel stochastic game theory. According to this design, every UAV performs the role of a learning agent, and every resource allocation strategy is based on the actions of the UAV. This gives us the ability to describe the dynamic resource allocation issue in a system of several UAVs. Each UAV’s actions in a designed random game specifically satisfy the properties of the Markov chain. This suggests that a UAV’s rewards depend only on its current state and actions. Additionally, resource allocation problems for various multi-UAV dynamic systems may be simulated using the framework.
-
We created an RMAL-based resource allocation algorithm to solve stochastic formula games that take place in multi-UAV systems. Since each UAV uses the traditional Q-learning techniques and functions as its learning agent, the behaviors of the UAVs are not taken into consideration. We created a resource allocation system based on the RMAL algorithm to tackle stochastic formula games that happen in multi-UAV systems. Each UAV functions as its learning agent, carrying out common Q-learning algorithms without taking into account what other UAVs are doing. This significantly reduces the amount of data shared between UAVs and the computational work performed by each UAV. In addition, we provide evidence that the RMAL-based algorithm for resource allocation converges.
-
Various system parameters are used to derive the development and exploration parameters of the \(\:\in\:\)-greedy algorithm from the simulation results presented here. In addition, simulation results demonstrate that the RMAL-based multi-UAVs system resource allocation framework provides a satisfactory trade-off between performance increases and increases in the quantity of information that must be exchanged.
To facilitate clarity and improve comprehension of technical terms used throughout this study, a comprehensive list of abbreviations is presented in Table 1. This table provides definitions for commonly used acronyms related to UAV-enabled communication systems, reinforcement learning, and wireless network modeling. The summarized notations serve as a reference for readers to interpret various terminologies consistently within the context of this work.
System model
We presented a multi-UAVs A2G communication system, depicted in Fig. 1, that operates on a discrete timeline and is comprised of a single antenna UAVs X and U are the users of a single antenna, represented by \(\:\text{X}\:\:=\left[1,\dots\:.,\:X\right]\) and \(\:\text{U}\:\:=\left[1,\dots\:\:\dots\:.,\:U\right]\)respectively. Randomly dispersed the ground users on a radius \(\:{O}_{d}\) Disks. As depicted in Fig. 1, several UAVs fly over the area and interact directly with the ground users25 via an aerial communication link. The UAV total bandwidth \(\:\omega\:\) is subdivided into orthogonal sub-channels K, abbreviated as \(\:\text{K}\:\:=\left[1,\dots\:.,\:K\right].\) In addition, the UAV is expected to operate autonomously based on a preprogrammed flight plan without human interaction, as described in20. In other words, a preprogrammed flight plan predetermined the UAV’s trajectory. Figure 1 shows three UAVs flying over the region of interest along a predetermined path. This article examines the resource distribution dynamic design in UAV systems concerning the user, power level, and sub-channel selection. In addition, it is believed that the communication among UAVs is without a central controller, and the a lack of global understanding in the wireless communication environment27. In simple words, local knowledge exists regarding the UAV and the user’s CSI. In practice, this assumption is reasonable due to UAV mobility, similar to research contributions21,22.

The proposed system for UAV-enabled communication employs Reinforcement Learning for decentralized dynamic resource allocation.
A2G channel model
Compared to terrestrial communication propagation, A2G channels are significantly reliant on altitude, elevation angle, and propagation environment. In reference3,21, we investigated the dynamic resource allocation topic in multi-UAVs under A2G channel model:
-
The Probabilistic Models: As demonstrated in21,29, The probabilistic rout loss model, which allows for the independent treatment of line-of-sight (LoS) and non-line-of-sight (NLoS) links with different probabilities, can be used to simulate the A2G communication link. According to29, the likelihood of establishing a LoS connection between time slot D, \(\:{UAV}_{x}\), and \(\:U\) is the ground user given by environment-dependent constants a and b.
$$\:{\rho\:}^{LoS}\left(D\right)=\:\:\frac{1}{1+\:\text{b}\:\text{e}xp(-{\text{a}\; \text{s}{in}}^{-1}\left(\frac{H}{{d}_{x,\:U}\:\left(D\right)-\text{b}}\right))}\:$$
(1)
\(\:{d}_{x,\:U}\:\) denotes \(\:{UAV}_{x}\)and user \(\:U\:\)and the altitude of \(\:{UAV}_{x}\) denoted by H. In addition,
\(\:{\:\rho\:}^{NLoS}\left(D\right)={1-\rho\:}^{LoS}\left(D\right)\:\)is the NLoS link probability.
The corresponding Non-Line-of-Sight (NLoS) probability is:
$$\:{\rho\:}^{NLoS}\left(D\right)=1-{\rho\:}^{LoS}\left(D\right)$$
(2)
The trajectory path-loss LoS and NLoS from the permitted ground user U in time slot D to the \(\:{UAV}_{x}\) may be expressed as follows,
$$\:{\rho\:L}_{x,\:U}^{LoS}=\:\:{L}_{x,\:U}^{RL}\:\left(D\right)+\:{\eta\:}^{LoS}$$
(3)
$$\:{{\rho\:}{L}}_{{x},\:{U}}^{{N}{L}{o}{S}}=\:\:{{L}}_{{x},\:{U}}^{{R}{L}}\:\left({D}\right)+\:{{\eta\:}}^{{N}{L}{o}{S}}$$
(4)
where \(\:{L}_{x,\:U}^{RL}\:\left(D\right)\) denotes the route loss in free space with \(\:{L}_{x,\:U}^{RL}\:\left(D\right)=20\text{log}\left({d}_{x,\:U}\left(D\right)\right)+20\text{log}\left({F}_{c}\right)+2-log\frac{4\mu\:}{c}\) and the carrier frequency \(\:{F}_{c}\). Additionally, \(\:{\eta\:}^{NLos}\) and \(\:{\eta\:}^{Los}\) representing the average additional path-losses LoS and NLoS, respectively. Consequently, the following representation may be used to show the average trajectory path loss between \(\:{UAV}_{x}\:\)and user \(\:U\) during time slot D:
$$\:{L}_{x,\:U}\left(D\right)=\:{P}^{LoS}\left(D\right)\:.\:\:\:{PL}_{x,\:U}^{LoS}\left(D\right)+\:\:\:\:\:{P}^{NLoS}\left(D\right)\:.\:\:{PL}_{x,\:U}^{NLoS}\left(D\right)\:$$
(5)
-
The LoS Model: In reference8, for practical A2G communication, the LoS model provides a good approximation. The path loss between an authorized ground user and a UAV depends on both their locations and the kind of propagation, according to the LoS model30. The channel gains between the authorized users on the ground and the UAVs are computed, taking into account their relative distances, using the LoS model and the free path loss model. The power gain of the LoS channel model in time slot D from the X-th UAV to the approved ground users U-th can be represented as follows:
$$\:{g}_{x,\:U}\left(D\right)=\:{\alpha\:}_{0}{d}_{x,\:U}^{-\alpha\:}\left(D\right)=\:\frac{{\alpha\:}_{0}}{{{\left(|{v}_{U}\right)\:-\:{u}_{x}\left(D\right)|}^{2}\:+\:{H}_{x}^{2})}^{\frac{\beta\:}{2}}}$$
(6)
where \(\:{u}_{x}\left(D\right)={(x}_{x}\left(D\right),\:{y}_{x}\left(D\right)),\:and\:\) \(\:{(x}_{x}\left(D\right),\:{y}_{x}\left(D\right))\:\) indicate the horizontal position of the \(\:{UAV}_{x}\) in time slot D. Consequently, \(\:{v}_{U}=({x}_{U},{y}_{U})\) reflects the user’s location \(\:U\). In addition, \(\:{\alpha\:}_{0}\) is denoted by the channel power with the distance \(\:{d}_{0}=1m,\) whereas \(\:\beta\:\:\ge\:2\:\)is the path loss index.
The signal model
Each pair of UAVs operating on the same sub-channel causes interference for ground users when it comes to UAV-to-ground communication. Let \(\:{C}_{x}^{k}\left(D\right)\:\)be a sub-channel indication, where \(\:{C}_{x}^{k}\left(D\right)=1\:\)if \(\:{UAV}_{x}\) occupies sub-channel \(\:k\) during time slot D; otherwise, \(\:{C}_{x}^{k}\left(D\right)=0\). It is satisfactory
$$\:\sum_{k\:\in\:\:K}{C}_{x}^{k}\left(D\right)\:\le\:1$$
(7)
.
In other words, each drone is restricted to a single sub-channel per time. Make \(\:{a}_{x}^{U}\left(D\right)\)A user-facing indication. \(\:{a}_{x}^{U}\left(D\right)=1\:\)if the user \(\:\mathcal{l}\) in the time frame D provided by the \(\:{UAV}_{x}\); \(\:{a}_{x}^{U}\left(D\right)=0\:\)if not. Thus, at time slot D, on sub-channel \(\:k\) and the SNIR of the UAV to ground transmission between \(\:{UAV}_{x}\) and authorized user U are the following:
$$\:{\gamma\:}_{x,\:U}^{k}\left(D\right)=\:\frac{{G}_{x,\:U}^{k}\left(D\right){{a}_{x}^{U}\left(D\right)C}_{x}^{k}\left(D\right){P}_{x}\left(D\right)}{{I}_{x,\:U}^{k}\left(D\right)+\:{\varphi\:}^{2}}$$
(8)
where \(\:{G}_{x,\:U}^{k}\left(D\right)\) indicates the channel gain of \(\:{UAV}_{x}\) and the authentic user U in sub-channel \(\:k\) and the time slot D. \(\:{P}_{x}\left(D\right)\:\) indicates the transmit power chosen by \(\:{UAV}_{x}\) for time slot D. The \(\:{UAV}_{x}\) with \(\:{I}_{x,\:U}^{k}\left(D\right)=\:\sum\:_{i\:\in\:X,\:\:\:i\:\ne\:x}{G}_{i,\:U}^{k}\left(D\right){C}_{x}^{k}\left(D\right){P}_{i}\left(D\right)\). The SINR of the \(\:{UAV}_{x}\) can be stated as follows for any time slot D:
$$\:{\gamma\:}_{x}\left(D\right)=\:\sum_{U\:\in\:\:\cup\:}\sum_{k\:\in\:\:K}{\gamma\:}_{x,\:U}^{k}\left(D\right)\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(9)
In22, the UAVs implement discrete transmit power control to manage interference and optimize communication performance within the network. The vector \(\:P=\{{P}_{1},\dots\:\dots\:..,\:{P}_{I}\}\) shows the transmit power value for each UAV that is in communication with the relevant associated user. For each \(\:{UAV}_{x}\), the binary variable \(\:{\mathcal{P}}_{x}^{i}\left(D\right),\:\mathcal{i}\:\in\:I=\left\{1,\dots\:\dots\:,\:I\right\}\:\)is defined. \(\:{\mathcal{P}}_{x}^{i}\left(D\right)=1\) if \(\:{UAV}_{x}\) decides to transmit at time slot D with power level \(\:{P}_{i}\); else \(\:{\mathcal{P}}_{x}^{i}\left(D\right)=0\). Note that for each D time slot, \(\:{UAV}_{x}\) may only choose single power.
$$\:\sum_{i\:\in\:\:I}{\mathcal{P}}_{x}^{i}\left(D\right)\:\le\:1,\:\:\:\:\:\:\forall\:x\:\in\:\:X$$
(10)
Now, the \(\:{UAV}_{x}\) has a limited set of power-level selection options including the following:
$$\:{\rho\:}_{x}=\:\left[{p}_{x}\left(D\right)\in\:P\:|\:\sum_{i\:\in\:\:I}{p}_{x}^{i}\left(D\right)\le\:1\right],\:\:\:\:\:\forall\:x\:\in\:X.$$
(11)
Similar to user selection via \(\:{UAV}_{x}\), all sub-channel selection has finite sets that are as follows:
$$\:{C}_{x}=[{c}_{x}\left(D\right)\in\:\:\:K\sum_{k\:\in\:\:K}{c}_{x}^{k}\left(D\right)\:\le\:1],\:\:\:\forall\:x\:\in\:X.\:$$
(12)
$$\:{A}_{x}=[{a}_{x}\left(D\right)\in\:\:\:\cup\:\sum_{U\:\in\:\:\cup\:}{a}_{x}^{U}\left(D\right)\:\le\:1],\:\:\:\forall\:x\:\in\:X.\:$$
(13)
Furthermore, we also assume that the multi-UAV system runs on a discrete-time basis, with the time timeline being divided into equal, non-overlapping time intervals. In addition, it is expected that the communication parameters do not change between time slots. For the time slot index, let D be the integer value. Specially, when each UAV records the CSI and decisions of authorized ground users in time slots \(\:{T}_{S}\:\ge\:1\) at preset intervals, which is referred to as the decision cycle. We look into the following approach for scheduling UAV transmissions: each UAV receives a time slot D to begin transmission, and the handover must be completed after its decision cycle, in the time slot \(\:D+\:{T}_{S}\). We suppose that UAVs are unaware of the precise amount of time they spend in the network. This characteristic prompted us to develop an online learning system for the maximization of energy efficiency performance for the multi-UAV networks over the long run31.
The framework of stochastic game for multi-UAVs systems
In this part, it’s started with a description of the optimization challenges addressed in this study. To imitate the randomness of the environment, a random set is then used to formulate the joint power level, user, and sub-channel selection problem.
Problem formulation
Note that beginning with (6), each UAV transmits at full power for maximum throughput, resulting in greater interference with other UAVs32. To ensure reliable communication from the UAV, the primary objective of the dynamic design of power level, user, and sub-channel selection is to ensure that the SINR generated by the UAV does not fall below the predetermined threshold33. In particular, the mathematical form can be shown as follows:
$$\:{\gamma\:}_{x}\left(D\right)\:\ge\:\:{\gamma\:}^{{\prime\:}},\:\:\:\forall\:x\:\in\:X.\:$$
(14)
where \(\:{\gamma\:}^{{\prime\:}}\) is the QoS threshold objective for UAV users. If constraint (14) is satisfied in time slot D, the UAV is awarded \(\:{Sr}_{x}\left(D\right)\), which is characterized as the gap between throughput and power cost reached by the user, power level, and the selected sub-channel, otherwise it will earn no reward. Thus, the \(\:{Sr}_{x}\left(D\right)\) can be used to represent the reward function of the \(\:{UAV}_{x}\) in D time slot:
$$\:{Sr}_{x}\left(D\right)=\:\left(\genfrac{}{}{0pt}{}{\frac{W}{K}\text{log}\left(1\:+\:\:{\gamma\:}_{x}\left(D\right)\right)-\:{\mathcal{w}}_{x}{P}_{x}\left(D\right),\:\:\:\:\:\:\:\:\:if\:{\gamma\:}_{x}\left(D\right)\:\ge\:\:{{\gamma\:}^{{\prime\:}}}_{x}\:}{0,\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:o.\mathcal{W}.,}\right)\:$$
(15)
For every \(\:x\:\in\:X\), the instantaneous payoff is represented by \(\:{Sr}_{x}\left(D\right)\). The power level in terms of cost per unit is \(\:{\mathcal{w}}_{x}\). The instantaneous reward for \(\:{UAV}_{x}\) in any D time slot relies on the following:
-
a)
Unobserved data: sub-channel and power levels as well as channel gain selected by other UAVs. Note that we exclude the UAV’s fixed energy consumption, such as that of the control unit and data processing23.
-
b)
Information observed: For the single user, power level and sub-channel decisions for \(\:{UAV}_{x}\), i.e., \(\:{a}_{x}\left(D\right),\:{C}_{x}\left(D\right)\:and\:{\mathcal{P}}_{x}\left(D\right)\). Additionally, it is dependent on the current channel gain \(\:{G}_{x,\:U}^{k}\left(D\right)\);
To maximize the long-term benefit, select the service users, power level transmission, and sub-channels for each time slot34. Specifically, we use future discounts24 as a criterion for evaluating each UAV. Specifically, at some point in the procedure, the discount reward equals the sum of its current period benefits plus the future reward discounted through a constant factor. Consequently, the following equation provides the long-term rewards for the \(\:{UAV}_{x}\):
$$\:\:{\mathcal{v}}_{x}\left(D\right)=\:\sum_{\tau\:=0}^{+\:inf}{\varDelta\:}^{\tau\:}{Sr}_{x}\left(D+\:\tau\:\:+1\:\right),\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(16)
where \(\:\varDelta\:\) represents \(\:0\:\le\:\:\varDelta\:\:<\) 1 discount factor. For example, if the value \(\:\varDelta\:\) is near 0, the choice emphasizes short-term gain; but, if the \(\:\varDelta\:\) is close to 1, visionary decisions are made. This value illustrates how future rewards influence optimum judgments.
In Eq. (16), the parameter τ represents the time-step offset or prediction horizon into the future, used to compute the discounted cumulative reward from the current time slot D onward. It starts at τ = 0 and increases indefinitely (theoretically up to +∞), reflecting the forward-looking nature of reinforcement learning where agents aim to optimize not only immediate but also long-term outcomes. Mathematically, τ indexes the number of steps into the future from the current decision point. The term Δ^τ serves as the discount factor that reduces the impact of future rewards as τ increases, making the algorithm more focused on near-term performance when Δ is small, and more long-term focused when Δ approaches 1. In practice, although the sum in (14) is over an infinite horizon, the influence of distant rewards becomes negligible for Δ < 1 and large τ, thus convergence is ensured. The cumulative reward function vₓ(D) is central to evaluating the utility of a UAV’s current policy, driving updates in Q-learning.
Next, we list the power level, sub-channel, and all the possible authorized users’ decisions taken by \(\:{UAV}_{x}\), \(\:x\:\in\:X,\) which may be written as \(\:{{\Phi\:}}_{x}=\:{A}_{x}\otimes{\:c}_{x}\otimes{P}_{x}\:\) and \(\:\otimes\) is for the Cartesian product. Thus, the goal of each \(\:{UAV}_{x}\) is to take decisions \(\:{{\Omega\:}}_{x}^{*}\left(D\right)=\left({a}_{x}^{*}\left(D\right),\:\:{C}_{x}^{*}\left(D\right),\:{\mathcal{P}}_{x}^{*}\left(D\right)\right)\in\:\:{{\Phi\:}}_{x}\) for the long-term performance maximization (14). For the UAV optimization problem, \(\:{\:UAV}_{x}\), \(\:x\:\in\:X,\) can therefore be stated as follows:
$$\:{{\Omega\:}}_{x}^{*}\left(D\right)=\text{arg}\;{max}_{{{\Omega\:}}_{x}\:\in\:\:{{\Phi\:}}_{x}}{Sr}_{x}\left(D\right)\:\:\:\:$$
(17)
So, the optimum design of the multi-UAVs system under consideration comprises sub-problems X corresponding to various X UAVs. Additionally, since each UAV lacks knowledge about the other UAVs, such as their rewards, the problem cannot be precisely resolved (17).
In the subsections that follow, we make an effort to articulate joint sub-channel, power level, and the authorized user’s selection problems as non-cooperative stochastic games to resolve the random environment optimization problem (17).
Equation (15) formulates the optimization problem for each UAV as a single-agent objective, aiming to select a combination of user, sub-channel, and power level Ωₓ*(D) ∈ Φₓ that maximizes its instantaneous reward Srₓ(D). However, in a multi-UAV environment, each UAV’s reward is influenced not only by its own action but also by the simultaneous actions of other UAVs due to interference and shared sub-channels. Therefore, the independent optimization of Eq. (15) becomes coupled and interdependent, necessitating a game-theoretic formulation. To capture this interdependence, we reformulate the problem as a stochastic game (Markov game) where each UAV is a rational agent44. The global system state evolves over time, and each UAV selects its strategy based on its observed state. The key to solving this game lies in identifying a Nash equilibrium: a set of strategies µ* = [µ₁*, µ₂*, …, µₓ*] where no UAV can improve its expected cumulative reward by unilaterally deviating from its strategy, given the strategies of others.
Formulation of stochastic game
We modeled the problem in formula (17) in this section using the framework of a randomized game (also known as a Markov game)25 because it generalizes the Markov decision-making process to the case of multiple agents.
In the network under consideration, the UAV X communicates with the user without knowledge of the operating system. We assume that all UAVs are rational and self-catered. Thus, for the maximization of long-term returns (17), all UAVs select the movements independently at any given time slot D. So, the action of each \(\:{\:UAV}_{x}\)is chosen in its action space\(\:{\:{\Phi\:}}_{x}\). The triples \(\:{{\Omega\:}}_{x}\left(D\right)=\left({\text{a}}_{x}\left(D\right),\:{\text{C}}_{x}\left(D\right),{\:\mathcal{P}}_{x}\left(D\right)\right)\in\:{\:{\Phi\:}}_{x}\) represent the actions performed by \(\:{\:UAV}_{x}\) in time slot D, where \(\:{\text{a}}_{x}\left(D\right),\:\:{\text{C}}_{x}\left(D\right)\:and\:{\:\mathcal{P}}_{x}\left(D\right)\)stated the power level, user selection, and sub-channel of \(\:{\:UAV}_{x}\)in time slot D, respectively. For each \(\:{\:UAV}_{x}\), \(\:{\:{\Omega\:}}_{-x}\left(D\right)\:\)represents the operation performed in time slot D by the other UAVs \(\:X-1\), which is \(\:{\:{\Omega\:}}_{-x}\left(D\right)\:\in\:\:{\Phi\:}\:\:\:\backslash\:\:\:{{\Phi\:}}_{x}\:\).
As a result, the instantaneous SINR of \(\:{\:UAV}_{x}\)in time slot D can be expressed as follows:
$$\:{\gamma\:}_{x}\left(D\right)\:\left\{{{\Omega\:}}_{x}\left(D\right),\:\:\:{{\Omega\:}}_{-x}\left(D\right),\:{G}_{x}\left(D\right)\right\}=\:\sum_{U\:\in\:\:\cup\:}\sum_{k\:\in\:\:K}\frac{{Ds}_{x,\:U}^{k}\left(D\right)\:\left\{{{\Omega\:}}_{x}\left(D\right),\:\:\:{{\Omega\:}}_{-x}\left(D\right),\:\:\:{G}_{x,\:U}\left(D\right)\right\}}{{I}_{x,\:U}^{k}\left(D\right)\:\left\{{{\Omega\:}}_{x}\left(D\right),\:{\:{\Omega\:}}_{-x}\left(D\right),\:\:\:{G}_{x,\:U}\left(D\right)\:\:\right\}+\:{{\Phi\:}}^{2}}$$
(18)
where \(\:{Ds}_{x,\:U}^{k}\left(D\right)=\:{G}_{x,\:U}^{k}\left(D\right){a}_{x}^{U}\left(D\right){c}_{x}^{k}\left(D\right)\) \(\:{P}_{x}\left(D\right),\) and \(\:{I}_{x,\:U}^{k}\left(D\right)\left(\bullet\:\right)\) in (18). Additionally, \(\:{G}_{x,\:U}\left(D\right)\:\)represents the instantaneous channel matrix responses between \(\:{UAV}_{x}\) and authorized ground user U at the given time slot D are the following:
$$\:{G}_{x,U\:}\left(D\right)=\:\:\left(\genfrac{}{}{0pt}{}{{G}_{1,\:\:U}^{1}\left(D\right)\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:{G}_{1,\:\:U}^{K}\left(D\right)\:}{{G}_{X,\:\:U}^{1}\left(D\right)\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:{G}_{X,\:\:U}^{K}\left(D\right)}\right)\:\:\:$$
(19)
with \(\:{G}_{x,\mathcal{\:}\mathcal{l}}\left(D\right)\) \(\:\in\:\) \(\:{Sr}^{X\:\times\:\:K}\) for all \(\:x\:\in\:X\) and \(\:U\:\in\:\:\cup\:\).
Each \(\:{UAV}_{x}\) can express its current SINR level \(\:{\gamma\:}_{x}\left(D\right)\) at any given time slot D. Consequently, the \(\:{S}_{x}\left(D\right)\) state for each \(\:{UAV}_{x}\), \(\:x\in\:\:X\) is fully observed are the following:
$$\:{S}_{x}\left(D\right)=\:\left\{\begin{array}{c}1,\:\:\:\:\:\:\:\:\:\:\:\:\:if\:{\gamma\:}_{x}\left(D\right)\ge\:\:{\gamma\:}^{{\prime\:}}\:\\\:0,\:\:\:\:\:\:\:\:\:\:\:\:\:o.W..\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\end{array}\right.$$
(20)
Let the state vector for all UAVs be \(\:S=[{S}_{1},\dots\:\dots\:\dots\:,\:{S}_{X}]\). As UAVs cannot cooperate, the \(\:{UAV}_{x}\) in this article is unaware of the states of the other UAVs.
We assume that each UAV’s actions follow the rules of the Markov chain, which means that a UAV’s reward is solely dependent on its state and path of action at any given moment. According to26, the dynamics of the state in a stochastic game where each player only acts in each state are represented by the Markov chain38. The Markov chain is defined formally in the manner that is detailed below.
Definition 1
A discrete stochastic process called a finite state Markov chain has the following definition: Let’s assume that a q \(\:\times\:\) q transition matrix E has entries \(\:0\:\le\:\:{E}_{\mathcal{i},i}\:\le\:1\) and \(\:{\sum\:}_{i=1}^{q}{E}_{\mathcal{i},i}=1\:\)for any 1\(\:\le\:\:\mathcal{i}\:\le\:\) q and that the collection of states \(\:Ds=[{S}_{1},\dots\:\dots\:\dots\:,\:\:{S}_{q}]\) is finite.
It progresses steadily from one state to the next. Assume that the chain is currently in the state. \(\:{S}_{\mathcal{i}}\). The next state’s \(\:{S}_{i}\) probability is
$$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\text{Pr}\left[S\left(D+1\right)=\:{S}_{i}\:|\:S\left(D\right){S}_{\mathcal{i}}\right]=\:{E}_{\mathcal{i},\:i}$$
(21)
It is also known as the Markov property because it just depends on the current state and not any past states.
Consequently, the \(\:{UAV}_{x}\) reward function, \(\:x\:\in\:X\), can be expressed as
$$\:{O}_{x}^{D}=\:{Sr}_{x}\left({{\Omega\:}}_{x}^{D},\:{{\Omega\:}}_{-\:x}^{D},\:\:{S}_{x}^{D}\right)=\:{S}_{x}^{D}\left({C}_{x}^{D}\left\{{{\Omega\:}}_{x}^{D},\:\:\:{{\Omega\:}}_{-\:x}^{D},\:\:{G}_{x}^{D}\right\}-\:{\mathcal{w}}_{x}{P}_{x}\left\{{{\Omega\:}}_{x}^{D}\right\}\right).\:$$
(22)
For the sake of compact notation, the time slot index D is expressed in superscript here. This notation will also be used for notational simplicity in the next sections.In (22), the action \(\:{{\Omega\:}}_{x}^{D}\) determines the instantaneous transmit power, while the UAV’s instantaneous rate is given by
$$\:{c}_{x}^{D}\left\{{{\Omega\:}}_{x}^{D},\:\:\:{{\Omega\:}}_{-\:x}^{D},\:\:{G}_{x}^{D}\right\}=\:\frac{W}{K}\text{log}(1+\:{\gamma\:}_{x}({{\Omega\:}}_{x}^{D},\:\:{{\Omega\:}}_{-\:x}^{D},\:\:{G}_{x}^{D}\left)\right)$$
(23)
The present state \(\:{S}_{x}^{D}\), which is completely observed, and the actions that are partially observed (\(\:{{\Omega\:}}_{x}^{D},\:\:{{\Omega\:}}_{-\:x}^{D})\), which are both dependent on the current state \(\:{S}_{x}^{D}\), determine the pay-out \(\:{O}_{x}^{D}\) that \(\:{UAV}_{x}\) will get at each time slot D, starting from (22). The chosen actions (\(\:{{\Omega\:}}_{x}^{D},\:\:{{\Omega\:}}_{-\:x}^{D})\) and the previous state \(\:{S}_{x}\left(D\right)\)are the only factors used to determine the possibilities of the new random state \(\:{S}_{x}^{D+1}\) to which \(\:{UAV}_{x}\) flies. This happens at the next time slot D + 1. This process is repeated until all available slots have been filled. \(\:{UAV}_{x}\) may specifically monitor its state \(\:{S}_{x}^{D}\) and the related action \(\:{{\Omega\:}}_{x}^{D}\) at any time slot D, but it is unaware of other players’ actions, \(\:{{\Omega\:}}_{-\:x}^{D}\), and the precise values \(\:{G}_{x}^{D}\). Each player \(\:{UAV}_{x}\:\)is also unaware of the probabilities of state transition. The examined UAV system in reference27 can thus be expressed as a stochastic game.
Definition 2
A tuple with values \(\:\phi\:=(Ds,\:X,\:{\Phi\:},\:E,\:Sr)\) can be used to construct a stochastic game where,
-
\(\:Ds\) denotes the state set with \(\:Ds=\:{Ds}_{1}\times\:\dots\:\dots\:\:\times\:{DS}_{X}\:,\:{Ds}_{x}\:\in\:\left\{\text{0,1}\right\},\:\:for\:all\:\:x\:\in\:X\);
-
The group for players is \(\:X\);
-
\(\:{{\Phi\:}}_{x}\) stands for the player’s \(\:{UAV}_{x}\) action set, while \(\:{\Phi\:}\) is the joint action set;
-
\(\:E\) is the probability function for sate transition, and it is affected by what each player does.
Specifically, \(\:E\left({S}_{x}^{D},\:{\Omega\:},\:{S}_{x}^{D+1}\right)=\:\text{Pr}\left[{S}_{x}^{D+1}\:\right|{S}_{x}^{D},{\Omega\:}\:\:],\) indicates the probability that the current state \(\:{S}_{x}^{D}\) will change to the next stage \(\:{S}_{x}^{D+1}\) by carrying out the joint action \(\:{\Omega\:}\) with \(\:{\Omega\:}=\left[{{\Omega\:}}_{1},\:\dots\:\dots\:..\:{{\Omega\:}}_{X}\right]\:{\Phi\:};\).
• For player \(\:x,\:Sr=[{Sr}_{1},\:\dots\:\dots\:..\:{Sr}_{X}]\), where \(\:{Sr}_{x}\::\:{\Phi\:}\:\times\:\:Ds\to\:Sr\) is a legitimately valuable reward function.
A mixed strategy in a stochastic game, \(\:{\mu\:}_{x}\::\:{Ds}_{x}\to\:\:{{\Phi\:}}_{x}\) refers to a group of probability distributions over the potential actions, indicating the relationship between the action set and the state set. In further detail, the mixed strategy for \(\:{UAV}_{x}\) in state \(\:{S}_{x}\) is defined as: \(\:{\mu\:}_{x}\left({S}_{x}\right)\) = [\(\:{\mu\:}_{x}\left({S}_{x},\:{{\Omega\:}}_{x}\right)\:|\:{{\Omega\:}}_{x}\in\:\:{{\Phi\:}}_{x}\)], where each element \(\:{\mu\:}_{x}({S}_{x},\:\:{{\Omega\:}}_{x})\) of \(\:{\mu\:}_{x}\left({S}_{x}\right)\) shows the probability distribution of \(\:{UAV}_{x}\) selecting a state action \(\:{\:{\Omega\:}}_{x}\) in state \(\:{S}_{x}\). X players and a vector of policies, one plan for each player, is called a joint strategy and has the form \(\:\mu\:=[{\:\mu\:}_{1}\left(\:{S}_{1}\right),\dots\:\dots\:.,\:\:{\mu\:}_{X}({S}_{X}\left)\right]\). Let \(\:{\:\mu\:}_{-x}=[{\:\mu\:}_{1},\:\dots\:..,\:{\:\mu\:}_{x-1},\:{\:\mu\:}_{x+1},\dots\:.\:,\:{\:\mu\:}_{X}({\:S}_{X}\left)\right]\) represents the same policy profile, but without player \(\:{UAV}_{x}\) policy \(\:{\mu\:}_{X\:}\). Based on the aforementioned factors, each player \(\:{UAV}_{x}\)in the specified stochastic game has the optimization goal of maximizing its expected payoff over time. The goal in (14) may be restated as follows for player \(\:{UAV}_{x}\) under a joint strategy \(\:\mu\:\) = [\(\:{\mu\:}_{1}\left({S}_{1}\right),\dots\:\dots\:.\:{\mu\:}_{X}\left({S}_{X}\right)\)] with assigning a strategy \(\:{\mu\:}_{\mathcal{i}}\) to each \(\:{UAV}_{\mathcal{i}}\) is
$$\:{Sf}_{x}\left(S,\:\mu\:\right)=F\:\left[\sum_{\tau\:=0}^{+\:inf}{\varDelta\:}^{\tau\:}\:{O}_{x}^{D+\tau\:+1}\:|\:{S}^{D}=S\right],\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(24)
Where \(\:{O}_{x}^{D+\tau\:+1}\) is the instantaneous reward received by \(\:{UAV}_{x}\) at time \(\:D+\tau\:+1\:\)and \(\:F\:\left[\sum\:_{\tau\:=0}^{+\:inf}{\varDelta\:}^{\tau\:}\:{O}_{x}^{D+\tau\:+1}\:|\:{S}^{D}=S\right]\) stand for expectation operations. Individuals (UAVs) in the defined stochastic game have individual anticipated rewards that depend on the combined strategy rather than the players’ tactics. Because not all participants could maximize their expected rewards at once, it is unrealistic to simply expect players to do so. Next, we discuss a Nash equilibrium solution for the stochastic game28.
Definition 3
The collection of techniques, called a Nash equilibrium, one for each participant, which is the most effective way to counter each other’s strategy. To put it another way, if the Nash equilibrium solution is \(\:{\mu\:}^{*}=[{\mu\:}_{1}^{*},\dots\:\dots\:,\:\:{\mu\:}_{X}^{*}]\), then for any \(\:{UAV}_{x}\), the \(\:{\mu\:}_{x\:}^{*}\) strategy like.
$$\:{Sf}_{x}({\mu\:}_{x\:}^{*},\:\:{\mu\:}_{-x})\:\ge\:\:{Sf}_{x}\left({\mu\:}_{x\:}^{{\prime\:}},\:\:{\mu\:}_{-x}\right),\:\:\:\:\:\:\forall\:{\mu\:}_{x\:}^{{\prime\:}}.\:\:$$
(25)
It implies that each UAV’s activity is the optimum reaction to the decision made by other UAVs in a Nash equilibrium. So long as all other UAVs maintain their current tactics, no UAV can gain from altering its approach in a Nash equilibrium solution. Keep in mind that the non-cooperative stochastic game’s imperfect information structure gives players the chance to repeatedly engage with the stochastic environment and figure out their best course of action. A Nash equilibrium strategy for each state \(\:{S}_{x}\) is what each player \(\:{UAV}_{x}\) hopes to find, each player is viewed as a learning agent. In the following section, the RMAL framework is shown as a means of optimizing the sum of expected rewards (22) using partial data.
The proposed solution
In this part, the RMAL framework for multi-UAV systems is introduced. Then, a resource allocation plan based on Q-learning will be suggested to optimize the multi-UAV system under consideration’s expected long-term gain.

RMAL framework for multi-UAV Systems.
RMAL framework for Multi-UAV SYSTEMS
Figure 2 depicts the principal RMAL ingredients that were examined for this work. Specifically, the information obtained locally during the time slot D-state \(\:{S}_{x}^{D}\) and the reward (result) \(\:{O}_{x}^{D}\)are presented for each \(\:{UAV}_{x}\), while the actions that \(\:{UAV}_{x}\) performed during the time slot D is displayed as well. The players in a stochastic game face a decision issue identical to a Markov decision scheme (MDS)26 when all other players adopt a fixed policy profile. Individuals of all ages execute the decision algorithm individually while conforming to a common framework built on Q-learning. The dynamics of the electronic environment are characterized by Markov characteristics, and the incentives received by UAVs are often based on their current condition and behavior39. The MDS of an agent \(\:{UAV}_{x}\:\)includes the following elements:
-
A discrete set of environmental states represented by \(\:{s}_{x}\);
-
A discrete set of possible actions represented by \(\:{\varphi\:}_{x}\);
-
The state migration probabilities are a representation of the environment time-gap dynamics, \(\:{E}_{{S}_{x}^{D}\to\:{S}_{x}^{D+1}}=E\left({S}_{x}^{D},\:\:{\Omega\:},{\:S}_{x}^{D+1}\right)\) for all \(\:{{\Omega\:}}_{x\:}\in\:\:{\varphi\:}_{x}\) and \(\:{S}_{x}^{D},{S}_{x}^{D+1}\:\in\:{\:s}_{x}\);
-
a reward function represented by \(\:{Sr}_{x}\) that represents the expected value of the subsequent \(\:{UAV}_{x}\) reward.
For example, if the current state is \(\:{S}_{x}\), the action \(\:{\varphi\:}_{x}\) will be performed, and the subsequent state will be \(\:{S}_{x}^{{\prime\:}}:\:{Sr}_{x}\:\left({S}_{x},{{\Omega\:}}_{x}\:,{S}_{x}^{{\prime\:}}\right)=F\:\left[{O}_{x}^{D+1}|{S}_{x}^{D}={S}_{x},{\:{\Omega\:}}_{x}^{D}={{\Omega\:}}_{x}\:,{S}_{x}^{D+1}={S}_{x}^{{\prime\:}}\right],\) where \(\:{O}_{x}^{D+1}\) represents the direct reward that the environment will offer to \(\:{UAV}_{x}\) at time \(\:D+1\). Due to the inability of drones to communicate with one another, it is essential to remember that each UAV has only limited knowledge of the stochastic environment in which it functions. In this study, MDSs with learning agents operating in unknown stochastic environments and unaware of the reward and transition functions are solved using Q-learning29. The Q-learning technique that can be utilized to solve a UAV’s MDS will be discussed next. Consider, without sacrificing generality, the \(\:{UAV}_{x}\) for the sake of simplicity. The functions of the state valve and the action value, commonly known as the Q function, are the two key concepts required to solve the MDS method mentioned above30.
To be more precise, the former is essentially the anticipated reward for achieving various stages in (22); this is what motivates the agent to follow certain rules. Similarly, the Q function of \(\:{UAV}_{x}\) begins in state \(\:{S}_{x}\), then goes into auction \(\:{{\Omega\:}}_{x}\), and then it follows the expected reward of policy \(\:\mu\:\), which may be represented as follows:
$$\:{Ql}_{x}\left({S}_{x},{\:{\Omega\:}}_{x},\:\:\mu\:\right)=F\left[\:\sum_{\tau\:=0}^{+inf}{\varDelta\:}^{\tau\:}{O}_{x}^{D+\tau\:+1}\:|{\:S}^{D}=S,{{\Omega\:}}_{x}^{D}={{\Omega\:}}_{x}\right]$$
(26)
where the value that corresponds to Eq. (26) is referred to as the action value or the Q-value.
Proposition 1
The specified function returns can be used as a starting point for deriving the recurrence relation of the state-value function. To be more specific, for any policy and any state \(\:{S}_{x}\) to be consistent, the following characteristics must exist between the two states: \(\:{\:S}_{x}^{D}\) = \(\:{S}_{x}\)and \(\:{S}_{x}^{D+1}\) = \(\:{S}_{x}^{{\prime\:}}\), with\(\:{S}_{x},\:\:{S}_{x}^{{\prime\:}}\in\:\)\(\:{s}_{x}:\)
$$\begin{aligned}\:{Sf}_{x}\left({S}_{x},\:{\Omega\:}\right)\:&=\:F\:\left[\:\sum_{\tau\:=0}^{+inf}{\varDelta\:}^{\tau\:}\:{O}_{x}^{D+\tau\:+1}\:\right|\:{S}_{x}^{D}={S}_{x}]\:=\:\sum_{{S}_{x}^{{\prime\:}}\:\:\in\:{\:s}_{x}}E\left({S}_{x},{\Omega\:},\:\:{S}_{x}^{{\prime\:}}\right)\\& \:\sum_{\:{\Omega\:}\:\in\:\:\varphi\:}\prod_{i\:\in\:\:X}{\:\:\mu\:}_{i}\left({S}_{i,\:\:\:}{{\Omega\:}}_{i}\right)\times\:\left(\left({S}_{x},\:\:{\Omega\:},\:\:{S}_{x}^{{\prime\:}}\right)\:+\varDelta\:Sf\left(\:{S}_{x}^{{\prime\:}},\:{\Omega\:}\right)\right)\end{aligned}$$
(27)
where \(\:{\mu\:}_{i}\left({S}_{i},{{\Omega\:}}_{i}\right)\:\)is the probability that the \(\:{\:UAV}_{x}\) would select a state-level action \(\:{{\Omega\:}}_{i}\) in state \(\:{S}_{i}\).
Take note that the reward that is anticipated when beginning in state \(\:{S}_{x}\) and strategy \(\:\mu\:\).
subsequently adhering to policy is denoted by the state-value function \(\:{Sf}_{x}\left({S}_{x},\:\mu\:\right)\). Based on Proposition 1, Eq. (26) can have the Q function rewritten such that it can also operate recursively. The resulting equation is as follows:
$$\begin{aligned}{Ql}_{x}\left({S}_{x},\:{\:{\Omega\:}}_{x},\:\mu\:\right)\:=\:F\:\left\{{O}_{x}^{D+1}+\:\varDelta\:\sum\:_{\tau\:=0}^{+inf}{\varDelta\:}^{\tau\:}\:{O}_{x}^{D+\tau\:+2}|\:{S}_{x}^{D}={S}_{x}\:,\:{{\Omega\:}}_{x}^{D}=\:{\Omega\:},{\:S}_{x}^{D+1}={S}_{x}^{{\prime\:}}\right\}\\ \:=\:\sum_{{S}_{x}^{{\prime\:}}\:\in\:\:{s}_{x}}F\left({S}_{x},\:{\Omega\:},\:\:{S}_{x}^{{\prime\:}}\right)\:\sum_{{{\Omega\:}}_{-x\:}\in\:{\:\varphi\:}_{-x}}\prod_{i\:\in\:\:X\backslash\:\left[x\right]}{\:\:\mu\:}_{i}\left({S}_{i,\:\:}{\:{\Omega\:}}_{i}\right)\times\:\:\left({Sr}_{x}\left({S}_{x},\:{\Omega\:},\:\:{S}_{x}^{{\prime\:}}\right)+\varDelta\:Sf\left(\:{S}_{x}^{{\prime\:}},\:{\upmu\:}\right)\right)\end{aligned}$$
(28)
Keep in mind that starting with the value (26), all UAV behaviors become reliant on the Q-value. It is essential to be aware that Eqs. 27 and 28 make up the fundamental building blocks of the Q-learning-based reinforcement learning method used to solve the MDS for each UAV36. Equations (27) and (28), which may be found above, can also be applied to produce the connection shown below between state values and Q-values.
$$\:{Sf}_{x}\left({S}_{x},\:\mu\:\right) = \:\sum_{{{\Omega\:}}_{x\:}\in\:\:{\varphi\:}_{x}}{\mu\:}_{x}\left({S}_{x,\:\:}{{\Omega\:}}_{x}\right){Ql}_{x}\left({S}_{x,}{{\Omega\:}}_{x},\:\ge\:\mu\:\right).\:\:\:\:\:\:\:\:\:\:\:\:$$
(29)
As was noted before, the objective of figuring out how to solve the MDS is to identify the best course of action that will result in the greatest possible payoff. When examining the situation from the standpoint of the state value function41, we can say that the best course of action for the \(\:{UAV}_{x}\:\)in state \(\:{S}_{x}\) is as follows:
$${Sf}_{x}^{*} = \:{max}_{\mu x}\:{Sf}_{x} ({S}_{x},\mu)\:{S}_{x}\in{s}_{x}$$
(30)
To achieve the best possible Q-values, we also have.
$$\:{Ql}_{x}^{*}\:({S}_{x},\:{{\Omega}}_{x})={max}_{\mu x}\:{Ql}_{x}\;({S}_{x},\:{{\Omega}}_{x},\mu){S}_{x}\in{s}_{x},{{\Omega\:}}_{x}\in{\varphi}_{x}$$
(31)
when solving Eq. (28) by substituting into Eq. (29), one possible rewrite of the optimal state value equation is:
$$\:{Sf}_{x}^{*} \:({S}_{x}) = \:{max}_{{{\Omega\:}}_{x}} \:{Ql}_{x}^{*} ({S}_{x},\:{{\Omega}}_{x})$$
(32)
Also, consider the fact that the use of \(\:{\sum\:}_{{{\Omega\:}}_{x}}\mu\:\left({S}_{x,\:\:}{\:{\Omega\:}}_{x}\right){Ql}_{x}^{*}\left({S}_{x},{{\Omega\:}}_{x}\right)\le\:{max}_{{{\Omega\:}}_{x}}{Ql}_{x}^{*}\left({S}_{x},{{\Omega\:}}_{x}\right)\) yields (32). It is important to keep in mind that, as opposed to the strategy space, the optimal state value equation in Eq. (32) maximizes the action space. Equation (32) can then be used with Eqs. (27) and (28), respectively, to create the Bellman optimum equations for state values and Q-values42, as follows.
$$\begin{aligned}\:{Sf}_{x}^{*} \:({S}_{x}) &= \:\sum_{{{\Omega\:}}_{-x}\:\:\in\:\:{\varphi\:}_{-m}}\prod_{i\:\in\:\:X\:\:\setminus\:\left[x\right]}{\:\:\mu\:}_{i}\left({S}_{i,\:}\:{{\Omega\:}}_{j}\right)\times \\ &\:\:{max}_{{{\Omega\:}}_{x}}\:\sum_{{S}_{x}^{{\prime\:}}}E\left({S}_{x}\:,\:\:{{\Omega\:},\:\:S}_{x}^{{\prime\:}}\right)\:\left\{Sr\left({S}_{x},{{\Omega\:}}_{x},\:\:{S}_{x}^{{\prime\:}}\right)+\:\varDelta\:{Sf}_{x}^{*}\left({S}_{x}^{{\prime\:}}\right)\right\}\end{aligned}$$
(33)
And.
$$\begin{aligned}\:{Ql}_{x}^{*}\:({S}_{x},\:{{\Omega}}_{m}) &=\:\sum_{{{\Omega\:}}_{-x}\:\in\:\:{\varphi\:}_{-x}\:}\prod_{i\:\in\:\:X\:\setminus\:\left[x\right]}{\:\:\mu\:}_{i}\left({S}_{i,\:}{\:{\Omega\:}}_{i}\right) \times\:\sum_{{S}_{x}^{{\prime\:}}}E\left({S}_{x},\:{{\Omega\:},S}_{x}^{{\prime\:}}\right) \: \\& \left\{Sr\left({S}_{x},{{\Omega\:}}_{x},\:\:{S}_{x}^{{\prime\:}}\right)\:+\varDelta\:{{max}_{{{\Omega\:}}_{x}^{{\prime\:}}}Ql}_{x}^{*}\left({S}_{x}^{{\prime\:}},\:{{\Omega\:}}_{x}^{{\prime\:}}\right)\right\}\end{aligned}$$
(34)
The most optimum strategy of action is always that which maximizes the Q-function of the current state (34). This can be inferred from the ideal policy of always choosing the option with the highest value43. It can be challenging to choose the ideal joint strategy since, in a multi-intelligent situation, the collaborative strategy requires that each intelligence’s Q-function be determined by the combined action30. Q-functions for each intelligence in the multi-intelligence case. We treat UAVs as independent learners (ILs) to address these issues. According to this, UAVs act and interact with their surroundings as if there are no other UAVs around since they are blind to the rewards and the actions of other UAVs.
Resource allocation based on Q-learning for Multi-UAVs systems
The resource allocation problem among UAVs is addressed in this part with an ILs31 based RMAL algorithm. The optimum policy for the MDS is chosen by each UAV, which then executes a typical Q-learning procedure to get its ideal Q-value45. More specifically, the choice of actions in each iteration is determined by the Q-value expressed in terms of dual states. \(\:{S}_{x}\) and its subsequent iterations. Thus, the Q-values reveal the nature of the activities that will be performed in the subsequent states. The following expression provides the update rule for Q-learning.
$$\begin{aligned}{Ql}_{x}^{D+1}\:({S}_{x},\:{\:{\Omega}}_{x}) =Ql_x^D({S}_{x},\:{\:{\Omega}}_{x}){\beta}_{D}\:\left[{O}_{x}^{D}\:+\:{\varDelta\:max}_{{{\Omega\:}}_{x}^{{\prime\:}}\:\in\:\:{{\Omega\:}}_{x}}{Ql}_{x}^{D}\left({S}_{x}^{{\prime\:}},\:{{\Omega\:}}_{x}^{{\prime\:}}\right)-\:{Ql}_{x}^{D}\left({S}_{x},{{\Omega\:}}_{x}\right)\right]\end{aligned}$$
(35)
with \(\:{S}_{x}^{D+1}={S}_{x},{{\Omega\:}}_{x}^{D}={{\Omega\:}}_{x}\) where \(\:{S}_{x}^{{\prime\:}}\)and \(\:{{\Omega\:}}_{x}^{{\prime\:}}\), respectively, equating to \(\:{S}_{x}^{D+1}\) and \(\:{{\Omega\:}}_{x}^{D+1}\). It is essential to remember that the best action value function may be created by iteratively deriving the appropriate action values46. To be more specific, each intelligence acquires the optimal action value by following the update algorithm in Eq. (35), \(\:{Ql}_{x}^{D}\) is the action value of the \(\:{UAV}_{x}\) in time slot \(\:D\) and \(\:{\beta\:}_{D}\) denotes the learning rate, respectively. Another crucial component of the Q-learning system is the action selection mechanism. This mechanism is what determines the activities that the intelligence will carry out while they are in the process of acquiring new knowledge. For the agent to build on what it now recognizes as outstanding judgment and study new activities, achieving equilibrium between exploration and exploitation is the aim32. Within the scope of this research, we investigate \(\:\epsilon\:\:\)– greedy exploration. With a probability of \(\:\epsilon\:\), the agent makes a random selection. With a probability of \(\:1\:-\:\epsilon\:\), the agent then decides on the optimal course of action, which is determined by the current Q-value that is the most significant. This is an example of \(\:\epsilon\:\) selection. As a result, the probability of selecting an action \(\:{{\Omega\:}}_{x}\) while in a state \(\:{S}_{x}\) can be computed using the following Eq.
$$\:{\mu\:}_{x}\left({S}_{x},\:\:{{\Omega\:}}_{x}\right)=\:\left\{\begin{array}{c}1\:-\:\epsilon\:,\:\:if\:{Ql}_{x}\:of\:{{\Omega\:}}_{x}\:is\:the\:highest,\\\:\epsilon\:,\:\:\:\:otherwise\:.\:\end{array}\right.$$
(36)
Exactly \(\:\epsilon\:\in\:\left(0,1\right)\). To guarantee that Q-learning will eventually converge, the learning rate \(\:{\beta\:}_{D}\) has been fixed at33 and is represented by the following Eq.
$$\:{\beta\:}_{D}=\frac{1}{{\left(D\:+\:{C}_{\beta\:}\right)}^{\phi\:\beta\:}}$$
(37)
where \(\:{C}_{\beta\:}>0,{{\upphi\:}}_{\beta\:}\in\:\left(\frac{1}{2},1\right)\). It is imperative to bear in mind that every UAV operates independently during the Q-learning phase of the suggested ILs-based RMAL algorithm. Therefore, the Q-learning process ends in Algorithm 1 for every. \(\:{UAV}_{x}\:,x\in\:\text{X}\).
Algorithm 1
Because the starting value of Q in Algorithm 1 is always set to zero, this learning method is sometimes referred to as zero-initialized Q learning34. Because the UAV does not have any previous information about the beginning state, it employs a strategy with an equal probability, denoted by the letter \(\:{\mu\:}_{x}\left({S}_{x},{{\Omega\:}}_{x}\right)=\frac{1}{\left|{{\upvarphi\:}}_{x}\right|}\).
Algorithm: Q-learning based RMAL algorithm for Multi-UAVs System
-
Begin.
-
Set \(\:D=0\) and parameters \(\:\varDelta\:,\:{C}_{\beta\:}\).
-
for all \(\:x\in\:\text{X}\) do.
-
Declare action-value with \(\:{Ql}_{x}^{D}\)(\(\:{S}_{x},\:{{\Omega\:}}_{x}\)) = 0, approach \(\:{\mu\:}_{x}=\:\left({S}_{x},\:{{\Omega\:}}_{x}\right)=\:\frac{1}{\left|{\varphi\:}_{x}\:\right|}\:=\:\frac{1}{X\:K\:I}\);
-
Load and assign value to the state \(\:{S}_{x}\) = \(\:{S}_{x}^{D}\) = 0;
-
Terminate for loop.
-
while \(\:D
do//Begin the while loop. -
for each UAV \(\:{UAV}_{m}\), \(\:x\in\:\text{X}\) do.
-
Tune the base learning rate \(\:{\beta\:}_{D}\) based on.
$$\:{\beta\:}_{D}=\frac{1}{{\left(D\:+\:{C}_{\beta\:}\right)}^{\phi\:\beta\:}}$$
-
Choose an action \(\:{{\Omega\:}}_{x}\) based on the selection scheme \(\:{\mu\:}_{x}\left({S}_{x}\right)\).
-
Compute the SINR values of the receiver on the basis of.
$$\:{\gamma\:}_{x}\left(D\right)\:\left\{{{\Omega\:}}_{x}\left(D\right),\:\:\:{{\Omega\:}}_{-x}\left(D\right),\:{G}_{x}\left(D\right)\right\}=\:\sum_{U\:\in\:\:\cup\:}\sum_{k\:\in\:\:K}\frac{{Ds}_{x,\:U}^{k}\left(D\right)\:\left\{{{\Omega\:}}_{x}\left(D\right),\:\:\:{{\Omega\:}}_{-x}\left(D\right),\:\:\:{G}_{x,\:U}\left(D\right)\right\}}{{I}_{x,\:U}^{k}\left(D\right)\:\left\{{{\Omega\:}}_{x}\left(D\right),\:{\:{\Omega\:}}_{-x}\left(D\right),\:\:\:{G}_{x,\:U}\left(D\right)\:\:\right\}+\:{{\Phi\:}}^{2}}$$
-
if \(\:{\gamma\:}_{x}\left(D\right)\ge\:\:{\gamma\:}_{x}^{{\prime\:}}\:\)do.
-
Set \(\:{S}_{x}^{D}\) = 1.
-
else-do.
-
Set \(\:{S}_{x}^{D}\) =0.
-
Terminate if.
-
Update and select the instantaneous system reward \(\:{O}_{x}^{D}\) on the basis of.
$$\:{\gamma\:}_{x}\left(D\right)\:\left\{{{\Omega\:}}_{x}\left(D\right),\:\:\:{{\Omega\:}}_{-x}\left(D\right),\:{G}_{x}\left(D\right)\right\}=\:\sum\:_{U\:\in\:\:\cup\:}\sum\:_{k\:\in\:\:K}\frac{{Ds}_{x,\:U}^{k}\left(D\right)\:\left\{{{\Omega\:}}_{x}\left(D\right),\:\:\:{{\Omega\:}}_{-x}\left(D\right),\:\:\:{G}_{x,\:U}\left(D\right)\right\}}{{I}_{x,\:U}^{k}\left(D\right)\:\left\{{{\Omega\:}}_{x}\left(D\right),\:{\:{\Omega\:}}_{-x}\left(D\right),\:\:\:{G}_{x,\:U}\left(D\right)\:\:\right\}+\:{{\Phi\:}}^{2}}$$
\(\:{Ql}_{x}^{D+1}\) (\(\:{S}_{x},\:{\:{\Omega\:}}_{x}\)) = \(\:{Ql}_{x}^{D}\)(\(\:{S}_{x},\:{{\Omega\:}}_{x}\)) \(\:{\beta\:}_{D}\:\left[{O}_{x}^{D}\:+\:{\varDelta\:max}_{{{\Omega\:}}_{x}^{{\prime\:}}\:\in\:\:{{\Omega\:}}_{x}}{Ql}_{x}^{D}\left({S}_{x}^{{\prime\:}},\:{{\Omega\:}}_{x}^{{\prime\:}}\right)-\:{Ql}_{x}^{D}\left({S}_{x},{{\Omega\:}}_{x}\right)\right]\).
$$\:{\mu\:}_{x}\left({S}_{x},\:\:{{\Omega\:}}_{x}\right)=\:\left\{\begin{array}{c}1\:-\:\epsilon\:,\:\:if\:{Ql}_{x}\:of\:{{\Omega\:}}_{x}\:is\:the\:highest,\\\:\epsilon\:,\:\:\:\:otherwise\:.\:\end{array}\right.$$
The proposed RMAL algorithm analysis
Here, we will look at the convergence of the previously suggested RMAL-based resource allocation strategy. It is essential to remember that the RMAL algorithm presented here may be thought of as a standalone multi-intelligent Q-learning method. In this concept, each UAV performs as a learning intelligence that uses the Q-learning algorithm to make judgments. As a consequence, by taking into account the following idea, convergence may be understood.
Proposition 2
When applying the RMAL method of the proposed Algorithm 1, which can be found in Algorithm 1, Every UAV Q-learning algorithm eventually reaches the Q-value of a single optimal procedure. The following observation is crucial to demonstrating that Proposition 2 is correct. Because UAVs are non-cooperative, the suggested RMAL algorithm’s convergence is reliant on the Q-learning method’s convergence31.
Theorem 1
The update rule in (33) of algorithm 1 Q-learning approach converges to the ideal \(\:{Ql}_{x}^{*}\left({S}_{x},{{\Omega\:}}_{x}\right)\) value with the probability one \(\:(\omega\:.\rho\:.1)\) if.
-
There is a finite number of states and actions;
-
\(\:\sum_{D=0}^{+inf}{\beta\:}^{D}=inf,\:\:\:\sum_{D=0}^{+inf}{\left({\beta\:}^{D}\right)}^{2}\:<\:\:inf\:\:\)uniformly \(\:\omega\:.\rho\:.\:1\);
-
Var\(\:\left[{O}_{x}^{D}\right]\) is bounded;
Simulation results
In this portion of the article, the performance of the suggested RMAL-based resource allocation strategy for multi-UAV systems is evaluated through simulations. We assume a multi-UAVs system set up in a disk of radius \(\:r\:=600\:m\). The ground users are uniformly and randomly distributed throughout the disk. It is thought that all UAVs fly at the same altitude, i.e., \(\:Al=80\:m\), During the simulation, we assume a noise power of \(\:{\varphi\:}^{2}=-70\) dBm. The simulation uses \(\:\frac{\omega\:}{k}=65\:KHZ\) and \(\:{T}_{s}=0.1s\) as sub-channel bandwidths. The channel parameters in the simulation are determined by the probabilistic model and follow Eq6., where \(\:A=9.61\) and \(\:B=0.61\), respectively. In addition, the carrier frequency is \(\:Fc=2\:GHZ\), followed by \(\:{\mathfrak{y}}^{Los}=1\) and \(\:{\mathfrak{y}}^{nlos}=2\). The routing loss factor is defined as \(\:\beta\:=2\) and the channel power gain is given as \(\:{d}_{0}=1\)m at the reference distance \(\:{\:\alpha\:}_{0}=-60\)dB in the LoS channel model scenario11. \(\:{P}_{x}=P=23dBm\) is the maximum power per UAV in the simulation, and \(\:i=3\) is the maximum number of power levels. The maximum power is split into J discrete power values in an equal amount. One power unit cost is \(\:{\omega\:}_{x}=\omega\:=100,\) and the user is expected to maintain a minimum SINR of \(\:{\gamma\:}_{0}=3\:\)dB.In addition to \(\:{C}_{\beta\:}=0.6\), \(\:{\rho\:}_{\alpha\:}=0.9\), and \(\:\varDelta\:=1\).

UAVs based Systems with \(\:X=3\:\)and \(\:U=80\).
In Fig. 3, we look at an implementation of a random multi-UAV system. In this version, a disc with a radius of \(\:r=600m\) has \(\:U=80\) users randomly dispersed across it, and three UAVs are initially positioned at the disc edges at an angle of \(\:\varnothing\:=\frac{\pi\:}{4}\:.\) To make things clearer, Fig. 4 shows the average reward and average reward per time slot for the UAV running at 40 m/s under the conditions shown in Fig. 3. The different average rewards are computed as shown in Fig. 4(a) and noted as \(\:{v}^{D}=\frac{1}{X}\sum\:x\in\:X{v}_{x}^{D}\). As shown in Fig. 4(a), the number of algorithm iterations raises the typical number of rewards. This is due to the potential of the already proposed RMAL algorithm to increase long-term rewards. Nevertheless, the average reward curve becomes flat when t increases to a value greater than 250 time slots. When the time is more significant than 250 s, the UAV flies out of the disc. As a direct consequence of this, the typical bonus does not rise. Figure 4(b) depicts the average number of immediate bonuses received by \(\:{O}^{D}=\sum\:x\in\:X{O}_{x}^{D}\:\:\)per time slot, which corresponds to the previous statement.
The key simulation parameters and performance observations for evaluating the proposed RMAL-based multi-UAV communication framework are summarized in Table 2. This table outlines critical metrics including environmental setup (e.g., disk radius, number of users, UAV altitude), communication parameters (e.g., noise power, sub-channel bandwidth, carrier frequency), and algorithm-specific observations (e.g., immediate and long-term rewards, SINR thresholds). These metrics provide a foundational basis for assessing the efficiency and stability of the proposed algorithm under realistic operational constraints and dynamic network conditions.

Comparing the Average Rewards with different \(\:\epsilon\:\), X = 3 and U = 80.
In Fig. 4(b), the x-axis represents the number of algorithm iterations, which corresponds to discrete time slots during which each UAV updates its policy based on observed states and rewards. These iterations range from 0 to a predefined simulation horizon (e.g., 500 slots) and are crucial for the convergence of the Q-learning-based RMAL algorithm. The y-axis denotes the UAV speed (in meters per second), which influences how frequently a UAV encounters new users and changes its spatial context. Higher speeds typically allow UAVs to explore the environment more dynamically, while lower speeds may result in more localized communication. It is important to note that both algorithm iterations and UAV speed are inherently non-negative in the actual simulation. Any negative values observed in the figure are purely visual artifacts from the surface plotting function and do not correspond to real-world or simulated states. These have been retained only to provide a smooth visualization of the reward surface.
To analyze the impact of the exploration rate (ϵ) on the learning dynamics of the RMAL-based multi-UAV system, a comparative evaluation of average rewards under different exploration settings was conducted. The results, as summarized in Table 3, highlight the critical role of balancing exploration and exploitation in reinforcement learning environments40.
Specifically, the exploration rate ϵ = 0.5 yielded the highest final average reward, indicating an effective trade-off between exploring new actions and exploiting known high-reward strategies. In contrast, lower exploration (ϵ = 0.2) led to slightly reduced performance, reflecting limited exposure to alternative actions and potentially suboptimal policy convergence. At ϵ = 0.8, the algorithm engaged in broader exploration but exhibited slower convergence and achieved only moderate rewards, suggesting that excessive exploration can delay learning stabilization. Notably, the purely exploitative configuration (ϵ = 0) resulted in the lowest reward values, as the system lacked the exploratory behavior needed to discover optimal strategies in dynamic environments.
As the algorithm iterates more, the average reward per time slot declines, as seen in Fig. 4(b).
Considering that the recommended Q-learning strategy’s learning rate \(\:{\beta\:}_{D}\) depends on the value of D in (35) and D value drops as the number of time slots rises in the case involving D. It is significant to remember that when the quantity of method iterations increases, \(\:{\beta\:}_{D}\) falls, showing that the update rate of Q values slows down as the time step rises. In addition, Fig. 4 analyzes how the average reward changes according to \(\:\epsilon\:=\left\{\text{0,0.2,0.5,0.9}\right\}\). Each UAV decides on a greedy action, commonly referred to as an exploitation strategy if \(\:\epsilon\:=0\). Each UAV will select a random action with a greater probability of occurring when \(\:\epsilon\:\) is equal to 1. It should be noted that \(\:\epsilon\:=0.5\) is a reliable choice in the considered arrangement, as shown in Fig. 4.
In Figs. 5 and 6, we take a look at how different system settings affect the typical number of rewards received. Using the LoS channel model stated in Eq. (4), Fig. 5 shows a graphical depiction of the average rewards received at various settings.

Average Rewards Comparison for LoS Channel Model with different \(\:\epsilon\:,\) X = 3 and U = 80.
In addition, a typical reward generated by a probabilistic model using \(\:X=4,K=3\:and\:U=250\:\)is shown in Fig. 6. To be more precise, the UAVs are dispersed in a random pattern along the edges of the cells. In the iterative method, each UAV flies over the cell and then keeps flying over the disk centre, which is also the centre of the cell. As can be seen in Figs. 5 and 6, the pattern of the curves representing the average reward applied to the different \(\:\epsilon\:\) values is similar to that depicted in Fig. 6. In addition, the multi-UAV network under study is capable of achieving the optimal average reward for a variety of different network configurations.

Multi-UAVs Systems Illustration with K = 3, M = 4, and U = 250.
By comparing, as shown in Fig. 7, the proposed RMAL algorithm with a corresponding theory-based resource allocation method, we assess the average reward of the algorithm. In Fig. 7, we analyze the same configuration as in Fig. 4, but this time we use the \(\:i=1\) value to simplify the algorithm implementation. The UAV activities only include the options that the user selects for each time slot. Further, we consider that all of the information transmission between the UAVs is handled by a matching theory-based user selection algorithm. This implies that before making a decision, each UAV is aware of what the other UAVs have done. We employ the Gale-Shapely (GS) approach35 in the matching theory-based user selection processes as a point of comparison. Each time slot involves doing this. Additionally, we evaluated Fig. 7 baseline scheme, the random user technique (Rand), for effectiveness. Figure 7 illustrates how, in terms of average reward, the matching-based user selection algorithm performs better than the recommended RMAL method. This is due to the lack of information sharing in the proposed RMAL algorithm. Each UAV makes its decision independently since it is impossible for them to keep track of the information that other UAVs are processing, such as rewards and choices, in this situation.

Average Rewards Comparison with different Algorithms where K = 1,J = 1, M = 2, and U = 80.
The performance comparison of different algorithms is summarized in Table 4. Among them, the proposed MARLA algorithm achieved the highest final average reward (~ 80) using a power level of 23 dBm, showing efficient learning and resource allocation37. The Mach algorithm performed moderately well with a final reward of ~ 50, while the Rand algorithm, based on random actions, performed the worst with a reward of only ~ 30. These results clearly demonstrate the effectiveness of MARLA in dynamic UAV communication environments.
The suggested RMAL algorithm also offers a higher average reward than the random user selection strategy, as shown in Fig. 7, which results in a lover average reward for the random user selection algorithm. The system cannot effectively exploit the observed information since the user selection was made at random. As a result, the developed RMAL algorithm may balance lowering the information exchange cost with enhancing system performance as a whole.

Average Rewards Comparison with different Algorithms where K = 1, I = 1, X = 2, and U = 80.
We investigate how the algorithm’s iterations and the UAV’s speed affect the average reward in Fig. 8. By having the UAV take off at random from the disc’s edge and then fly directly over the disc’s center while moving at different speeds, this is UAV. Figure 8 depicts the identical layout as Fig. 6, with the exception that \(\:X=1\) and \(\:K=1\) have been added for illustration. As can be observed, given a fixed pace, the usual payout increases continuously in direct proportion to the algorithm’s iterations. Additionally, when D is less than 150, the average reward for more huge speeds rises more quickly than the average reward for slower speeds. This is the case of keeping a constant time gap. This is because the user and the UAV positions are chosen randomly. As a result, to satisfy its QoS criteria, the UAV might not be able to recognize the right user right away. Figure 8 further shows that there is a negative correlation between the increase in speed at the end of the algorithm iteration and the average reward obtained. This is so that the time needed to launch the disc will be shorter if the UAV flies quickly. As a direct result, the total service time for UAVs flying at faster speeds is shorter than that for UAVs flying at slower speeds.
Comparative analysis with MARL algorithms
To evaluate the effectiveness of the proposed RMAL framework, we compare it with three state-of-the-art multi-agent reinforcement learning algorithms:
-
QMIX: A value-decomposition-based MARL that combines individual Q-functions into a global Q-function while maintaining consistency.
-
MADQN: Multi-agent DQN, which extends DQN to multi-agent settings using shared experience and coordinated updates.
-
MAPPO: Multi-Agent Proximal Policy Optimization, a popular actor-critic based algorithm for continuous action spaces.
The simulation environment consists of 3 UAVs (X = 3), 80 users (U = 80), 4 sub-channels (K = 4), and 3 power levels (i = 3). Each algorithm was executed for T = 500 time slots with identical initialization and reward functions.
Table 5 presents a performance comparison between the proposed RMAL algorithm and other multi-agent reinforcement learning methods. RMAL demonstrates competitive results with lower computational overhead and faster convergence. While MAPPO and QMIX yield slightly higher average rewards and SINR satisfaction, they come with higher computational and memory requirements due to centralized training and value decomposition mechanisms47. In contrast, the proposed RMAL algorithm achieves competitive performance with lower overhead and full decentralization, making it suitable for real-time UAV networks where bandwidth and processing power are limited.
