Deep Learning-Based MPPT for PV Systems: LSTM Forecasting and Adaptive TSMC via PPO Agent

Aymen Lachheb; Chabakata Mahamat; Rym Marouani

doi:10.2478/pead-2026-0018

Introduction

The growing global demand for clean energy has positioned photovoltaic (PV) systems as a cornerstone of modern power generation. To maximise the energy output from these systems, it is crucial to implement effective maximum power point tracking (MPPT) algorithms (Bhukya and Kota, 2018). These algorithms dynamically adjust the operating voltage and current of a solar panel to ensure it operates at its optimal powers point under varying environmental conditions.

Conventional MPPT techniques, such as Perturb & Observe (P&O) (Bhukya and Kota, 2018) and Incremental Conductance (INC), are widely used due to their simplicity and low implementation cost (Saberi et al., 2023). These algorithms iteratively adjust the operating point to locate the power-voltage (P-V) curve peak. Although effective under uniform solar irradiance, their performance degrades significantly under partial shading conditions (PSCs). Under PSC, the P-V curve exhibits multiple peaks, and these algorithms are prone to becoming trapped at a local maximum rather than the global one. This failure to reach the true maximum power point results in considerable energy loss, slow tracking speeds, and pronounced steady-state oscillations, all of which diminish the overall efficiency of the system (Naick et al., 2017).

To address these deficiencies, researchers have extensively investigated non-linear control strategies. Feedback linearisation controllers, which can globally linearise system non-linearities, have been used to achieve MPPT in PV systems (Kumara and Rao, 2017). Similarly, backstepping control employs a systematic step-by-step design methodology and has demonstrated effectiveness in maximum power extraction from PV systems (Behih and Attoui, 2021). However, both approaches share a critical vulnerability: their dependence on precise system parameter knowledge. Performance degrades significantly under dynamic uncertainties common in practical applications.

Sliding mode control (SMC) methods offer a promising solution to handle the uncertainties inherent in PV systems (Ait-Chekdhidh et al., 2025). These techniques have gained significant attention for MPPT applications due to their robustness, low sensitivity to parameter changes, simplicity, and fast response (Abianeh and Ferdowsi 2021). The literature encompasses diverse SMC variants for PV maximum power extraction (Ahmad et al., 2020a), including fractional-order sliding mode controller (FOSMC) (Tadj et al., 2023), fuzzy-logic-based SMC (FLSMC) (Qureshi et al., 2022, Second-order SMC (Dursun and Kulaksiz, 2020), super-twisting SMC (ST-SMC) (Hadj Salah et al., 2023), terminal sliding mode control (TSMC) (Behih and Attoui, 2021), super-twisting fractional-order terminal SMC (ST-FOTSMC) (Sami et al., 2020). Despite these advantages, a fundamental limitation persists: the requirement for known upper bounds of system uncertainties and disturbances. In complex real-world PV installations, accurate determination of these bounds presents significant challenges, potentially limiting practical applicability.

Terminal sliding mode control (TSMC) have shown particular promise in MPPT applications due to their robustness and finite-time convergence characteristics (Saad et al., 2018). However, their effectiveness relies heavily on accurate reference voltage determination, which proves challenging under dynamic operating conditions. Adaptive terminal sliding mode control (ATSMC) addresses several key limitations of conventional SMC by eliminating the need for prior knowledge of uncertainty bounds, ensuring finite-time convergence, and reducing chattering phenomena (Baraean et al., 2023). The adaptive mechanism adjusts parameters in real-time without requiring predetermined uncertainty bounds, making it ideal for complex applications. However, non-linear control strategies face several limitations, which have spurred research into more sophisticated intelligent MPPT algorithms, including fuzzy logic controllers (FLC), particle swarm optimisation (PSO) (Al-Majidi et al., 2020), genetic algorithms (GA), artificial neural networks (ANNs), and various machine learning (ML)-based approaches.

FLC utilise rule-based systems with membership functions dynamically adjust control parameters, offering enhanced tracking precision and oscillation suppression around the MPP. However, FLCs exhibit limitations when confronted with rapid variations in solar irradiance (G) and temperature (T) due to their static membership functions and limited capacity to model complex input-output relationships (de Lima and Oliveira, 2024).

PSO enhances tracking efficiency by iteratively adjusting the control duty cycle, with particles updating their positions based on individual historical bests and the swarm’s collective knowledge (Abdulkadir et al., 2014). Nonetheless, in rapidly changing environmental conditions, PSO particles may fail to reposition quickly enough, resulting in suboptimal MPP tracking performance.

GA address the issue of slow adaptation by exploring a broader search space using evolutionary operations such as selection, crossover, and mutation, which improve optimisation during fast environmental changes (Firdouse and Surender Reddy 2023). However, GAs have limited adaptability to intricate data relationships due to their fixed operations and slower convergence rates (Hadji et al., 2018).

ANNs, which form the foundation of ML and deep learning (DL), generally surpass both traditional and existing intelligent MPPT methods in accuracy and efficiency (Abreo 2021). ANNs learn complex patterns from data, continuously refine their models with new information, and dynamically optimise MPPT performance (Lachheb et al., 2025). Despite their advantages, challenges remain regarding interpretability, overfitting susceptibility, and dynamic adaptability under extreme conditions.

Despite the growing interest in artificial intelligence (AI) for PV systems, a significant gap remains in the specific application of DL techniques to optimise solar PV power generation. Ab-Belkhair et al. (2020) proposed a dual deep neural network (DNN) architecture for a hybrid solar/wind system, utilising a custom feed-forward neural network (FF-NN) for MPPT. While they successfully demonstrated the potential of DNNs for such applications and encouraged their deployment, the methodology presents significant limitations. The approach is computationally intensive, employing two separate DNNs with an excessively large architecture (1,000 hidden neurons) for a relatively small dataset of 66,000 samples. Furthermore, the study omits critical details regarding data preprocessing and visualisation, which are fundamental to validating any data-driven DNN model. The research work investigated by Srinivasan and Ramalingam Balamurugan (2022) presents a recurrent neural network (RNN)-based MPPT algorithm for grid-connected PV systems, with a focus on cell temperature adaptation. While the approach is promising, the study lacks critical methodological details, including the size of the dataset and the specific architecture of the model. Furthermore, as a data-driven approach, it omits essential steps in the data pipeline, such as pre-processing and visualisation, which are crucial for reproducibility and validation. A more significant limitation is the choice of a standard RNN, which is inherently susceptible to the vanishing gradient problem and struggles to learn long-term dependencies in temporal data. This limitation could be effectively addressed by employing more advanced RNN variants, such as long short-term memory (LSTM) or gated recurrent unit (GRU) networks, which are specifically designed with gating mechanisms to mitigate this issue.

This study, presented by Dharma Raj et al. (2022) implements a sophisticated 5-layer bidirectional LSTM (Bi-LSTM) network with 600 neurons per layer for MPPT control in a hybrid renewable system. While the Bi-LSTM architecture processing data in both forward and backward directions undoubtedly enhances prediction accuracy, it does so at a considerable cost. The combination of a deep network and bidirectional processing introduces significant computational overhead and training inefficiencies. In contrast, the two-layer stacked LSTM architecture proposed in our work is designed to achieve a more favourable balance between accuracy and computational efficiency.

To address the individual limitations of traditional and intelligent MPPT algorithms, including sensitivity, complexity, and slow response times, researchers have developed hybrid MPPT methods that synergistically combine complementary algorithmic strengths. Recent research has focused on hybrid approaches that integrate different algorithmic capabilities. Some studies have combined ANNs with conventional methods to improve tracking speed, while others have merged metaheuristic optimisation algorithms (GA, PSO) with FLC to better handle PSCs [26,27] (Melhaoui et al., 2025, Abdolrasol et al., 2025).

In their study, Ali et al. (2018) optimised standalone PV systems using a hybrid GA-ANN approach, leveraging GA’s global search capability while utilising ANN’s adaptive learning for real-time MPPT fine-tuning under changing conditions. Similarly, research work presented by Roberts and Bhattacharya (2017) addressed partial shading by integrating ANNs with Hill Climbing (HC), exploiting ANN’s complex data relationship learning while using HC for refined convergence towards optimal solutions.

Despite their advantages, hybrid MPPT approaches face limitations, including slow convergence and insufficient dynamic adaptability due to reliance on predefined rules. Consequently, integrating advanced DL-based MPPT methods become crucial for superior data processing and adaptive learning capabilities, which are essential to surpass the performance of existing state-of-the-art MPPT techniques. LSTM networks, specialised RNNs, have demonstrated significant promise in predicting optimal operating points based on environmental data. LSTMs excel in this application due to their capacity for modelling complex, non-linear, time-dependent relationships between environmental conditions and PV panel performance.

More recently, reinforcement learning (RL), particularly deep reinforcement learning (DRL), has emerged as a powerful MPPT tool. RL agents learn optimal control policies through trial-and-error interaction, adapting to complex environments without explicit programming. Algorithms such as proximal policy optimisation (PPO) are particularly suitable, offering stable learning characteristics and effective handling of continuous action spaces for controller parameter adjustment.

Despite these advances, several gaps remain in the literature:

Many hybrid approaches combine algorithms in a sequential or loosely coupled manner, failing to achieve true real-time, adaptive synergy between predictive and control components.
Some intelligent methods, especially complex DRL algorithms, can be computationally intensive, limiting their practical application in low-cost PV systems.
While some methods perform well under a few shading patterns, they often struggle with highly unpredictable and rapidly changing environmental conditions, which are becoming more common.

This research addresses these gaps by proposing a novel, tightly integrated hybrid approach that combines LSTM’s predictive capabilities for voltage reference generation with PPO’s dynamic optimisation for robust TSMC parameter tuning. This synergistic integration is designed to deliver superior performance in efficiency, stability, and real-time adaptability across all operating conditions.

The primary objectives of this paper are:

To develop and evaluate a LSTM model for accurate maximum power point voltage prediction (V_MPP) based on real-time environmental data.
To propose a DRL approach using the PPO algorithm to dynamically optimise the parameters of a TSMC.
To validate the superior performance of this integrated hybrid system against conventional methods through comprehensive simulation studies, with particular emphasis on partial shading scenarios.

Overview of the Proposed Hybrid Approach

The proposed LSTM–PPO–TSMC architecture integrates three complementary intelligent components to address key limitations of conventional MPPT methods under dynamic PV conditions. Standalone RL or adaptive sliding mode approaches provide partial benefits but fail to ensure fast convergence, stability, and robustness simultaneously.

Pure RL-based MPPT controllers suffer from high sample complexity, slow convergence, and potential instability, while ignoring known physical relationships between environmental variables and the maximum power point. In contrast, adaptive sliding mode controllers are robust but reactive, with performance strongly dependent on accurate reference voltage estimation, which is challenging under partial shading and rapid irradiance variations.

The proposed triple-layer architecture resolves these issues through a clear division of roles. The LSTM network predicts the optimal voltage V_MPP by learning temporal patterns in irradiance and temperature data, reducing the search space for control. The TSMC ensures robust, finite-time tracking of this reference despite disturbances and uncertainties. Meanwhile, the PPO agent dynamically tunes the TSMC parameters (α,σ) in real time, improving adaptability and suppressing chattering.

By combining prediction intelligence, robust non-linear control, and continuous RL-based optimisation, the proposed hybrid method achieves faster response, higher efficiency, and improved stability.

In the proposed LSTM–PPO–TSMC-based MPPT scheme, Figure 1, the LSTM network predicts the reference maximum power point voltage $V_{M P P}^{*}$ V_{MPP}^* using PV electrical and environmental measurements, while the TSMC ensures robust voltage regulation. The PPO agent adaptively tunes the TSMC parameters to enhance dynamic performance and suppress chattering. This hybrid strategy exploits the strengths of each component to achieve intelligent and robust MPPT. The LSTM provides a fast and accurate estimate of V_MPP based on historical and real-time environmental data, allowing the PPO algorithm to initialise its optimisation near the optimal operating point. This significantly accelerates convergence and reduces tracking time. The PPO agent dynamically adjusts the TSMC parameters in real time to compensate for prediction errors, system uncertainties, and rapid environmental variations. This adaptive optimisation improves robustness and maintains optimal performance under fluctuating conditions. Finally, the TSMC guarantees rapid and stable convergence to the reference voltage V_MPP, ensuring resilience to disturbances while serving as a robust baseline controller. The synergy between LSTM prediction, PPO-based optimisation, and TSMC regulation results in an MPPT system that is fast, accurate, and highly adaptive.

Figure 2 illustrates the flow chart, this diagram illustrates the sequential data flow of the proposed control system, from the input of real-time PV data to the optimisation of power output using the integrated LSTM, PPO, and TSMC modules. The LSTM network predicts the reference MPP voltage from real-time PV measurements. The PPO agent adaptively optimises the TSMC parameters, while the TSMC generates the control signal for the DC–DC converter to maximise the PV output power.

Design of an Adaptative Robust MPPT Control Strategy

This paper proposes a hybrid LSTM–TSMC control strategy to achieve adaptive MPPT in PV systems under rapidly varying environmental conditions. The LSTM algorithm generates an optimal voltage, while the TSMC component ensures fast convergence and robust tracking despite irradiance fluctuations and partial shading effects.

3.1.

LSTM for maximum point voltage prediction

Accurate prediction of the maximum power point voltage (V_MPP) is a crucial step in the MPPT tracking process, especially in environments with rapid irradiance and temperature variations. To address this requirement, we propose the use of a LSTM neural network, capable of capturing the complex temporal dependencies between environmental and electrical variables of the PV system.

The LSTM networks are particularly well-suited for time series modelling due to their ability to learn long-term dependencies in data sequences. V_MPP prediction is inherently a time series task because the optimal voltage depends on past and present conditions of irradiance, temperature, and panel performance. To train the LSTM, a historical dataset is crucial that includes simultaneous measurements of G, T, and real V_MPP obtained by using a highly accurate reference MPPT algorithm. Training data quality and diversity are crucial for prediction accuracy.

The LSTM network is trained offline using irradiance and temperature profiles derived from standard EN50530 dynamic test conditions. The corresponding V_MPP values are generated using a validated PV array model. The dataset is chronologically divided into 70% training, 15% validation, and 15% testing sets to preserve temporal correlations and prevent data leakage.

The diagram presented in Figure 3 illustrates the architecture of the LSTM model developed to predict V_MPP. This model is based on a sequential approach where the input data including solar irradiance (G), ambient temperature (T), voltage (Vpv), and current (Ipv) are first normalised to ensure stable convergence during training. Then, these data are organised into time sequences using sliding windows, allowing the LSTM network to capture non-linear dynamics and temporal correlations between variables.

The network consists of two LSTM layers with 64 hidden units each, followed by a fully connected dense layer with 32 neurons and ReLU activation. Dropout is applied between layers to reduce overfitting. The model is trained using the Adam optimiser with a learning rate of 0.001 and the mean squared error (MSE) as the loss function. A batch size of 128 and a maximum of 100 epochs are used, with early stopping to prevent overtraining. L2 regularisation is incorporated to enhance generalisation, and the output layer employs a linear activation function to predict the maximum power point voltage V_MPP.

The LSTM layer, equipped with a tanh activation function, extracts relevant features from these sequences. The output is then passed to a fully connected dense layer, which generates the prediction of V_MPP.

The predictive performance of the model was rigorously evaluated using cross-validation. Quantitatively, the model demonstrates high precision, achieving a coefficient of determination R² of 0.98 and a low root mean square error (RMSE) of 0.42 V. The mean absolute percentage error (MAPE) was calculated at 1.37%, indicating that the model maintains high fidelity across its operational range. The training methodology follows best practices in ML for time-series forecasting, resulting in a robust predictor suitable for real-time MPPT applications.

3.2.

PPO agent for TSMC parameter optimisation

The sliding mode control (SMC) is known for its robustness to uncertainties and disturbances. TSMC is an improvement on SMC that aims to reach the setpoint in a finite time, which can improve the speed of convergence to the MPP.

The effectiveness of a TSMC controller is highly dependent on its parameters, which include sliding mode gains, equivalent control law gains, and terms related to the reachability condition. These parameters directly affect the stability, response speed, and tracking accuracy of the MPPT system. Traditionally, setting these parameters is a challenging task, especially in a dynamic environment. To address this, we propose using a PPO RL agent to dynamically optimise the TSMC parameters in real-time. PPO was selected for its stability, robustness, and efficiency in exploring the parameter space, which is crucial for fine-tuning controllers. The PPO agent interacts with the environment (the PV system) and learns to dynamically adjust the TSMC’s parameters (such as α and σ) to maximise the power extracted from the solar panel, even under fluctuating solar irradiance and temperature conditions.

In this study, PPO can learn to adjust the TSMC’s parameters based on system states (such as, voltage, current, power, V_MPP, voltage error) to maximise the panel’s power output.

A PPO agent is an actor-critic RL algorithm that utilises two distinct but interconnected neural networks to learn an optimal policy. This structure allows the agent to make and evaluate its own decisions, leading to more stable and efficient learning.

The Actor network is the decision-maker. It takes the system’s state as input. Based on this state, the Actor’s role is to determine the best possible action to take. In our case, this action corresponds to the specific parameters (α and σ) that will be used to control the TSMC. The Actor network uses a hidden layer of 64 neurons with a ReLU activation function to process the input, followed by an output layer that provides the final parameter values.

The Critic network acts as the evaluator. It takes the same state information as the Actor and estimates the value of that state. This value represents the expected future reward the agent can anticipate from being in that state and following the current policy. The Critic’s prediction is used to calculate the advantage, which is the difference between the actual reward received after taking an action and the Critic’s predicted value.

This advantage signal is crucial for training the Actor. A positive advantage indicates that the action taken was better than expected, so the Actor’s policy is updated to make that action more likely in the future. Conversely, a negative advantage means the action was worse than expected, and the policy is adjusted to discourage it. This feedback mechanism allows the Actor to learn much more effectively than if it were to rely solely on delayed, long-term rewards. The Critic essentially provides a baseline that reduces the variance in the learning signal, leading to faster and more stable training.

The PPO agent’s learning process takes place over several epochs, where the Actor and Critic are updated iteratively. The ppo_update function is the core of this process. It uses a probability ratio mechanism to limit policy updates and ensure that the agent does not stray too far from its previous actions. This prevents abrupt changes and keeps the learning stable.

The reward is a central element of this learning. For the MPPT application, the reward is the instantaneous power generated by the PV panel. The PPO agent’s goal is to maximise the cumulative sum of rewards over time, which means maximising the power extracted from the system. The TSMC controller’s parameters are thus optimised so that the system converges to the maximum power point, even under variable irradiance conditions.

3.3.

ATSMC design

Terminal sliding mode control (TSML) is a variant of conventional sliding mode control specifically designed to ensure finite-time convergence to the desired state. This provides high robustness against system uncertainties and external disturbances. In the context of MPPT, this study uses the TSMC to efficiently track the V_MPP predicted by the LSTM model, even in the presence of rapid variations in irradiance and temperature. The TSMC controller uses the V_MPP voltage predicted by the LSTM model as a dynamic setpoint in real time. This combination allows for anticipating changes in environmental conditions and quickly adjusting the PV system’s operating point, thereby improving overall energy efficiency.

The control design process begins by defining the tracking error variables as follows: (1) $e_{1} (t) = V_{p v} (t) - V_{p v}^{*} (t)$ {e_1}\left( t \right) = {V_{pv}}\left( t \right) - V_{pv}^*\left( t \right) (2) $e_{2} (t) = i_{L} (t) - i_{L}^{*} (t)$ {e_2}\left( t \right) = {i_L}\left( t \right) - i_L^*\left( t \right) where: V_pv is the measured voltage of the PV panel and $V_{p v}^{*}$ V_{pv}^* is the reference voltage predicted by the LSTM.

The inductor current reference i_L is derived from the PV system dynamics, the relationship between the PV current i_pv, the inductor current i_L, and the PV voltage V_pv is given by: (3) $i_{L} (t) = i_{p v} (t) - C_{p v} \frac{{d V}_{p v} (t)}{d t}$ {i_L}\left( t \right) = {i_{pv}}\left( t \right) - {C_{pv}}{{d{V_{pv}}\left( t \right)} \over {dt}}

The inductance current reference is adopted as: (4) $i_{L}^{*} (t) = i_{p v} (t) - C_{p v} {\dot{V}}_{p v}^{*} (t)$ i_L^*\left( t \right) = {i_{pv}}\left( t \right) - {C_{pv}}\dot V_{pv}^*\left( t \right)

By incorporating the expression of the current reference into the Eq. (2), the current tracking error e₂ becomes: (5) $e_{2} (t) = i_{L} (t) - i_{p v} (t) + C_{p v} {\dot{V}}_{p v}^{*} (t)$ {e_2}\left( t \right) = {i_L}\left( t \right) - {i_{pv}}\left( t \right) + {C_{pv}}\dot V_{pv}^*\left( t \right)

The error dynamics are derived through the following analytical procedure.

(6)

{\dot{e}}_{1} = \frac{1}{C_{p v}} (i_{p v} (t) - i_{L} (t)) - {\dot{V}}_{p v}^{*} = - \frac{e_{2} (t)}{C_{p v}}

{\dot e_1} = {1 \over {{C_{pv}}}}\left( {{i_{pv}}\left( t \right) - {i_L}\left( t \right)} \right) - \dot V_{pv}^* = - {{{e_2}\left( t \right)} \over {{C_{pv}}}}

(7)

{\dot{e}}_{2} = \frac{1}{L} V_{p v} (t) - \frac{1}{L} V_{d c} (1 - u (t)) + δ (t) - i_{L}^{*} (t)

{\dot e_2} = {1 \over L}{V_{pv}}\left( t \right) - {1 \over L}{V_{dc}}\left( {1 - u\left( t \right)} \right) + \delta \left( t \right) - i_L^*\left( t \right)

The proposed control system is designed to simultaneously achieve two primary objectives:

-
Ensuring zero convergence of tracking errors (1) and (2) within a finite time for precise tracking.
-
To synthesise pulse width modulation (PWM) control signals that enforce the reaching condition $S \dot{S} < 0$ {\rm{S\dot S}} < 0 , guaranteeing that the system trajectory reaches the sliding manifold.

The TSMC controller guarantees finite-time convergence of the system trajectory (S) to the sliding manifold, even under matched uncertainties. This is achieved through a co-designed non-linear sliding manifold and a discontinuous control law with fractional-power terms, ensuring Lyapunov stability and deterministic bounded-time convergence.

For robust MPPT, we use sliding mode control with surface defined in Eq. (8), ensuring the system reaches and maintains the desired trajectory by enforcing S = 0.

The terminal sliding surface is defined as: (8) $S (t) = \frac{1}{α} e_{2}^{x} (t) - e_{1} (t)$ S\left( t \right) = {1 \over \alpha }e_2^x\left( t \right) - {e_1}\left( t \right) where the variable S denotes the terminal sliding manifold in the TSMC framework, used to ensure finite-time convergence and robust control of the PV system.

α > 0, with x = p/q and 1 < x < 2, the values for p and q should be positive odd numbers, as this helps meet specific mathematical requirements that ensure system stability and reliability under the specified constraints 0 < p < q, (Baraean et al., 2023).

When the sliding condition is achieved S(t) = 0, the current error becomes as: (9) $e_{2} (t) = α {(t)}^{\frac{1}{x}} e_{1}^{\frac{1}{x}} (t),$ {e_2}\left( t \right) = \alpha {(t)^{{1 \over x}}}e_1^{{1 \over x}}\left( t \right), and consequently, the equation of dynamic error described in Eq. (6) becomes: (10) ${\dot{e}}_{1} (t) = - \frac{α^{\frac{1}{x}}}{C_{p v}} e_{1}^{\frac{1}{x}} (t)$ {\dot e_1}\left( t \right) = - {{{\alpha ^{{1 \over x}}}} \over {{C_{pv}}}}e_1^{{1 \over x}}\left( t \right)

Once the system reaches and remains on the sliding surface (S(t) = 0), the errors (e₁, e₂) are driven to zero.

To ensure reachability of this surface, we employ a framework that yields the intended outcomes:

Theorem:

The Eqs (1)–(3) describes the dynamic behaviour of a PV system. Applying robust terminal sliding mode control implemented with a specific control law u(t) developed to force the system reaches the sliding surface S = 0 within a finite time interval and guarantees tracking of the maximum power: (11) $u (t) = - \frac{1}{V_{d c}} [\frac{1}{L} (V_{p v} (t) - V_{d c}) + \frac{α e_{2}^{2 - x} (t)}{x C_{p v}} - i_{L}^{*} (t) + σ (t) s i g n (S (t))],$ u\left( t \right) = - {1 \over {{V_{dc}}}}\left[ {{1 \over L}\left( {{V_{pv}}\left( t \right) - {V_{dc}}} \right) + {{\alpha e_2^{2 - x}\left( t \right)} \over {x{C_{pv}}}} - i_L^*\left( t \right) + \sigma \left( t \right)sign\left( {S\left( t \right)} \right)} \right], the parameters α > 0 and σ > 0 ensure the reachability of the sliding surface and robust control under uncertainties.

3.3.1.

Lyapunov stability analysis with adaptive parameter updates

The integration of the PPO agent for real-time TSMC parameter adaptation (α and σ) necessitates rigorous stability analysis to ensure that the learning process does not compromise system stability. We present a comprehensive Lyapunov-based analysis that establishes sufficient conditions for stability under adaptive parameter updates.

Theorem 1: Stability Under Bounded Parameter Adaptation

Consider the TSMC system described by Eqs (1)–(11) with time-varying parameters α(t) and σ(t) updated at discrete intervals k with sampling period T_s. If the following conditions hold:

(C1) Parameter bounds: α(t) ∈ [αmin, αmax] and σ(t) ∈ [σmin, σmax], where αmin > 0, σmin > 0
(C2) Rate limitation: |α[k + 1] - α[k]| ≤ Δαmax and |σ[k + 1] - σ[k]| ≤ Δσmax
(C3) Sampling period constraint: T_s < T_s_max = 1/(2*fc), where fc is the system’s closed-loop bandwidth.

Then the closed-loop system is uniformly ultimately bounded (UUB) stable, and the sliding surface S(t) converges to a compact set around zero in finite time.

Proof:

Step 1: Lyapunov Function Construction

Consider the Lyapunov function candidate: (12) $V (S) = \frac{1}{2} S^{2} (t)$ V\left( S \right) = {1 \over 2}{S^2}\left( t \right) where S(t) is the terminal sliding surface defined in Eq. (8). The function V(S) is positive definite (V > 0 for S ≠ 0) and radially unbounded.

Step 2: Time Derivative Analysis

Taking the time derivative of V along the system trajectories: (13) $\dot{V} (S) = S \dot{S}$ \dot V\left( S \right) = S\dot S

Substituting the sliding surface dynamics from Eq. (13) with the control law from Eq. (11): (14) $\dot{V} (S) = S (t) [{\dot{e}}_{2} (t) - α x e_{1}^{x - 1} (t) {\dot{e}}_{t} (t)]$ \dot V\left( S \right) = S\left( t \right)\left[ {{{\dot e}_2}\left( t \right) - \alpha xe_1^{x - 1}\left( t \right){{\dot e}_t}\left( t \right)} \right]

Applying the TSMC control law u(t) from Eq. (11), we obtain, (15) $\dot{V} (S) = - σ (t) | S (t) | + S (t) η (t)$ \dot V\left( S \right) = - \sigma \left( t \right)\left| {S\left( t \right)} \right| + S\left( t \right)\eta \left( t \right) where η(t) represents the lumped uncertainty including model mismatches, external disturbances, and parameter variation effects. Assuming bounded uncertainty |η(t)| ≤ ηmax, we have: (16) $\dot{V} (S) \leq - σ (t) | S (t) | + η max | S (t) |$ {\dot V\left( S \right) \le - \sigma \left( t \right)\left| {S\left( t \right)} \right| + \eta \;\max \;\left| {S\left( t \right)} \right|} (17) $\dot{V} (S) \leq - (σ (t) - η max) | S (t) |$ {\dot V\left( S \right) \le - \left( {\sigma \left( t \right) - \eta \;\max \;} \right)\left| {S\left( t \right)} \right|}

Step 3: Stability Condition Under Parameter Adaptation

For stability, we require $\dot{V} (S) \leq 0$ \dot V\left( S \right) \le 0 outside a boundary layer, which is guaranteed if σ(t) > ηmax + ε,

where ε > 0 is a design margin. This condition is enforced by constraining the PPO agent’s action space such that σmin ≥ ηmax + ε. Under condition (C1), this constraint is always satisfied, ensuring $\dot{V} (S) < 0$ \dot V\left( S \right) < 0 whenever |S(t)| > 0.

Step 4: Effect of Parameter Rate Limitation

The parameter updates occur at discrete intervals k with period T_s. During the interval t ∈ [kT_s, (k + 1)T_s], the parameters remain constant. The rate limitation (C2) ensures that parameter changes are gradual, preventing abrupt jumps that could destabilise the system.

The maximum instantaneous degradation in the Lyapunov derivative at a parameter update instant is bounded by: (18) $Δ \dot{V} [k] \leq Δ σ_{m a x} | S (k T_{S}) |$ \Delta \dot V\left[ k \right] \le \Delta {\sigma _{{{max}}}}\left| {S\left( {k{T_S}} \right)} \right|

By choosing Δσmax sufficiently small and T_s small relative to system dynamics (condition C3), the system remains stable between updates, and the overall convergence is preserved.

The reward function used in this study is as follows: (19) $r [k] = P_{p v} (k) - λ_{1} | e_{1} (k) | - λ_{2} | S (k) | - λ_{3} | Δ u [k] |$ r\left[ k \right] = {P_{pv}}\left( k \right) - {\lambda _1}\left| {{e_1}\left( k \right)} \right| - {\lambda _2}\left| {S\left( k \right)} \right| - {\lambda _3}\left| {\Delta u\left[ k \right]} \right| where Δu[k] = u[k] - u[k-1] represents the control signal variation (chattering penalty).

Simulation Results and Discussion

To validate the effectiveness of the proposed hybrid MPPT controller, we developed a simulation platform to model a PV system under various environmental conditions. This section details the simulation tools, system parameters, and test scenarios. The entire control system was implemented using MATLAB R2020. The MathWorks, Inc. (Natick, Massachusetts, USA). The simulated system is based on a monocrystalline PV panel with specific electrical characteristics.

Figure 4 provides a comprehensive analysis of the PPO agent’s training process over 20 epochs. The plots illustrate the agent’s learning progress and the distribution of its performance, with Reward being the key metric for optimisation.

The upper-left subplot shows the average reward per epoch. The blue line with markers indicates that the agent’s performance improved significantly from the start of training, peaking around epoch 15. The upward trend from epoch 2–15 demonstrates that the PPO algorithm successfully learned a policy to maximise the reward.

The upper-right subplot displays the moving average reward, which smooths out the fluctuations seen in the first plot. This red curve clearly shows a consistent and stable increase in the agent’s performance, confirming that the learning process was successful.

The lower-left subplot correlates the average reward with the overall system tracking efficiency. The graph shows a clear positive correlation: as the PPO agent’s reward increases during training, the overall tracking efficiency also improves.

The lower-right subplot shows a histogram of the reward distribution, providing insight into the frequency of different reward values obtained by the agent. This confirms that the PPO agent’s goal of maximising its reward is perfectly aligned with the real-world objective of maximising the PV system’s efficiency.

This section presents the results obtained from the simulations carried out in different environmental scenarios and compares the performance of the proposed hybrid MPPT controller with those of existing conventional and intelligent methods. For fair performance benchmarking, all comparative MPPT methods shown in Figure 5 including classical SMC, TSMC, ANN-MPPT, PSO-MPPT, and P&O, were independently implemented and simulated by the authors within the same MATLAB/Simulink. The MathWorks, Inc. (Natick, Massachusetts, USA) environment using identical PV system parameters, converter settings, sampling time, and irradiance/temperature profiles.

The controller parameters of each method were selected according to widely adopted configurations reported in the literature and further fine-tuned to ensure stable operation under the same operating conditions.

Specifically, the SMC and TSMC gains were chosen based on standard sliding mode design rules, while the ANN and PSO configurations were designed according to standard implementations. Specifically, the ANN model employed a three-layer feedforward neural network with one hidden layer of 10 neurons, trained using the same dataset as the proposed LSTM predictor, whereas a standard PSO algorithm was implemented with 10 particles, inertia weight w = 0.7, and cognitive/social coefficients c₁ = c₂ = 1.5. The PSO directly searched the duty cycle space to maximise power output. This unified simulation framework ensures that the comparison reflects differences in control strategy performance rather than discrepancies in system modelling or simulation conditions.

Throughout this paper, tracking efficiency is defined as the ratio of the actual power extracted by the PV system P_pv to the theoretical maximum available power P_max, expressed as: (20) $η_{t r a c k} (%) = \frac{P_{p v}}{P_{max}} * 100$ {\eta _{track}}\left( \% \right) = {{{P_{pv}}} \over {{P_{{\rm{\;max\;}}}}}}*100 where P_max is determined from the PV array’s characteristic curve under the prevailing irradiance and temperature conditions. This metric quantifies how effectively the MPPT algorithm locates and maintains operation at the true MPP.

Figure 5 compares six distinct algorithms: a standard SMC, a hybrid LSTM–TSMC, and a more advanced hybrid LSTM–PPO–TSMC. In the ANN model, a three-layer feedforward is implemented with one hidden layer (10 neurons, tanh activation) trained on the same dataset used for the LSTM. The ANN inputs were irradiance G and temperature T, and the output was the estimated V_mpp. The ANN was coupled with a PI controller for duty cycle adjustment.

The LSTM–PPO–TSMC demonstrates the highest level of performance. It tracks the maximum power reference with exceptional accuracy and minimal power fluctuations. The red line remains very close to the top of each power step, showing that the system is consistently operating at or near the optimal point. The small, tight oscillations indicate a highly stable and efficient control strategy.

The TSMC approach performs well, tracking the power reference accurately. However, it exhibits noticeably more oscillations than the LSTM–PPO–TSMC system. The larger fluctuations suggest that without the PPO agent’s dynamic optimisation of the controller’s parameters, the system is less stable and less able to perfectly regulate the output. Moreover, the SMC algorithm shows the least effective performance. It tracks the power reference but with a much lower average power output and significantly more power oscillations compared to the other two methods. The instability and lower efficiency are evident from the wide fluctuations in the blue line, indicating a struggle to consistently operate at the MPP.

The ANN-MPPT showed good steady-state accuracy under stable irradiance, achieving ~96.1% efficiency. However, under rapid irradiance changes the ANN exhibited slower adaptation and larger voltage oscillations compared to the LSTM-based predictor, due to its lack of temporal memory.

PSO demonstrated good global search capability. However, it suffered from high computational latency (~150–200 ms per update) and significant power oscillations during convergence, especially under fast-changing conditions.

All algorithms show a fast response to the step changes in the power reference at time = 1.5s and 2.0s. There is no significant lag in their ability to begin tracking the new maximum power point. However, the LSTM–PPO– TSMC and LSTM–TSMC systems converge to their respective stable states much more cleanly and with less initial overshoot than the SMC algorithm.

Simulation results demonstrate the hybrid LSTM–PPO–TSMC approach’s superiority. By combining the predictive intelligence of the LSTM, the dynamic optimisation of PPO, and the robust control of TSMC, the system achieves a near-perfect balance of high efficiency, fast response, and outstanding stability. The results suggest that the PPO component is crucial for refining the control parameters in real time, as its absence in the LSTM TSMC system leads to greater oscillations and a loss of stability. The standard SMC algorithm, lacking both predictive and optimisation capabilities, proves to be the least effective for this application.

The integration of PPO for TSMC parameter optimisation enabled dynamic adaptation of the controller according to environmental conditions. The comparative results are presented in Table 2.

Table 1.

Parameters used for comparative MPPT methods

Method	Main parameters
SMC	λ = 5, switching gain = 10
TSMC	α = 0.8, σ = 0.5
ANN-MPPT	1 hidden layer, 10 neurons, tanh
PSO-MPPT	Particles = 10, w = 0.7, c₁ = c₂ = 1.5
P&O	Step size = 0.01
Proposed	LSTM (2 × 64), PPO adaptive α,σ

ANN, artificial neural networks; MPPT, maximum power point tracking; P&O, perturb & observe; PPO, proximal policy optimisation; PSO, particle swarm optimisation.

These results, presented in Table 2 demonstrate that the proposed hybrid controller outperforms conventional methods in terms of accuracy, speed, and stability. The PPO agent intelligently adjusts the TSMC parameters, while the LSTM anticipates MPP variations, thus reducing the tracking error.

Table 2.

Performance metric for different MPPT algorithms

MPPT method	Average error	Response time (ms)	Oscillations	Efficiency (%)
Classic SMC	0.65	120	Low	94.3
TSMC	0.52	100	Very low	95.8
ANN-MPPT	0.48	110	Moderate	96.1
PSO-MPPT	0.55	180	High	95.2
TSMC + PPO + LSTM (proposed)	0.25	65	Negligible	97.4

ANN, artificial neural networks; LSTM, long short-term memory; MPPT, maximum power point tracking; PPO, proximal policy optimisation; PSO, particle swarm optimisation.

To validate the proposed approach in non-standard conditions, a fluctuating and dynamic irradiance pattern, which mimics a partly cloudy day, was specifically used during the simulation.

Figure 6 compares the real-time performance of three MPPT algorithms under a fluctuating irradiance profile, which is characteristic of a partly cloudy day. A dashed black line represents the theoretical maximum power (P_max), serving as the optimal reference for the algorithms. The Figure 6 clearly illustrates the tracking performance of the compared MPPT algorithms. The conventional SMC exhibits the weakest performance, showing a noticeable tracking delay and large oscillations. Its output frequently falls below the theoretical maximum power and remains noisy, indicating poor adaptability to rapid and non-linear irradiance variations and resulting in significant energy loss. The TSMC achieves a clear improvement over the standard SMC, tracking the maximum power more accurately, particularly during increasing irradiance. However, noticeable oscillations persist, with repeated over-and undershoots around the ideal power trajectory.

The proposed LSTM–PPO–TSMC algorithm delivers the best performance. Its output closely follows the theoretical maximum power with minimal oscillations, demonstrating superior stability and rapid adaptation to irradiance fluctuations. This improvement stems from the combined effect of accurate LSTM-based prediction and real-time PPO optimisation of the TSMC parameters. Overall, the results confirm that the LSTM–PPO–TSMC approach achieves optimal MPPT performance under dynamic operating conditions. The clear performance gap relative to SMC and LSTM–TSMC highlights the critical role of PPO in enhancing robustness, tracking accuracy, and energy efficiency in practical PV systems.

The results shown in Table 3 clearly demonstrate that classical SMC exhibits the slowest response and the highest overshoot due to severe chattering and lack of predictive capability, whereas LSTM + TSMC improves convergence speed but still shows noticeable overshoot under abrupt irradiance changes. Incorporating LSTM prediction significantly reduces overshoot by anticipating the new operating point, while the proposed LSTM–PPO–TSMC achieves the fastest settling time and minimal overshoot, confirming that PPO-based adaptive tuning effectively balances fast convergence and damping. These quantitative results are fully consistent with the time-domain behaviour observed in Figure 6 and confirm the superior dynamic performance of the proposed controller.

Table 3.

Performance metric for different MPPT algorithms

MPPT method	Settling time (ms)	Overshoot (%)
Classic SMC	160	9.5
LSTM + TSMC	85	3.8
LSTM + PPO + TSMC (proposed)	55	1.6

LSTM, long short-term memory; MPPT, maximum power point tracking, PPO, proximal policy optimization.

Figure 7 illustrates the dynamic adjustment of the two key parameters (α and σ) of the TSMC over a 100-second simulation period. These parameters, optimised in real-time by the PPO agent, are crucial for the controller’s performance in tracking the maximum power point.

The parameter α is related to the convergence speed of the controller. As seen in Figure 7, α is adjusted from a high value (1.0) to a low value (0.1) and back. A high α is typically used when the system needs to quickly converge to the desired setpoint, such as at the beginning of a simulation or after a sudden change in irradiance. A lower α is used once the system is close to the setpoint to minimise steady-state oscillations and maintain high stability.

The parameter governs the robustness of the controller against uncertainties and disturbances. It is adjusted from a low value (0.3) to a high value (1.0) and back. A high σ is needed to handle significant changes or uncertainties in the system, ensuring the controller remains effective even under fluctuating conditions. A lower σ is sufficient when the system is stable, and there are fewer disturbances.

The graph clearly shows that the PPO agent dynamically adjusts these parameters in a coordinated manner in response to the changing environmental conditions. For instance, when the system experiences a rapid change (indicated by the sharp drop in the red line), the PPO agent immediately increases the value of σ to ensure robust tracking and stability. This adaptive behaviour is what allows the hybrid LSTM–PPO–TSMC controller to maintain superior performance and efficiency compared to methods with fixed parameters.

Figure 8 presents a frequency-domain analysis of the chattering phenomenon in two different control systems: SMC and the proposed LSTM–PPO–TSMC hybrid controller. The SMC has a consistently high magnitude across a broad range of frequencies. This indicates a significant presence of high-frequency components, which is the primary characteristic of chattering. Chattering is an undesirable effect in SMC that causes high-frequency oscillations in the control signal, leading to increased power losses, mechanical stress, and reduced system lifespan. In stark contrast, the proposed hybrid controller exhibits a significantly lower magnitude of chattering across all frequencies. The blue line is consistently several orders of magnitude lower than the red line on the logarithmic scale, especially at higher frequencies. This reduction is a direct result of the terminal sliding mode control (TSMC) component, which is designed to minimise chattering by ensuring the system’s state converges to the sliding surface in a finite time. The low-frequency oscillations are likely due to the system’s natural dynamics, but the suppression of high-frequency noise is highly effective.

This result provides strong evidence that the LSTM–PPO–TSMC hybrid controller is far superior to the standard SMC in suppressing chattering. By effectively reducing high-frequency oscillations, the proposed method not only improves the system’s efficiency but also its reliability and longevity. This is a critical advantage for real-world applications of MPPT. These quantitative findings are fully consistent with the frequency-domain observations in Figure 9, and provide strong evidence of improved power quality.

Figure 9 provides clear evidence that the proposed LSTM–PPO–TSMC algorithm is far superior to the standard SMC in maintaining stable efficiency under dynamic, real-world conditions. While the SMC’s performance is highly sensitive to fluctuations in irradiance, the hybrid approach demonstrates exceptional resilience, ensuring continuous and optimal energy harvesting. The proposed controller has been tested under extreme conditions, such as rapid irradiance changes. It maintained remarkable stability, without significant performance loss. In addition, the architecture is flexible enough to be generalised to other types of PV panels or multi-source systems.

From a practical implementation standpoint, the proposed architecture presents several advantages over alternative intelligent MPPT methods. First, the LSTM network is trained entirely offline and deployed in inference-only mode, meaning no online backpropagation is required during converter operation. This drastically reduces the runtime computational burden compared to fully online DRL approaches, which require 11.3 × more power as shown in Table 4. Second, the PPO agent operates in a constrained action space with pre-trained weights, requiring only a forward pass through a shallow actor-critic network (64-neuron hidden layer) at each update step. Third, the TSMC controller itself is a lightweight analytical control law requiring only 64 operations per step, adding negligible overhead. Compared to PSO-MPPT, which requires continuous evaluation of 10 particles and consumes 3.1× more power, the proposed method achieves superior tracking performance at a fraction of the computational cost. Compared to classic SMC and TSMC alone, the additional overhead introduced by the LSTM and PPO components is modest, approximately 4,320 additional operations per step, yet yields substantial gains in tracking efficiency (97.4% vs. 94.3% and 95.8%, respectively). The memory footprint of 38 KB is well within the 1 MB RAM available on the STM32H7 platform, confirming that the proposed method is readily deployable on commercial embedded hardware without requiring specialised AI accelerators or high-performance computing resources.

Table 4.

Comprehensive computational complexity comparison of MPPT methods

MPPT method	Operations/step	Execution Time (ms)	Memory (KB)	Update rate	Relative cost
Classic SMC	32	0.02	2	Full switching	0.08×
TSMC	64	0.05	4	Full switching	0.13×
ANN–MPPT (3-layer)	1,280	0.18	12	1 kHz	0.70×
PSO-MPPT (10 particles)	3,200	0.35	28	100 Hz	3.10×
Full DRL (online)	18,000	2.10	256	10 Hz	11.3×
LSTM–PPO–TSMC (proposed)	4,416	0.45	38	1 kHz	1.0×

ANN, artificial neural networks; DRL, deep reinforcement learning; LSTM, long short-term memory; MPPT, maximum power point tracking; PPO, proximal policy optimisation; PSO, particle swarm optimization.

Conclusion

This research presents a novel hybrid MPPT architecture that synergistically integrates LSTM networks, TSMC, and PPO for enhanced PV system performance. The proposed TSMC–PPO–LSTM framework demonstrates exceptional capabilities through its tri-fold approach: LSTM networks provide accurate V_MPP voltage prediction, TSMC ensures robust tracking control, and PPO enables real-time dynamic optimisation of controller parameters. Comprehensive simulation results validate the superiority of this hybrid approach, achieving remarkable performance metrics including 97.4% tracking efficiency, reduced average tracking error to 0.25, and rapid response time of 65 ms. These improvements represent substantial advances over conventional MPPT methods, particularly in terms of convergence speed, tracking accuracy, and system stability under varying environmental conditions. The integration of DL prediction capabilities with RL optimisation represents a paradigm shift in MPPT controller design. By combining the temporal pattern recognition strengths of LSTM with the adaptive optimisation power of PPO within a robust TSMC framework, the proposed system demonstrates superior resilience to irradiance fluctuations, temperature variations, and PSCs that typically challenge conventional MPPT algorithms. The research contributions extend beyond immediate performance gains, establishing a foundation for next-generation intelligent PV systems. This hybrid approach autonomously adapts to dynamic conditions through continuous learning, making it a promising solution for maximising energy harvesting efficiency in practical applications.

Deep Learning-Based MPPT for PV Systems: LSTM Forecasting and Adaptive TSMC via PPO Agent

Full Article

Paradigm

My account