Structure based machine learning
In the present study, the CHEMBL was used to assemble a training set of 437 FFAR4 actives (activators), in-actives (inhibitors), decoys (the molecules with the same physicochemical formula of actives but different topology and no activity at the receptor) and randoms. Chemopy Morgan fingerprints were calculated using CHEMDES webserver27 and all subsets were assembled into one training set file (Supplementary Information).
The classification algorithm BayesNET which works on Bayesian network theory was chosen to derive a model as it is most suitable for data with bit arrays entries. Bayesian network algorithm takes the entries and calculates a posterior probability (the model so created is based on the preponderance of occurrence of similar entries and hence taken as features which mostly recur in a particular labelled compounds and hence likely to be essential for the behavior of this compound). The WEKA software based Bayesian network classification reported the best model with the kappa statistic of 0.7. The confusion Matrix resulting from BayesNET algorithm classification and resultant model is shown in Fig. 1.

The confusion matrix for the best model trained.
Kappa statistic is the measure of predictability and generalizability of the model. It is a gauge of the extent to which a huge amount of data can be utilized to provide an accurate result i.e. it denotes the degree to which the study’s data accurately reflect the factors under investigation.
It is possible to calculate Cohen’s kappa using the following formula:
$$\kappa =\frac{{\text{Pr}}\left({\text{a}}\right)-{\text{Pr}}\left(e\right)}{1-{\text{Pr}}(e)}$$
where Pr(e) denotes chance agreement and Pr(a) is the actual observed agreement.
In other words, the kappa statistic of 0.821 indicates that the best model has the generalization ability (for a model applied to an external dataset with what accuracy it is likely to pick a true-active) of around 82%. Other evaluation metrics (F1 score, Mathew Correlation Coefficient (MCC), Precision vs Recall) are presented in table and speak of good predictive accuracy of the model. The detailed statistics of the model are shown in Table 1.
After verification, this model was proceeded to screen the test set. The test set contained 32,000 compounds having applied Modified Lipinski Rule of 3 to 6 million compound in CHEMBL Database. These had Chemopy Morgan fingerprints calculated as per the training set. Upon screening this dataset, 693 compounds were picked up as actives as shown in Fig. 2.

The sorted result of screening by applying the best trained model under the BayesNET algorithm. All the entries in the screening dataset were labelled in-actives at first. Total 693 compound were predicted to be actives.
Molecular interactions
Molecular docking was applied to the 693 hits that came from structure-based screening using machine learning at the target protein’s active site within the designated grid. When choosing the possible hits, docking interactions with important residues in the target site were evaluated. The MOE-implemented PLIF module was used to analyze the shortlisted compounds through protein–ligand interactions fingerprinting (PLIF) analysis. The major interactions between ligand and protein are identified using PLIF, which is regarded as a potent technique. The resulting hits were further narrowed down to 127. The compounds that were shortlisted had binding energies ranging from − 8.85 to − 4.0 kcal/mol. The Protein–Ligand Interaction Profiler (PLIP) was employed to examine the binding poses of the highest-ranked compounds, facilitating the visualization, identification, and in-depth analysis of protein–ligand interactions. Three compounds were finally shortlisted based on the interactions reported earlier14. Interactions of the compounds with the residues Leu196 and Trp198 of FFAR4 were commonly observed in case of all three complexes, while residues Phe25, Glu204, Asp208 also took part in hydrogen bonding. The PLIP interactions of the final 3 compounds shortlisted based on interactions with crucial residues are shown in Fig. 3.

Molecular Interactions observed as a result of molecular docking of the three shortlisted compounds (Compound 1: Maroon, Compound 2: Green, Compound 3: Orange) with FFAR4. The interacting residues are represented as sticks, while black dotted lines represent the hydrogen bonds.
Swiss ADME
Swiss ADME server was used to examine the pharmacokinetic and physicochemical characteristics of 127 hit compounds that were derived from the docking studies. Lipinski’s rule of five, a compound’s solubility, lipophilicity, flexibility, and ability to penetrate the blood–brain barrier are among the factors taken into account in this work. As a result, 27 hit compounds were chosen. All of the compounds’ molecular weights were discovered to fall within the typical range of 95% of the medicines that were tested. It was discovered that every chemical complied with the Lipinski criterion of five drug likelihood with no exceptions. Based on the optimal combinations of PLIP, PLIF (Fig. S2), and ADME profiles identified, three compounds were selected. As shown in Fig. 4, the three compounds projected high oral absorption rate and moderate to well blood–brain barrier permeability. Table S1 provides a comprehensive explanation of the pharmacokinetic parameters for these compounds. In conclusion, it can be deduced that these compounds exhibit favorable ADME profiles, making them promising candidates for further exploration. Additionally, the toxicity profiles of these compounds were also calculated via Protox 3.0 webserver. The three compounds were found to be very non-toxic, representing appropriate safety profiles, as can be seen in Table S2.

ADME profiles in web diagrams of the three final hits along with their structures. The pink area in the web diagrams are acceptable ranges for the values of lipophilicity, solubility, flexibility, unsaturation and molecular size while the white area are disallowed regions. which primarily determine the pharmacokinetic profiles of drug-like compounds.
Molecular binding interactions
In order to observe the molecular binding pattern within the complexes, the resultant trajectories of the simulations were examined with a distance limited at 4 Å. Figure 5 depicts the molecular binding interactions observed after the molecular dynamic simulation of the membrane-bound FFAR4 with compounds 1, 2, and 3. Upon examining the intermolecular interactions, it becomes apparent that the stability of protein–ligand complexes in all three simulated systems primarily relies on a multitude of hydrogen bonds.

Molecular binding interactions observed after the molecular dynamic simulation of the membrane bound FFAR4 with the three compounds (Compound 1: Maroon, Compound 2: Green, Compound 3: Orange). The interacting residues are represented as sticks, while black dotted lines represent the hydrogen bonds.
Compounds 1 and 3 both primarily showed hydrogen bonding interactions with FFAR4 residues including Leu196, Trp198, Glu204 and Asn291 in common at a distance of 2.73, 2.33, 2.73, 2.33 Å and 1.85, 3.25, 2.01, 2.22 Å respectively. These observations are in-line with the interactions observed via molecular docking studies in our present work, as well as the previous reports14. Compound 1 showed an additional bonding with hydrogen atom involving Arg22 at a distance of 3.42 Å and hydrophobic interactions with residue Phe27 and Phe115, while residues Trp277, Ile261, Ile284, Phe211 showed hydrophobic interactions with compound 3. Compound 2 on the other hand only showed one hydrogen bond interaction involving Thr119 at a distance of 1.95 Å.
Structural stability and flexibility
To assess the structural stability, flexibility, and overall dynamics of the FFAR4 complexes, RMSDs of the 100 ns simulated trajectories were calculated and shown in Fig. 6A. The RMSD of any simulated bio molecular system serves as an essential parameter to gauge the system’s stability. In the context of structural biology when studying the atomic coordinates, the lower RMSD values indicate greater system stability whereas larger RMSD values correspond to less stable complexes28. Throughout the simulation, all three systems exhibited fluctuations, with consistent RMSDs. FFAR4 as part of complex 1, 2, and 3 demonstrated an average RMSD of 3.57 ± 0.23, 3.64 ± 0.46, and 3.51 ± 0.34 nm respectively, while Apo FFAR4 exhibited an average of 3.57 ± 0.29 nm deviation. As can be observed in the plot, throughout the entirety of the simulation, comparable RMSDs with minimal fluctuations were observed. Although the fluctuation varied slightly across the three complexes, they remained within the consistent range comparable with the unbound FFAR4.

(A) RMSD of FFAR4 backbone carbon atoms as part of protein–ligand complex 1, 2 and, 3 compared to Apo FFAR4, for the entire duration of the simulation. (B) RMSD of the three compounds as part of protein–ligand complex 1, 2 and, 3 during the course of 100 ns of simulation.
Moreover, further visual analysis of the trajectories was also performed to assess the stability of the bound ligands. The ligands as part of complex 1, 2, and 3 projected an average RMSD of 1.19 ± 0.19 1.91 ± 0.61, and 2.03 ± 0.30 nm, respectively. As depicted in Fig. 6B, compound 2 exhibited the most deviations throughout the simulation, while compound 1 and 3 remained comparatively steady and showed fewer deviations, indicating stable protein–ligand complexes. During the simulations, consistent deviations in the same regions were observed for compound 1 and compound 3, indicating stable binding modes relative to their initial positions. However, the magnitude of deviations is higher in case of complex 3 as compared to others. The instability observed in case of compound 2 can be explained by the inconsistent intermolecular interactions observed (Fig. 5), suggesting instable protein–ligand complex. RMSD of FFAR4 backbone carbon atoms as part of protein–ligand complex 1, 2, 3, and reference ligand compared to Apo FFAR4 (Fig. S3), and the RMSD of the selected compounds when compared with the reference ligand (Fig. S4), indicate similar structural stability.
In order to assess the inherent flexibility of the FFAR4 residues during the course of simulation, Root Mean Square Fluctuations (RMSF) of the simulated trajectories, were computed. The RMSF provides information regarding residue-level perturbations where higher values within a bio molecular system signify flexibility and, consequently a less stable state29. On the other hand, reduced fluctuation imparts greater stability to the system29. The average RMSF (Fig. 7) experienced by the FFAR4 residues in complex 1, 2, and 3 were noted as 6.03 ± 0.57, 5.77 ± 0.66, and, 5.27 ± 0.87 while Apo FFAR4 exhibited an average of 7.34 ± 0.82 nm. As can be observed in the plot, throughout the entirety of the simulation, comparable RMSFs with minimal fluctuations in the same regions were observed with all the complexes, with the exception of complex 1 which showed higher fluctuations. The observed fluctuations can be attributed to the residues involved in binding the ligands and forming the protein–ligand complex. The RMSF of the selected compounds when compared with the reference ligand (Fig. S3), indicates similar flexibility of the FFAR4 residues during the course of simulation. The RMSF plots of the key residues of the FFAR4 binding pocket are provided in Supplementary as Fig. S5.

RMSF of FFAR4 backbone carbon atoms as part of protein–ligand complex 1, 2, and 3, as compared to Apo FFAR4, for the entire duration of the simulation.
Structural compactness
The time-dependent convergence of the radius of gyration (RoG), representing the structural properties of the simulated ensembles, was further examined and illustrated in Fig. 8. RoG, reflecting the compactness of the protein, is defined as the root-mean-squared distance of the components within an object relative to its center of mass. After computing the influence of ligand binding to the receptor at molecular level, the folding behavior of the protein with respect to time was studied via RoG. A properly folded conformation in general features constant gyration values, while on the other hand, distortion in the folding behavior induces the RoG values to change over time30. The gyration values observed for the studied systems displayed fluctuations spanning from 22.8 to 23.5 nm highlighting the divergence in the structural compactness. The mean values of 23.04 ± 0.09, 22.98 ± 0.10, 23.20 ± 0.10, 23.09 ± 0.12, were observed for the Apo FFAR4 and the three complexes, respectively. The RoG data shows that all three complexes were structurally compact, ordered. The RoG of the selected compounds when compared with the reference ligand (Fig. S3), indicates similar compactness in case of complex 1 and 3, while complex 2 showed fluctuations.

The compactness of FFAR4 as part of protein–ligand complexes 1, 2, ad 3, as compared to FFAR4, during the course of entire simulation.
Conformational motions and thermodynamic stability
To capture the substantial conformational changes with respect to free energy, PCA and FEL were computed for all of the simulated systems. We generated 2D FEL plots for both the ligand-free state and ligand-bound complex, utilizing principal component values with a color-coded contour map indicating energy levels and simulation extremes. All three systems demonstrated different conformational patterns with various energy basins.
In general, all of the systems demonstrated different conformational patterns with various energy basins. In Apo state, the membrane bound receptor protein showed a high degree of variance in PC1 and PC2 during both the initial and final MD conformations suggesting instability. As can be observed in Fig. 9A, initially, the PCs values decrease but towards the end of the 100 ns simulation an increase can be observed. Conversely, the stability of the protein backbone in complex 1 is evident from much smaller values of PC1 and PC2 towards the end of simulations as can be seen in Fig. 9B as compared to Apo in which initial variances of PC1 and PC2 begin at -40 and -50 respectively. Complex 2 also demonstrates much smaller values at − 10 and − 40 as seen in Fig. 9C. For complex 3, the final steps show PC1 and PC2 variances at 20 and 40 again indicating a more stabilized complex as seen in Fig. 9D. It can be inferred from the PCA plots, that all three compounds confer stability to the complex which is in-line with the RMSD observations.

The Conformational PCA plots computed from the backbone atoms of (A) Apo FFAR4. (B) Complex 1 (C) Complex 2 (D) Complex 3 next to the corresponding Free energy landscapes (FELs) of (E) Apo FFAR4 (F) Complex 1 (G) Complex 2 (H) Complex 3.
The FEL plot of the Apo FFAR4 show the best free energy profiles at around PC1 and PC2 variances. As can be observed in Fig. 9E, the Apo FFAR4 shows fewer high free energy basins, as compared to the protein–ligand complexes as seen in Figs. 9F–H which show higher free energies and lower variances, which can be attributed to the binding of the ligand. The highest free energy states for the first compound complex lies in in range of 10 to 0 (PC1) and 20 to 0 (PC2) Fig. 9F. In case of the compound 2 these values are − 2 to − 22 (PC1) and − 5 to 0 (PC2) Fig. 9G, while for the third complex, FEL contour hovers around 20 to -5 (PC1) and 20 to − 10 (PC2) Fig. 9H. As can be observed via the FEL plots, the contours shift to wider energy basins, inferring stable protein–ligand complex.
