Model architecture
Predicting the mechanical properties of spider silk based on the primary sequence of various spidroins is a complex task due to the lack of knowledge about the microstructural organization of the silk and how that translates to constitutive relations. Additional complexity arises from the fact that different spidroins (MaSp1, MaSp2, MaSp3, MaSp, MiSp) are present in the spider silk and their molecular organization is largely unknown31. This poses challenges to the physics-based modeling of silks, and as such makes data-driven methods that are agnostic to structural data advantageous for this purpose. This calls for a representation that can capture the effect of each spidroin on the properties of the spider silk. With the advancements in the field of deep learning, one common strategy is to use pre-trained models such as ProtBERT33. ProtBERT has been used previously for predicting properties in other contexts. However, this method involved fine-tuning more than a million parameters29,33. Fine-tuning would approximately require 500 or more training data29, making it infeasible for the Spider Silkome Database (SSD) (see “Dataset” section in “Methods”). For this reason, we present a new representation of the sequence that deals with the complexities and data constraints mentioned herein (see “Representation” section in “Methods”). The complete deep-learning framework used for the prediction of mechanical properties of the dragline spider silk is discussed in the “Deep learning model” section of “Methods”.
It has been reported in the literature that certain motifs have a higher impact on the properties31 but the list of motifs is limited and needs a framework to identify important motifs for different properties. Ideally, a predictive model should be able to help ascertain which segments of the primary sequence (motifs) impact (positively or negatively) each property. Identifying influential motifs is crucial for designing new sequences that result in improved mechanical properties, for instance through microbial production of designer sequences to form a silk dope that can be spun into fibers. However, motif identification in protein is a large combinatorial problem34. Therefore, our secondary goal with this study is to build a framework that can identify critical motifs.
Choice of enrichment descriptor of amino acids
To use the representation of the primary sequence discussed in “Representation” section in “Methods” for the prediction of mechanical properties, it is necessary to first fix certain parameters through parametric studies. To establish the maximum distance (max(m)) between a pair of amino acids used to develop the representation, in Supplementary Notes 2 we have established appropriate max(m) values (refer Supplementary Fig. 2 and Supplementary Tables 1 and 2) for all properties. In Supplementary Note 3, we establish a choice between the two representations (\({{{{{{{\mathcal{P}}}}}}}}\) or \({{{{{{{\mathcal{L}}}}}}}}\); refer Supplementary Fig. 3) based on the best way to store the descriptors of amino acids in a pair. It is pointed out in the “Deep learning model” section of “Methods” that input features fi’s are downselected from representation \({{{{{{{\mathcal{P}}}}}}}}\) or \({{{{{{{\mathcal{L}}}}}}}}\) based on user-defined cutoff values. These cutoff values are heuristically chosen to keep the number of tunable parameters low. In Supplementary Fig. 4, we also show the impact of the cutoff value on the model performance. Next, we study the relative importance of amino acid’s enriching descriptor d. Figure 1 shows the comparison between different d for the best-chosen max(m) and representation for all the properties. The d=B-factor with \({{{{{{{\mathcal{L}}}}}}}}\) representation clearly outperforms all other d’s for all properties except ϵsup, which is evident from Pr and Pc values given in Tables 1 and 2 respectively. The B-factor renders features that are most informative for the prediction of mechanical properties as it performs as the best descriptor for all properties except ϵsup. This can be physically justified by the fact that the Debye-Waller factor of atoms in polymers is found to have an inverse correlation with its bulk modulus32,35, and also serves as an indicator for glass-transition behavior. We also know that the cohesive energy, which is also related to the Debye-Waller factor, is used to compute the bulk modulus which governs all other mechanical properties in models like the group interaction model (GIM)36 and other constitutive laws37. This implies that there is a correlation between molecular mobility and mechanical properties. The outperformance of the model using d = B-factor supports the notion that the segmental molecular mobility in proteins is strongly related to macroscale mechanical properties. For ϵsup, d = 1 performs the best. This can either be due to the lack of data in the case of ϵsup or the fact that ϵsup majorly depends on just the occurrence of certain motifs in the spider silk. In the literature31 it has been shown that the ϵsup highly depends on the occurrence of poly-Alanine motifs. It is also clear from Fig. 1 and Supplementary Table 3 that the developed DL model works the best for ϵsup with mean R2 > 0.7. This can be attributed to the fact that ϵsup follows a uniform distribution as shown in Supplementary Fig. 1 leading to a similar range of output values in the train and test dataset. Furthermore, it’s important to recognize that the stress-strain curve of the protein relies on the intricate nanomechanics governing its unraveling process38. This complexity is further advanced when multiple proteins (such as spidroins in this scenario) are simultaneously subjected to a pulling force. Consequently, predicting the stress-strain curve from the primary sequence of spider silk is a highly non-linear problem. Therefore we observe R2 < 0.7 for properties obtained from the stress-strain curve. We also present the comparison of results from the task-specific model and the model trained on all the properties (multi-task) simultaneously in Supplementary Note 6. The model architecture for multi-task learning is shown in Supplementary Fig. 5. From the results shown in Supplementary Fig. 6, it is clear that the task-specific model is a better option.

a Shows the comparison of R2, and b shows the comparison of PCC. The error bars in the figure indicate ±standard deviation.
Supplementary Note 5 captures the details of best-performing models for all mechanical properties. Based on the parametric studies presented for different representations, properties (d), and max(m) values (see Supplementary Notes 2–4), the best choice for each mechanical property is given in Supplementary Table 3. Supplementary Table 4 gives details about the number of input features to FFNN and the number of trainable parameters in FFNN for each mechanical property. From the above discussions, we have shown that the deep learning model developed for the prediction of mechanical properties of spider silk is robust and accurate, considering the high variability in experimental data as discussed in the “Training details” section in “Methods”.
To further prove the robustness of our model, we test it against an experimental mutation study presented in the literature. One of the experimental studies39 shows that mutating Tyr (Y) to Phe (F) in MaSp1 of biomimetic spider silk decreases ϵsup. Therefore, in our test dataset, we replace Y with F in MaSp1 and observe a mean decrease of 71% in ϵsup. Therefore, our model predicts the same trend as observed in the experiment, thereby validating our model.
Motif identification
Having proved the robustness of our deep-learning based model, in this section, we will discuss the motifs identified to be most influential for different mechanical properties. As the first step, we calculate the feature importance (\({\overline{q}}_{i}\)) of all the features (fi) considered for the prediction of the properties using the method discussed in the “Feature importance analysis” section in “Methods”. Subsequently, for the features with \({\overline{q}}_{j}\) > 0.1, the 3 types of motifs (ϕw, ϕt, and ϕb) are identified and their impact (Pm) is quantized as described in the “Method for motif identification and quantifying their effect” section in “Methods”. The complete information about the motifs and their impact is presented in Supplementary Note 7. It can be observed from the Supplementary Tables 5–9 that Pm values can take on positive or negative values indicating a positive or negative correlation between the number of motifs (θn in Eq. (6)) and the property respectively. At this point, it is essential to physically interpret the impact magnitude Pm. To that extent, let us take motif LVSSGP (from MaSp1) for ϵf as an example as it is one of the motifs with the highest positive impact on ϵf. It is evident from Supplementary Table 5 that LVSSGP contributes 0.61% of the max ϵf value per θn. Now, if we want to increase the ϵf by 1.83% of max ϵf value, then we need to increase θn of LVSSGP by 3. Based on Eq. (6), θn can be increased by either increasing the number of motifs or decreasing the number of repeat units in the sequence. It is also very interesting to note that the mean B-factor of LVSSGP motif in MaSp1 is 0.42 which is higher than the mean B-factor of all individual amino acids (Fig. 2a). This suggests that the LVSSGP segment exhibits greater mobility and flexibility within MaSp1, thereby positively impacting strain.

a Variation of normalized B factor prediction with respect to the amino acids in MaSp1. b, d, f, h Heatmap showing P(Increase > 20%) for σUTS, ϵf, E, T respectively. c, e, g, i Heatmap showing P(Decrease > 20%) for σUTS, ϵf, E, T respectively.
All the mechanical properties of the dragline spider silk are due to the collective effect of several motifs. This is evident from the fact that none of the Pm values in Supplementary Tables 5–9 are extremely high. The contribution of so many different motifs makes it very difficult to come up with one common design rule for optimizing any property. For example, increasing the LVSSGP motif in MaSp1 increases ϵf but it also leads to the increase of SS motif which has a negative impact on ϵf. Hence, relationships like this need to be considered while designing fibrous protein-based materials. For the same reason, optimizing a primary sequence for two properties at the same time will be more difficult especially when most motifs have contrasting effects on the two properties. For instance, it may be desirable to increase both ϵf and σUTS, however, motifs like SS have negative and positive impacts on both properties as shown in Supplementary Tables 5 and 7 respectively.
Design rules
One of the aims of this work is to find the mutations that are required for increasing the mechanical properties in the dragline spider silk. To discuss mutations, we introduce the nomenclature used to indicate substitution as well as deletion/insertion in proteins, which follows standard mutation nomenclature40. The nomenclature for substitution is \( < {{{{{{{{\rm{Res}}}}}}}}}_{b} > < pos > < {{{{{{{{\rm{Res}}}}}}}}}_{a} > \) which means that amino acid Resb is being replaced by amino acid Resa at position pos. To indicate the deletion/insertion we use \( < {{{{{{{{\rm{Res}}}}}}}}}_{s} > < {{{{{{{{\rm{Res}}}}}}}}}_{s}\,{{{{{{{\rm{pos}}}}}}}} > \_ < {{{{{{{{\rm{Res}}}}}}}}}_{e} > < {{{{{{{{\rm{Res}}}}}}}}}_{e}\,{{{{{{{\rm{pos}}}}}}}} > delins < \,{{\mbox{group of newly inserted amino acid}}}\, > \) as the nomenclature where Ress and Rese indicate the first and last amino acid deleted.
Based on the observations from Supplementary Tables 5–9, we present some mutation recommendations to increase the properties in Table 3. Before interpreting these mutations, it is essential to recall that the spider silk structure consists of crystalline as well as amorphous regions. The crystalline region consists of groups of amino acids forming β-sheets. Based on the literature41, it is understood that the amino acids Ala, Val, Ile, Tyr, Cys, Trp, Phe, and Thr are more likely to be found in β-sheet regions and amino acids Gly, Pro, Asn, and Ser are more likely to be found in the turns connecting two β-sheets. Further considering the likelihood/propensity of the amino acids to form β-sheet, they can be ranked as L, V > A > G42. From the perspective of major ampullate spidroins in spider silk, literature43,44 highlights that Ala and Gly constitute their primary components. Most Ala residues are integrated into the β-sheet structure, underscoring their strong propensity for β-sheet formation. Gly is present in the β-sheets as poly(GA) and in the amorphous regions. Poly-valine is also found to form β-sheets within an amorphous network to improve toughness and strength45. These observations reinforce that A and V have higher β-sheet propensity than G. Additionally it is shown experimentally that Proline usually favors a more amorphous structure46 or is present in β-turns as GPGXX47. Building upon the aforementioned insights regarding the propensity of various amino acids to form β-sheets, we will investigate the impact of mutations among different motifs. To examine this effect, we report the ΔPm value which is defined as the difference between the higher and lower Pm values. In the literature, it has been established that the β-sheet represents a highly ordered domain within spider silk17 and that the local order of a protein region correlates with its B-factor48. Consequently, in the next section, we investigate the impact of mutations on the mechanical properties from the perspective of the B-factor.
Taking the above facts into consideration, we study the effect of different mutations on different mechanical properties. We first start with ϵf and hypothesize that it increases 50% of the times when the mutations of the amino acids decrease the β-sheet propensity. Next, we examine σUTS and E and observe that 50% of times the mutation of amino acids that increase β-sheet propensity also increases the property. The percentage 50% might look like a coin-toss probability but it is important to note in this case there are 3 possible mutations: high to low propensity, low to high propensity or the trend is not very clear such as for mutation G to N. The toughness (T) does not show any clear trend like other properties because T is dependent on the area under the stress-strain curve. The area under the stress-strain curve is driven by high σUTS and ϵf. Since different mutations favor σUTS and ϵf, we cannot observe a clear trend of T with respect to the β-sheet propensity like other properties. Higher toughness requires a higher area under the stress-strain curve which in turn requires higher maximum stress or the strain at break or both.
Upon comparing the motifs documented in the SSD paper31 with those presented in Supplementary Tables 5–9, we observe notable parallels. Specifically, in the case of ϵf, akin to the findings in the SSD paper, we identify that motifs such as SAAAAA and AS exert a negative influence on the property. Conversely, both studies concur that the motif GGAGQ within MaSp1 contributes positively to ϵf. Both our research and the SSD paper indicate that motifs QGPSG and YGPGS in MaSp2 impact ϵf positively and negatively, respectively. Furthermore, the motif GGPGGYG in MaSp2 affects ϵf negatively. In terms of σUTS, our observation regarding the adverse impact of a poly-Ala segment aligns with the findings in the SSD paper. However, augmenting the poly-Ala segment with Q and G demonstrates a positive effect on the property; for instance, the motif AGQGGA positively influences σUTS. Both the SSD paper and our study identify that motifs YGGL and GAGQGGY in MaSp1 positively impact σUTS. Additionally, in MaSp2, both studies ascertain that motifs PGGY and GPGGY positively affect σUTS. Concerning property E, both works indicate that motifs QGGQGG and AGQGGY within MaSp1 exhibit a positive impact. Furthermore, both studies highlight the recurrence of segments GQGG and GP in several motifs affecting E in MaSp1 and MaSp2 respectively. In the case of property T, both investigations reveal that motifs YGGL and YGG in MaSp1 have a positive influence. Moreover, they observe the segment GQ in many significant motifs for property T in MaSp1, while segments QGP and PG emerge in numerous impactful motifs for T in MaSp2. Overall, our approach has introduced an accelerated framework for identifying significant motifs by prioritizing feature importance, rather than relying on exhaustive motif searches. Additionally, we offer a more structured method to measure the influence of motifs through the computation of Pm.
In the case of supercontraction (ϵsup), it is evident from the Pm values in Supplementary Table 9 that the larger the length of the poly-Ala motif, the larger the decrease in ϵsup. This observation is backed by the literature study that shows that the ϵsup is positively correlated with the amorphous/poly-Ala region length ratio (PCC=0.53)31. Increasing the length of even one poly-Ala motif leads to the decreases in amorphous/poly-Ala region length ratio and subsequently the ϵsup. Building on this, we observe from Table 3 that mutating a larger poly-Ala to a smaller one increases the property. The role of poly-Ala blocks (4 or more Ala) is further highlighted by the presence of several poly-Ala blocks in the motifs reported in Supplementary Tables 5–9. The literature also emphasizes the significance of poly-Ala blocks in facilitating the formation of β-sheets43,49,50. The research49 indicates that a minimum of three poly-Ala blocks is necessary for the formation of β-sheets in spider silk. Beyond three blocks of poly-Ala, an additional increase in the block count enhances crystallinity by 25–39%. It has also been shown experimentally in the literature that poly-Ala enhances the ability of the recombinant spider silk protein to form β-sheet structure, thereby increasing the σUTS and T 50.
Can B-factor explain the effect of mutations?
Based on the observations in the previous section, it is clear that mutations among the amino acids can have an impact on the mechanical properties. To clearly understand the impact of the mutation of one amino acid to another, we first choose certain amino acids from hydrophobic, polar, and charged groups based on the results shown in Supplementary Fig. 7. For this study, we focus on σUTS, ϵf, E, and T as they are all derived from the same stress-strain curve. We want to point out that we did not consider amino acids A and G for the mutation as they both are extremely important for the formation of β-sheet and amorphous region in MaSp1 spidroin respectively51. However, we will briefly discuss the impact of A and G on mechanical properties toward the end of this section. To understand the terms used for studying mutation refer to the “Mutation study” section in “Methods”.
In this section, our analysis is focused on the effects of mutations in MaSp1 and MaSp2. However, we have chosen not to include MaSp3 and MiSp in the mutation study, based on the discussion below. We use test datasets for the mutation study as the performance of the DL model on the test dataset reflects its true performance. Out of 203 dragline species, only 22 dragline species have MaSp3 data document for them. We allocate 10% of the total examples as the test dataset, resulting in a statistically insignificant representation of dragline species with MaSp3 within the test dataset. This is the reason the mutation study on MaSp3 is not added to our work. As for MiSp, the reason for its inclusion to generate input features fi is discussed in the “Representation” section in “Methods” even though it is primarily associated with auxiliary spiral silk52. Due to this, some of the input features derived from MiSp might just be a noise leading to some unrealistic mutation results. Hence, we have not included the mutation study for MiSp in our work.
From Fig. 2b, c, it can be observed that there are reasonably higher chances that Q → <D,K> and S → <K,D> in MaSp1 will increase σUTS whereas V → E in MaSp1 will lead to a decrease in σUTS. This can be very well explained using the B-factor values shown in Fig. 2a for all amino acids in MaSp1 spidroin. It is clear from Fig. 2a that Q and S have a higher B-factor compared to D and K. Even though D and K have more chances to exhibit higher B-factor53, they exhibit lower B-factor than a polar amino acid S in MaSp1. This can be explained by the fact that in MaSp1 the amino acid A and T are the most frequent neighbors of D and K respectively. Amino acid A is mostly available in the crystalline part of the MaSp1 spidroin13 and amino acid T has a higher propensity of forming β-sheets54. Thus implying the presence of D and K in a more crystalline/structured region of the MaSp1. On the other hand, amino acid S has the highest chance of being present next to amino acid G which is majorly present in the amorphous region in MaSp113. This explains the high B-factor of S in MaSp1. Overall the lower B-factor of D and K implies their ability to form crystalline/structured regions in MaSp1; leading to an increase in σUTS. Conversely, amino acid E has a higher B-factor than V; hence V → E in MaSp1 leads to a decrease in σUTS. For ϵf to be higher, high extensibility and low stiffness are typically needed. This can be achieved by mutating to amino acids that have a higher B-factor as that can reduce the β-sheet regions in the spider silk. Since P has a higher B-factor and amino acid T has a lower B-factor compared to Q, we observe from Fig. 2d, e that Q → P and Q → T in MaSp1 lead to an increase and decrease in ϵf respectively.
The property E is very similar to σUTS as it also increases with the formation of more β-sheet regions in the spider silk. Hence, mutating to amino acids with lower B-factor is beneficial for E. We observe exactly the same from Fig. 2f, g that mutating Q → <D,K,N,S, I>, and S → <K,D,T,V,I,P> in MaSp1 leads to an increase in E. On the other hand, V → <E,Q,P> leads to a decrease in the property as they have a higher B-factor compared to V.
As discussed in the above section, higher T needs higher σUTS and ϵf. Also from Fig. 2h, i it is difficult to hypothesize any pattern of T with respect to the B-factor prediction. Then we plot P(Increase > 20%) and P(Decrease > 20%) with respect to ΔB-factor as shown in Fig. 3a, b respectively and also report the correlation (PCC) between the two variables. It is evident from the figures that in MaSp1, mutations that lead to the decrease in B-factor are favorable for toughness. It can also be hypothesized from Fig. 2h, i that the presence of amino acids F and Y are favorable in MaSp1 for higher T.

a, b P(Increase > 20%) versus ΔB-factor and P(Decrease > 20%) versus ΔB-factor respectively in MaSp1, c, d P(Increase > 20%) versus ΔB-factor and P(Decrease > 20%) versus ΔB-factor respectively in MaSp2. The green marker indicates that the higher values are favorable for T and the red marker indicates the higher values are detrimental for T. A black line is fitted to the data to indicate the nature of the correlation.
It has been argued previously that MaSp1 is mostly responsible for the strength of the spider silk whereas MaSp2 is responsible for the elasticity and extensibility55. Therefore, we carry out a similar mutation study for MaSp2 spidroin as well, to see if there are contrasting effects of mutations. The two probabilities discussed in Fig. 4 are also calculated for mutations in MaSp2 and shown in Fig. 4.

a Variation of normalized B factor prediction with respect to the amino acids in MaSp2. b, d, f, h Heatmap showing P(Increase > 20%) for σUTS, ϵf, E, T respectively. c, e, g, i Heatmap showing P(Decrease > 20%) for σUTS, ϵf, E, T respectively.
From Fig. 4b it can be observed that mutations Q → I, P → I, and S → D in MaSp2 lead to an increase in σUTS due to the decrease in B-factor after mutation. From Fig. 4a, d, e, it can be inferred that the mutation of P → R in MaSp2 leads to a decrease in the B-factor, thus leading to a decrease in ϵf due to the formation of a more ordered region. But not all the mutations that lead to a decrease in the B-factor (S → D, P → Q and S → Y in MaSp2), negatively impact ϵf. It is seen in the literature that S has a higher propensity of forming β-sheet54 than D, and P has a higher probability of being present in the β-turns than Q56. The former observations from the literature can explain why mutations S → D and P → Q have higher chances of increasing ϵf. To explain the impact of S → Y, we find in the literature31 that motifs such as GS, GGS, and AS more negatively impact ϵf than GY, GGY, and AY respectively. This explains the reason for the increase in ϵf after S → Y.
From Fig. 4f it can be observed that E increases due to the mutation of P → (Q, D, or K) in MaSp2 as these mutations lead to a decrease in the B-factor. But two of the mutations S → P and V → P increase the B-factor and E both. Therefore, this trend cannot be explained using the B-factor alone. However, it has been noted in the literature55 that MaSp2 is a Proline-rich spidroin and this is important for the structure of MaSp2. Also, it has been observed in the literature31 that the increase in the occurrence of V in MaSp2 spidroin negatively impacts E. Thus, the increase in E due to V → P mutation is supported by these prior findings.
Similar to MaSp1, we plot P(Increase > 20%) and P(Decrease > 20%) with respect to ΔB-factor as shown in Fig. 3c, d respectively. Figure 3c does not show any correlation between P(Increase > 20%) and ΔB-factor. But from Fig. 3d it can be observed that in MaSp2, an increase in the B-factor after mutation is favored. It is in accordance with the literature57, where it has been shown that amino acid P participates in β-turns and contributes to the elasticity of the MaSp2 spidroin. It can be hypothesized from Fig. 4h, i that amino acids D, E, and P are favorable in MaSp2 for higher T.
As pointed out above, we did not consider A and G for the mutation study, but we performed A → G and G → A in MaSp1 and MaSp2 to stress test the model. We observe that G → A in MaSp1 strongly favors σUTS, E, and T with P(Increase > 20%) ≈ 0.6 whereas A → G does not have a strong impact on any properties. The G → A mutation leads to a decrease in the B-factor; hence increasing the σUTS, E, and T. In MaSp2, A → G and G → A do not have a huge impact on any properties.
In conclusion, the B-factor can very well explain the effect of mutations on mechanical properties in MaSp1. In MaSp2, the B-factor can explain the effect of most of the mutations except for the few mutations involving Proline (P). This is due to the fact that MaSp2 is Proline (P) rich spidroin55 with P majorly participating in the β-turns57, contributing to the elasticity of the dragline silk. Additionally, this study also highlights a few mutations that can improve or worsen a group of properties. For example, Q → D in MaSp1 increases σUTS, T, and E and S → < any hydrophobic amino acid > increases T and E. Conversely, Q → R in MaSp1 worsens σUTS and T and V → E in MaSp1 worsens σUTS and E. Similarly, we find that P → Q mutation in MaSp2 has a high chance to increase ϵf, T, and E.
