LLMs have taken the world by storm. People are mostly using polished APIs of LLMs, they type a prompt, get an answer. What they miss is the architectural importance it carries and where it can excel and where it needs to be improved. Under the hood lie non-obvious design choices that determine speed, cost, and capability; choices that matter deeply if you want to build, fine-tune, or optimize these models.
I implemented GPT-2 from scratch with only PyTorch to understand the architecture end-to-end. On top of it, I added LoRA (Low-Rank Adapters), RoPE (Rotary Positional Embeddings), KV Cache and more. While implementing, there were several moments that made me scratch my head and I kept documenting all of them. Today I am sharing 6 of the most important ones. For a deeper look at the architecture, you can find the in my previous deep-dive.
1. LoRA vs RsLoRA (Rank Stabilized):
LoRA fine-tunes a model by training only two low-rank matrices, B and A, with shapes (dimension, rank) and (rank, dimension), while keeping the original weights frozen (W) [1]. This reduces the number of trainable parameters drastically (In my case, just 0.18% of all weights).
There are alpha (α) and rank (r) associated with LoRA, with alpha/rank acting as scaling factor. The formula goes like:
This scaling factor decides the amount of importance to be given to the fine-tuned parameters. If alpha is 32 and rank is 16, then scaling factor is 2 thus the weights are given 2x importance. Therefore this scaling factor can vary. A visual idea is given below:

But there is an issue with LoRA as reported by Kalajdzievski, where he argued that if rank is kept on increasing, dividing the fine-tuned parameters with rank eventually reduces the weights importance [2]. In simple terms: as rank grows, the individual weight updates shrink and LoRA quietly becomes less effective without you realizing it.
Note: For those interested in the underlying math, I’ve included the statistical proof below. Otherwise, feel free to jump straight to the next section.
Proof: The entries of B and A are randomly initialized when we start fine-tuning (standard practice: normally distributed), i.e., Bⱼₖ, Aₖᵢ ~ N(0,σ²).
So as we increase the “r”, the variance of B*A increases proportionally:
\[
\begin{aligned}
Var(B \cdot A) &= Var(\sum B_{jk} \cdot A_{ki}) \\
&= Var(X_1 + X_2 + \dots + X_r) \\
&\text{(denoting } \sum B_{jk} \cdot A_{ki} \sim \text{X’s independent variables)} \\
&= Var(X_1) + Var(X_2) + \dots + Var(X_r) \\
&\text{(recall basic probability rule)} \\
&= r \cdot c \\
&\text{(assuming c is constant variance value)} \\
\text{Result: } &Var(B \cdot A) \propto r
\end{aligned}
\]
But we can’t stop here, as we need variance of complete fine-tuned weights (ΔW) accounting scalability factor in it:
\[
\begin{aligned}
&\phantom{\text{(denoting } \sum B_{jk} \cdot A_{ki} \sim \text{X’s independent variables)}} \\[-2.5ex]
Var(\Delta W) &= Var\left(\frac{\alpha}{r} \cdot (B \cdot A)\right) \\
&= \frac{\alpha^2}{r^2} \cdot Var(B \cdot A) \\
&\text{(Used Variance rule here)} \\
&= \frac{\alpha^2}{r^2} \cdot (r \cdot c) \\
&\text{(Using results from above)} \\
&= \frac{1}{r^2} \cdot r \\
&\text{(since } \alpha^2 \text{ and c are constant)} \\
&= 1/r \\
\text{Result: } &Var(\Delta W) \propto 1/r
\end{aligned}
\]
This shows that with increased rank, the variance of fine-tuned weights decreases which means the weight updates become smaller and smaller. To resolve this shrinking issue, Kalajdzievski introduced a simple and effective solution, to replace “r” with “√r“. I will not go through calculations again, but what it then resulted into Var(ΔW) = r/r = 1. Which eventually made the variance constant and weights magnitude remains stable with every update (shown in the plot below). Thus it’s better to stick with RsLoRA than LoRA.

2. RoPE instead of Learned Parameters or Sinusoidal Positional Embeddings (PEs)
Positional embeddings are often treated as a secondary detail, but we might underestimate the importance they carry and how a wrong approach can completely spoil a huge LLM model. The research paper “Attention Is All You Need” [3] focused on Sinusoidal Positional Embeddings (PEs). This approach involved no parameters, using a fixed formula to generate values. However, it carried many caveats: the fixed formula was not flexible enough to capture relative positions and only provided absolute position. Another major issue was that these positional embeddings were directly added to the token embeddings, thus altering the magnitude of the actual information the token embeddings carried.
To overcome these, models like GPT-2 and GPT-3 started using a Learned Parameters-based approach. Instead of relying on a single fixed formula, it was left to the neural network to find the positional information using backpropagation. While this worked in the right direction, it again had a few caveats: it added more parameter load to the model (context_size * dimension) and the major problem, direct addition to token embeddings still remained.
RoPE (Rotary Positional Embeddings) came to the rescue [4]. It overcame most of the drawbacks that the other two approaches were carrying. Most modern LLMs now ship with RoPE by default, and for good reason. Unlike learned or sinusoidal approaches, RoPE encodes position by rotating Query and Key matrices based on their position and frequency, leaving token embeddings untouched. Thus, it achieved two objectives with a single effort: zero parameter load on the model and no direct addition, ensuring the actual information carried by token embeddings is left unchanged.
I’ve covered all three in depth with visuals and a pros/cons breakdown: read the full article here.
3. Weight Tying
Weight tying refers to sharing weights between the token embedding layer and the output projection head. Historically GPT, GPT-2 and BERT all used it. On a 124M parameter model it saves 38M parameters which is roughly 30% of the entire model, which was significant. The intuition also made sense since embedding maps token → vector and output head maps vector → token, making them natural transposes of each other. However as models scaled to billions of parameters this 38M saving became less than 0.5% of the total, practically meaningless. So most modern LLMs like LLaMA, Mistral and Falcon keep them separate, also because separate weights gives the output head freedom to specialize independently. Weight tying makes sense for small models, but quietly disappeared as models scaled.
So if you’re building a small model from scratch, it’s worth keeping. If you’re fine-tuning a billion-parameter model, don’t bother looking for it, it’s likely already gone.
4. Pre-LayerNorm vs Post-LayerNorm
Pre-LN and Post-LN sit on opposite ends of a stability vs. performance tradeoff. The original “Attention Is All You Need” architecture utilized Post-LN (where normalization happens after the residual addition). While Post-LN can lead to better final performance, it is notoriously difficult to train because it can cause gradients to explode or vanish in deep networks.
Starting with GPT-2, the industry switched to Pre-LN (where normalization happens inside the residual block). This choice prioritizes training stability, though it often comes at a slight cost to the model’s ultimate representational power. Researchers have been trying to break this trade-off ever since, leading to modern variations like DeepNorm, RMSNorm, and Double Norm.

5. KV-Cache
The attention mechanism is the core engine of the Transformer, allowing the model to dynamically weight the importance of different tokens across a sequence. It is arguably the most critical innovation in modern AI, as it enables the model to maintain long-range context and “focus” on relevant information.
There exist three different components inside the attention mechanism: Query, Key and Value.
- Query (Q): represents the current token the model is focusing on
- Key (K): used with the query to find the relationship of the current token with other tokens
- Value (V): the actual content a token shares if selected
During inference, tokens are predicted one at a time autoregressively. Each new token attends to all previous tokens, meaning that the K and V matrices were being recomputed from scratch for every previously seen token, every single time. Wasteful.
The fix is simple: just cache the K and V matrices as you go. Each new token only needs to compute its own K and V, then retrieve the rest from cache. This drops time complexity from O(T²) to O(T) for a sequence of length T.

The actual speedup: For a sequence of 15 tokens, without KV cache you’re doing 15 full K and V computations per step. With cache, you do 1. That’s roughly a 15x reduction in attention compute. In practice you see around 2x overall speedup accounting for other operations.
But there is a tradeoff that nobody mentions: KV cache is not free. It consumes memory proportional to the number_of_layers * sequence_length * dimension. For long contexts this becomes significant, which is exactly why memory is the bottleneck in LLM serving, not compute.
This memory overhead has been a major research challenge, and recently, Google Research introduced a breakthrough to address it. In their 2026 paper, “TurboQuant: Online Vector Quantization with Near-Optimal Distortion Rate” [5], researchers demonstrated a way to compress the KV cache down to just 3 bits per value.
This technique achieves a 5x to 6x reduction in memory consumption with zero accuracy loss. It works by rotating the dimensional coordinates so they follow a Beta distribution, then applying Lloyd-Max Quantization combined with a 1-bit Quantized Johnson-Lindenstrauss (QJL) transform to correct residual errors. B This approach unclogs the memory wall and allows models to handle massive contexts that previously required multiple GPUs on a single chip.
6. Quantization Tradeoff: Why LayerNorm is skipped during INT8 quantization
Modern LLMs are enormous. Storing and running them in full 32-bit or 16-bit floating point precision is expensive, both in memory and compute. Quantization is the process of reducing the numerical precision of model weights, typically from 32-bit floats down to 8-bit integers (INT8) or even 4-bit. This makes models significantly cheaper to store and faster to run, which is why almost every production LLM deployment uses some form of quantization [6].
But quantization is not applied blindly to every layer equally, and this is where it gets interesting.
LayerNorm is almost always skipped during INT8 quantization. The reason is a simple cost-benefit calculation that most articles never explain.
- The benefit is negligible: LayerNorm has almost no parameters, just γ and β, a handful of values compared to the millions sitting in a single linear layer. On a 124M parameter model this is a negligible fraction of total memory. The savings from quantizing them are essentially zero.
- The cost is high: LayerNorm is mathematically sensitive. It computes mean and variance across each token’s embedding, then applies γ and β to rescale. Small precision errors in these parameters, which INT8 introduces, directly distort the normalized output, cascading into every subsequent layer.
The tradeoff is clear: quantize LayerNorm and you gain almost nothing while introducing meaningful quality degradation. So it stays in full precision.
This is a broader lesson in quantization, not all parameters are equal. The question is never just “how many bytes does this save?” but “how sensitive is this layer to precision loss relative to what we save?”.
Conclusion
These 6 things are not secrets, they’re hiding in plain sight inside every major LLM. But tutorials rarely stop to explain the why behind them. Why rsLoRA fixes a variance problem most people never notice. Why RoPE leaves token embeddings untouched. Why weight tying quietly disappeared as models scaled. Why Pre-LN trades performance for stability. Why KV Cache turns O(T²) into O(T). Why LayerNorm survives quantization at full precision.
Building from scratch forces you to confront every one of these decisions. You can’t abstract them away. And that’s exactly why I’d recommend it to anyone who wants to truly understand how these systems work, not just use them.
These six observations are just the surface of what I encountered while building this model. In my upcoming posts, I’ll be doing a deep dive into the specific math of quantization errors and the practical challenges of deploying LLMs at scale. If you’re interested in the intersection of statistical theory and ML engineering, follow along for the next installment.
References
[1] E. Hu, Y. Shen, P. Wallis et al., LoRA: Low-Rank Adaptation of Large Language Models (2021), arXiv:2106.09685
[2] D. Kalajdzievski, A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA (2023), arXiv:2312.03732
[3] A. Vaswani, N. Shazeer, N. Parmar et al., Attention Is All You Need (2017), arXiv:1706.03762
[4] J. Su, Y. Lu, S. Pan et al., RoFormer: Enhanced Transformer with Rotary Position Embedding (2021), arXiv:2104.09864
[5] Zandieh et al., TurboQuant: Online Vector Quantization with Near-Optimal Distortion Rate (2025), arXiv:2504.19874.
[6] T. Dettmers, M. Lewis, Y. Belkada, L. Zettlemoyer, LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (2022), arXiv:2208.07339
