The 60-Year Hunt for AI's Most Important Function
I was trying to understand how SwiGLU works, but I couldn’t find an explanation that clicked for me.
So I made this video to explain it from first principles.
Check it out: youtu.be/JRaPNrpsQ9s
**Modern Transformer - Complete Guide**
Interested in learning the recent advances in transformers?
After 14 videos, I've finally completed this series!
🥳🥳🥳
Check out the course here:
www.youtube.com/playlist?lis...
The Most Underrated Layer Inside Every AI Model
Virtually every AI model has normalization layers.
BUT, what makes them so essential? 🤔
New video on learning the role of normalization in stabilizing training and alternatives like DyT and Derf.
youtu.be/JHl_gwVoh-k
How is DeepSeek V4 so INSANELY cheap? 🤔
Compared to a GQA baseline, it's new *compressed attention* mechanism (CSA and HCA) slashes the KV cache memory cost by 98% 🤯 at a 1M-token context!
Here’s how: youtu.be/q8holiIirgo
**Modern Transformer architecture explained**
I compiled a list of videos on the Transformer architecture into a short "YouTube course".
www.youtube.com/playlist?lis...
Hopefully, this would be helpful for beginners in the community.
Happy learning! 😎
Finally got some time to read the DeepSeek Engram paper!
Idea: Replace repeated reconstruction with direct lookup of common knowledge.
It’s so intuitive that it feels strange this wasn’t part of the design from the start.
Video summary here: youtu.be/87Q8nf1XHKA
How do we make attention actually capture context?
Exclusive Self Attention (XSA) is an interesting variant that improves attention with minimal cost in speed & memory.
Check out the video here: youtu.be/2eZKT4H9_iQ
New video! How do LLMs grow outrageously large yet blazingly fast?
The secret: Mixture of Experts (MoE)
In this video, we cover the role of FFNs, how to scale them without slowing down, and how to maintain load balance and training stability.
Full video here: youtu.be/0QQlYR1r6pQ