From Maths Anxiety to MicroGPT Magic:

Connecting the Pieces of AI Learning

Today, I had a full circle moment. Last year around this time I decided I wanted to understand AI. Not how to work with it, but the REAL fundamentals. AI is trained through a combination of machine learning (how computers learn patterns from data and make predictions or decisions) and deep learning (a subset of machine learning). I remember I started with the machine learning and deep learning specialization from Andrew Ng and the amazing folks at DeepLearning.ai. I immediately got fascinated by the theory behind it, but my biggest problems: the code and the maths. I have a severe maths anxiety, but also a fascination with it. I love watching maths related documentaries and movies.

My studies became the backbone of my own learning system the XueCodex. Something you can read more about here: How AI Helped Me Fall Back in Love with Learning

That was over 8 months ago. As with many of my projects, my theory took a backseat. I started focusing on building, learning engineering principles, using my product thinking to create solutions for my own problems. Which I have done — I just launched the v1 of AIropa. But whenever I’m building, I keep getting this nagging feeling inside me. I truly am interested in this. However vague or abstract, I LOVE what AI can do for us, but I also LOVE the theory behind it. I realized for the first time in my life… I want to SPECIALIZE! I know, weird right, as a very vocal member of Generalist World. But what I really want to do is create custom AI solutions with local AI models. We are currently in the hype of AI, but every solution is an AI wrapper. AI won’t take over the world, but as the technology becomes more relevant, I firmly believe there will be a moment similar to the rise of cloud infrastructure. We might want to run open source models on premises.

I think that an AI Product Engineer with a humanities background and a love for ethics and philosophy can be useful here. That’s why I want to learn more, train more, understand more (I’m an Information Geek after all).

Well today I had the perfect opportunity. I came across a LinkedIn post where the great Andrej Karpathy released a Python script to train a MicroGPT in 200 lines. I’m still focusing on understanding Python so it sounded like the perfect challenge. Note: it wasn’t about the Python. It was about the logic and concepts I learned over a year ago.

I normally use Bitty (ChatGPT) for my training, but even though ChatGPT has served me well, I am less and less happy with the model. So I decided to take Claude for a spin. Our chat was a combination of code and context engineering. Bitty knows how I learn. Claude doesn’t, but it shows how adaptive foundation models are if you feed them the right information.

When I looked at the code, I didn’t start to cry. I used to think of coding as scary, but I noticed I could read it, that I understood it. The first 20 lines were easy — importing some functions, making sure we download the data, clean it up. I actually noticed I understood most of the functions because I use them in my projects all the time.

It was only at defining the Value class that the true challenge came. A block of code full of terms like gradients, children, and local derivatives. Pure abstraction. I literally thought that the relationship between “parents” and “children” in the code was about the letters in the names we were trying to predict. It wasn’t — it was about tracking which math operations led to which results, so the model could trace its mistakes backward.

The realization I had this week is that I learn differently. Most coding classes are bottom up. The historian in me is a natural divergent and top-down thinker. I need to see the blueprints and the house to understand why the toilet is leaking or see the crack in the wall that needs extra mortar. I need the context for it to land, or else it doesn’t.

So what does this GPT actually do? It takes 32,000 baby names, learns the patterns in how letters follow each other, and then hallucinates 20 completely new names. That’s it. I had learned about the fundamentals — backpropagation, loss functions, ReLU — and they were hidden in my own XueCodex. But it was all so abstract and mathematical that I couldn’t connect the pieces.

Claude was trying to explain how the model learns letter relationships using formulas and code examples — weights multiplied by embeddings, parents and children in computation graphs, gradients flowing backward. I kept saying “too abstract.” It tried concrete examples with the name “emma.” Still didn’t land.

Then I asked: “Does it work like a huge chessboard where every letter has a position and the model figures out where they should go?”

Claude refined it: the model doesn’t jump around finding letters. It rearranges the board. Letters that follow each other in real names — like e and m in “emma” — drift closer together during training. Letters that never appear together drift apart.

That’s when everything clicked. Suddenly:

  • The weights were just positions on the board
  • Training was rearranging the board over 1000 rounds
  • Attention was looking at the board through four different lenses, each one catching different patterns
  • The Value class — that scary math thing — was just the machinery that figures out which direction to nudge each piece
  • The loss going down meant the board was getting more organized

The key realization: it’s a grid with numbered positions. The model wants to figure out the optimal arrangement so it can predict names better. All the scary math — autograd, backpropagation, gradient descent — is just the mechanism for nudging pieces in the right direction.

Once that landed, the explanation of the remaining functions came much easier. I could look back at what I learned and connect it. The maths is still fuzzy, and I believe I will never be able to calculate it myself. But I get it. I can confidently say I can read 90% of the code Karpathy created and understand what it does and why.

And then I ran it. I watched the loss go down — from 3.37 at step 1, slowly dropping as the chessboard reorganized. And then the names appeared:

sample 1: kamon
sample 2: ann
sample 3: karai
sample 4: jaire
sample 5: vialan
sample 6: karia
sample 7: anna
sample 8: areli
sample 9: kaina
sample 10: anton

Names the model never saw. Names that don’t exist. But they sound right — because the board learned the patterns.

Back in March last year I asked myself if I could learn about Deep Learning despite failing high school maths. Tonight, on Valentine’s Day, I trained my own GPT and watched it generate names I never taught it.

The answer is yes.

Scroll to Top