GENAI

We’ve all seen ChatGPT answer questions, write stories, or even help with code — like magic. But have you ever wondered what’s actually happening behind the scenes? How does it know what to say next? At the heart of ChatGPT is a powerful type of AI called a language model, trained to understand and generate human-like text. In this blog, we’ll take a friendly look under the hood to understand the model behind ChatGPT — how it learns, how it talks, and why it sometimes gets things right (or wrong).

What is GPT?

GPT stands for Generative Pre-trained Transformer. It’s a type of AI model designed to understand and generate human-like text. Let’s break that down:

Generative means it can create text — like writing a story, answering a question, or finishing your sentence.
Pre-trained means it learned a lot about language by reading huge amounts of text from the internet before being fine-tuned for specific tasks. However, it's important to note that these models have a knowledge cut-off , meaning their training data only goes up to a certain date, and they won't have direct information about events after that.
Transformer is the name of the model architecture it’s built on — a smart way for the AI to figure out which words matter and how they relate to each other in a sentence.

Simple analogy of GPT

Imagine GPT as a super-smart autocomplete — like the one on your phone, but trained on everything from books, websites, and conversations.

Think of it like this:

🧠 GPT is like a chef who has read millions of recipes.
It hasn’t cooked every dish, but it knows the patterns — what ingredients go together, what steps come next. So, when you give it a few words (your prompt), it whips up a response based on what it learned from all those recipes (texts).

This chef doesn’t copy recipes — it creates new ones on the spot, based on patterns it recognizes.

So when you ask ChatGPT something, it's not pulling answers from a database — it's predicting, word by word, what sounds most natural or useful based on everything it’s learned.

GPT doesn’t "look up" answers — it predicts them word by word, using what it learned from reading a massive amount of text.

Transformer

A Transformer is an AI model that processes data (like text) by focusing on the relationships between words, no matter where they appear in a sentence.

A Transformer is a type of deep learning model architecture that was introduced in a 2017 paper titled "Attention is All You Need" by Vaswani et al. It revolutionized natural language processing (NLP) and is the foundation of models like GPT (used in ChatGPT), BERT, and many others.

Generative AI models, like GPT, are transforming how we interact with technology. This post will explore the key concepts that power these models, including tokenization, vector embeddings, positional encoding, and self-attention.

1. Tokenization: Breaking Down Language

What is Tokenization?
- GenAI models process language by first breaking it down into smaller units called "tokens."
- This process is called tokenization.
- Tokens can be words, subwords, or even individual characters.
- For example, the sentence "The cat sat on the mat" can be tokenized into word tokens: ["The", "cat", "sat", "on", "the", "mat", "."]
Why is Tokenization Important?
- Tokenization is the first step in Natural Language Processing (NLP).
- It allows machines to understand and work with language by converting text into manageable parts. l
Real-World Applications of Tokenization
- Search engines use tokenization to break down queries and find relevant results.
- Chatbots and language models like ChatGPT use tokenization to understand and generate text.
- Translation tools rely on tokenized input to accurately convert languages.
The Two Stages of Tokenization in NLP
- Text to Tokens: The initial step of splitting the text.
- Tokens to Numbers (IDs): Converting each token into a numerical ID using a vocabulary or tokenizer model.
  - For example: "I" → 101, "love" → 202, "ice" → 303, "cream" → 404, "." → 999. So, "I love ice cream ." becomes [101, 202, 303, 404, 999].
- These numerical IDs are the final input that the model processes.
Tokenization Summary
- A table summarizing the tokenization process:

Stage	Input	Output	Example
Tokenization	"I love ice cream."	["I", "love", "ice", "cream", "."]	Word or subword tokens
Encoding	["I", "love", "ice", "cream", "."]	[101, 202, 303, 404, 999]	Token IDs (numbers)

Important Note: Different models (e.g., Gemini, GPT, Claude) use their own unique tokenization processes.

2. Vector Embeddings: Representing Meaning

What are Vector Embeddings?
- Vector embeddings are a technique in machine learning and NLP to represent data, especially text, as numerical vectors in a high-dimensional space.
- This allows models to understand and process words, sentences, and even images.
Common Embedding Models:
- Word2Vec (Google)
- GloVe (Stanford)
- OpenAI Embeddings (used in search, chatbots, etc.)
Why Use Vector Embeddings?
- Search Engines: To find similar content based on meaning, not just exact words.
- Recommendation Systems: To embed users/items and find close matches.
- Chatbots/AI Models: To understand and respond to user input meaningfully.
The Goal of Vector Embeddings:
- To represent the semantic meaning of words and phrases.
- Example:
- USA —> Trump
- India —> ?
- The answer is Modi. The model understands the relationship between "USA" and "India" is similar to "Trump" and "Modi," or "King," "Queen," "Man," and "Woman."
- Vector embeddings capture real-world meanings.

3. Positional Encoding: Adding Context to Word Order

What is Positional Encoding?
- Positional encoding is crucial in transformer models (like BERT, GPT) when dealing with sequential data like text.
- Transformers, unlike RNNs or LSTMs, process the entire input simultaneously and lack an inherent sense of word order.
- Positional encoding injects information about the position of words in the input.
How Positional Encoding Works
- Consider the sentence: "The cat sat on the mat."
- Word embeddings alone don't tell the model that "The" is the first word or "mat" is the sixth.
- Positional encoding adds a vector to each word's embedding to encode its position.
- Positional encoding tells the vector embedding the position of the token.

4. Self-Attention Mechanism: Understanding Relationships

What is Self-Attention?
- The self-attention mechanism is a key component of transformers.
- It allows the model to look at other words in the input (including itself) and determine the importance of each word in understanding the current word.
Example of Self-Attention
- In the sentence "The cat sat on the mat," when processing "sat," self-attention might focus more on "cat" and less on "mat" because "cat" is more closely related in meaning.
- Tokens "talk" to each other to adjust their embeddings, allowing their meaning to be refined based on the context of the sentence.
Multi-Head Attention
- Example: "The cat sat on the mat because it was tired."
- Self-attention can help the model connect "it" to "cat."
- Multi-head attention allows the model to consider different aspects of the relationships between words, such as grammatical structure, nearby words, and synonyms.

5. Model Training and Inference

Two Phases of a Model
- Training Phase: The model learns from data.
- Inference Phase: The model uses what it has learned to make predictions on new, unseen data.
  - Inference is the process of using a trained AI model to make an educated guess about something unknown.
Training Phase Steps
- Give the input.
- Get the output.
- Compare the output to the expected output.
- Calculate the "loss" (the difference between the predicted and actual output).
- Backpropagate to update the model's parameters and reduce the loss.
Computational Resources:
- The training process requires a large amount of GPU resources.
Example
- Input: "<start> My name is"
- Expected output: "Sandhya <END>"
- Model's output: "Sandhya jcbsdjst <END>"
- The difference ("jcbsdjst") results in a loss, which is used to update the model.

What is Temperature?

Temperature controls how random an AI’s answers are.

Low temperature (like 0.2) = safe, predictable answers.
High temperature (like 1.2) = creative, surprising answers.

It changes how confident the model feels about different options.

Example :

Low Temp → "The sky is blue."

High Temp → "The sky dances with color."

What is Knowledge Cutoff?

Knowledge Cutoff means the AI only knows information up to a certain date.
If something happens after that date, the AI won't know about it.
For ChatGPT-4, the original cutoff was April 2023.

Knowledge up to: April 2023

Anything after → ❓

What is Softmax?

Softmax is a function that takes a list of numbers and turns them into probabilities.

The numbers can be positive, negative, or big/small.
After Softmax, the numbers become values between 0 and 1.
All these values add up to 1, like probabilities should.

It is often used in AI at the final step to choose an output (like selecting the most likely word).

Example: AI Choosing the Next Word in ChatGPT

When you type a sentence like:

"The cat is sitting on the"

ChatGPT needs to predict what word should come next.
Maybe it thinks about these options:

Word	Raw Score
mat	3.5
table	2.1
roof	1.0
moon	0.2

👉 These are raw scores (how likely each word feels).

But the AI can't just pick the highest score immediately —
it needs to turn these scores into probabilities using Softmax!

🛠 How Softmax helps here:

Softmax will convert the scores into probabilities.
Then the AI can sample or choose the next word based on those probabilities.

For example, after Softmax:

Word	Probability
mat	70%
table	20%
roof	8%
moon	2%

Now the AI sees "mat" is the most probable, but it could still occasionally pick "table" if a little creativity is needed (depending on the temperature setting too!).

💡 In simple words:

Without Softmax:

Raw numbers are just messy scores.

With Softmax:

AI gets clean chances to pick the next move!

🔥 Where Softmax is used in AI:

AI Task	Why Softmax?
Chatbots (like ChatGPT)	Pick next word
Image Classification	Decide object in image (cat, dog, car...)
Translation Systems	Choose next translated word
Speech Recognition	Predict next sound or word

What is Vocab Size?

Vocab size (short for vocabulary size) means how many different words or tokens an AI model knows and can understand or generate.

A small vocab size = fewer words (good for simple tasks).
A large vocab size = more words (good for complex language).

✨ Simple Example:

If the vocab size = 10,000,
it means the AI model can recognize or produce 10,000 different tokens (like words, punctuation, parts of words).

Tokens can be:

Whole words (like "apple", "run").
Pieces of words (like "ing", "un", "tion").
Symbols (like ".", "?", "!" etc.).

Token ≠ always a full word!
Example: "playing" → could be split into "play" + "ing".

javaCopyEditVocab Size = 5

Tokens the AI knows: [ "I", "love", "cats", "dogs", "." ]

If you ask about "birds", it won't know unless "birds" or "bird" pieces exist in its vocab!

🧠 Why is vocab size important?

Bigger vocab → can handle more complex language but needs more memory and training time.
Smaller vocab → faster, simpler models but might miss details.

ChatGPT models usually have vocab sizes between 30,000 to 100,000+ tokens depending on the version.

✅ In short:
Vocab size = How many different "words or word-pieces" the AI can understand and work with!

Decoding AI Jargons

What is GPT?

Simple analogy of GPT

Transformer

What is Temperature?

What is Knowledge Cutoff?

What is Softmax?

Example: AI Choosing the Next Word in ChatGPT

🛠 How Softmax helps here:

💡 In simple words:

🔥 Where Softmax is used in AI:

What is Vocab Size?

✨ Simple Example:

🧠 Why is vocab size important?

Comments

Command Palette

What is GPT?

Simple analogy of GPT

Transformer

What is Temperature?

What is Knowledge Cutoff?

What is Softmax?

Example: AI Choosing the Next Word in ChatGPT

🛠 How Softmax helps here:

💡 In simple words:

🔥 Where Softmax is used in AI:

What is Vocab Size?

✨ Simple Example:

🧠 Why is vocab size important?

Comments