- Contents in this wiki are for entertainment purposes only
Small Language Models for Dummies: A Practical Guide
Small Language Models for Dummies: A Practical Guide
Introduction
Welcome to language models for beginners! Think of a language model like a very smart word-prediction machine. Just like how your phone predicts the next word when you're typing, but much more sophisticated.
Part 1: The Basics - How It Works
The Recipe
Imagine you're baking a cake (our language model):
Ingredients (Input): Words or tokens Kitchen Tools (Model Parts): Mixing Bowl (Embeddings): Turns words into numbers the model can understand Beaters (Attention): Figures out which words are important together Oven (Transformer Blocks): Where the magic happens Result (Output): Predicted next words
A Simple Example
Let's say we want to teach our model to complete sentences: Input: "The cat sits on the" Expected Output: "mat"
from simple_tokenizer import Tokenizer # A basic word-based tokenizer
# Create vocabulary from our tiny dataset
tokenizer = Tokenizer()
for sentence in sentences:
tokenizer.add_sentence(sentence)
# Convert sentences to numbers
input_sequences = []
target_words = []
for sentence in sentences:
words = sentence.split()
# Use all but last word as input
input_seq = tokenizer.encode(' '.join(words[:-1]))
# Use last word as target
target = tokenizer.encode(words[-1])[0]
input_sequences.append(input_seq)
target_words.append(target)
Part 2: Baby Steps - Your First Training
Step 1: Preparing Data
from simple_tokenizer import Tokenizer # A basic word-based tokenizer
# Create vocabulary from our tiny dataset
tokenizer = Tokenizer()
for sentence in sentences:
tokenizer.add_sentence(sentence)
# Convert sentences to numbers
input_sequences = []
target_words = []
for sentence in sentences:
words = sentence.split()
# Use all but last word as input
input_seq = tokenizer.encode(' '.join(words[:-1]))
# Use last word as target
target = tokenizer.encode(words[-1])[0]
input_sequences.append(input_seq)
target_words.append(target)
Step 2: Training Loop
# Initialize our tiny model
model = SimpleLLM(
vocab_size=len(tokenizer),
embed_dim=32, # Small embedding size
num_heads=2, # Just 2 attention heads
ff_dim=64, # Small feed-forward size
num_layers=2, # Only 2 layers
max_seq_length=10
)
# Train for a few epochs
for epoch in range(5):
for input_seq, target in zip(input_sequences, target_words):
loss = train_step(model, optimizer, input_seq, target)
print(f"Epoch {epoch}, Loss: {loss}")
Part 3: Making It Useful - Real Examples
Example 1: Simple Sentence Completion
def complete_sentence(model, tokenizer, start_text):
# Convert input text to tokens
input_tokens = tokenizer.encode(start_text)
# Generate prediction
with torch.no_grad():
output = model(torch.tensor([input_tokens]))
next_word_idx = torch.argmax(output[0, -1]).item()
# Convert back to text
predicted_word = tokenizer.decode([next_word_idx])
return start_text + " " + predicted_word
# Try it out
result = complete_sentence(model, tokenizer, "The cat sits on")
print(result) # Should print something like "The cat sits on the mat"
Example 2: Simple Question Answering
# Train with question-answer pairs
qa_pairs = [
("What color is the sky?", "blue"),
("What color is grass?", "green"),
("What color is a banana?", "yellow")
]
def train_qa(model, tokenizer, qa_pairs):
for question, answer in qa_pairs:
input_seq = tokenizer.encode(question)
target = tokenizer.encode(answer)[0]
loss = train_step(model, optimizer, input_seq, target)
return loss
def ask_question(model, tokenizer, question):
input_tokens = tokenizer.encode(question)
with torch.no_grad():
output = model(torch.tensor([input_tokens]))
answer_idx = torch.argmax(output[0, -1]).item()
return tokenizer.decode([answer_idx])
Part 4: Common Pitfalls and Solutions
Pitfall 1: Model Doesn't Learn
Symptom: Loss doesn't decrease Solution: Check your learning rate (try 0.001) Verify your data is correctly tokenized Make sure your sequences aren't too long
Pitfall 2: Out of Memory
Symptom: CUDA out of memory error Solution: Reduce batch size Reduce model size (embed_dim, num_layers) Shorten input sequences
Pitfall 3: Poor Predictions
Symptom: Random or nonsensical outputs Solution: Train longer Add more diverse training data Increase model size slightly
Part 5: Next Steps
Once you're comfortable with the basics, try:
Adding more training data Experimenting with model sizes Implementing temperature in generation Adding beam search for better completions Remember: Start small, verify each step works, then gradually increase complexity. It's better to have a working tiny model than a broken big one!
Debugging Tips
Print Shapes:
print(f"Input shape: {input_tokens.shape}")
print(f"Output shape: {output.shape}")
Check Predictions:
def inspect_prediction(model, tokenizer, text):
tokens = tokenizer.encode(text)
output = model(torch.tensor([tokens]))
probs = F.softmax(output[0, -1], dim=0)
top_k = torch.topk(probs, 5)
print("Top 5 predictions:")
for prob, idx in zip(top_k.values, top_k.indices):
word = tokenizer.decode([idx.item()])
print(f"{word}: {prob.item():.3f}")
Validate Training Data:
def check_data(tokenizer, text):
tokens = tokenizer.encode(text)
decoded = tokenizer.decode(tokens)
print(f"Original: {text}")
print(f"Decoded : {decoded}")
print(f"Tokens : {tokens}")
Remember: The key to success with small models is starting simple and gradually