- Contents in this wiki are for entertainment purposes only
Small Language Models for Dummies: A Practical Guide
Introduction
Welcome to language models for beginners! Think of a language model like a very smart word-prediction machine. Just like how your phone predicts the next word when you're typing, but much more sophisticated.
Part 1: The Basics - How It Works
The Recipe
Imagine you're baking a cake (our language model):
Ingredients (Input): Words or tokens Kitchen Tools (Model Parts): Mixing Bowl (Embeddings): Turns words into numbers the model can understand Beaters (Attention): Figures out which words are important together Oven (Transformer Blocks): Where the magic happens Result (Output): Predicted next words
A Simple Example
Let's say we want to teach our model to complete sentences: Input: "The cat sits on the" Expected Output: "mat"
from simple_tokenizer import Tokenizer # A basic word-based tokenizer # Create vocabulary from our tiny dataset tokenizer = Tokenizer() for sentence in sentences: tokenizer.add_sentence(sentence) # Convert sentences to numbers input_sequences = [] target_words = [] for sentence in sentences: words = sentence.split() # Use all but last word as input input_seq = tokenizer.encode(' '.join(words[:-1])) # Use last word as target target = tokenizer.encode(words[-1])[0] input_sequences.append(input_seq) target_words.append(target)
Part 2: Baby Steps - Your First Training
Step 1: Preparing Data
from simple_tokenizer import Tokenizer # A basic word-based tokenizer # Create vocabulary from our tiny dataset tokenizer = Tokenizer() for sentence in sentences: tokenizer.add_sentence(sentence) # Convert sentences to numbers input_sequences = [] target_words = [] for sentence in sentences: words = sentence.split() # Use all but last word as input input_seq = tokenizer.encode(' '.join(words[:-1])) # Use last word as target target = tokenizer.encode(words[-1])[0] input_sequences.append(input_seq) target_words.append(target)
Step 2: Training Loop
# Initialize our tiny model model = SimpleLLM( vocab_size=len(tokenizer), embed_dim=32, # Small embedding size num_heads=2, # Just 2 attention heads ff_dim=64, # Small feed-forward size num_layers=2, # Only 2 layers max_seq_length=10 ) # Train for a few epochs for epoch in range(5): for input_seq, target in zip(input_sequences, target_words): loss = train_step(model, optimizer, input_seq, target) print(f"Epoch {epoch}, Loss: {loss}")
Part 3: Making It Useful - Real Examples
Example 1: Simple Sentence Completion
def complete_sentence(model, tokenizer, start_text): # Convert input text to tokens input_tokens = tokenizer.encode(start_text) # Generate prediction with torch.no_grad(): output = model(torch.tensor([input_tokens])) next_word_idx = torch.argmax(output[0, -1]).item() # Convert back to text predicted_word = tokenizer.decode([next_word_idx]) return start_text + " " + predicted_word # Try it out result = complete_sentence(model, tokenizer, "The cat sits on") print(result) # Should print something like "The cat sits on the mat"
Example 2: Simple Question Answering
# Train with question-answer pairs qa_pairs = [ ("What color is the sky?", "blue"), ("What color is grass?", "green"), ("What color is a banana?", "yellow") ] def train_qa(model, tokenizer, qa_pairs): for question, answer in qa_pairs: input_seq = tokenizer.encode(question) target = tokenizer.encode(answer)[0] loss = train_step(model, optimizer, input_seq, target) return loss def ask_question(model, tokenizer, question): input_tokens = tokenizer.encode(question) with torch.no_grad(): output = model(torch.tensor([input_tokens])) answer_idx = torch.argmax(output[0, -1]).item() return tokenizer.decode([answer_idx])
Part 4: Common Pitfalls and Solutions
Pitfall 1: Model Doesn't Learn
Symptom: Loss doesn't decrease Solution: Check your learning rate (try 0.001) Verify your data is correctly tokenized Make sure your sequences aren't too long
Pitfall 2: Out of Memory
Symptom: CUDA out of memory error Solution: Reduce batch size Reduce model size (embed_dim, num_layers) Shorten input sequences
Pitfall 3: Poor Predictions
Symptom: Random or nonsensical outputs Solution: Train longer Add more diverse training data Increase model size slightly
Part 5: Next Steps
Once you're comfortable with the basics, try:
Adding more training data Experimenting with model sizes Implementing temperature in generation Adding beam search for better completions Remember: Start small, verify each step works, then gradually increase complexity. It's better to have a working tiny model than a broken big one!
Debugging Tips
Print Shapes:
print(f"Input shape: {input_tokens.shape}") print(f"Output shape: {output.shape}")
Check Predictions:
def inspect_prediction(model, tokenizer, text): tokens = tokenizer.encode(text) output = model(torch.tensor([tokens])) probs = F.softmax(output[0, -1], dim=0) top_k = torch.topk(probs, 5) print("Top 5 predictions:") for prob, idx in zip(top_k.values, top_k.indices): word = tokenizer.decode([idx.item()]) print(f"{word}: {prob.item():.3f}")
Validate Training Data:
def check_data(tokenizer, text): tokens = tokenizer.encode(text) decoded = tokenizer.decode(tokens) print(f"Original: {text}") print(f"Decoded : {decoded}") print(f"Tokens : {tokens}")
Remember: The key to success with small models is starting simple and gradually adding complexity as you understand each part!