- Contents in this wiki are for entertainment purposes only
Small Language Models for Dummies: A Practical Guide
# Small Language Models for Dummies: A Practical Guide
- Introduction
Welcome to language models for beginners! Think of a language model like a very smart word-prediction machine. Just like how your phone predicts the next word when you're typing, but much more sophisticated.
- Part 1: The Basics - How It Works
- The Recipe
Imagine you're baking a cake (our language model): 1. Ingredients (Input): Words or tokens 2. Kitchen Tools (Model Parts):
- Mixing Bowl (Embeddings): Turns words into numbers the model can understand - Beaters (Attention): Figures out which words are important together - Oven (Transformer Blocks): Where the magic happens
3. Result (Output): Predicted next words
- A Simple Example
Let's say we want to teach our model to complete sentences:
Input: "The cat sits on the" Expected Output: "mat"
```python
- Create a tiny dataset
sentences = [
"The cat sits on the mat", "The dog sits on the rug", "A bird sits in the tree"
] ```
- Part 2: Baby Steps - Your First Training
- Step 1: Preparing Data
```python from simple_tokenizer import Tokenizer # A basic word-based tokenizer
- Create vocabulary from our tiny dataset
tokenizer = Tokenizer() for sentence in sentences:
tokenizer.add_sentence(sentence)
- Convert sentences to numbers
input_sequences = [] target_words = [] for sentence in sentences:
words = sentence.split()
# Use all but last word as input
input_seq = tokenizer.encode(' '.join(words[:-1]))
# Use last word as target
target = tokenizer.encode(words[-1])[0]
input_sequences.append(input_seq)
target_words.append(target)
```
- Step 2: Training Loop
```python
- Initialize our tiny model
model = SimpleLLM(
vocab_size=len(tokenizer), embed_dim=32, # Small embedding size num_heads=2, # Just 2 attention heads ff_dim=64, # Small feed-forward size num_layers=2, # Only 2 layers max_seq_length=10
)
- Train for a few epochs
for epoch in range(5):
for input_seq, target in zip(input_sequences, target_words):
loss = train_step(model, optimizer, input_seq, target)
print(f"Epoch {epoch}, Loss: {loss}")
```
- Part 3: Making It Useful - Real Examples
- Example 1: Simple Sentence Completion
```python def complete_sentence(model, tokenizer, start_text):
# Convert input text to tokens
input_tokens = tokenizer.encode(start_text)
# Generate prediction
with torch.no_grad():
output = model(torch.tensor([input_tokens]))
next_word_idx = torch.argmax(output[0, -1]).item()
# Convert back to text
predicted_word = tokenizer.decode([next_word_idx])
return start_text + " " + predicted_word
- Try it out
result = complete_sentence(model, tokenizer, "The cat sits on") print(result) # Should print something like "The cat sits on the mat" ```
- Example 2: Simple Question Answering
```python
- Train with question-answer pairs
qa_pairs = [
("What color is the sky?", "blue"),
("What color is grass?", "green"),
("What color is a banana?", "yellow")
]
def train_qa(model, tokenizer, qa_pairs):
for question, answer in qa_pairs:
input_seq = tokenizer.encode(question)
target = tokenizer.encode(answer)[0]
loss = train_step(model, optimizer, input_seq, target)
return loss
def ask_question(model, tokenizer, question):
input_tokens = tokenizer.encode(question)
with torch.no_grad():
output = model(torch.tensor([input_tokens]))
answer_idx = torch.argmax(output[0, -1]).item()
return tokenizer.decode([answer_idx])
```
- Part 4: Common Pitfalls and Solutions
- Pitfall 1: Model Doesn't Learn
- **Symptom**: Loss doesn't decrease - **Solution**:
1. Check your learning rate (try 0.001) 2. Verify your data is correctly tokenized 3. Make sure your sequences aren't too long
- Pitfall 2: Out of Memory
- **Symptom**: CUDA out of memory error - **Solution**:
1. Reduce batch size 2. Reduce model size (embed_dim, num_layers) 3. Shorten input sequences
- Pitfall 3: Poor Predictions
- **Symptom**: Random or nonsensical outputs - **Solution**:
1. Train longer 2. Add more diverse training data 3. Increase model size slightly
- Part 5: Next Steps
Once you're comfortable with the basics, try: 1. Adding more training data 2. Experimenting with model sizes 3. Implementing temperature in generation 4. Adding beam search for better completions
Remember: Start small, verify each step works, then gradually increase complexity. It's better to have a working tiny model than a broken big one!
- Debugging Tips
1. Print Shapes: ```python print(f"Input shape: {input_tokens.shape}") print(f"Output shape: {output.shape}") ```
2. Check Predictions: ```python def inspect_prediction(model, tokenizer, text):
tokens = tokenizer.encode(text)
output = model(torch.tensor([tokens]))
probs = F.softmax(output[0, -1], dim=0)
top_k = torch.topk(probs, 5)
print("Top 5 predictions:")
for prob, idx in zip(top_k.values, top_k.indices):
word = tokenizer.decode([idx.item()])
print(f"{word}: {prob.item():.3f}")
```
3. Validate Training Data: ```python def check_data(tokenizer, text):
tokens = tokenizer.encode(text)
decoded = tokenizer.decode(tokens)
print(f"Original: {text}")
print(f"Decoded : {decoded}")
print(f"Tokens : {tokens}")
```
Remember: The key to success with small models is starting simple and gradually adding complexity as you understand each part!