Contents in this wiki are for entertainment purposes only
This is not fiction ∞ this is psience of mind

Small Language Models for Dummies: A Practical Guide

From Catcliffe Development
Revision as of 16:07, 2 January 2025 by XenoEngineer (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

If I had an AI

Base64 Audio Example

Introduction

Welcome to language models for beginners! Think of a language model like a very smart word-prediction machine. Just like how your phone predicts the next word when you're typing, but much more sophisticated.

Part 1: The Basics - How It Works

The Recipe

Imagine you're baking a cake (our language model):

Ingredients (Input): Words or tokens Kitchen Tools (Model Parts): Mixing Bowl (Embeddings): Turns words into numbers the model can understand Beaters (Attention): Figures out which words are important together Oven (Transformer Blocks): Where the magic happens Result (Output): Predicted next words

A Simple Example

Let's say we want to teach our model to complete sentences: Input: "The cat sits on the" Expected Output: "mat"

from simple_tokenizer import Tokenizer  # A basic word-based tokenizer

# Create vocabulary from our tiny dataset
tokenizer = Tokenizer()
for sentence in sentences:
    tokenizer.add_sentence(sentence)

# Convert sentences to numbers
input_sequences = []
target_words = []
for sentence in sentences:
    words = sentence.split()
    # Use all but last word as input
    input_seq = tokenizer.encode(' '.join(words[:-1]))
    # Use last word as target
    target = tokenizer.encode(words[-1])[0]
    input_sequences.append(input_seq)
    target_words.append(target)

Part 2: Baby Steps - Your First Training

Step 1: Preparing Data

from simple_tokenizer import Tokenizer  # A basic word-based tokenizer

# Create vocabulary from our tiny dataset
tokenizer = Tokenizer()
for sentence in sentences:
    tokenizer.add_sentence(sentence)

# Convert sentences to numbers
input_sequences = []
target_words = []
for sentence in sentences:
    words = sentence.split()
    # Use all but last word as input
    input_seq = tokenizer.encode(' '.join(words[:-1]))
    # Use last word as target
    target = tokenizer.encode(words[-1])[0]
    input_sequences.append(input_seq)
    target_words.append(target)

Step 2: Training Loop

# Initialize our tiny model
model = SimpleLLM(
    vocab_size=len(tokenizer),
    embed_dim=32,    # Small embedding size
    num_heads=2,     # Just 2 attention heads
    ff_dim=64,      # Small feed-forward size
    num_layers=2,    # Only 2 layers
    max_seq_length=10
)

# Train for a few epochs
for epoch in range(5):
    for input_seq, target in zip(input_sequences, target_words):
        loss = train_step(model, optimizer, input_seq, target)
    print(f"Epoch {epoch}, Loss: {loss}")

Part 3: Making It Useful - Real Examples

Example 1: Simple Sentence Completion

def complete_sentence(model, tokenizer, start_text):
    # Convert input text to tokens
    input_tokens = tokenizer.encode(start_text)
    
    # Generate prediction
    with torch.no_grad():
        output = model(torch.tensor([input_tokens]))
        next_word_idx = torch.argmax(output[0, -1]).item()
    
    # Convert back to text
    predicted_word = tokenizer.decode([next_word_idx])
    return start_text + " " + predicted_word

# Try it out
result = complete_sentence(model, tokenizer, "The cat sits on")
print(result)  # Should print something like "The cat sits on the mat"

Example 2: Simple Question Answering

# Train with question-answer pairs
qa_pairs = [
    ("What color is the sky?", "blue"),
    ("What color is grass?", "green"),
    ("What color is a banana?", "yellow")
]

def train_qa(model, tokenizer, qa_pairs):
    for question, answer in qa_pairs:
        input_seq = tokenizer.encode(question)
        target = tokenizer.encode(answer)[0]
        loss = train_step(model, optimizer, input_seq, target)
    return loss

def ask_question(model, tokenizer, question):
    input_tokens = tokenizer.encode(question)
    with torch.no_grad():
        output = model(torch.tensor([input_tokens]))
        answer_idx = torch.argmax(output[0, -1]).item()
    return tokenizer.decode([answer_idx])

Part 4: Common Pitfalls and Solutions

Pitfall 1: Model Doesn't Learn

Symptom: Loss doesn't decrease Solution: Check your learning rate (try 0.001) Verify your data is correctly tokenized Make sure your sequences aren't too long

Pitfall 2: Out of Memory

Symptom: CUDA out of memory error Solution: Reduce batch size Reduce model size (embed_dim, num_layers) Shorten input sequences

Pitfall 3: Poor Predictions

Symptom: Random or nonsensical outputs Solution: Train longer Add more diverse training data Increase model size slightly

Part 5: Next Steps

Once you're comfortable with the basics, try:

Adding more training data Experimenting with model sizes Implementing temperature in generation Adding beam search for better completions Remember: Start small, verify each step works, then gradually increase complexity. It's better to have a working tiny model than a broken big one!

Debugging Tips

Print Shapes:

print(f"Input shape: {input_tokens.shape}")
print(f"Output shape: {output.shape}")

Check Predictions:

def inspect_prediction(model, tokenizer, text):
    tokens = tokenizer.encode(text)
    output = model(torch.tensor([tokens]))
    probs = F.softmax(output[0, -1], dim=0)
    top_k = torch.topk(probs, 5)
    
    print("Top 5 predictions:")
    for prob, idx in zip(top_k.values, top_k.indices):
        word = tokenizer.decode([idx.item()])
        print(f"{word}: {prob.item():.3f}")

Validate Training Data:

def check_data(tokenizer, text):
    tokens = tokenizer.encode(text)
    decoded = tokenizer.decode(tokens)
    print(f"Original: {text}")
    print(f"Decoded : {decoded}")
    print(f"Tokens  : {tokens}")

Remember: The key to success with small models is starting simple and gradually adding complexity as you understand each part!