Bucketing example with simple python
1. We’ll show how to:
- Create a list of input sentences
- Convert them to numerical encoding (simplified)
- Bucket them by length ranges
- Pad each to max in its bucket
# Step 1: Sample data (sentence tokens) sentences = [ ["Hi"], ["Hello", "friend"], ["Good", "morning", "everyone"], ["Welcome", "to", "the", "AI", "session"], ["This", "is", "an", "example", "of", "bucketing"] ] # Step 2: Simple encoding (word to integer ID) vocab = {} counter = 1 for sentence in sentences: for word in sentence: if word not in vocab: vocab[word] = counter counter += 1 # Convert words to numeric tokens encoded_sentences = [[vocab[word] for word in sentence] for sentence in sentences] # Step 3: Bucketing buckets = { "1-2": [], "3-5": [], "6+": [] } for sentence in encoded_sentences: l = len(sentence) if l <= 2: buckets["1-2"].append(sentence) elif l <= 5: buckets["3-5"].append(sentence) else: buckets["6+"].append(sentence) # Step 4: Padding each bucket def pad_sentences(bucket): max_len = max(len(s) for s in bucket) padded = [s + [0]*(max_len - len(s)) for s in bucket] return padded # Step 5: Display padded buckets for label, bucket in buckets.items(): if bucket: padded = pad_sentences(bucket) print(f"\nBucket {label} (max length {len(padded[0])}):") for p in padded: print(p)
Bucket 1-2 (max length 2):
[1, 0]
[2, 3]Bucket 3-5 (max length 5):
[4, 5, 6, 0, 0]
[7, 8, 9, 10, 11]Bucket 6+ (max length 6):
[12, 13, 14, 15, 16, 17]
2. Connect bucketing to a neural network training loop with a real-life impact example
Real-Life Use Case: Customer Support Chatbot
Imagine we’re building an AI-powered customer support chatbot.
- It receives user queries of different lengths:
- “Hi”
- “I need help with my internet connection”
- “My account has been suspended after I moved to another country”
Without bucketing, we’d pad all queries to the maximum length (e.g., 20 tokens), which wastes memory and slows training.
With bucketing, we:
- Reduce wasted padding,
- Use efficient training batches,
- Preserve contextual flow better.
Neural Network Training (Simplified)
We’ll simulate:
- A mini neural network that predicts intent (like “greeting”, “complaint”, etc.)
- Trained using bucketed padded inputs
- With step-by-step explanation
Step-by-Step Pure Python Simulation
Setup
We’ll:
- Define fake intents (labels)
- Bucket and pad as before
- Train a toy neural net (dot-product + sigmoid) using gradient updates
Step 1: Data
# Buckets from earlier sentences = [ ["Hi"], # Greeting ["Hello", "friend"], # Greeting ["Good", "morning", "everyone"], # Greeting ["Welcome", "to", "the", "AI", "session"], # Event ["This", "is", "an", "example", "of", "bucketing"] # Technical ] labels = [0, 0, 0, 1, 2] # 0: Greeting, 1: Event, 2: Technical # Encoding vocab = {} counter = 1 for sent in sentences: for word in sent: if word not in vocab: vocab[word] = counter counter += 1 encoded = [[vocab[word] for word in sent] for sent in sentences]
Step 2: Bucketing + Padding
def bucket_data(inputs, labels): buckets = {"1-2": [], "3-5": [], "6+": []} label_buckets = {"1-2": [], "3-5": [], "6+": []} for i, sent in enumerate(inputs): l = len(sent) if l <= 2: k = "1-2" elif l <= 5: k = "3-5" else: k = "6+" buckets[k].append(sent) label_buckets[k].append(labels[i]) return buckets, label_buckets def pad_sentences(bucket): max_len = max(len(s) for s in bucket) return [s + [0] * (max_len - len(s)) for s in bucket] buckets, label_buckets = bucket_data(encoded, labels)
Step 3: Mini Neural Net (pure Python)
We’ll simulate:
- Input layer → weights → output layer
- Dot product for simplicity
- Sigmoid activation
- Gradient descent for weight update
import random import math def sigmoid(x): return 1 / (1 + math.exp(-x)) def sigmoid_deriv(x): sx = sigmoid(x) return sx * (1 - sx) # Initialize weights (same length as input vector) def init_weights(input_len, num_classes): return [[random.uniform(-0.1, 0.1) for _ in range(input_len)] for _ in range(num_classes)] # Train per bucket def train_bucket(X, y, num_classes, epochs=50, lr=0.01): input_len = len(X[0]) W = init_weights(input_len, num_classes) for epoch in range(epochs): for xi, yi in zip(X, y): # Forward pass logits = [sum(w*x for w,x in zip(W[class_i], xi)) for class_i in range(num_classes)] preds = [sigmoid(logit) for logit in logits] # Compute error and backprop for class_i in range(num_classes): target = 1 if yi == class_i else 0 error = preds[class_i] - target grad = [error * sigmoid_deriv(logits[class_i]) * x for x in xi] # Update weights W[class_i] = [w - lr * g for w, g in zip(W[class_i], grad)] return W # Training each bucket trained_weights = {} num_classes = 3 # Greeting, Event, Technical for bucket_key in buckets: if buckets[bucket_key]: X_pad = pad_sentences(buckets[bucket_key]) trained_weights[bucket_key] = train_bucket(X_pad, label_buckets[bucket_key], num_classes)
Step 4: Predict from Trained Buckets
def predict(x, W): logits = [sum(w * xi for w, xi in zip(w_row, x)) for w_row in W] preds = [sigmoid(l) for l in logits] return preds.index(max(preds)) # Return class with highest probability # Example test prediction test_sent = ["Hello", "there"] encoded_test = [vocab.get(w, 0) for w in test_sent] bucket_key = "1-2" if len(encoded_test) <= 2 else "3-5" if len(encoded_test) <= 5 else "6+" max_len = len(trained_weights[bucket_key][0]) test_pad = encoded_test + [0] * (max_len - len(encoded_test)) pred_class = predict(test_pad, trained_weights[bucket_key]) print(f"Predicted class: {pred_class} → {'Greeting' if pred_class==0 else 'Event' if pred_class==1 else 'Technical'}")