Bucketing example with simple python
1. We’ll show how to:
- Create a list of input sentences
- Convert them to numerical encoding (simplified)
- Bucket them by length ranges
- Pad each to max in its bucket
# Step 1: Sample data (sentence tokens)
sentences = [
["Hi"],
["Hello", "friend"],
["Good", "morning", "everyone"],
["Welcome", "to", "the", "AI", "session"],
["This", "is", "an", "example", "of", "bucketing"]
]
# Step 2: Simple encoding (word to integer ID)
vocab = {}
counter = 1
for sentence in sentences:
for word in sentence:
if word not in vocab:
vocab[word] = counter
counter += 1
# Convert words to numeric tokens
encoded_sentences = [[vocab[word] for word in sentence] for sentence in sentences]
# Step 3: Bucketing
buckets = {
"1-2": [],
"3-5": [],
"6+": []
}
for sentence in encoded_sentences:
l = len(sentence)
if l <= 2:
buckets["1-2"].append(sentence)
elif l <= 5:
buckets["3-5"].append(sentence)
else:
buckets["6+"].append(sentence)
# Step 4: Padding each bucket
def pad_sentences(bucket):
max_len = max(len(s) for s in bucket)
padded = [s + [0]*(max_len - len(s)) for s in bucket]
return padded
# Step 5: Display padded buckets
for label, bucket in buckets.items():
if bucket:
padded = pad_sentences(bucket)
print(f"\nBucket {label} (max length {len(padded[0])}):")
for p in padded:
print(p)
Bucket 1-2 (max length 2):
[1, 0]
[2, 3]Bucket 3-5 (max length 5):
[4, 5, 6, 0, 0]
[7, 8, 9, 10, 11]Bucket 6+ (max length 6):
[12, 13, 14, 15, 16, 17]
2. Connect bucketing to a neural network training loop with a real-life impact example
Real-Life Use Case: Customer Support Chatbot
Imagine we’re building an AI-powered customer support chatbot.
- It receives user queries of different lengths:
- “Hi”
- “I need help with my internet connection”
- “My account has been suspended after I moved to another country”
Without bucketing, we’d pad all queries to the maximum length (e.g., 20 tokens), which wastes memory and slows training.
With bucketing, we:
- Reduce wasted padding,
- Use efficient training batches,
- Preserve contextual flow better.
Neural Network Training (Simplified)
We’ll simulate:
- A mini neural network that predicts intent (like “greeting”, “complaint”, etc.)
- Trained using bucketed padded inputs
- With step-by-step explanation
Step-by-Step Pure Python Simulation
Setup
We’ll:
- Define fake intents (labels)
- Bucket and pad as before
- Train a toy neural net (dot-product + sigmoid) using gradient updates
Step 1: Data
# Buckets from earlier
sentences = [
["Hi"], # Greeting
["Hello", "friend"], # Greeting
["Good", "morning", "everyone"], # Greeting
["Welcome", "to", "the", "AI", "session"], # Event
["This", "is", "an", "example", "of", "bucketing"] # Technical
]
labels = [0, 0, 0, 1, 2] # 0: Greeting, 1: Event, 2: Technical
# Encoding
vocab = {}
counter = 1
for sent in sentences:
for word in sent:
if word not in vocab:
vocab[word] = counter
counter += 1
encoded = [[vocab[word] for word in sent] for sent in sentences]
Step 2: Bucketing + Padding
def bucket_data(inputs, labels):
buckets = {"1-2": [], "3-5": [], "6+": []}
label_buckets = {"1-2": [], "3-5": [], "6+": []}
for i, sent in enumerate(inputs):
l = len(sent)
if l <= 2:
k = "1-2"
elif l <= 5:
k = "3-5"
else:
k = "6+"
buckets[k].append(sent)
label_buckets[k].append(labels[i])
return buckets, label_buckets
def pad_sentences(bucket):
max_len = max(len(s) for s in bucket)
return [s + [0] * (max_len - len(s)) for s in bucket]
buckets, label_buckets = bucket_data(encoded, labels)
Step 3: Mini Neural Net (pure Python)
We’ll simulate:
- Input layer → weights → output layer
- Dot product for simplicity
- Sigmoid activation
- Gradient descent for weight update
import random
import math
def sigmoid(x):
return 1 / (1 + math.exp(-x))
def sigmoid_deriv(x):
sx = sigmoid(x)
return sx * (1 - sx)
# Initialize weights (same length as input vector)
def init_weights(input_len, num_classes):
return [[random.uniform(-0.1, 0.1) for _ in range(input_len)] for _ in range(num_classes)]
# Train per bucket
def train_bucket(X, y, num_classes, epochs=50, lr=0.01):
input_len = len(X[0])
W = init_weights(input_len, num_classes)
for epoch in range(epochs):
for xi, yi in zip(X, y):
# Forward pass
logits = [sum(w*x for w,x in zip(W[class_i], xi)) for class_i in range(num_classes)]
preds = [sigmoid(logit) for logit in logits]
# Compute error and backprop
for class_i in range(num_classes):
target = 1 if yi == class_i else 0
error = preds[class_i] - target
grad = [error * sigmoid_deriv(logits[class_i]) * x for x in xi]
# Update weights
W[class_i] = [w - lr * g for w, g in zip(W[class_i], grad)]
return W
# Training each bucket
trained_weights = {}
num_classes = 3 # Greeting, Event, Technical
for bucket_key in buckets:
if buckets[bucket_key]:
X_pad = pad_sentences(buckets[bucket_key])
trained_weights[bucket_key] = train_bucket(X_pad, label_buckets[bucket_key], num_classes)
Step 4: Predict from Trained Buckets
def predict(x, W):
logits = [sum(w * xi for w, xi in zip(w_row, x)) for w_row in W]
preds = [sigmoid(l) for l in logits]
return preds.index(max(preds)) # Return class with highest probability
# Example test prediction
test_sent = ["Hello", "there"]
encoded_test = [vocab.get(w, 0) for w in test_sent]
bucket_key = "1-2" if len(encoded_test) <= 2 else "3-5" if len(encoded_test) <= 5 else "6+"
max_len = len(trained_weights[bucket_key][0])
test_pad = encoded_test + [0] * (max_len - len(encoded_test))
pred_class = predict(test_pad, trained_weights[bucket_key])
print(f"Predicted class: {pred_class} → {'Greeting' if pred_class==0 else 'Event' if pred_class==1 else 'Technical'}")
