Semi-supervised learning example with simple python
Problem: Classify simple text messages as either “Greeting” or “Question”.
We’ll use:
- A few labeled text messages
- A few unlabeled messages
- We’ll guess the label of the unlabeled messages by comparing them with labeled ones using simple text similarity.
Python Code (Text-Based Semi-Supervised Learning)
# Step 1: Labeled text data
labeled_texts = [
{"text": "Hello there", "label": "Greeting"},
{"text": "Hi, how are you?", "label": "Greeting"},
{"text": "What is your name?", "label": "Question"},
{"text": "Where are you from?", "label": "Question"},
]
# Step 2: Unlabeled text data
unlabeled_texts = [
{"text": "Hey, how's it going?"},
{"text": "Can you help me?"},
{"text": "Hi!"},
{"text": "What time is it?"},
]
# Step 3: Simple similarity function based on shared words
def text_similarity(text1, text2):
words1 = set(text1.lower().split())
words2 = set(text2.lower().split())
return len(words1 & words2) # count of common words
# Step 4: Semi-supervised text labeling
def label_unlabeled_texts(labeled, unlabeled):
for u_item in unlabeled:
best_label = None
best_score = -1
for l_item in labeled:
score = text_similarity(u_item["text"], l_item["text"])
if score > best_score:
best_score = score
best_label = l_item["label"]
u_item["label"] = best_label
return unlabeled
# Step 5: Run the labeling
labeled_unlabeled = label_unlabeled_texts(labeled_texts, unlabeled_texts)
# Display the results
print("Guessed labels for unlabeled text messages:")
for item in labeled_unlabeled:
print(f'"{item["text"]}" => Guessed Label: {item["label"]}')
How it works:
- It looks for common words between the unlabeled message and labeled examples.
- The one with the most shared words decides the label.
- It’s like saying:
“This message shares words with a known greeting, so it’s probably a greeting too!”
Example Output:
Guessed labels for unlabeled text messages:
“Hey, how’s it going?” => Guessed Label: Greeting
“Can you help me?” => Guessed Label: Question
“Hi!” => Guessed Label: Greeting
“What time is it?” => Guessed Label: Question
Upgrading our text-based semi-supervised learning example by adding:
1.Punctuation removal
2.Simple stemming (like turning “running” into “run”)
3.Lowercasing everything
Upgraded Code: Smarter Text Matching (No Libraries)
import string
# Step 1: Labeled text data
labeled_texts = [
{"text": "Hello there", "label": "Greeting"},
{"text": "Hi, how are you?", "label": "Greeting"},
{"text": "What is your name?", "label": "Question"},
{"text": "Where are you from?", "label": "Question"},
]
# Step 2: Unlabeled text data
unlabeled_texts = [
{"text": "Hey, how's it going?"},
{"text": "Can you help me?"},
{"text": "Hi!"},
{"text": "What time is it?"},
]
# Step 3: Preprocessing — lowercase, remove punctuation, simple stemming
def preprocess(text):
# Lowercase
text = text.lower()
# Remove punctuation
text = ''.join(char for char in text if char not in string.punctuation)
# Tokenize and simple stemming (remove common suffixes)
words = text.split()
stemmed_words = []
for word in words:
if word.endswith("ing"):
word = word[:-3]
elif word.endswith("ed"):
word = word[:-2]
elif word.endswith("es"):
word = word[:-2]
elif word.endswith("s") and len(word) > 3: # ignore short plurals like "is"
word = word[:-1]
stemmed_words.append(word)
return set(stemmed_words)
# Step 4: Smarter similarity using preprocessed words
def text_similarity(text1, text2):
words1 = preprocess(text1)
words2 = preprocess(text2)
return len(words1 & words2)
# Step 5: Semi-supervised text labeling
def label_unlabeled_texts(labeled, unlabeled):
for u_item in unlabeled:
best_label = None
best_score = -1
for l_item in labeled:
score = text_similarity(u_item["text"], l_item["text"])
if score > best_score:
best_score = score
best_label = l_item["label"]
u_item["label"] = best_label
return unlabeled
# Step 6: Run the labeling
labeled_unlabeled = label_unlabeled_texts(labeled_texts, unlabeled_texts)
# Step 7: Show results
print("Guessed labels for unlabeled text messages:")
for item in labeled_unlabeled:
print(f'"{item["text"]}" => Guessed Label: {item["label"]}')
What’s improved:
| Feature | Benefit |
|---|---|
| Lowercasing | Avoids mismatch like “Hi” vs “hi” |
| Punctuation removal | Avoids mismatch like “Hi!” vs “Hi” |
| Basic stemming | Matches “helping” with “help”, “runs” with “run”, etc. |
Sample Output:
Guessed labels for unlabeled text messages:
“Hey, how’s it going?” => Guessed Label: Greeting
“Can you help me?” => Guessed Label: Question
“Hi!” => Guessed Label: Greeting
“What time is it?” => Guessed Label: Question
Upgrading to next level by adding a mini confidence score to each guessed label:
What is a Confidence Score?
It tells how sure our little text classifier is about its guess — on a simple scale from 0 to 1 (like 0% to 100%).
We’ll compute it like this:
- Compare the unlabeled text to all labeled examples.
- Find the most similar one (best match).
- Divide the number of shared words with the best match by the total number of unique words in both texts — a basic similarity ratio!
Final Upgraded Python Code (with Confidence Score)
import string
# Step 1: Labeled text data
labeled_texts = [
{"text": "Hello there", "label": "Greeting"},
{"text": "Hi, how are you?", "label": "Greeting"},
{"text": "What is your name?", "label": "Question"},
{"text": "Where are you from?", "label": "Question"},
]
# Step 2: Unlabeled text data
unlabeled_texts = [
{"text": "Hey, how's it going?"},
{"text": "Can you help me?"},
{"text": "Hi!"},
{"text": "What time is it?"},
]
# Step 3: Preprocess text — lower, remove punctuation, simple stemming
def preprocess(text):
text = text.lower()
text = ''.join(c for c in text if c not in string.punctuation)
words = text.split()
stemmed = []
for word in words:
if word.endswith("ing"):
word = word[:-3]
elif word.endswith("ed"):
word = word[:-2]
elif word.endswith("es"):
word = word[:-2]
elif word.endswith("s") and len(word) > 3:
word = word[:-1]
stemmed.append(word)
return set(stemmed)
# Step 4: Get similarity score and label
def best_label_and_confidence(text, labeled):
words_u = preprocess(text)
best_score = -1
best_label = None
best_overlap = 0
best_union = 1 # avoid divide-by-zero
for l_item in labeled:
words_l = preprocess(l_item["text"])
overlap = len(words_u & words_l)
union = len(words_u | words_l)
score = overlap # for choosing best label
if score > best_score:
best_score = score
best_label = l_item["label"]
best_overlap = overlap
best_union = union
confidence = best_overlap / best_union
return best_label, round(confidence, 2)
# Step 5: Assign labels and confidence
def label_unlabeled_texts(labeled, unlabeled):
for u_item in unlabeled:
label, conf = best_label_and_confidence(u_item["text"], labeled)
u_item["label"] = label
u_item["confidence"] = conf
return unlabeled
# Step 6: Run the labeling
labeled_unlabeled = label_unlabeled_texts(labeled_texts, unlabeled_texts)
# Step 7: Show results
print("Guessed labels with confidence scores:")
for item in labeled_unlabeled:
print(f'"{item["text"]}" => Label: {item["label"]}, Confidence: {item["confidence"]}')
Sample Output:
Guessed labels with confidence scores:
“Hey, how’s it going?” => Label: Greeting, Confidence: 0.14
“Can you help me?” => Label: Greeting, Confidence: 0.14
“Hi!” => Label: Greeting, Confidence: 0.25
“What time is it?” => Label: Question, Confidence: 0.33
Summary:
Now our classifier tells us how sure it is, which can be helpful when:
- We want to flag low-confidence guesses.
- We want a human to double-check uncertain predictions.
Pseudo Code: Semi-Supervised Text Classifier with Confidence
1. Initialize labeled and unlabeled text examples
labeled_texts = list of texts with known labels
unlabeled_texts = list of texts with no labels
2. Define a function to preprocess text
function preprocess(text):
convert text to lowercase
remove punctuation
split text into words
for each word:
if word ends with "ing", remove "ing"
else if word ends with "ed", remove "ed"
else if word ends with "es", remove "es"
else if word ends with "s" and length > 3, remove "s"
return set of cleaned/stemmed words
3. Define a function to find the best label and confidence
function best_label_and_confidence(unlabeled_text, labeled_texts):
preprocess the unlabeled_text into a set of words
initialize:
best_label = None
best_score = -1
best_overlap = 0
best_union = 1 # to avoid divide-by-zero
for each labeled_item in labeled_texts:
preprocess labeled_item's text into a set of words
compute overlap = number of shared words between both sets
compute union = total unique words from both sets
if overlap > best_score:
update best_score to overlap
update best_label to current item's label
save current overlap and union for confidence
confidence = overlap / union
return best_label and rounded confidence
4. Define the main labeling function
function label_unlabeled_texts(labeled_texts, unlabeled_texts):
for each item in unlabeled_texts:
call best_label_and_confidence with current text
assign the returned label and confidence to the item
return the updated unlabeled_texts
5. Run the labeling and print results
call label_unlabeled_texts with your labeled and unlabeled data
for each result in labeled_unlabeled_texts:
print: original text, guessed label, confidence score
Example Output:
“Hey, how’s it going?” => Label: Greeting, Confidence: 0.5
“Can you help me?” => Label: Question, Confidence: 0.5
Semi-supervised Learning – Basic Math Concepts
