Semi-supervised learning example with simple python
Problem: Classify simple text messages as either “Greeting” or “Question”.
We’ll use:
- A few labeled text messages
- A few unlabeled messages
- We’ll guess the label of the unlabeled messages by comparing them with labeled ones using simple text similarity.
Python Code (Text-Based Semi-Supervised Learning)
# Step 1: Labeled text data
labeled_texts = [
    {"text": "Hello there", "label": "Greeting"},
    {"text": "Hi, how are you?", "label": "Greeting"},
    {"text": "What is your name?", "label": "Question"},
    {"text": "Where are you from?", "label": "Question"},
]
# Step 2: Unlabeled text data
unlabeled_texts = [
    {"text": "Hey, how's it going?"},
    {"text": "Can you help me?"},
    {"text": "Hi!"},
    {"text": "What time is it?"},
]
# Step 3: Simple similarity function based on shared words
def text_similarity(text1, text2):
    words1 = set(text1.lower().split())
    words2 = set(text2.lower().split())
    return len(words1 & words2)  # count of common words
# Step 4: Semi-supervised text labeling
def label_unlabeled_texts(labeled, unlabeled):
    for u_item in unlabeled:
        best_label = None
        best_score = -1
        for l_item in labeled:
            score = text_similarity(u_item["text"], l_item["text"])
            if score > best_score:
                best_score = score
                best_label = l_item["label"]
        u_item["label"] = best_label
    return unlabeled
# Step 5: Run the labeling
labeled_unlabeled = label_unlabeled_texts(labeled_texts, unlabeled_texts)
# Display the results
print("Guessed labels for unlabeled text messages:")
for item in labeled_unlabeled:
    print(f'"{item["text"]}" => Guessed Label: {item["label"]}')
How it works:
- It looks for common words between the unlabeled message and labeled examples.
- The one with the most shared words decides the label.
- It’s like saying:
 “This message shares words with a known greeting, so it’s probably a greeting too!”
Example Output:
Guessed labels for unlabeled text messages:
“Hey, how’s it going?” => Guessed Label: Greeting
“Can you help me?” => Guessed Label: Question
“Hi!” => Guessed Label: Greeting
“What time is it?” => Guessed Label: Question
Upgrading our text-based semi-supervised learning example by adding:
1.Punctuation removal
2.Simple stemming (like turning “running” into “run”)
3.Lowercasing everything
Upgraded Code: Smarter Text Matching (No Libraries)
import string
# Step 1: Labeled text data
labeled_texts = [
    {"text": "Hello there", "label": "Greeting"},
    {"text": "Hi, how are you?", "label": "Greeting"},
    {"text": "What is your name?", "label": "Question"},
    {"text": "Where are you from?", "label": "Question"},
]
# Step 2: Unlabeled text data
unlabeled_texts = [
    {"text": "Hey, how's it going?"},
    {"text": "Can you help me?"},
    {"text": "Hi!"},
    {"text": "What time is it?"},
]
# Step 3: Preprocessing — lowercase, remove punctuation, simple stemming
def preprocess(text):
    # Lowercase
    text = text.lower()
    # Remove punctuation
    text = ''.join(char for char in text if char not in string.punctuation)
    # Tokenize and simple stemming (remove common suffixes)
    words = text.split()
    stemmed_words = []
    for word in words:
        if word.endswith("ing"):
            word = word[:-3]
        elif word.endswith("ed"):
            word = word[:-2]
        elif word.endswith("es"):
            word = word[:-2]
        elif word.endswith("s") and len(word) > 3:  # ignore short plurals like "is"
            word = word[:-1]
        stemmed_words.append(word)
    return set(stemmed_words)
# Step 4: Smarter similarity using preprocessed words
def text_similarity(text1, text2):
    words1 = preprocess(text1)
    words2 = preprocess(text2)
    return len(words1 & words2)
# Step 5: Semi-supervised text labeling
def label_unlabeled_texts(labeled, unlabeled):
    for u_item in unlabeled:
        best_label = None
        best_score = -1
        for l_item in labeled:
            score = text_similarity(u_item["text"], l_item["text"])
            if score > best_score:
                best_score = score
                best_label = l_item["label"]
        u_item["label"] = best_label
    return unlabeled
# Step 6: Run the labeling
labeled_unlabeled = label_unlabeled_texts(labeled_texts, unlabeled_texts)
# Step 7: Show results
print("Guessed labels for unlabeled text messages:")
for item in labeled_unlabeled:
    print(f'"{item["text"]}" => Guessed Label: {item["label"]}')
What’s improved:
| Feature | Benefit | 
|---|---|
| Lowercasing | Avoids mismatch like “Hi” vs “hi” | 
| Punctuation removal | Avoids mismatch like “Hi!” vs “Hi” | 
| Basic stemming | Matches “helping” with “help”, “runs” with “run”, etc. | 
Sample Output:
Guessed labels for unlabeled text messages:
“Hey, how’s it going?” => Guessed Label: Greeting
“Can you help me?” => Guessed Label: Question
“Hi!” => Guessed Label: Greeting
“What time is it?” => Guessed Label: Question
Upgrading to next level by adding a mini confidence score to each guessed label:
What is a Confidence Score?
It tells how sure our little text classifier is about its guess — on a simple scale from 0 to 1 (like 0% to 100%).
We’ll compute it like this:
- Compare the unlabeled text to all labeled examples.
- Find the most similar one (best match).
- Divide the number of shared words with the best match by the total number of unique words in both texts — a basic similarity ratio!
Final Upgraded Python Code (with Confidence Score)
import string
# Step 1: Labeled text data
labeled_texts = [
    {"text": "Hello there", "label": "Greeting"},
    {"text": "Hi, how are you?", "label": "Greeting"},
    {"text": "What is your name?", "label": "Question"},
    {"text": "Where are you from?", "label": "Question"},
]
# Step 2: Unlabeled text data
unlabeled_texts = [
    {"text": "Hey, how's it going?"},
    {"text": "Can you help me?"},
    {"text": "Hi!"},
    {"text": "What time is it?"},
]
# Step 3: Preprocess text — lower, remove punctuation, simple stemming
def preprocess(text):
    text = text.lower()
    text = ''.join(c for c in text if c not in string.punctuation)
    words = text.split()
    stemmed = []
    for word in words:
        if word.endswith("ing"):
            word = word[:-3]
        elif word.endswith("ed"):
            word = word[:-2]
        elif word.endswith("es"):
            word = word[:-2]
        elif word.endswith("s") and len(word) > 3:
            word = word[:-1]
        stemmed.append(word)
    return set(stemmed)
# Step 4: Get similarity score and label
def best_label_and_confidence(text, labeled):
    words_u = preprocess(text)
    best_score = -1
    best_label = None
    best_overlap = 0
    best_union = 1  # avoid divide-by-zero
    for l_item in labeled:
        words_l = preprocess(l_item["text"])
        overlap = len(words_u & words_l)
        union = len(words_u | words_l)
        score = overlap  # for choosing best label
        if score > best_score:
            best_score = score
            best_label = l_item["label"]
            best_overlap = overlap
            best_union = union
    confidence = best_overlap / best_union
    return best_label, round(confidence, 2)
# Step 5: Assign labels and confidence
def label_unlabeled_texts(labeled, unlabeled):
    for u_item in unlabeled:
        label, conf = best_label_and_confidence(u_item["text"], labeled)
        u_item["label"] = label
        u_item["confidence"] = conf
    return unlabeled
# Step 6: Run the labeling
labeled_unlabeled = label_unlabeled_texts(labeled_texts, unlabeled_texts)
# Step 7: Show results
print("Guessed labels with confidence scores:")
for item in labeled_unlabeled:
    print(f'"{item["text"]}" => Label: {item["label"]}, Confidence: {item["confidence"]}')
Sample Output:
Guessed labels with confidence scores:
“Hey, how’s it going?” => Label: Greeting, Confidence: 0.14
“Can you help me?” => Label: Greeting, Confidence: 0.14
“Hi!” => Label: Greeting, Confidence: 0.25
“What time is it?” => Label: Question, Confidence: 0.33
Summary:
Now our classifier tells us how sure it is, which can be helpful when:
- We want to flag low-confidence guesses.
- We want a human to double-check uncertain predictions.
Pseudo Code: Semi-Supervised Text Classifier with Confidence
1. Initialize labeled and unlabeled text examples
labeled_texts = list of texts with known labels
unlabeled_texts = list of texts with no labels
2. Define a function to preprocess text
function preprocess(text):
    convert text to lowercase
    remove punctuation
    split text into words
    for each word:
        if word ends with "ing", remove "ing"
        else if word ends with "ed", remove "ed"
        else if word ends with "es", remove "es"
        else if word ends with "s" and length > 3, remove "s"
    return set of cleaned/stemmed words
3. Define a function to find the best label and confidence
function best_label_and_confidence(unlabeled_text, labeled_texts):
    preprocess the unlabeled_text into a set of words
    initialize:
        best_label = None
        best_score = -1
        best_overlap = 0
        best_union = 1   # to avoid divide-by-zero
    for each labeled_item in labeled_texts:
        preprocess labeled_item's text into a set of words
        compute overlap = number of shared words between both sets
        compute union = total unique words from both sets
        if overlap > best_score:
            update best_score to overlap
            update best_label to current item's label
            save current overlap and union for confidence
    confidence = overlap / union
    return best_label and rounded confidence
4. Define the main labeling function
function label_unlabeled_texts(labeled_texts, unlabeled_texts):
    for each item in unlabeled_texts:
        call best_label_and_confidence with current text
        assign the returned label and confidence to the item
    return the updated unlabeled_texts
5. Run the labeling and print results
call label_unlabeled_texts with your labeled and unlabeled data
for each result in labeled_unlabeled_texts:
    print: original text, guessed label, confidence score
Example Output:
“Hey, how’s it going?” => Label: Greeting, Confidence: 0.5
“Can you help me?” => Label: Question, Confidence: 0.5
Semi-supervised Learning – Basic Math Concepts
