Semi-supervised learning example with simple python

Problem: Classify simple text messages as either “Greeting” or “Question”.

We’ll use:

A few labeled text messages
A few unlabeled messages
We’ll guess the label of the unlabeled messages by comparing them with labeled ones using simple text similarity.

Python Code (Text-Based Semi-Supervised Learning)

# Step 1: Labeled text data
labeled_texts = [
    {"text": "Hello there", "label": "Greeting"},
    {"text": "Hi, how are you?", "label": "Greeting"},
    {"text": "What is your name?", "label": "Question"},
    {"text": "Where are you from?", "label": "Question"},
]

# Step 2: Unlabeled text data
unlabeled_texts = [
    {"text": "Hey, how's it going?"},
    {"text": "Can you help me?"},
    {"text": "Hi!"},
    {"text": "What time is it?"},
]

# Step 3: Simple similarity function based on shared words
def text_similarity(text1, text2):
    words1 = set(text1.lower().split())
    words2 = set(text2.lower().split())
    return len(words1 & words2)  # count of common words

# Step 4: Semi-supervised text labeling
def label_unlabeled_texts(labeled, unlabeled):
    for u_item in unlabeled:
        best_label = None
        best_score = -1
        for l_item in labeled:
            score = text_similarity(u_item["text"], l_item["text"])
            if score > best_score:
                best_score = score
                best_label = l_item["label"]
        u_item["label"] = best_label
    return unlabeled

# Step 5: Run the labeling
labeled_unlabeled = label_unlabeled_texts(labeled_texts, unlabeled_texts)

# Display the results
print("Guessed labels for unlabeled text messages:")
for item in labeled_unlabeled:
    print(f'"{item["text"]}" => Guessed Label: {item["label"]}')

How it works:

It looks for common words between the unlabeled message and labeled examples.
The one with the most shared words decides the label.
It’s like saying:
“This message shares words with a known greeting, so it’s probably a greeting too!”

Example Output:

Guessed labels for unlabeled text messages:
“Hey, how’s it going?” => Guessed Label: Greeting
“Can you help me?” => Guessed Label: Question
“Hi!” => Guessed Label: Greeting
“What time is it?” => Guessed Label: Question

Upgrading our text-based semi-supervised learning example by adding:

1.Punctuation removal
2.Simple stemming (like turning “running” into “run”)
3.Lowercasing everything

Upgraded Code: Smarter Text Matching (No Libraries)

import string

# Step 1: Labeled text data
labeled_texts = [
    {"text": "Hello there", "label": "Greeting"},
    {"text": "Hi, how are you?", "label": "Greeting"},
    {"text": "What is your name?", "label": "Question"},
    {"text": "Where are you from?", "label": "Question"},
]

# Step 2: Unlabeled text data
unlabeled_texts = [
    {"text": "Hey, how's it going?"},
    {"text": "Can you help me?"},
    {"text": "Hi!"},
    {"text": "What time is it?"},
]

# Step 3: Preprocessing — lowercase, remove punctuation, simple stemming
def preprocess(text):
    # Lowercase
    text = text.lower()
    # Remove punctuation
    text = ''.join(char for char in text if char not in string.punctuation)
    # Tokenize and simple stemming (remove common suffixes)
    words = text.split()
    stemmed_words = []
    for word in words:
        if word.endswith("ing"):
            word = word[:-3]
        elif word.endswith("ed"):
            word = word[:-2]
        elif word.endswith("es"):
            word = word[:-2]
        elif word.endswith("s") and len(word) > 3:  # ignore short plurals like "is"
            word = word[:-1]
        stemmed_words.append(word)
    return set(stemmed_words)

# Step 4: Smarter similarity using preprocessed words
def text_similarity(text1, text2):
    words1 = preprocess(text1)
    words2 = preprocess(text2)
    return len(words1 & words2)

# Step 5: Semi-supervised text labeling
def label_unlabeled_texts(labeled, unlabeled):
    for u_item in unlabeled:
        best_label = None
        best_score = -1
        for l_item in labeled:
            score = text_similarity(u_item["text"], l_item["text"])
            if score > best_score:
                best_score = score
                best_label = l_item["label"]
        u_item["label"] = best_label
    return unlabeled
# Step 6: Run the labeling
labeled_unlabeled = label_unlabeled_texts(labeled_texts, unlabeled_texts)
# Step 7: Show results
print("Guessed labels for unlabeled text messages:")
for item in labeled_unlabeled:
    print(f'"{item["text"]}" => Guessed Label: {item["label"]}')

What’s improved:

Feature	Benefit
Lowercasing	Avoids mismatch like “Hi” vs “hi”
Punctuation removal	Avoids mismatch like “Hi!” vs “Hi”
Basic stemming	Matches “helping” with “help”, “runs” with “run”, etc.

Sample Output:

Guessed labels for unlabeled text messages:
“Hey, how’s it going?” => Guessed Label: Greeting
“Can you help me?” => Guessed Label: Question
“Hi!” => Guessed Label: Greeting
“What time is it?” => Guessed Label: Question

Upgrading to next level by adding a mini confidence score to each guessed label:

What is a Confidence Score?

It tells how sure our little text classifier is about its guess — on a simple scale from 0 to 1 (like 0% to 100%).

We’ll compute it like this:

Compare the unlabeled text to all labeled examples.
Find the most similar one (best match).
Divide the number of shared words with the best match by the total number of unique words in both texts — a basic similarity ratio!

Final Upgraded Python Code (with Confidence Score)

import string

# Step 1: Labeled text data
labeled_texts = [
    {"text": "Hello there", "label": "Greeting"},
    {"text": "Hi, how are you?", "label": "Greeting"},
    {"text": "What is your name?", "label": "Question"},
    {"text": "Where are you from?", "label": "Question"},
]

# Step 2: Unlabeled text data
unlabeled_texts = [
    {"text": "Hey, how's it going?"},
    {"text": "Can you help me?"},
    {"text": "Hi!"},
    {"text": "What time is it?"},
]

# Step 3: Preprocess text — lower, remove punctuation, simple stemming
def preprocess(text):
    text = text.lower()
    text = ''.join(c for c in text if c not in string.punctuation)
    words = text.split()
    stemmed = []
    for word in words:
        if word.endswith("ing"):
            word = word[:-3]
        elif word.endswith("ed"):
            word = word[:-2]
        elif word.endswith("es"):
            word = word[:-2]
        elif word.endswith("s") and len(word) > 3:
            word = word[:-1]
        stemmed.append(word)
    return set(stemmed)
# Step 4: Get similarity score and label
def best_label_and_confidence(text, labeled):
    words_u = preprocess(text)
    best_score = -1
    best_label = None
    best_overlap = 0
    best_union = 1  # avoid divide-by-zero
    for l_item in labeled:
        words_l = preprocess(l_item["text"])
        overlap = len(words_u & words_l)
        union = len(words_u | words_l)
        score = overlap  # for choosing best label
        if score > best_score:
            best_score = score
            best_label = l_item["label"]
            best_overlap = overlap
            best_union = union
    confidence = best_overlap / best_union
    return best_label, round(confidence, 2)

# Step 5: Assign labels and confidence
def label_unlabeled_texts(labeled, unlabeled):
    for u_item in unlabeled:
        label, conf = best_label_and_confidence(u_item["text"], labeled)
        u_item["label"] = label
        u_item["confidence"] = conf
    return unlabeled

# Step 6: Run the labeling
labeled_unlabeled = label_unlabeled_texts(labeled_texts, unlabeled_texts)

# Step 7: Show results
print("Guessed labels with confidence scores:")
for item in labeled_unlabeled:
    print(f'"{item["text"]}" => Label: {item["label"]}, Confidence: {item["confidence"]}')

Sample Output:

Guessed labels with confidence scores:
“Hey, how’s it going?” => Label: Greeting, Confidence: 0.14
“Can you help me?” => Label: Greeting, Confidence: 0.14
“Hi!” => Label: Greeting, Confidence: 0.25
“What time is it?” => Label: Question, Confidence: 0.33

Summary:

Now our classifier tells us how sure it is, which can be helpful when:

We want to flag low-confidence guesses.
We want a human to double-check uncertain predictions.

Pseudo Code: Semi-Supervised Text Classifier with Confidence

1. Initialize labeled and unlabeled text examples

labeled_texts = list of texts with known labels
unlabeled_texts = list of texts with no labels

2. Define a function to preprocess text

function preprocess(text):
    convert text to lowercase
    remove punctuation
    split text into words
    for each word:
        if word ends with "ing", remove "ing"
        else if word ends with "ed", remove "ed"
        else if word ends with "es", remove "es"
        else if word ends with "s" and length > 3, remove "s"
    return set of cleaned/stemmed words

3. Define a function to find the best label and confidence

function best_label_and_confidence(unlabeled_text, labeled_texts):
    preprocess the unlabeled_text into a set of words
    initialize:
        best_label = None
        best_score = -1
        best_overlap = 0
        best_union = 1   # to avoid divide-by-zero

    for each labeled_item in labeled_texts:
        preprocess labeled_item's text into a set of words
        compute overlap = number of shared words between both sets
        compute union = total unique words from both sets
        if overlap > best_score:
            update best_score to overlap
            update best_label to current item's label
            save current overlap and union for confidence
    confidence = overlap / union
    return best_label and rounded confidence

4. Define the main labeling function

function label_unlabeled_texts(labeled_texts, unlabeled_texts):
    for each item in unlabeled_texts:
        call best_label_and_confidence with current text
        assign the returned label and confidence to the item
    return the updated unlabeled_texts

5. Run the labeling and print results

call label_unlabeled_texts with your labeled and unlabeled data

for each result in labeled_unlabeled_texts:
    print: original text, guessed label, confidence score

Example Output:

“Hey, how’s it going?” => Label: Greeting, Confidence: 0.5
“Can you help me?” => Label: Question, Confidence: 0.5

Semi-supervised Learning – Basic Math Concepts