Semi-supervised learning example with simple python

Problem: Classify simple text messages as either “Greeting” or “Question”.

We’ll use:

  • A few labeled text messages
  • A few unlabeled messages
  • We’ll guess the label of the unlabeled messages by comparing them with labeled ones using simple text similarity.

Python Code (Text-Based Semi-Supervised Learning)

# Step 1: Labeled text data
labeled_texts = [
    {"text": "Hello there", "label": "Greeting"},
    {"text": "Hi, how are you?", "label": "Greeting"},
    {"text": "What is your name?", "label": "Question"},
    {"text": "Where are you from?", "label": "Question"},
]

# Step 2: Unlabeled text data
unlabeled_texts = [
    {"text": "Hey, how's it going?"},
    {"text": "Can you help me?"},
    {"text": "Hi!"},
    {"text": "What time is it?"},
]

# Step 3: Simple similarity function based on shared words
def text_similarity(text1, text2):
    words1 = set(text1.lower().split())
    words2 = set(text2.lower().split())
    return len(words1 & words2)  # count of common words

# Step 4: Semi-supervised text labeling
def label_unlabeled_texts(labeled, unlabeled):
    for u_item in unlabeled:
        best_label = None
        best_score = -1
        for l_item in labeled:
            score = text_similarity(u_item["text"], l_item["text"])
            if score > best_score:
                best_score = score
                best_label = l_item["label"]
        u_item["label"] = best_label
    return unlabeled

# Step 5: Run the labeling
labeled_unlabeled = label_unlabeled_texts(labeled_texts, unlabeled_texts)

# Display the results
print("Guessed labels for unlabeled text messages:")
for item in labeled_unlabeled:
    print(f'"{item["text"]}" => Guessed Label: {item["label"]}')

How it works:

  • It looks for common words between the unlabeled message and labeled examples.
  • The one with the most shared words decides the label.
  • It’s like saying:
    “This message shares words with a known greeting, so it’s probably a greeting too!”

Example Output:

Guessed labels for unlabeled text messages:
“Hey, how’s it going?” => Guessed Label: Greeting
“Can you help me?” => Guessed Label: Question
“Hi!” => Guessed Label: Greeting
“What time is it?” => Guessed Label: Question

Upgrading our text-based semi-supervised learning example by adding:

1.Punctuation removal
2.Simple stemming (like turning “running” into “run”)
3.Lowercasing everything

Upgraded Code: Smarter Text Matching (No Libraries)

import string

# Step 1: Labeled text data
labeled_texts = [
    {"text": "Hello there", "label": "Greeting"},
    {"text": "Hi, how are you?", "label": "Greeting"},
    {"text": "What is your name?", "label": "Question"},
    {"text": "Where are you from?", "label": "Question"},
]

# Step 2: Unlabeled text data
unlabeled_texts = [
    {"text": "Hey, how's it going?"},
    {"text": "Can you help me?"},
    {"text": "Hi!"},
    {"text": "What time is it?"},
]

# Step 3: Preprocessing — lowercase, remove punctuation, simple stemming
def preprocess(text):
    # Lowercase
    text = text.lower()
    # Remove punctuation
    text = ''.join(char for char in text if char not in string.punctuation)
    # Tokenize and simple stemming (remove common suffixes)
    words = text.split()
    stemmed_words = []
    for word in words:
        if word.endswith("ing"):
            word = word[:-3]
        elif word.endswith("ed"):
            word = word[:-2]
        elif word.endswith("es"):
            word = word[:-2]
        elif word.endswith("s") and len(word) > 3:  # ignore short plurals like "is"
            word = word[:-1]
        stemmed_words.append(word)
    return set(stemmed_words)

# Step 4: Smarter similarity using preprocessed words
def text_similarity(text1, text2):
    words1 = preprocess(text1)
    words2 = preprocess(text2)
    return len(words1 & words2)

# Step 5: Semi-supervised text labeling
def label_unlabeled_texts(labeled, unlabeled):
    for u_item in unlabeled:
        best_label = None
        best_score = -1
        for l_item in labeled:
            score = text_similarity(u_item["text"], l_item["text"])
            if score > best_score:
                best_score = score
                best_label = l_item["label"]
        u_item["label"] = best_label
    return unlabeled
# Step 6: Run the labeling
labeled_unlabeled = label_unlabeled_texts(labeled_texts, unlabeled_texts)
# Step 7: Show results
print("Guessed labels for unlabeled text messages:")
for item in labeled_unlabeled:
    print(f'"{item["text"]}" => Guessed Label: {item["label"]}')

What’s improved:

Feature Benefit
Lowercasing Avoids mismatch like “Hi” vs “hi”
Punctuation removal Avoids mismatch like “Hi!” vs “Hi”
Basic stemming Matches “helping” with “help”, “runs” with “run”, etc.

Sample Output:

Guessed labels for unlabeled text messages:
“Hey, how’s it going?” => Guessed Label: Greeting
“Can you help me?” => Guessed Label: Question
“Hi!” => Guessed Label: Greeting
“What time is it?” => Guessed Label: Question

Upgrading to next level by adding a mini confidence score to each guessed label:

What is a Confidence Score?

It tells how sure our little text classifier is about its guess — on a simple scale from 0 to 1 (like 0% to 100%).

We’ll compute it like this:

  • Compare the unlabeled text to all labeled examples.
  • Find the most similar one (best match).
  • Divide the number of shared words with the best match by the total number of unique words in both texts — a basic similarity ratio!

Final Upgraded Python Code (with Confidence Score)

import string

# Step 1: Labeled text data
labeled_texts = [
    {"text": "Hello there", "label": "Greeting"},
    {"text": "Hi, how are you?", "label": "Greeting"},
    {"text": "What is your name?", "label": "Question"},
    {"text": "Where are you from?", "label": "Question"},
]

# Step 2: Unlabeled text data
unlabeled_texts = [
    {"text": "Hey, how's it going?"},
    {"text": "Can you help me?"},
    {"text": "Hi!"},
    {"text": "What time is it?"},
]

# Step 3: Preprocess text — lower, remove punctuation, simple stemming
def preprocess(text):
    text = text.lower()
    text = ''.join(c for c in text if c not in string.punctuation)
    words = text.split()
    stemmed = []
    for word in words:
        if word.endswith("ing"):
            word = word[:-3]
        elif word.endswith("ed"):
            word = word[:-2]
        elif word.endswith("es"):
            word = word[:-2]
        elif word.endswith("s") and len(word) > 3:
            word = word[:-1]
        stemmed.append(word)
    return set(stemmed)
# Step 4: Get similarity score and label
def best_label_and_confidence(text, labeled):
    words_u = preprocess(text)
    best_score = -1
    best_label = None
    best_overlap = 0
    best_union = 1  # avoid divide-by-zero
    for l_item in labeled:
        words_l = preprocess(l_item["text"])
        overlap = len(words_u & words_l)
        union = len(words_u | words_l)
        score = overlap  # for choosing best label
        if score > best_score:
            best_score = score
            best_label = l_item["label"]
            best_overlap = overlap
            best_union = union
    confidence = best_overlap / best_union
    return best_label, round(confidence, 2)

# Step 5: Assign labels and confidence
def label_unlabeled_texts(labeled, unlabeled):
    for u_item in unlabeled:
        label, conf = best_label_and_confidence(u_item["text"], labeled)
        u_item["label"] = label
        u_item["confidence"] = conf
    return unlabeled

# Step 6: Run the labeling
labeled_unlabeled = label_unlabeled_texts(labeled_texts, unlabeled_texts)

# Step 7: Show results
print("Guessed labels with confidence scores:")
for item in labeled_unlabeled:
    print(f'"{item["text"]}" => Label: {item["label"]}, Confidence: {item["confidence"]}')

Sample Output:

Guessed labels with confidence scores:
“Hey, how’s it going?” => Label: Greeting, Confidence: 0.14
“Can you help me?” => Label: Greeting, Confidence: 0.14
“Hi!” => Label: Greeting, Confidence: 0.25
“What time is it?” => Label: Question, Confidence: 0.33

Summary:

Now our classifier tells us how sure it is, which can be helpful when:

  • We want to flag low-confidence guesses.
  • We want a human to double-check uncertain predictions.

Pseudo Code: Semi-Supervised Text Classifier with Confidence

1. Initialize labeled and unlabeled text examples

labeled_texts = list of texts with known labels
unlabeled_texts = list of texts with no labels

2. Define a function to preprocess text

function preprocess(text):
    convert text to lowercase
    remove punctuation
    split text into words
    for each word:
        if word ends with "ing", remove "ing"
        else if word ends with "ed", remove "ed"
        else if word ends with "es", remove "es"
        else if word ends with "s" and length > 3, remove "s"
    return set of cleaned/stemmed words

3. Define a function to find the best label and confidence

function best_label_and_confidence(unlabeled_text, labeled_texts):
    preprocess the unlabeled_text into a set of words
    initialize:
        best_label = None
        best_score = -1
        best_overlap = 0
        best_union = 1   # to avoid divide-by-zero

    for each labeled_item in labeled_texts:
        preprocess labeled_item's text into a set of words
        compute overlap = number of shared words between both sets
        compute union = total unique words from both sets
        if overlap > best_score:
            update best_score to overlap
            update best_label to current item's label
            save current overlap and union for confidence
    confidence = overlap / union
    return best_label and rounded confidence

4. Define the main labeling function

function label_unlabeled_texts(labeled_texts, unlabeled_texts):
    for each item in unlabeled_texts:
        call best_label_and_confidence with current text
        assign the returned label and confidence to the item
    return the updated unlabeled_texts

5. Run the labeling and print results

call label_unlabeled_texts with your labeled and unlabeled data

for each result in labeled_unlabeled_texts:
    print: original text, guessed label, confidence score

Example Output:

“Hey, how’s it going?” => Label: Greeting, Confidence: 0.5
“Can you help me?” => Label: Question, Confidence: 0.5

Semi-supervised Learning – Basic Math Concepts