Semi-supervised learning example with simple python
Problem: Classify simple text messages as either “Greeting” or “Question”.
We’ll use:
- A few labeled text messages
- A few unlabeled messages
- We’ll guess the label of the unlabeled messages by comparing them with labeled ones using simple text similarity.
Python Code (Text-Based Semi-Supervised Learning)
# Step 1: Labeled text data labeled_texts = [ {"text": "Hello there", "label": "Greeting"}, {"text": "Hi, how are you?", "label": "Greeting"}, {"text": "What is your name?", "label": "Question"}, {"text": "Where are you from?", "label": "Question"}, ] # Step 2: Unlabeled text data unlabeled_texts = [ {"text": "Hey, how's it going?"}, {"text": "Can you help me?"}, {"text": "Hi!"}, {"text": "What time is it?"}, ] # Step 3: Simple similarity function based on shared words def text_similarity(text1, text2): words1 = set(text1.lower().split()) words2 = set(text2.lower().split()) return len(words1 & words2) # count of common words # Step 4: Semi-supervised text labeling def label_unlabeled_texts(labeled, unlabeled): for u_item in unlabeled: best_label = None best_score = -1 for l_item in labeled: score = text_similarity(u_item["text"], l_item["text"]) if score > best_score: best_score = score best_label = l_item["label"] u_item["label"] = best_label return unlabeled # Step 5: Run the labeling labeled_unlabeled = label_unlabeled_texts(labeled_texts, unlabeled_texts) # Display the results print("Guessed labels for unlabeled text messages:") for item in labeled_unlabeled: print(f'"{item["text"]}" => Guessed Label: {item["label"]}')
How it works:
- It looks for common words between the unlabeled message and labeled examples.
- The one with the most shared words decides the label.
- It’s like saying:
“This message shares words with a known greeting, so it’s probably a greeting too!”
Example Output:
Guessed labels for unlabeled text messages:
“Hey, how’s it going?” => Guessed Label: Greeting
“Can you help me?” => Guessed Label: Question
“Hi!” => Guessed Label: Greeting
“What time is it?” => Guessed Label: Question
Upgrading our text-based semi-supervised learning example by adding:
1.Punctuation removal
2.Simple stemming (like turning “running” into “run”)
3.Lowercasing everything
Upgraded Code: Smarter Text Matching (No Libraries)
import string # Step 1: Labeled text data labeled_texts = [ {"text": "Hello there", "label": "Greeting"}, {"text": "Hi, how are you?", "label": "Greeting"}, {"text": "What is your name?", "label": "Question"}, {"text": "Where are you from?", "label": "Question"}, ] # Step 2: Unlabeled text data unlabeled_texts = [ {"text": "Hey, how's it going?"}, {"text": "Can you help me?"}, {"text": "Hi!"}, {"text": "What time is it?"}, ] # Step 3: Preprocessing — lowercase, remove punctuation, simple stemming def preprocess(text): # Lowercase text = text.lower() # Remove punctuation text = ''.join(char for char in text if char not in string.punctuation) # Tokenize and simple stemming (remove common suffixes) words = text.split() stemmed_words = [] for word in words: if word.endswith("ing"): word = word[:-3] elif word.endswith("ed"): word = word[:-2] elif word.endswith("es"): word = word[:-2] elif word.endswith("s") and len(word) > 3: # ignore short plurals like "is" word = word[:-1] stemmed_words.append(word) return set(stemmed_words) # Step 4: Smarter similarity using preprocessed words def text_similarity(text1, text2): words1 = preprocess(text1) words2 = preprocess(text2) return len(words1 & words2) # Step 5: Semi-supervised text labeling def label_unlabeled_texts(labeled, unlabeled): for u_item in unlabeled: best_label = None best_score = -1 for l_item in labeled: score = text_similarity(u_item["text"], l_item["text"]) if score > best_score: best_score = score best_label = l_item["label"] u_item["label"] = best_label return unlabeled # Step 6: Run the labeling labeled_unlabeled = label_unlabeled_texts(labeled_texts, unlabeled_texts) # Step 7: Show results print("Guessed labels for unlabeled text messages:") for item in labeled_unlabeled: print(f'"{item["text"]}" => Guessed Label: {item["label"]}')
What’s improved:
Feature | Benefit |
---|---|
Lowercasing | Avoids mismatch like “Hi” vs “hi” |
Punctuation removal | Avoids mismatch like “Hi!” vs “Hi” |
Basic stemming | Matches “helping” with “help”, “runs” with “run”, etc. |
Sample Output:
Guessed labels for unlabeled text messages:
“Hey, how’s it going?” => Guessed Label: Greeting
“Can you help me?” => Guessed Label: Question
“Hi!” => Guessed Label: Greeting
“What time is it?” => Guessed Label: Question
Upgrading to next level by adding a mini confidence score to each guessed label:
What is a Confidence Score?
It tells how sure our little text classifier is about its guess — on a simple scale from 0 to 1 (like 0% to 100%).
We’ll compute it like this:
- Compare the unlabeled text to all labeled examples.
- Find the most similar one (best match).
- Divide the number of shared words with the best match by the total number of unique words in both texts — a basic similarity ratio!
Final Upgraded Python Code (with Confidence Score)
import string # Step 1: Labeled text data labeled_texts = [ {"text": "Hello there", "label": "Greeting"}, {"text": "Hi, how are you?", "label": "Greeting"}, {"text": "What is your name?", "label": "Question"}, {"text": "Where are you from?", "label": "Question"}, ] # Step 2: Unlabeled text data unlabeled_texts = [ {"text": "Hey, how's it going?"}, {"text": "Can you help me?"}, {"text": "Hi!"}, {"text": "What time is it?"}, ] # Step 3: Preprocess text — lower, remove punctuation, simple stemming def preprocess(text): text = text.lower() text = ''.join(c for c in text if c not in string.punctuation) words = text.split() stemmed = [] for word in words: if word.endswith("ing"): word = word[:-3] elif word.endswith("ed"): word = word[:-2] elif word.endswith("es"): word = word[:-2] elif word.endswith("s") and len(word) > 3: word = word[:-1] stemmed.append(word) return set(stemmed) # Step 4: Get similarity score and label def best_label_and_confidence(text, labeled): words_u = preprocess(text) best_score = -1 best_label = None best_overlap = 0 best_union = 1 # avoid divide-by-zero for l_item in labeled: words_l = preprocess(l_item["text"]) overlap = len(words_u & words_l) union = len(words_u | words_l) score = overlap # for choosing best label if score > best_score: best_score = score best_label = l_item["label"] best_overlap = overlap best_union = union confidence = best_overlap / best_union return best_label, round(confidence, 2) # Step 5: Assign labels and confidence def label_unlabeled_texts(labeled, unlabeled): for u_item in unlabeled: label, conf = best_label_and_confidence(u_item["text"], labeled) u_item["label"] = label u_item["confidence"] = conf return unlabeled # Step 6: Run the labeling labeled_unlabeled = label_unlabeled_texts(labeled_texts, unlabeled_texts) # Step 7: Show results print("Guessed labels with confidence scores:") for item in labeled_unlabeled: print(f'"{item["text"]}" => Label: {item["label"]}, Confidence: {item["confidence"]}')
Sample Output:
Guessed labels with confidence scores:
“Hey, how’s it going?” => Label: Greeting, Confidence: 0.14
“Can you help me?” => Label: Greeting, Confidence: 0.14
“Hi!” => Label: Greeting, Confidence: 0.25
“What time is it?” => Label: Question, Confidence: 0.33
Summary:
Now our classifier tells us how sure it is, which can be helpful when:
- We want to flag low-confidence guesses.
- We want a human to double-check uncertain predictions.
Pseudo Code: Semi-Supervised Text Classifier with Confidence
1. Initialize labeled and unlabeled text examples
labeled_texts = list of texts with known labels
unlabeled_texts = list of texts with no labels
2. Define a function to preprocess text
function preprocess(text): convert text to lowercase remove punctuation split text into words for each word: if word ends with "ing", remove "ing" else if word ends with "ed", remove "ed" else if word ends with "es", remove "es" else if word ends with "s" and length > 3, remove "s" return set of cleaned/stemmed words
3. Define a function to find the best label and confidence
function best_label_and_confidence(unlabeled_text, labeled_texts): preprocess the unlabeled_text into a set of words initialize: best_label = None best_score = -1 best_overlap = 0 best_union = 1 # to avoid divide-by-zero for each labeled_item in labeled_texts: preprocess labeled_item's text into a set of words compute overlap = number of shared words between both sets compute union = total unique words from both sets if overlap > best_score: update best_score to overlap update best_label to current item's label save current overlap and union for confidence confidence = overlap / union return best_label and rounded confidence
4. Define the main labeling function
function label_unlabeled_texts(labeled_texts, unlabeled_texts): for each item in unlabeled_texts: call best_label_and_confidence with current text assign the returned label and confidence to the item return the updated unlabeled_texts
5. Run the labeling and print results
call label_unlabeled_texts with your labeled and unlabeled data for each result in labeled_unlabeled_texts: print: original text, guessed label, confidence score
Example Output:
“Hey, how’s it going?” => Label: Greeting, Confidence: 0.5
“Can you help me?” => Label: Question, Confidence: 0.5
Semi-supervised Learning – Basic Math Concepts