Scikit-learn Primary Concepts
A. Basic Level – Getting Started
1. Linear Regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
# Generate sample data
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)
# Split into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Model creation
model = LinearRegression()
model.fit(X_train, y_train)
# Prediction
y_pred = model.predict(X_test)
# Output
print("Coefficient:", model.coef_)
print("Intercept:", model.intercept_)
Concepts: Model fitting, training-test split, coefficients.
2. Classification using k-NN
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
# Load dataset
data = load_iris()
X, y = data.data, data.target
# Train model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)
# Predict
print("Prediction:", knn.predict([[5.1, 3.5, 1.4, 0.2]]))
Concepts: Supervised learning, classification, k-nearest neighbors.
B. Intermediate Level – Real Data and Pipelines
3. Data Preprocessing + Logistic Regression
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report
# Load real dataset
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv")
df = df[['age', 'fare', 'survived']].dropna()
X = df[['age', 'fare']]
y = df['survived']
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# Create pipeline
pipe = make_pipeline(StandardScaler(), LogisticRegression())
pipe.fit(X_train, y_train)
# Evaluate
print(classification_report(y_test, pipe.predict(X_test)))
Concepts: Pipelines, scaling, real-world datasets, classification metrics.
4. GridSearchCV for Hyperparameter Tuning
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
params = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
# SVM + Grid Search
grid = GridSearchCV(SVC(), params, cv=5)
grid.fit(X_train, y_train)
print("Best Parameters:", grid.best_params_)
print("Best Score:", grid.best_score_)
Concepts: Model tuning, grid search, cross-validation.
C. Advanced Level – Model Stacking and Feature Engineering
5. Feature Selection with RandomForest
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
# Fit a random forest
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
# Select features based on importance
selector = SelectFromModel(rf, threshold="median")
X_new = selector.transform(X_train)
print("Reduced feature shape:", X_new.shape)
Concepts: Feature selection, model-based filtering.
6. Model Stacking with VotingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
# Create multiple classifiers
clf1 = LogisticRegression()
clf2 = DecisionTreeClassifier()
clf3 = SVC(probability=True)
# Combine with voting
ensemble = VotingClassifier(estimators=[
('lr', clf1), ('dt', clf2), ('svc', clf3)], voting='soft')
ensemble.fit(X_train, y_train)
print("Ensemble Accuracy:", ensemble.score(X_test, y_test))
Concepts: Ensemble learning, soft voting, model fusion.
7. Pipeline with Imputation + Scaling + Model
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', MinMaxScaler()),
('classifier', GradientBoostingClassifier())
])
pipeline.fit(X_train, y_train)
print("Accuracy:", pipeline.score(X_test, y_test))
Concepts: Full pipeline with missing value handling.
Next – FNN Neuron in Hidden Layer
