Scikit-learn Primary Concepts
A. Basic Level – Getting Started
1. Linear Regression
from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.datasets import make_regression # Generate sample data X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42) # Split into training and testing X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) # Model creation model = LinearRegression() model.fit(X_train, y_train) # Prediction y_pred = model.predict(X_test) # Output print("Coefficient:", model.coef_) print("Intercept:", model.intercept_)
Concepts: Model fitting, training-test split, coefficients.
2. Classification using k-NN
from sklearn.datasets import load_iris from sklearn.neighbors import KNeighborsClassifier # Load dataset data = load_iris() X, y = data.data, data.target # Train model knn = KNeighborsClassifier(n_neighbors=3) knn.fit(X, y) # Predict print("Prediction:", knn.predict([[5.1, 3.5, 1.4, 0.2]]))
Concepts: Supervised learning, classification, k-nearest neighbors.
B. Intermediate Level – Real Data and Pipelines
3. Data Preprocessing + Logistic Regression
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline from sklearn.metrics import classification_report # Load real dataset df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv") df = df[['age', 'fare', 'survived']].dropna() X = df[['age', 'fare']] y = df['survived'] # Split X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) # Create pipeline pipe = make_pipeline(StandardScaler(), LogisticRegression()) pipe.fit(X_train, y_train) # Evaluate print(classification_report(y_test, pipe.predict(X_test)))
Concepts: Pipelines, scaling, real-world datasets, classification metrics.
4. GridSearchCV for Hyperparameter Tuning
from sklearn.svm import SVC from sklearn.model_selection import GridSearchCV params = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']} # SVM + Grid Search grid = GridSearchCV(SVC(), params, cv=5) grid.fit(X_train, y_train) print("Best Parameters:", grid.best_params_) print("Best Score:", grid.best_score_)
Concepts: Model tuning, grid search, cross-validation.
C. Advanced Level – Model Stacking and Feature Engineering
5. Feature Selection with RandomForest
from sklearn.ensemble import RandomForestClassifier from sklearn.feature_selection import SelectFromModel # Fit a random forest rf = RandomForestClassifier(n_estimators=100) rf.fit(X_train, y_train) # Select features based on importance selector = SelectFromModel(rf, threshold="median") X_new = selector.transform(X_train) print("Reduced feature shape:", X_new.shape)
Concepts: Feature selection, model-based filtering.
6. Model Stacking with VotingClassifier
from sklearn.ensemble import VotingClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.tree import DecisionTreeClassifier # Create multiple classifiers clf1 = LogisticRegression() clf2 = DecisionTreeClassifier() clf3 = SVC(probability=True) # Combine with voting ensemble = VotingClassifier(estimators=[ ('lr', clf1), ('dt', clf2), ('svc', clf3)], voting='soft') ensemble.fit(X_train, y_train) print("Ensemble Accuracy:", ensemble.score(X_test, y_test))
Concepts: Ensemble learning, soft voting, model fusion.
7. Pipeline with Imputation + Scaling + Model
from sklearn.impute import SimpleImputer from sklearn.preprocessing import MinMaxScaler from sklearn.ensemble import GradientBoostingClassifier from sklearn.pipeline import Pipeline pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='mean')), ('scaler', MinMaxScaler()), ('classifier', GradientBoostingClassifier()) ]) pipeline.fit(X_train, y_train) print("Accuracy:", pipeline.score(X_test, y_test))
Concepts: Full pipeline with missing value handling.
Next – FNN Neuron in Hidden Layer