Part 2: Machine Learning#

This final part is an introduction to the very broad topic of machine learning, mainly via Python’s Scikit-Learn package. You can think of machine learning as a class of algorithms that allow a program to detect particular patterns in a dataset, and thus “learn” from the data to draw inferences from it. This is not meant to be a comprehensive introduction to the field of machine learning; that is a large subject and necessitates a more technical approach than we take here. Nor is it meant to be a comprehensive manual for the use of the Scikit-Learn package (for this, you can refer to the resources listed in the first page of the course. Rather, the goals here are:

  • To introduce the fundamental vocabulary and concepts of machine learning

  • To introduce the Scikit-Learn API and show some examples of its use

  • To take a deeper dive into the details of several of the more important classical machine learning approaches, and develop an intuition into how they work and when and where they are applicable

Much of this material is drawn from the Scikit-Learn tutorials and workshops I have given on several occasions at PyCon, SciPy, PyData, and other conferences. Any clarity in the following pages is likely due to the many workshop participants and co-instructors who have given me valuable feedback on this material over the years!

A list of machine learning methods#

1. Supervised Learning#

Classification#

  • Linear Models: Logistic Regression, Perceptron

  • Tree-Based: Decision Tree (CART, ID3, C4.5), Random Forest, Gradient Boosting (XGBoost, LightGBM, CatBoost)

  • Kernel-Based: SVM (Linear/RBF/Poly Kernels), Kernel Logistic Regression

  • Probabilistic: Naïve Bayes (Gaussian, Multinomial), Bayesian Networks

  • Neural Networks: MLP, CNN (for structured data), Transformers (e.g., BERT for text)

Regression#

  • Linear: Linear Regression, Ridge (L2), Lasso (L1), Elastic Net (L1+L2)

  • Nonlinear: Polynomial Regression, SVR (Support Vector Regression), Gaussian Processes

  • Tree-Based: Regression Trees, Random Forest Regressor, XGBoost Regressor

Evaluation Metrics#

  • Classification: Accuracy, Precision/Recall, F1-Score, ROC-AUC, Log-Loss

  • Regression: MSE, RMSE, MAE, R², Adjusted R²


2. Unsupervised Learning#

Clustering#

  • Centroid-Based: K-Means (K-Means++), K-Medoids (PAM), Fuzzy C-Means

  • Hierarchical: Agglomerative (Single/Linkage), Divisive

  • Density-Based: DBSCAN (OPTICS, HDBSCAN)

  • Distribution-Based: Gaussian Mixture Models (GMM)

  • Spectral Clustering: Graph-cut methods (e.g., Normalized Cuts)

Dimensionality Reduction#

  • Linear: PCA, Factor Analysis

  • Nonlinear: t-SNE, UMAP, Autoencoders (Variational, Denoising)

Association Rules#

  • Apriori, FP-Growth, Eclat

Evaluation Metrics#

  • Clustering: Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index

  • Dimensionality Reduction: Reconstruction Error (for autoencoders), Explained Variance (PCA)


3. Semi-Supervised Learning#

  • Self-Training: Iterative labeling (e.g., Yarowsky Algorithm)

  • Graph-Based: Label Propagation, Laplacian SVM

  • Deep Learning: Semi-Supervised GANs (e.g., SGAN), Pseudo-Labeling

  • Co-Training: Multi-view learning

Evaluation Metrics#

  • Adjusted Rand Index (if ground truth is partially known), Purity


4. Self-Supervised Learning (Emerging)#

  • Pretext Tasks: RotNet, Masked Autoencoders (e.g., BERT, MAE)

  • Contrastive Learning: SimCLR, MoCo, BYOL

  • Applications: Pretraining for NLP/CV (e.g., GPT, Vision Transformers)


Key Notes:#

  1. Scope: This is a subset of ML methods; real-world projects often combine techniques (e.g., PCA + K-Means).

  2. Choice Depends On: Data type (tabular, image, text), problem goals, and computational resources.

  3. Evaluation: Always match metrics to the task (e.g., Silhouette for clustering, F1 for imbalanced classification).