Part 2: Machine Learning#
This final part is an introduction to the very broad topic of machine learning, mainly via Python’s Scikit-Learn package. You can think of machine learning as a class of algorithms that allow a program to detect particular patterns in a dataset, and thus “learn” from the data to draw inferences from it. This is not meant to be a comprehensive introduction to the field of machine learning; that is a large subject and necessitates a more technical approach than we take here. Nor is it meant to be a comprehensive manual for the use of the Scikit-Learn package (for this, you can refer to the resources listed in the first page of the course. Rather, the goals here are:
To introduce the fundamental vocabulary and concepts of machine learning
To introduce the Scikit-Learn API and show some examples of its use
To take a deeper dive into the details of several of the more important classical machine learning approaches, and develop an intuition into how they work and when and where they are applicable
Much of this material is drawn from the Scikit-Learn tutorials and workshops I have given on several occasions at PyCon, SciPy, PyData, and other conferences. Any clarity in the following pages is likely due to the many workshop participants and co-instructors who have given me valuable feedback on this material over the years!
A list of machine learning methods#
1. Supervised Learning#
Classification#
Linear Models: Logistic Regression, Perceptron
Tree-Based: Decision Tree (CART, ID3, C4.5), Random Forest, Gradient Boosting (XGBoost, LightGBM, CatBoost)
Kernel-Based: SVM (Linear/RBF/Poly Kernels), Kernel Logistic Regression
Probabilistic: Naïve Bayes (Gaussian, Multinomial), Bayesian Networks
Neural Networks: MLP, CNN (for structured data), Transformers (e.g., BERT for text)
Regression#
Linear: Linear Regression, Ridge (L2), Lasso (L1), Elastic Net (L1+L2)
Nonlinear: Polynomial Regression, SVR (Support Vector Regression), Gaussian Processes
Tree-Based: Regression Trees, Random Forest Regressor, XGBoost Regressor
Evaluation Metrics#
Classification: Accuracy, Precision/Recall, F1-Score, ROC-AUC, Log-Loss
Regression: MSE, RMSE, MAE, R², Adjusted R²
2. Unsupervised Learning#
Clustering#
Centroid-Based: K-Means (K-Means++), K-Medoids (PAM), Fuzzy C-Means
Hierarchical: Agglomerative (Single/Linkage), Divisive
Density-Based: DBSCAN (OPTICS, HDBSCAN)
Distribution-Based: Gaussian Mixture Models (GMM)
Spectral Clustering: Graph-cut methods (e.g., Normalized Cuts)
Dimensionality Reduction#
Linear: PCA, Factor Analysis
Nonlinear: t-SNE, UMAP, Autoencoders (Variational, Denoising)
Association Rules#
Apriori, FP-Growth, Eclat
Evaluation Metrics#
Clustering: Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index
Dimensionality Reduction: Reconstruction Error (for autoencoders), Explained Variance (PCA)
3. Semi-Supervised Learning#
Self-Training: Iterative labeling (e.g., Yarowsky Algorithm)
Graph-Based: Label Propagation, Laplacian SVM
Deep Learning: Semi-Supervised GANs (e.g., SGAN), Pseudo-Labeling
Co-Training: Multi-view learning
Evaluation Metrics#
Adjusted Rand Index (if ground truth is partially known), Purity
4. Self-Supervised Learning (Emerging)#
Pretext Tasks: RotNet, Masked Autoencoders (e.g., BERT, MAE)
Contrastive Learning: SimCLR, MoCo, BYOL
Applications: Pretraining for NLP/CV (e.g., GPT, Vision Transformers)
Key Notes:#
Scope: This is a subset of ML methods; real-world projects often combine techniques (e.g., PCA + K-Means).
Choice Depends On: Data type (tabular, image, text), problem goals, and computational resources.
Evaluation: Always match metrics to the task (e.g., Silhouette for clustering, F1 for imbalanced classification).