1. Python for ML/AI
Why Python?
Setup
Install Python.
Installing packages: numpy, pandas, scipy, matplotlib, seaborn, sklearn)
iPython setup.
Introduction
Keywords and Identifiers
Statements, Indentation and Comments
Variables and Datatypes
Input and Output
Operators
Flow Control
If...else
while loop
for loop
break and continue
Data Structures
Lists
Tuples
Dictionary
Strings
Sets
Functions
Introduction
Types of functions
Function Arguments
Recursive Functions
Lambda Functions
Modules
Packages
File Handling
Exception Handling
Debugging Python
NumPy
Introduction to NumPy.
Numerical operations.
Matplotlib
Pandas
Getting started with pandas
Data Frame Basics
Key Operations on Data Frames.
Computational Complexity: an Introduction
Space and Time Complexity: Find largest number in a list
Binary search
Find elements common in two lists.
Find elements common in two lists using a Hashtable/Dict
Further reading about Computational Complexity .
2. Plotting for exploratory data analysis (EDA)
Iris dataset
Data-point, vector, observation
Dataset
Input variables/features/dimensions/independent variable
Output Variable/Class Label/ Response Label/ dependent variable
Objective: Classification.
Scatter-plot: 2D, 3D.
Pair plots.
PDF, CDF, Univariate analysis.
Histogram and PDF
Univariate analysis using PDFs.
Cumulative distribution function (CDF)
Mean , Variance, Std-dev
Median, Percentiles, Quantiles, IQR, MAD and Outliers.
Box-plot with whiskers
Violin plots.
Summarizing plots.
Univariate, Bivariate and Multivariate analysis.
Multivariate probability density, contour plot.
Exercise: Perform EDA on Haberman dataset.
3. Probability and Statistics
Introduction to Probability and Stats
Why learn it?
P(X=x1) , Dice and coin example
Random variables: discrete and continuous.
Outliers (or) extreme points.
Population & Sample.
Gaussian/Normal Distribution
Examples: Heights and weights.
Why learn about distributions.
Mu, sigma: Parameters
PDF (iris dataset)
CDF
1-std-dev, 2-std-dev, 3-std-dev range.
Symmetric distribution, Skewness and Kurtosis
Standard normal variate (z) and standardization.
Kernel density estimation.
Sampling distribution & Central Limit theorem.
Q-Q Plot: Is a given random variable Gaussian distributed?
How to randomly sample data points. [UniformDisb.ipynb]
Bernoulli and Binomial distribution
Log-normal and power law distribution:
Log-normal: CDF, PDF, Examples.
Power-law & Pareto distributions: PDF, examples
Converting power law distributions to normal: Box-Cox/Power transform.
Correlation
Co-variance
Pearson Correlation Coefficient
Spearman Rank Correlation Coefficient
Correlation vs Causation
Confidence Intervals
Confidence Interval vs Point estimate.
Computing confidence-interval given a distribution.
For mean of a random variable
Known Standard-deviation: using CLT
Unknown Standard-deviation: using t-distribution
Confidence Interval using empirical bootstrap [BootstrapCI.ipynb]
Hypothesis testing
Hypothesis Testing methodology, Null-hypothesis, test-statistic, p-value.
Resampling and permutation test.
K-S Test for similarity of two distributions.
Code Snippet [KSTest.ipynb]
4. Linear Algebra
Why learn it ?
Fundamentals
Point/Vector (2-D, 3-D, n-D)
Dot product and angle between 2 vectors.
Projection, unit vector
Equation of a line (2-D), plane(3-D) and hyperplane (n-D)
Distance of a point from a plane/hyperplane, half-spaces
Equation of a circle (2-D), sphere (3-D) and hypersphere (n-D)
Equation of an ellipse (2-D), ellipsoid (3-D) and hyperellipsoid (n-D)
Square, Rectangle, Hyper-cube and Hyper-cuboid..
5. Dimensionality reduction and Visualization:
What is dimensionality reduction?
Data representation and pre-processing
Row vector, Column vector: Iris dataset example.
Represent a dataset: D= {x_i, y_i}
Represent a dataset as a Matrix.
Data preprocessing: Column Normalization
Mean of a data matrix.
Data preprocessing: Column Standardization
Co-variance of a Data Matrix.
MNIST dataset (784 dimensional)
Explanation of the dataset.
Code to load this dataset.
Principal Component Analysis.
Why learn it.
Geometric intuition.
Mathematical objective function.
Alternative formulation of PCA: distance minimization
Eigenvalues and eigenvectors.
PCA for dimensionality reduction and visualization.
Visualize MNIST dataset.
Limitations of PCA
Code example.
PCA for dimensionality reduction (not-visualization)
T-distributed stochastic neighborhood embedding (t-SNE)
What is t-SNE?
Neighborhood of a point, Embedding.
Geometric intuition.
Crowding problem.
How to apply t-SNE and interpret its output (distill.pub)
t-SNE on MNIST.
Code example.
6. Real world problem: Predict sentiment polarity given product reviews on Amazon.
Exploratory Data Analysis.
Dataset overview: Amazon Fine Food reviews
Data Cleaning: Deduplication.
Featurizations: convert text to numeric vectors.
Why convert text to a vector?
Bag of Words (BoW)
Text Preprocessing: Stemming, Stop-word removal, Tokenization,
Lemmatization.
uni-gram, bi-gram, n-grams.
tf-idf (term frequency- inverse document frequency)
Word2Vec.
Avg-Word2Vec, tf-idf weighted Word2Vec
Code samples
Bag of Words.
Text Preprocessing
Bi-Grams and n-grams.
TF-IDF
Word2Vec
Avg-Word2Vec and TFIDF-Word2Vec
Exercise: t-SNE visualization of Amazon reviews with polarity based color-coding
7. Classification and Regression Models: K-Nearest Neighbors
Foundations
How “Classification” works?
Data matrix notation.
Classification vs Regression (examples)
K-Nearest Neighbors
Geometric intuition with a toy example.
Failure cases.
Distance measures: Euclidean(L2) , Manhattan(L1), Minkowski, Hamming
Cosine Distance & Cosine Similarity
How to measure the effectiveness of k-NN?
Simple implementation:
Test/Evaluation time and space complexity.
Limitations.
Determining the right “k”
Decision surface for K-NN as K changes.
Overfitting and Underfitting.
Need for Cross validation.
K-fold cross validation.
Visualizing train, validation and test datasets
How to determine overfitting and underfitting?
Time based splitting
k-NN for regression.
Weighted k-NN
Voronoi diagram.
kd-tree based k-NN:
Binary search tree
How to build a kd-tree.
Find nearest neighbors using kd-tree
Limitations.
Extensions.
Locality sensitive Hashing (LSH)
Hashing vs LSH.
LSH for cosine similarity
LSH for euclidean distance.
Probabilistic class label
Code Samples for K-NN
Decision boundary
Cross Validation
Exercise: Apply k-NN on Amazon reviews dataset.
8. Classification algorithms in various situations:
Introduction
Imbalanced vs balanced dataset.
Multi-class classification.
k-NN, given a distance or similarity matrix
Train and test set differences.
Impact of Outliers
Local Outlier Factor.
Simple solution: mean dist to k-NN.
k-distance (A), N(A)
reachability-distance(A, B)
Local-reachability-density(A)
LOF(A)
Impact of Scale & Column standardization.
Interpretability
Feature importance & Forward Feature Selection
Handling categorical and numerical features.
Handling missing values by imputation.
Curse of dimensionality.
Bias-Variance tradeoff.
Intuitive understanding of bias-variance.
9. Performance measurement of models:
Accuracy
Confusion matrix, TPR, FPR, FNR, TNR
Precision & recall, F1-score.
Receiver Operating Characteristic Curve (ROC) curve and AUC.
Log-loss.
R-Squared/ Coefficient of determination.
Median absolute deviation (MAD)
Distribution of errors.
10. Naive Bayes
Conditional probability.
Independent vs Mutually exclusive events.
Bayes Theorem with examples.
Exercise problems on Bayes Theorem
Naive Bayes algorithm.
Toy example: Train and test stages.
Naive Bayes on Text data.
Laplace/Additive Smoothing.
Log-probabilities for numerical stability.
Cases:
Bias and Variance tradeoff.
Feature importance and interpretability.
Imbalanced data
Outliers.
Missing values.
Handling Numerical features (Gaussian NB)
Multiclass classification.
Similarity or Distance matrix.
Large dimensionality.
Best and worst cases.
Code example
Exercise: Apply Naive Bayes to Amazon reviews.
11. Logistic Regression:
Geometric intuition.
Sigmoid function & Squashing
Optimization problem.
Weight vector.
L2 Regularization: Overfitting and Underfitting.
L1 regularization and sparsity.
Probabilistic Interpretation: Gaussian NaiveBayes
Loss minimization interpretation
Hyperparameter search: Grid Search and Random Search
Column Standardization.
Feature importance and model interpretability.
Collinearity of features.
Train & Run time space and time complexity.
Real world cases.
Non-linearly separable data & feature engineering.
Code sample: Logistic regression, GridSearchCV, RandomSearchCV
Exercise: Apply Logistic regression to Amazon reviews dataset.
Extensions to Logistic Regression: Generalized linear models (GLM)
12. Linear Regression and Optimization.
Geometric intuition.
Mathematical formulation.
Cases.
Code sample.
Solving optimization problems
Differentiation.
13.5.1_a Online differentiation tools
Maxima and Minima
Vector calculus: Grad
Gradient descent: geometric intuition.
Learning rate.
Gradient descent for linear regression.
SGD algorithm
Constrained optimization & PCA
Logistic regression formulation revisited.
Why L1 regularization creates sparsity?
Exercise: Implement SGD for linear regression
13. Support Vector Machines (SVM)
Geometric intuition.
Mathematical derivation.
Loss minimization: Hinge Loss.
Dual form of SVM formulation.
Kernel trick.
Polynomial kernel.
RBF-Kernel.
Domain specific Kernels.
Train and run time complexities.
nu-SVM: control errors and support vectors.
SVM Regression.
Cases.
Code Sample.
Exercise: Apply SVM to Amazon reviews dataset.
14. Decision Trees
Geometric Intuition: Axis parallel hyperplanes.
Sample Decision tree.
Building a decision Tree:
Entropy
15.3.1.a Intuition behind entropy
Information Gain
Gini Impurity.
Constructing a DT.
Splitting numerical features. 14.3.5a
Feature standardization.
Categorical features with many possible values.
Overfitting and Underfitting.
Train and Run time complexity.
Regression using Decision Trees.
Cases
Code Samples.
Exercise: Decision Trees on Amazon reviews dataset.
15. Ensemble Models:
What are ensembles?
Bootstrapped Aggregation (Bagging)
Intuition
Random Forest and their construction.
Bias-Variance tradeoff.
Train and Run-time Complexity.
Code Sample.
Extremely randomized trees.
Cases
Boosting:
Intuition
Residuals, Loss functions and gradients.
Gradient Boosting
Regularization by Shrinkage.
Train and Run time complexity.
XGBoost: Boosting + Randomization
AdaBoost: geometric intuition.
Stacking models.
Cascading classifiers.
Kaggle competitions vs Real world.
Exercise: Apply GBDT and RF to Amazon reviews dataset.
16. Featurizations and Feature engineering.
Introduction.
Time-series data.
Moving window.
Fourier decomposition.
Deep learning features: LSTM
Image data.
Image histogram.
Keypoints: SIFT.
Deep learning features: CNN
Relational data.
Graph data.
Feature Engineering.
Indicator variables.
Feature binning.
Interaction variables.
Mathematical transforms.
Model specific featurizations.
Feature orthogonality.
Domain specific featurizations.
Feature slicing.
Kaggle Winners solutions.
17a. Miscellaneous Topics
17a.1 Calibration of Models.
17a.1.1 Need for calibration.
17a.1.2 Calibration Plots.
17a.1.3 Platt’s Calibration/Scaling.
17a.1.4 Isotonic Regression 17a.1.5
Code Samples
Modeling in the presence of outliers: RANSAC
Productionizing models.
Retraining models periodically.
A/B testing.
17. Unsupervised learning/Clustering: K-Means (2)
What is Clustering?
18.1.a Unsupervised learning
Applications.
Metrics for Clustering.
K-Means
Geometric intuition, Centroids
Mathematical formulation: Objective function
K-Means Algorithm.
How to initialize: K-Means++
Failure cases/Limitations.
K-Medoids
Determining the right K.
Time and space complexity.
Code Samples
Exercise: Cluster Amazon reviews.
18. Hierarchical clustering
Agglomerative & Divisive, Dendrograms
Agglomerative Clustering.
Proximity methods: Advantages and Limitations.
Time and Space Complexity.
Limitations of Hierarchical Clustering.
Code sample.
Exercise: Amazon food reviews.
19. Recommender Systems and Matrix Factorization.
Problem formulation: Movie reviews.
Content based vs Collaborative Filtering.
Similarity based Algorithms.
Matrix Factorization:
PCA, SVD
NMF
MF for Collaborative filtering
MF for feature engineering.
Clustering as MF
Hyperparameter tuning.
Matrix Factorization for recommender systems: Netflix Prize Solution [30:00]
Cold Start problem.
Word Vectors using MF.
Eigen-Faces.
Code example.
Exercise: Word Vectors using Truncated SVD.
20. Deep Learning:Neural Networks.
History of Neural networks and Deep Learning.
How Biological Neurons work? 22.2a Growth of
biological neural networks.
Diagrammatic representation: Logistic Regression and Perceptron
Multi-Layered Perceptron (MLP)
Notation.
Training a single-neuron model.
Training an MLP: Chain rule
Training an MLP: Memoization
Backpropagation algorithm.
Activation functions.
Vanishing and Exploding Gradient problem.
Bias-Variance tradeoff.
Decision surfaces: Playground
21. Deep Learning:Deep Multi-layer perceptrons 22.1. 1980s
to 2010s
Dropout layers & Regularization.
Rectified Linear Units (ReLU).
Weight initialization.
Batch Normalization.
Optimizers:
Hill-descent analogy in 2D
Hill descent in 3D and contours.
SGD recap.
SGD with Momentum.
Nesterov Accelerated Gradient (NAG)
AdaGrad
Adadelta and RMSProp
Adam
Which algorithm to choose when?
Gradient monitoring and Clipping.
Softmax for multi-class classification.
How to train a Deep MLP?
Auto Encoders.
Word2Vec.
CBOW
Skip-gram
Algorithmic Optimizations.
22. Deep Learning:Tensorflow and Keras.
Overview.
GPU vs CPU for Deep Learning.
Google Colaboratory
TensorFlow.
Install TensorFlow.
Online documentation and tutorials.
Softmax Classifier on MNIST dataset.
MLP: Initialization
Model 1: Sigmoid activation.
Model 2: ReLU activation.
Model 3: Batch Normalization.
Model 4 : Dropout.
MNIST classification in Keras.
Hyperparameter tuning in Keras.
Exercise: Try different MLP architectures on MNIST dataset.
23. Deep Learning:Convolutional Neural Nets.
Biological inspiration: Visual Cortex
Convolution
Edge Detection on images.
Padding and strides
Convolution over RGB images.
Convolutional layer.
Max-pooling.
CNN Training: Optimization
Example CNN: LeNet [1998]
ImageNet dataset
Data Augmentation.
Convolution Layers in Keras
AlexNet
VGGNet
Residual Network.
Inception Network.
Transfer Learning: Reusing existing models.
What is Transfer learning.
Code example: Cats vs Dogs.
Code Example: MNIST dataset.
Assignment: Try various CNN networks on MNIST dataset.
24. Deep Learning:Recurrent Neural Networks
Why RNNs
Recurrent Neural Network.
Training RNNs: Backprop.
Types of RNNs.
Need for LSTM/GRU.
LSTM.
GRUs.
Deep RNN.
Bidirectional RNN.
Code example : IMDB Sentiment classification
Exercise: Amazon Fine Food reviews LSTM model.
25. Case Study 2: Personalized Cancer Diagnosis.
Business/Real world problem
Overview.
Business objectives and constraints.
ML problem formulation
Data
Mapping real world to ML problem.
Train, CV and Test data construction.
Exploratory Data Analysis
Reading data & preprocessing
Distribution of Class-labels.
“Random” Model.
Univariate Analysis
Gene feature.
Variation Feature.
Text feature.
Machine Learning Models
Data preparation.
Baseline Model: Naive Bayes
K-Nearest Neighbors Classification.
Logistic Regression with class balancing
Logistic Regression without class balancing
Linear-SVM.
Random-Forest with one-hot encoded features
Random-Forest with response-coded features
Stacking Classifier
Majority Voting classifier.
Assignments.
26. Case study 6:StackOverflow.
Business/Real world problem
Problem description.
Business objectives and constraints.
Mapping to an ML problem
Data overview
ML problem formulation.
Performance metrics.
Hamming loss
EDA
Data Loading.
Analysis of tags
Data preprocessing.
ML modeling
Multi-label classification.
Data preparation.
Train-test split.
Featurization.
Logistic regression: One VS Rest
Sampling data and tags + Weighted models.
Logistic regression revisited.
Why not use advanced techniques?
Assignments.
27. Case study 7:Quora Question pair similarity
Business/Real world problem.
Problem definition.
Business objectives and constraints.
Mapping to an ML problem
Data overview
ML problem and performance metric.
Train-test split
EDA
Basic Statistics.
Basic Feature Extraction.
Text Preprocessing.
Advanced Feature Extraction.
Feature analysis.
Data Visualization: T-SNE.
TF-IDF weighted word-vector featurization.
ML Models
Loading data.
Random Model.
Logistic Regression & Linear SVM
XGBoost
28. Case studies/Projects:
Amazon Fine Food Reviews and Sentiment Analysis
StackOverflow Tag Prediction
Quora Question pair similarity
Amazon fashion discovery engine for recommendation .
Sentiment analysis for twitter data