Data Cleansing Master Class in Python
Data preparation may be the most important part of a machine learning project. It is the most time consuming part, although it seems to be the least discussed topic. Learn data cleansing from start to finish.
Course Introduction
Course Structure
Is this Course Right for You?
Introduction to Data Preparation
The Machine Learning Process
Data Preparation Defined
Choosing a Data Preparation Approach
What is Data
What is Raw Data?
Machine Learning is Mostly Data Preparation
Common Data Preparation Tasks - Data Cleansing
Common Data Preparation Tasks - Feature Selection
Common Data Preparation Tasks - Data Transforms
Common Data Preparation Tasks - Feature Engineering
Common Data Preparation Tasks - Dimensionality Reduction
Data Leakage
Problem With Naïve Data Preparation
Case Study: Data Leakage: Train/Test/Split Naïve Approach
Case Study: Data Leakage: Train/Test/Split Correct Approach
Case Study: Data Leakage: K-Fold Naïve Approach
Case Study: Data Leakage: K-Fold Correct Approach
Data Cleansing Overview
Identify Columns That Contain a Single Value
Identify Columns with Few Values
Remove Columns with Low Variance
Identify and Remove Rows That Contain Duplicate Data
Defining Outliers
Remove Outliers - The Standard Deviation Approach
Remove Outliers - The IQR Approach
Automatic Outlier Detection
Mark Missing Values
Remove Rows with Missing Values
Statistical Imputation
Mean Value Imputation
Simple Imputer with Model Evaluation
Compare Different Statistical Imputation Strategies
K-Nearest Neighbors Imputation
KNNImputer and Model Evaluation
Iterative Imputation
IterativeImputer and Model Evaluation
IterativeImputer and Different Imputation Order
Feature Selection Introduction
Feature Selection Defined
Statistics for Feature Selection
Loading a Categorical Dataset
Encode the Dataset for Modeling
Chi-Squared
Mutual Information
Modeling with Selected Categorical Features
Feature Selection with ANOVA on Numerical Input
Feature Selection with Mutual Information
Modeling with Selected Numerical Features
Tuning Number of Selected Features
Select Features for Numerical Output
Linear Correlation with Correlation Statistics
Linear Correlation with Mutual Information
Baseline and Model Built Using Correlation
Model Built Using Mutual Information Features
Tuning Number of Selected Features
Recursive Feature Elimination
RFE for Classification
RFE for Regression
RFE Hyperparameters
Feature Ranking for RFE
Feature Importance Scores Defined
Feature Importance Scores: Linear Regression
Feature Importance Scores: Logistic Regression and CART
Feature Importance Scores: Random Forests
Permutation Feature Importance
Feature Selection with Importance
Scale Numerical Data
Diabetes Dataset for Scaling
MinMaxScaler Transform
StandardScaler Transform
Robust Scaling Data
Robust Scaler Applied to Dataset
Explore Robust Scaler Range
Nominal and Ordinal Variables
Ordinal Encoding
One-Hot Encoding Defined
One-Hot Encoding
Dummy Variable Encoding
OrdinalEncoder Transform on Breast Cancer Dataset
Make Distributions More Gaussian
Power Transform on Contrived Dataset
Power Transform on Sonar Dataset
Box-Cox on Sonar Dataset
Yeo-Johnson on Sonar Dataset
Polynomial Features
Effect of Polynomial Degrees
Transforming Different Data Types
The ColumnTransformer
The ColumnTransformer on Abalone Dataset
Manually Transform Target Variable
Automatically Transform Target Variable
Challenge of Preparing New Data for a Model
Save Model and Data Scaler
Load and Apply Saved Scalers