Machine Learning Lifecycle
- Define the Problem: Identify the task and desired outcome.
- Gather Data: Collect relevant data from various sources.
- Data Preprocessing: Clean and prepare data (handle outliers, missing values, formatting).
- Exploratory Data Analysis (EDA): Visualise and understand the patterns and relationships in the data.
- Feature Engineering & Selection: Create and select the most relevant features to improve model performance.
- Choose a Model: Select a suitable machine learning algorithm.
- Split the Data: Divide the data into training and testing (or validation) sets.
- Train the Model: Fit the model to the training data.
- Model Evaluation: Assess the model’s performance using test/validation data.
- Parameter Tuning: Optimise hyperparameters to improve accuracy.
- Deployment: Integrate the model into a real-world environment.
- Monitor and Maintain: Track performance over time and retrain as needed.
The Significance of Data Cleaning
The Impact of Data Quality on Model Performance
- Accuracy and Reliability: High-quality data enables models to make precise predictions. Conversely, poor-quality data leads to inaccurate and unreliable outcomes.
- Generalization: Models trained on clean data generalize better to unseen data, performing well in real-world scenarios.
- Bias and Fairness: Ensuring data is representative and unbiased prevents models from perpetuating or amplifying biases.
- There is an old proverb in computing: "garbage in, garbage out."
- This means that the quality of the input data directly impacts the quality of the model's predictions.
Techniques for Data Cleaning
Handling Outliers
- Outliers are data points that are significantly different from other observations.
- Techniques:
- Trim: Remove outliers from the dataset.
- Cap: Replace outliers with the nearest acceptable value.
- Transform: Apply transformations like log or square root to reduce the impact of extreme values.
Use visualization tools like box plots to identify outliers before deciding on a handling strategy.
Removing or Consolidating Duplicate Data
- Duplicates can skew analysis and model training.
- Techniques:
- Identify and Remove: Use software or database queries to find and delete duplicates.
- Consolidate: Merge partially varying duplicates by averaging numerical values or choosing the most frequent category.