Machine Learning Lifecycle
- Define the Problem: Identify the task and desired outcome.
- Gather Data: Collect relevant data from various sources.
- Data Preprocessing: Clean and prepare data (handle outliers, missing values, formatting).
- Exploratory Data Analysis (EDA): Visualise and understand the patterns and relationships in the data.
- Feature Engineering & Selection: Create and select the most relevant features to improve model performance.
- Choose a Model: Select a suitable machine learning algorithm.
- Split the Data: Divide the data into training and testing (or validation) sets.
- Train the Model: Fit the model to the training data.
- Model Evaluation: Assess the model’s performance using test/validation data.
- Parameter Tuning: Optimise hyperparameters to improve accuracy.
- Deployment: Integrate the model into a real-world environment.
- Monitor and Maintain: Track performance over time and retrain as needed.
The Significance of Data Cleaning
The Impact of Data Quality on Model Performance
- Accuracy and Reliability: High-quality data enables models to make precise predictions. Conversely, poor-quality data leads to inaccurate and unreliable outcomes.
- Generalization: Models trained on clean data generalize better to unseen data, performing well in real-world scenarios.
- Bias and Fairness: Ensuring data is representative and unbiased prevents models from perpetuating or amplifying biases.
- There is an old proverb in computing: "garbage in, garbage out."
- This means that the quality of the input data directly impacts the quality of the model's predictions.
Techniques for Data Cleaning
Handling Outliers
- Outliers are data points that are significantly different from other observations.
- Techniques:
- Trim: Remove outliers from the dataset.
- Cap: Replace outliers with the nearest acceptable value.
- Transform: Apply transformations like log or square root to reduce the impact of extreme values.
Use visualization tools like box plots to identify outliers before deciding on a handling strategy.
Removing or Consolidating Duplicate Data
- Duplicates can skew analysis and model training.
- Techniques:
- Identify and Remove: Use software or database queries to find and delete duplicates.
- Consolidate: Merge partially varying duplicates by averaging numerical values or choosing the most frequent category.
Identifying and Correcting Incorrect Data
Techniques:
- Data Validation: Implement rules based on known ranges or formats to identify anomalies.
- Cross-Referencing: Use external sources to validate and correct data.
- Anomaly Detection: Employ machine learning techniques to identify potential inaccuracies.
Filtering Irrelevant Data
Techniques:
- Feature Selection: Use methods like correlation matrices or random forest importance to remove irrelevant features.
- Domain Expertise: Consult experts to identify unnecessary data elements.
Transforming Improperly Formatted Data
Techniques:
- Parsing: Reformat data into a usable structure, such as standardizing date formats.
- Regular Expressions: Use patterns to identify and transform data formats.
In the example dataset, dates were standardized to the YYYY-MM-DD format, ensuring consistency across entries.
Handling Missing Data
Techniques:
- Imputation: Replace missing values with the mean, median, or mode.
- Deletion: Remove records with missing data (listwise or pairwise deletion).
- Predictive Modeling: Use algorithms like regression or decision trees to predict missing values.
Choosing the right technique depends on the nature of the data and the specific requirements of the analysis or model.
Normalization and Standardization
- Normalization
- Definition: Rescales features to a specific range, typically [0, 1] or [-1, 1].
- Use Cases: Beneficial for algorithms sensitive to input scale, such as gradient descent-based methods.
- Standardization
- Definition: Rescales features to have a mean of 0 and a standard deviation of 1.
- Use Cases: Useful for data with unknown minimum and maximum values or when outliers are present.
Think of normalization like adjusting the volume on a music track so that no part is too loud or too quiet.
Normalization can be affected by outliers, while standardization is less sensitive due to its reliance on mean and standard deviation.