Cortex automatically connects every step of the Machine Learning process into end-to-end Machine Learning Pipelines that anyone in your business can run. In this guide, we’ll discuss the second step of a Cortex pipeline: data cleaning.
What is data cleaning?
Data cleaning refers to the process of detecting and correcting inaccurate records within a dataset. At enterprise scale, it’s not uncommon for some inaccuracies to surface during data collection or transmission. In Cortex, raw data is cleaned as it flows continuously into your account so that all new data points are consistent with the existing ones. Cleaning techniques are also applied after each pipeline’s feature engineering step.
Why does it matter?
As far as ML is concerned, garbage in means garbage out. In other words, the quality of your predictions is only as good as the quality of your input data. Without proper data cleaning, a machine learning model may latch onto an erroneous data pattern (say, duplicate events) during training. If this data pattern isn’t “real”, the model’s accuracy won’t be replicated on live data that actually reflects your business. Cortex’s automated data cleaning guards against this potential issue.
Which data cleaning techniques does Cortex use?
Cortex has a variety of data cleaning techniques at its disposal to ensure data quality throughout your pipeline. These techniques are applied on a per-pipeline basis where necessary. Some examples include –
Nonsense Data Removal
Remove characters that are not recognized by the unicode standard. Examples include characters from rarely-used languages.
Remove or replace timestamps that occur in the future or the very distant past. For example, an event erroneously timestamped in the year 3000 may be removed, or replaced with the timestamp it was received.
3 Std. Outlier Removal
Remove or replace outlier values that fall at least 3 standard deviations outside the mean. For example, an event indicating a $1m purchase amount may be removed if other purchase events indicate statistically lower prices.
Missing Data Imputation
These Missing Data Imputation cleaning techniques are performed both on the raw data in your account, as well as on features that have been generated by the feature engineering step later in the pipeline.
If some data points have missing values for a particular field, fill with the average value for that field. For example, a purchase event with a missing value for price would be filled with the average price across all recent purchase events.
If some data points have missing values for a particular field, fill with the most frequently occurring value for that field. For example, a purchase event with a missing value for price would be filled with the mode price across all recent purchase events.
If some data points have missing values for a particular field, predict the missing value based on other fields that are present for that data point. A linear regression model is used for this prediction. For example, a purchase event with a missing value for price would be filled with a price prediction based on the event’s time, item category, device, etc.
Re-scaling helps facilitate comparison across features with very different magnitudes. This step occurs on features that have been generated by the feature engineering step later in the pipeline.
Re-scale all features so that the average value equals zero, and the standard deviation equals one.
Re-scale all features so that the minimum value equals zero, and the maximum value equals one.
Re-scale all features so that the maximum value equals one.
- What is a Machine Learning Pipeline?
- Data Preprocessing
- Feature Engineering
- Model Selection
- Prediction Generation
Still have questions? Reach out to firstname.lastname@example.org for more info!