How Can We Help?

Data Cleaning

Cortex automatically connects every step of the Machine Learning process into end-to-end Machine Learning Pipelines that anyone in your business can run. In this guide, we’ll discuss the second step of a Cortex pipeline: data cleaning.

What is data cleaning?

Data cleaning refers to the process of detecting and correcting inaccurate records within a dataset. At enterprise scale, it’s not uncommon for some inaccuracies to surface during data collection or transmission. In Cortex, raw data is cleaned as it flows continuously into your account so that all new data points are consistent with the existing ones. Cleaning techniques are also applied after each pipeline’s feature engineering step.

Why does it matter?

As far as ML is concerned, garbage in means garbage out. In other words, the quality of your predictions is only as good as the quality of your input data. Without proper data cleaning, a machine learning model may latch onto an erroneous data pattern (say, duplicate events) during training. If this data pattern isn’t “real”, the model’s accuracy won’t be replicated on live data that actually reflects your business. Cortex’s automated data cleaning guards against this potential issue.

Which data cleaning techniques does Cortex use?

Cortex has a variety of data cleaning techniques at its disposal to ensure data quality throughout your pipeline. These techniques are applied on a per-pipeline basis where necessary. Some examples include –

Nonsense Data Removal

Unidentified Characters
Remove characters that are not recognized by the unicode standard. Examples include characters from rarely-used languages.

Out-of-Range Dates
Remove or replace timestamps that occur in the future or the very distant past. For example, an event erroneously timestamped in the year 3000 may be removed, or replaced with the timestamp it was received.

Outlier Removal

3 Std. Outlier Removal
Remove or replace outlier values that fall at least 3 standard deviations outside the mean. For example, an event indicating a $1m purchase amount may be removed if other purchase events indicate statistically lower prices.

Missing Data Imputation

These Missing Data Imputation cleaning techniques are performed both on the raw data in your account, as well as on features that have been generated by the feature engineering step later in the pipeline.

Mean Substitution
If some data points have missing values for a particular field, fill with the average value for that field. For example, a purchase event with a missing value for price would be filled with the average price across all recent purchase events.

Frequent Substitution
If some data points have missing values for a particular field, fill with the most frequently occurring value for that field. For example, a purchase event with a missing value for price would be filled with the mode price across all recent purchase events.

Regression Substitution
If some data points have missing values for a particular field, predict the missing value based on other fields that are present for that data point. A linear regression model is used for this prediction. For example, a purchase event with a missing value for price would be filled with a price prediction based on the event’s time, item category, device, etc.

Feature Re-Scaling

Re-scaling helps facilitate comparison across features with very different magnitudes. This step occurs on features that have been generated by the feature engineering step later in the pipeline.

Gaussian Zero-Mean
Re-scale all features so that the average value equals zero, and the standard deviation equals one.

Min-Max Normalization
Re-scale all features so that the minimum value equals zero, and the maximum value equals one.

Decimal Scaling
Re-scale all features so that the maximum value equals one.

Related Links

Still have questions? Reach out to support@mparticle.com for more info!

Table of Contents