One common question we hear from customers is:
And also a very related question:
These are great questions! The answers are a bit more nuanced however. There are generally three different ways to increase the performance of a Machine Learning Pipeline:
- Adding more data points for the algorithm to train on
- Adding more features to existing data points
- Building a better machine learning pipeline
Note that implementing a form of (1) is the main way to realize an ongoing increase in performance. This is especially true in cases where the initial pipeline was trained with only a small amount of data.
Let’s dive into each of these paradigms in a bit more detail below!
Adding More Data Points
One of the amazing aspects of machine learning is its ability to perform better when simply provided with more data points related to an event. As more of the same data is provided, the algorithm is able to better understand the underlying function being estimated and often increases in performance.
For instance if you are looking to predict who will churn, having more examples of churned users provides a better understanding of whether a current user will churn. Alternately, if we are predicting whether a house will increase in value, having more examples of past houses that increase or decrease in value will help the algorithm understand whether any particular house will increase or decrease in value.
The new data which is added must be similar statistically to the old data. In statistical terms there must be stationarity between the old and new data. Note that stationarity is often not the case when considering, for instance, behavioral data on a website. Behavioral data generated from, for instance a website, a year ago will look very different from behavioral data generated over the last week. This often makes it difficult to leverage older data.
In the case where the underlying data is varying over time (i.e. the data is not stationary), there is still a huge benefit of using Machine Learning. Why? The algorithm will continuously adapt to the latest trends in the data. Although the absolute performance of the pipeline may not increase, it will stay relatively consistent over time without requiring any manual intervention.
When the data is stationary, continually adding new data to the data-set can get you a good increase in performance without needing to add new features or augment the algorithms used to process the data.
Adding More Features to the Existing Data
A great way to generate a lift in performance is to add more features to existing data points. If these features are valuable in terms of predicting an outcome they will often positively impact the performance of the pipeline.
For instance, let’s say we are predicting whether someone will buy sneakers in the next week. If we augmented an existing e-commerce data-set with brick and mortar data, we might be able to better predict a customer’s interest in purchasing sneakers and thus increase the performance of the pipeline.
Note that adding new features to existing data will typically result in a one-time increase in performance vs. a continual and gradual increase in performance as can be obtained by steadily increasing the amount of training data.
Building a Better Machine Learning Pipeline
Another paradigm for increasing performance is to enhance the algorithm used to make the particular prediction. For the context of this blog post we are subsuming all stages of the ML Pipeline in the term algorithm – preprocessing, feature cleaning, feature engineering, and model selection.
For instance, a business might want to augment an existing pipeline with a new type of engineered feature and leverage Vidora’s manual feature engineering functionality to add that feature.
However, similar to augmenting existing data with new features, enhancing the algorithm will typically result in a one-time increase in performance.
Unlocking the Secrets of Increasing Performance
There are many ways to increase the performance of your Machine Learning Pipeline. In this post we covered three general approaches. In the future posts, we will provide more information on how to leverage your Cortex account to realize some of the gains described above.