Classification Performance

How do I evaluate the performance of my Classification pipeline?

Once you’ve built a Classification pipeline, Cortex makes it easy to explore results and learn more about your pipeline’s predictive power. In this guide, we’ll show you how to evaluate the performance of a Classification pipeline in Cortex.

Cortex summarizes performance of any pipeline in three ways: Pipeline Quality, Performance Metrics, and Performance Graphs. The following sections describe how to interpret each of these specifically for a Classification pipeline.

Classification Pipeline Quality

Pipeline Quality gives you a quick sense for your pipeline’s performance without having to scrutinize any technical metrics. A pipeline’s Quality is meant to serve as a rough guide for how well its predictions match reality, but in truth what constitutes good performance depends on the difficulty of the problem – sometimes “Average” actually represents the best possible performance that can be achieved with the data at hand.

For Classification, Pipeline Quality is determined based on AUC, a common measure of performance for binary classification machine learning. AUC is described in more detail in the Metrics section below.

Pipeline Quality AUC
Excellent >85%
Very Good 75-85%
Good 65-75%
Average 55-65%
Below Average <55%

Classification Performance Metrics

Cortex publishes four well-known performance metrics for each Future Events pipeline. All metrics are computed on a holdout set (i.e. data not used during training) to make sure your pipeline can generalize to data it’s never seen before.

To frame these metrics in real terms, consider a Future Events pipeline which predicts each user’s probability of purchasing within the next 14 days. Note however that your Cortex account can be configured to make predictions about any type of object tied to your event data (e.g. commerce items, media content, home listings, etc.).


AUC (or AUROC, short for Area Under the Receiver Operating Characteristics curve) is one of the most commonly-used measures of classification performance. It is represented as a percentage from 0-100%. The higher your pipeline’s AUC, the better its predictive power.

AUC is derived from the ROC curve (read here for more details), but can be interpreted in a more intuitive way: if a positive and negative label are both drawn at random, what is the probability that the positive label was given a higher prediction than the negative label? In our example above, a positive label is a user who purchased within 14 days, while a negative label is a user who did not end up purchasing in the window.


Of all the users that your pipeline predicted to be in the positive class, what percentage were actually in the positive class? In terms of our example, of all the users predicted to purchase within 14 days, what percent actually did? Read here for more details.


Of all the users that were actually in the positive class, what percentage did your pipeline predict to be in the positive class? In terms of our example, of all the users who purchased within 14 days, what percent were predicted to do so? Read here for more details.


What percent of all users did your pipeline classify correctly across both the positive and negative classes? In terms of our example, what percent of all predictions were correct (for both users who were and were not predicted to purchase)? Read here for more details.

*Note: Precision, recall, and accuracy are reported at the maximum F1 score of each Future Events pipeline’s precision-recall curve (described in the Performance Graphs section below).

Classification Performance Graphs

Precision-Recall Curve

The precision-recall curve measures the tradeoff between two important ML metrics, as the threshold for a positive class prediction changes. The more accurate your pipeline, the larger the area beneath this curve.

For a given pipeline, precision and recall are often inversely related — improving one usually means sacrificing the other. To understand why, note first that Cortex’s Classification predictions take the form of a “model score” between 0 and 1.

user_id model_score
ABC 0.941
DEF 0.897
UVW 0.0212
XYZ 0.0192

In reference to the above table:

  • This sample table is sorted in descending order of users’ model scores.
  • User ABC is predicted as most similar to users in the positive class.
  • User XYZ is predicted as least similar to users in the positive class.

Model scores are useful in that they tell you something about the confidence of each prediction. But precision and recall measure whether a pipeline’s predictions are right or wrong, so in order to compute these metrics we must first set a model score threshold. Any user scored above this threshold is considered a positive class prediction (e.g. predicted to be a homeowner), and any user below this threshold is considered a negative class prediction (e.g. predicted to not be a homeowner).

Raising this threshold is likely to increase our pipeline’s precision but lower its recall. Imagine setting a high threshold such that only a few high-confidence users are considered positive predictions (say, those with model scores above 0.95). Our pipeline would get a lot of these predictions “right” (i.e. have high precision), but would miss out on many other actual positives (i.e. have low recall). If we set a low threshold so that most predictions are considered positive (say, those with model score above 0.1), many of these predictions will be incorrect (low precision), but we’d correctly capture almost all the actual positives (high recall).

The Precision-Recall Curve measures this tradeoff as we vary the model score threshold from high (left side of the curve) to low (right side).

Related Links

Still have questions? Reach out to for more info!

Table of Contents