How to Build a Classification Pipeline
Vidora Cortex is an easy-to-use platform that enables anyone to automate Machine Learning Pipelines from continuous streams of event data. In this guide, we’ll show you how to expand small sets of users into audience segments using Classification pipelines in Cortex.
What are Classification pipelines?
Predictions from a Classification pipeline answer the question: how likely is each user to belong to a certain group? For Classification pipelines, the groups of users you upload (in ML terms, your positive and negative labels) will consist of sets of users who are known to share a particular trait, i.e. the positive set, as well as a set of users who are known to not share this particular trait, i.e. the negative set. This lets you take traits that you’ve collected for a small group and leverage them into broad insights about your entire set of users. Learn more about common survey-based use cases for these pipelines here.
Note that while we’ll be using the example of predicting a user attribute, your Cortex account can be configured to make predictions about any type of object tied to your event data (e.g. commerce items, media content, home listings, etc.).
When should I use Classification pipelines?
- Your prediction should answer “Yes/No” question about a user attribute, (e.g. “Is this user a CEO?”. If you’re looking to identify a numeric attribute (e.g. “What is the age of this user?”), use a Regression pipeline instead.
- You have both a positive set (users which share the trait you’d like to identify in other users) and a corresponding negative set (another group which don’t exhibit this trait), use a Classification pipeline. If you only have access to positive labels, use a Look Alike pipeline instead.
The following diagram will help explain which pipeline type is best suited for different predictions.
What are examples of Classification pipelines?
Classification Pipeline Examples
- What is the likelihood that each user is a homeowner, based on sets of homeowners and non- homeowners?
- What is the likelihood that each user is a student, based on sets of students and non-students?
How do I build these pipelines in Cortex?
Step 1: Choose Pipeline Type
Select ‘Create New Pipeline’ from with your Cortex account. Make sure that the “Batch | Real-Time” toggle is set to “Batch”, and choose the Classification pipeline type.
Step 2: Upload Sets
Upload a set of positive and negative user labels. Your positive labels should be a list of User IDs of users you already know exhibit the trait you are looking to predict in the rest of your user base. Your negative labels should be a list of User IDs of users who you already know do not exhibit the trait you are looking to predict. Cortex will find these IDs in your existing event data, learn what attributes and behaviors are associated with each set, and score each remaining user in terms of how similarly it resembles users in the positive set.
Step 3: Define Dates
Traits can change over time. If the sets that you uploaded were collected in the past, some users may have switched groups over time. To make sure Cortex learns from the right data, define a range of training dates during which your sets are likely to be valid.
By default, your date range will be set to the most recent period. You may want to override this default if there is another window that better satisfies the below conditions:
- Your example sets are likely to be valid. If your examples were collected in the past, sets for some users may be outdated. Find a range when the sets you uploaded are likely to have remained constant.
- Your example users are likely to be active. Cortex learns from event data associated with the IDs you uploaded. In order to provide Cortex as much data as possible, choose a range when the greatest number of your uploaded IDs were active and completing events.
Event data is likely to reflect the present. If your business’ data has changed over time or contains a lot of seasonality, you should carefully select a range that contains event patterns that are relevant going forward.
Step 4: Specify Settings
Specify settings such as your pipeline’s name, schedule, tags, and more.
Every time your pipeline runs, Cortex goes through the end-to-end process of generating fresh predictions from the latest data that’s been ingested. If you’d like to power automation based on predictions that are always up-to-date, make sure your pipeline is set to run repeatedly. If you’re just testing things out or building a pipeline for one-time use, your pipeline should only run once.
Step 5: Review
The final step is to review your pipeline and ensure all settings look accurate! If anything needs updated, simply go ‘Back’ in the workflow and update any step.
Step 6: Update Labels Over Time (Optional)
If you’re collecting new positive and negative labels over time, you can import these extra labels into Cortex so that your pipelines are always learning from the most recent information. To upload new labels, hit the “Edit” button on your pipeline (next to “Export Predictions”).
- Classification Performance
- How to Build a Look Alike Pipeline
- How to Build a Regression Pipeline
- How to Build a Future Events Pipeline
- How to Build an Uplift Pipeline
- How to Build a Recommendations Pipeline
Still have questions? Reach out to firstname.lastname@example.org for more info!