Google Cloud Storage

A Data Source in Cortex represents an integration between your Cortex account and a third-party platform used for enterprise data storage. Once your source is connected, data will regularly flow into Cortex where it can be used to build Machine Learning pipelines.

Google Cloud Storage is a service for storing files of any format and size within Google Cloud. In this guide, we’ll walk through how to connect Google Cloud Storage as a Data Source within your Cortex account.

Connecting a Google Cloud Storage Data Source

To connect your Cortex account to a Google Cloud Storage (GCS) bucket, create a Data Source in Cortex by following the steps below. Each Data Source must be associated with a single schema, so if your GCS bucket contains multiple distinct datasets that you’d like to send to Cortex, you should create a separate Data Source for each one.

Step 1: Select Platform

First navigate to the Data Sources area of the Data tab in Cortex, and select the “Google Cloud Storage” icon from the list of available third-party platforms.

Step 2: Define Data

Name your Data Source, and specify which type of dataset it contains. If your GCS bucket contains multiple types of data, you should create a separate Data Source in Cortex for each one.

Dataset Description Example
Events Timestamped actions taken by your users. Customer ABC completes a purchase event at time T.
Attributes Characteristics or traits of your users. Customer ABC has job title ‘Professor’, and age 49.
Items Metadata attributes for the items that your users interact with, via events. Item XYZ has category ‘shoes’, and price 49.99.
ID Mappings Associations between one set of IDs and another. Customer ABC has cookie ID 123.

 

Once you’ve specified which type of dataset your source will provide, Cortex will display a set of schema requirements to which your data must conform. This data should be uploaded to your S3 bucket in CSV, TSV, or JSON files (zipped or unzipped). Click the View Sample <Dataset Type> Files link in Cortex to see and download example files for each dataset type.

Additionally, these files should be organized into subfolders that are named based on the date of upload (YYYY-MM-DD). For example, the file path of a set of events data uploaded on May 3rd, 2021 might look like:

Example Upload:

gs://my-bucket/my-attributes-folder/2021-05-03/sample_events_file.csv


Step 3: Authorize Cortex to Access Your GCS Bucket

Before connecting the Data Source, you must first authorize Cortex to access your Cloud Storage bucket. This can be done by logging into your Google Cloud Storage console in a new browser window, and taking the following steps.

1. Find the bucket that you’d like to connect to Cortex. Click the three vertical dots to the right of the row, and select Edit bucket permissions from the list of options.

2. Paste email address partners@vidora-production.iam.gserviceaccount.com into the “New members” field, and select the “Storage Object Admin” role from the dropdown at right. Click “Save” when done.

Step 4: Connect Your Bucket

Once you’ve authorized Vidora to access the bucket, head back to your Cortex account and hit “Next” to proceed to the Connect step. On this screen, enter the name and folder path of the GCS bucket that you just authorized Cortex to access. See below for an example.

Step 5: Activate the Connection

After you’ve entered your bucket details, hit “Connect” to activate the integration. Cortex will automatically test whether the connection is valid before creating the Data Source.

Once your Data Source is live, any valid file that is uploaded into your GCS bucket will be ingested into Cortex. You may click into your newly created Data Source in order to view information about its schema, and a snapshot of sample data recently received by Cortex.

Related Links

Still have questions? Reach out to support@vidora.com for more info!


Table of Contents