It’s never too early to set up a data lake

I recently helped one of the startup companies I’m advising, Mindspand, get set up with data collection. Mindspand is a lifelong learning course discovery platform that makes it easy for consumers to find a rich variety of online and hands-on classes and workshops matching lifestyles, budgets, and interests. Large companies may manage millions of calls per day with a dedicated team of engineers. Mindspand measures their searches in the hundreds. They didn’t have the need for sophisticated tools or the employees to maintain a complex system. But they wanted to understand how their APIs were performing. They wanted to piece together a customer’s journey from the backend to supplement information they got on the site from Google Analytics.

I liked that Mindspand wanted to make data-driven decisions from the start. They are pitching machine learning as a key differentiator as they scale. While that seems the fashionable thing to pitch today, not all companies show this data-driven mindset in their DNA.

I call the infrastructure I set up “data droplets.” Not quite large enough to be a lake. In fact it’s small enough that you can view the raw files as customers visit the site (which is quite fun)! But the architecture I built for them gives them a pattern to iterate on as they scale.

Setup

Data Pipeline Architecture

Mindspand’s architecture is in AWS. It consists of a front-end, Express backend, and Postgres database. I used the following technologies to put this data lake together:

  • I started in the front-end by assigning a session ID when a user first visits the site. This gets sent to the backend, and can be later used to tie calls together
  • In the Express application, I created a middleware component to process the request and response. It packed these values into a JSON object and posted it to Amazon’s Simple Queue Service (SQS)
  • A Lambda function processes objects on the queue with some light transformation (e.g. stripping PII)
  • Finally the Lambda function writes the results out to S3. A combination of the session ID and timestamp made the name unique

Queries

Of course, what fun is a data lake if you can’t query it? At first, I wrote some queries that would run directly against the data lake. With only a few hundred entires a day, it wasn’t hard to scan S3 for the files we wanted for a given query. But as the weeks went on and their traffic grew, there were soon tens of thousands of records to process.

To manage this, I considered putting Hive on top of the data lake. In the end though, I decided to use DynamoDB to store the metadata as described in this article. Given the desire to keep the architecture simple, I didn’t want to manage an EMR cluster to allow us to run Hive queries. Using DynamoDB to save metadata was straight forward.

S3 bucket metadata stored in DynamoDB

First, I created a table that used the SessionID as a primary key with the timestamp as a sort key. Then, I created a lambda function to process writes on the S3 bucket and store an entry in this table with the key of the file in S3. This allowed me to query full sessions at a time, as well as over individual date ranges. The modifications to the reports were easy. There was an existing function to get S3 keys that matched a pattern by scanning through all keys in the bucket. I changed that function to instead do a query on the DynamoDB table to get the key list.

Going forward

Is it a stretch to think that “data droplets” are going to power machine learning? Perhaps. But I’ve already seen Mindspand use this data set to make product decisions like improving the discoverability of search features.

It starts with manual queries to tune some specific problems. From there, more will become automated. The queries and needs will become more sophisticated as more data comes into the system. And they’re on the path to machine learning.