In Part 1 of this series, I described how we used Spark Streaming and Kafka to create some new reporting infrastructure. I described a system that uses idempotent writes and stable batches to ensure the integrity of our pre-aggregated metrics in the face of system failure. This week, I’m going to explain how we drastically improved the read performance of the system. In order to get idempotent writes, we create a new record for each batch of a metric. Our table schema last week looked like:
CREATE TABLE opens_per_delivery (   batch_id BIGINT,   delivery_id BIGINT,   num_opens BIGINT,   UNIQUE INDEX ix_batch_unique (batch_id, delivery_id) );  
If we were to only use this table directly, we’d have a system that ensured correctness and allowed our aggregation (and potentially data storage writes) to scale horizontally, but we’d have missed our goal of having fast access to the data. We’d be issuing queries like:
SELECT SUM(num_opens) FROM opens_per_delivery WHERE delivery_id=?  
As the number of batches covering this delivery grows, this query will become more and more expensive. Deliveries in Bronto are reasonably short-lived, so this might not be a problem, but Messages can exist for years. We solve this by introducing an additional compaction inside of our database. We introduce the notion of an Intake Table, which is a table (or analogous data separation concept) where we write records broken out by batch as described above. We then have one or more downstream tables that compact across batches. In order to keep track of which rows have been compacted, we create a total ordering over the intake rows using an Intake Id. Batches may be processed in arbitrary orders (in failure cases, for instance, we may process new batches before we complete old ones), but the intake ids will always be increasing. By transactionally keeping track of the maximum Intake Id that has been compacted into the downstream table, we can safely increment the counts in the downstream row. At some point in the future, these rows can be removed from the database in order to free up space. CompactionWe still might do some merging of results from the intake table at read time, but now we only have to merge from the index where the last compaction finished. Within components of your distributed system, you can create rich transactional semantics like our compaction. Between components, we must instead rely on idempotency and at-least-once semantics. Without those guarantees, systems are likely to end up with erroneous or incomplete data in the face of failure. It’s tempting to expect tools like Spark and Kafka to solve these problems for you, but ultimately, the systems we design have to ensure safety across database, framework and queue boundaries. Once we came up with a safe design, we were able to iterate on it to get the performance guarantees we require. See also: Renoyld Xin is the Chief Architect at Databricks, the commercial entity behind Spark, and has a lot to say about streaming. Check out the deck from his recent talk on the future of Spark Streaming. For those of you thinking about incorporating Spark into your stack in the next year or so, this is a great preview of what is coming down the pike.