An Overview of Apache Streaming Technologies
There are many technologies for streaming data: simple event processors, stream processors, and complex event processors. Even within the open-source community there is a bewildering amount of options with sometimes few major differences that are not well documented or easy to find. That’s why I’ve decided to create an overview of Apache streaming technologies, including Flume, NiFi, Gearpump, Apex, Kafka Streams, Spark Streaming, Storm (and Trident), Flink, Samza, Ignite, and Beam.
I am known to write large posts, but today I want to make an exception. Without further ado, here’s the overview (click to see the full-screen image):
Flume and NiFi definitely belong into the data collection (DC) and single-event processor (SEP) bucket, whereas the others are more event stream processor (ESP) engines or complex event processors (CEP). Obviously Spark itself and Ignite are more than mere stream processors, but they are listed here because they offer streaming capabilities. The same can be said of Apex too, which is a platform that unifies batch and streaming data, which places it somewhere between a data collection engine, a rudimentary ETL tool, and an event stream processor.
The ’N/A’ in the row with ‘back-pressure’ for Kafka Streams indicates that there is no back-pressure per se, because the queue is managed by Kafka and in itself limited only by the amount of disk space available. After all, Kafka allows messages to be replayed at leisure. Kafka Streams is a Java library for Kafka, the message delivery system initially developed by LinkedIn. Exactly-once semantics and transactions will be added to Kafka in a future version. Kafka is used by LinkedIn, Netflix, and Spotify, to name but a few.
Auto-scaling is sometimes also known as elastic scaling, dynamic scaling, dynamic (resource) allocation, and dynamic work re-balancing. Note that auto-scaling is not the same as a load balancer. A load balancer ‘simply’ distributes the workload, but when the workload becomes too much for the cluster or there are more resources available than needed, a load balancer does not scale up or down.
With ‘in-flight modifications’ I mean the ability to modify data flows on the fly, that is without any downtime or applications re-submission. This is sometimes also called zero-downtime expansion, and ad-hoc or dynamic application modifications. With regard to in-flight modifications, Spooker is a project that is perhaps worth a second look.
Beam is an SDK that requires a runner (e.g. Apex, Flink, Spark, or Google Cloud DataFlow). The entries with an asterisk in the column underneath Beam are for Google Cloud DataFlow only; these entries really depend on the runner that’s being used. If for instance, you choose the Spark runner and you run Spark on Mesos, then technically the resource manager is Mesos, but that fact is hidden from Beam. The advantage of Beam is that the code stays the same but the underlying runner can be exchanged, provided it has been fully implemented.
Gearpump offers at-least-once semantics for replayable sources (e.g. Kafka) and exactly-once semantics when such sources are combined with the persistence of checkpoints to a fault-tolerant source (e.g. HDFS). Prioritization is baked into the software, at least for streaming applications, not natively for individual messages. Support for Mesos is at least listed as a possible future addition. Likewise, a runner for Beam may be in the works at some later date.
For the sake of balance, I want to mention that there are of course many other (commercial) options in the CEP market, for instance: Software AG’s Apama, Microsoft StreamInsight, Oracle Event Processing, SAP ESP, Tibco BusinessEvents, and Streambase. I have listed these in no particular order.
The information in the table has been compiled from the official project pages, by digging through the source code, and from the following sources:
- Cake Solution’s two-part comparison of streaming technologies.
- MapR’s overview of streaming technologies.
- Google’s comparison of DataFlow and Spark Streaming, which I have summarized on Read That For Me.
- Taylor Goetz’s Storm vs Spark Streaming slides.
- A discussion on StackOverflow about Flink vs Storm.
- Databricks’ post on improved fault tolerance in Spark Streaming thanks to WALs.
- data Artisans’ article on Flink’s design.
- MapR’s in-depth article on Flink.
- An article by MapR on Apex.
- Hortonworks’ information on NiFi’s fault tolerance in site-to-site data flows as well as a follow-up piece.
- Confluent’s documentation of Kafka Streams.
- Guozhang Wang’s presentation on Kafka Streams.
- Intel’s description of Gearpump.
- Lightbend’s background on Akka inside Gearpump as well as their detailed overview of the concepts behind it. The table is not intended to be complete, it is unlikely to be entirely accurate, and it will almost certainly to be out-of-date in no time. In case you do discover grave inaccuracies, please let me know and I’ll try and set the record straight.
If you’re interested in a similar overview of file formats and SerDes for Hadoop, please have a look at my previous post on the topic.