Apache Spark vs. Apache Flink
Apache Spark vs. Apache Flink – Introduction
Apache Flink, the high performance big data stream processing framework is reaching a first level of maturity. We examine comparisons with Apache Spark, and find that it is a competitive technology, and easily recommended as real-time analytics framework.
Since the olden days of Hadoop MapReduce, big data processing frameworks have certainly moved on. Spark in particular has brought not only increased performance massively, but more importantly democratized Hadoop, by allowing (comparatively) quick and simple development of big data analytical frameworks.
Enter Apache Flink, not necessarily a new technology, compared to its competitors, but one that is rapidly gaining momentum and moving towards enterprise-readiness at a rapid pace. Flink supports both streaming and batch paradigms for processing big data, while also shipping with integrated support for machine learning and graph processing.
But, in the current deluge of technologies, do we really need another new technology for data processing? On paper, Apache Spark – which has become a very popular tool over the last couple of years – offers similar features and capabilities. Curt Monash’s statement on this topic was: „Flink is basically a Spark alternative out of Germany, which I’ve been dismissing as unneeded“ . Hence, by comparing Apache Spark vs. Apache Flink we aim to determine whether Flink is able to compete with Spark. Or is it merely another “unneeded” tool in the ever-growing big data ecosystem? We start by providing a brief side-by-side overview of the two technologies.
Apache Spark vs. Apache Flink – What do they have in common?
Both are open source tools developed within the organizational framework of the Apache Foundation. Each can be used as a standalone solution, but they are often integrated into a big data environment, e.g. Hadoop (YARN, HDFS and often Apache Kafka). Both Spark and Flink offer a similar range of features and APIs, for example support for SQL queries, graph processing, as well as batch and stream processing   .
|Apache Spark vs.
|Apache Flink||Apache Spark|
|SQL queries||MRQL||Spark SQL|
|Graph Processing||Spargel (base), Gelly (library)||GraphX|
|Stream Processing||Flink Streaming||Spark Streaming|
|APIs||Scala, Java & Python||Scala, Java, Python & R|
|Machine Learning||Flink ML||MLlib & ML|
|Deployment||Standalone, Mesos, EC2, YARN|
|Connectors||Kafka, Amazon S3, ElasticSearch, Twitter, etc.|
The following code snippets demonstrate that the APIs are very similar, but not feature-identical. Each code sample reads a csv file containing a list of sold items and subsequently computes the most frequently distributed product. At first glance the approaches of each technology have a strong similarity and the advantages & disadvantages seem to be in balance. Only when we dive deeper in the features of each framework we will recognize the differences. In this particular example, we can see, that the maxBy-function in Flink is not (yet?) natively supported in Spark, and needs to be worked around using Spark’s windowing-capability – but largely the APIs allow for a similar construction of a data processing pipeline.
Mini buckets vs. water hose
The key difference between Spark and Flink are the different computational concepts underlying each framework. Spark uses a batch concept for both batch and stream processing, whereas Flink is based on a pure streaming approach. Imagine you need to collect and transport water: Spark handles incoming data as a sequence of fixed-size buckets, Flink is sending incoming data drop by drop directly through a water hose. The differences   between Flink and Spark are listed in the following table.
vs. Apache Flink
|Apache Spark||Apache Flink|
|Streaming||Micro-batch Streaming||Event-based Streaming|
|Batch Processing||In-memory processing. Operations are coordinated by Directed Acyclic Graphs||Stream-first approach (Kappa-architecture): Treats batches as a stream with limit|
|Optimization||Whole-Stage Code generation for optimized data processing, DataSet-based queries offer optimized execution plans. Manual memory tuning is very important ||„Automatic Optimization“: A suitable method will be chosen dependent on the input, output and operations. C++ style memory management within JVM|
|Data re-utilization & iterations||Execution plans in DAGs, which implies that it needs to schedule and run the same set of instructions in each iteration. Re-used data is cached in memory.||Iterative processing in its engine, based on cyclic data flows (one iteration, one schedule). Additionally, it offers delta iterations to leverage operations that only changes part of data|
|Latency||Micro-Batch model leads to high latency in the range of seconds||Low latency in the range of milliseconds |
|Out of Order Streams||With the new release there were added some basic methods for event time processing||Using Event Time, the Out of Order Events can be processed accurately|
|Support||Supported by all big Hadoop distributions: Cloudera, Hortonworks, etc. Databricks, provides a cloud platform and support packages.||Mostly via mailing lists or forums.|
The published results for batch processing performance vary somewhat, depending on the specific workload. The Terasort ,  benchmark shows Flink 0.9.1 being faster than Spark 1.5.1. Regarding the performance of the machine learning libraries, Apache Spark have shown to be the framework with faster runtimes (Flink version 1.0.3 against Spark 1.6.0)  . In September 2016 Flink and Spark were analyzed regarding the performance of several batch and iterative processing benchmarks . It was shown that Spark is 1.7x faster than Flink for large graph processing while Flink is up to 1.5x faster for batch and small graph workloads using less resources. It seems to be a neck-and-neck race between the tools.
The take-away from these performance comparisons is, that to select the faster framework you have to benchmark for your specific workload. Surprisingly for such a hot topic, there are very few public comparisons of recent versions of Spark and Flink, (Spark 2.2 and Flink 1.3). This is troublesome, as both platforms have made impressive performance gains even over the past year. In part 2 of our blog, we will provide our own detailed performance comparison, so keep tuned!
Closing thoughts reg. Apache Spark vs. Apache Flink
Big data as the triple challenge of ever increasing volume, high demands to quality, and the demand of ever quicker business insight continues to require technologies that remain performant with regards to latency and throughput at any scale, while allowing for quick development and high quality of code.
If requirements for data stream processing with high throughput, low latency and good fault-tolerance are the drivers of development, Flink provides an excellent application framework . If the application should be embedded in a Hadoop distribution like Hortonworks or Cloudera, then Spark would be a better choice as it is well integrated into the respective platforms, with vendor support. Flink and Spark are both continuously improving to offer easier, faster, and smarter data processing features.
Ultimately, the decision for the best framework depends on the question, “which one is more suitable for my requirements?” Even the favorite programming language of the development team can be a crucial factor – Spark’s Java API is derived from the Scala API: this can occasionally lead to unattractive Java code. Data engineers often prefer Python or Scala which Spark supports with more mature, feature-complete and faster APIs. Spark’s tight integration with R – “the golden child of data science” – provides Spark within R and thereby integrates well into existing Data Science toolboxes.
One of Spark’s most touted features is speed as it can „run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk“ . Flink provides strong competition with often similar performance in batch processing and significantly lower latency for stream processing. While the community “hype” of Spark appears to transfer over to Flink, only the future will tell how much impact this has on actual market-share.
-  https://data-artisans.com/blog/extending-the-yahoo-streaming-benchmark
-  https://spark.apache.org/
-  Eventtime-Processing (20-Apr-2017)
-  Learn about Apache Flink (07-Apr-2016)
-  Rumble in the Big Data Jungle (15-Jul-2016)
-  Storm vs. Samza vs. Spark vs. Flink (28-Oct-2016)
-  Tuning Spark
-  Benchmarking Streaming Computation Engines at Yahoo!
-  Terasort for Apache Spark and Apache Flink
-  Reproducible experiments on cloud
-  García-Gil, D., Ramírez-Gallego, S., García, S. et al. A comparison on scalability for batch big data processing on Apache Spark and Apache Flink. Big Data Analytics, March 2017, 2: 1.
-  Inoubli Wissem, Aridhi Sabeur, Mezni Haithem, Jung Alexander. An Experimental Survey on Big Data Frameworks. Mar 2017.
-  Ovidiu-Cristian Marcu, Alexandru Costan, Gabriel Antoniu, Mar´ıa S. P´erez-Hern´andez. Spark versus Flink: Understanding Performance in Big Data Analytics Frameworks. Cluster 2016 – The IEEE 2016 International Conference on Cluster Computing, Sep 2016, Taipei, Taiwan.
- Apache, Spark, Flink, Hadoop and Apache project logos are either registered trademarks or trademarks of the The Apache Software Foundation (Copyright © 2014-2017) in the United States or other countries. Sources of Images used: https://spark.apache.org/ and https://flink.apache.org/