| from Melanie Däschinger

Apache Spark vs. Apache Flink

Apache Spark vs. Apache Flink – Introduction

Logos Apache Spark and Apache Flink

Apache Flink, the high performance big data stream processing framework is reaching a first level of maturity. We examine comparisons with Apache Spark, and find that it is a competitive technology, and easily recommended as real-time analytics framework.

Since the olden days of Hadoop MapReduce, big data processing frameworks have certainly moved on. Spark in particular has brought not only increased performance massively, but more importantly democratized Hadoop, by allowing (comparatively) quick and simple development of big data analytical frameworks.

Enter Apache Flink, not necessarily a new technology, compared to its competitors, but one that is rapidly gaining momentum and moving towards enterprise-readiness at a rapid pace. Flink supports both streaming and batch paradigms for processing big data, while also shipping with integrated support for machine learning and graph processing.

But, in the current deluge of technologies, do we really need another new technology for data processing? On paper, Apache Spark – which has become a very popular tool over the last couple of years – offers similar features and capabilities. Curt Monash’s statement on this topic was: „Flink is basically a Spark alternative out of Germany, which I’ve been dismissing as unneeded“ [1]. Hence, by comparing Apache Spark vs. Apache Flink we aim to determine whether Flink is able to compete with Spark. Or is it merely another “unneeded” tool in the ever-growing big data ecosystem? We start by providing a brief side-by-side overview of the two technologies.

Apache Spark vs. Apache Flink – What do they have in common?

Both are open source tools developed within the organizational framework of the Apache Foundation. Each can be used as a standalone solution, but they are often integrated into a big data environment, e.g. Hadoop (YARN, HDFS and often Apache Kafka). Both Spark and Flink offer a similar range of features and APIs, for example support for SQL queries, graph processing, as well as batch and stream processing [3] [4] [5].

Apache Spark vs.
Apache Flink
Apache Flink Apache Spark
SQL queries MRQL Spark SQL
Graph Processing Spargel (base), Gelly (library) GraphX
Stream Processing Flink Streaming Spark Streaming
APIs Scala, Java & Python Scala, Java, Python & R
Machine Learning Flink ML MLlib & ML
Stable version 1.3.2 2.2.0
Throughput High
Fault-Tolerance Exactly-once guarantee
Deployment Standalone, Mesos, EC2, YARN
Connectors Kafka, Amazon S3, ElasticSearch, Twitter, etc.

 

Comparison of a batch Java program in Apache Spark and Apache Flink
Figure: Comparison of a batch Java program in Apache Spark and Apache Flink

The following code snippets demonstrate that the APIs are very similar, but not feature-identical. Each code sample reads a csv file containing a list of sold items and subsequently computes the most frequently distributed product. At first glance the approaches of each technology have a strong similarity and the advantages & disadvantages seem to be in balance. Only when we dive deeper in the features of each framework we will recognize the differences. In this particular example, we can see, that the maxBy-function in Flink is not (yet?) natively supported in Spark, and needs to be worked around using Spark’s windowing-capability – but largely the APIs allow for a similar construction of a data processing pipeline.

Mini buckets vs. water hose

The key difference between Spark and Flink are the different computational concepts underlying each framework. Spark uses a batch concept for both batch and stream processing, whereas Flink is based on a pure streaming approach. Imagine you need to collect and transport water: Spark handles incoming data as a sequence of fixed-size buckets, Flink is sending incoming data drop by drop directly through a water hose. The differences [5] [6] between Flink and Spark are listed in the following table.

Apache Spark
vs. Apache Flink
Apache Spark Apache Flink
Streaming Micro-batch Streaming Event-based Streaming
Batch Processing In-memory processing. Operations are coordinated by Directed Acyclic Graphs Stream-first approach (Kappa-architecture): Treats batches as a stream with limit
Implemented in Scala Java
Optimization Whole-Stage Code generation for optimized data processing, DataSet-based queries offer optimized execution plans. Manual memory tuning is very important [7] „Automatic Optimization“: A suitable method will be chosen dependent on the input, output and operations. C++ style memory management within JVM
Data re-utilization & iterations Execution plans in DAGs, which implies that it needs to schedule and run the same set of instructions in each iteration. Re-used data is cached in memory. Iterative processing in its engine, based on cyclic data flows (one iteration, one schedule). Additionally, it offers delta iterations to leverage operations that only changes part of data
Latency Micro-Batch model leads to high latency in the range of seconds Low latency in the range of milliseconds [8]
Out of Order Streams With the new release there were added some basic methods for event time processing Using Event Time, the Out of Order Events can be processed accurately
Support Supported by all big Hadoop distributions: Cloudera, Hortonworks, etc. Databricks, provides a cloud platform and support packages. Mostly via mailing lists or forums.

 

Performance

The published results for batch processing performance vary somewhat, depending on the specific workload. The Terasort [9], [10] benchmark shows Flink 0.9.1 being faster than Spark 1.5.1. Regarding the performance of the machine learning libraries, Apache Spark have shown to be the framework with faster runtimes (Flink version 1.0.3 against Spark 1.6.0) [11] [12]. In September 2016 Flink and Spark were analyzed regarding the performance of several batch and iterative processing benchmarks [13]. It was shown that Spark is 1.7x faster than Flink for large graph processing while Flink is up to 1.5x faster for batch and small graph workloads using less resources. It seems to be a neck-and-neck race between the tools.

The take-away from these performance comparisons is, that to select the faster framework you have to benchmark for your specific workload. Surprisingly for such a hot topic, there are very few public comparisons of recent versions of Spark and Flink, (Spark 2.2 and Flink 1.3). This is troublesome, as both platforms have made impressive performance gains even over the past year. In part 2 of our blog, we will provide our own detailed performance comparison, so keep tuned!

Closing thoughts reg. Apache Spark vs. Apache Flink

Big data as the triple challenge of ever increasing volume, high demands to quality, and the demand of ever quicker business insight continues to require technologies that remain performant with regards to latency and throughput at any scale, while allowing for quick development and high quality of code.

If requirements for data stream processing with high throughput, low latency and good fault-tolerance are the drivers of development, Flink provides an excellent application framework [1]. If the application should be embedded in a Hadoop distribution like Hortonworks or Cloudera, then Spark would be a better choice as it is well integrated into the respective platforms, with vendor support. Flink and Spark are both continuously improving to offer easier, faster, and smarter data processing features.

Ultimately, the decision for the best framework depends on the question, “which one is more suitable for my requirements?” Even the favorite programming language of the development team can be a crucial factor – Spark’s Java API is derived from the Scala API: this can occasionally lead to unattractive Java code. Data engineers often prefer Python or Scala which Spark supports with more mature, feature-complete and faster APIs. Spark’s tight integration with R – “the golden child of data science” – provides Spark within R and thereby integrates well into existing Data Science toolboxes.

One of Spark’s most touted features is speed as it can „run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk“ [2]. Flink provides strong competition with often similar performance in batch processing and significantly lower latency for stream processing. While the community “hype” of Spark appears to transfer over to Flink, only the future will tell how much impact this has on actual market-share.

Quellen

  • [1] https://data-artisans.com/blog/extending-the-yahoo-streaming-benchmark
  • [2] https://spark.apache.org/
  • [3] Eventtime-Processing (20-Apr-2017)
  • [4] Learn about Apache Flink (07-Apr-2016)
  • [5] Rumble in the Big Data Jungle (15-Jul-2016)
  • [6] Storm vs. Samza vs. Spark vs. Flink (28-Oct-2016)
  • [7] Tuning Spark
  • [8] Benchmarking Streaming Computation Engines at Yahoo!
  • [9] Terasort for Apache Spark and Apache Flink
  • [10] Reproducible experiments on cloud
  • [11] García-Gil, D., Ramírez-Gallego, S., García, S. et al. A comparison on scalability for batch big data processing on Apache Spark and Apache Flink. Big Data Analytics, March 2017, 2: 1.
  • [12] Inoubli Wissem, Aridhi Sabeur, Mezni Haithem, Jung Alexander. An Experimental Survey on Big Data Frameworks. Mar 2017.
  • [13] Ovidiu-Cristian Marcu, Alexandru Costan, Gabriel Antoniu, Mar´ıa S. P´erez-Hern´andez. Spark versus Flink: Understanding Performance in Big Data Analytics Frameworks. Cluster 2016 – The IEEE 2016 International Conference on Cluster Computing, Sep 2016, Taipei, Taiwan.
  • Apache, Spark, Flink, Hadoop and Apache project logos are either registered trademarks or trademarks of the The Apache Software Foundation (Copyright © 2014-2017) in the United States or other countries. Sources of Images used: https://spark.apache.org/ and https://flink.apache.org/

Share this article with others

Tags

About the author

Nach ihrem Informatik Masterstudium an der Universität Augsburg, mit dem Schwerpunkt Softwaretechniken und Programmiersprachen, ist Melanie 2016 als Consultant für Big Data Themen bei Woodmark eingestiegen. Ihre Schwerpunkte liegen im Data Engineering und Data Science Bereich innerhalb des Hadoop Ökosystems. Dort handhabt sie mit Hilfe von Komponenten wie Hive, HDFS, Spark, Flink, Pig, Oozie, Scoop, etc. das Verarbeiten, Speichern und Analysieren von großen Datenmengen.

To overview blog posts