Apache® Spark™[1] is a distributed computing framework with in-memory processing to speed analytic applications up to 100 times faster compared to current technologies such as Hadoop[2]. Actually, Hadoop was the leading open source Big Data framework but recently the newer and more advanced Spark has become more popular, although they do not perform exactly the same tasks, and they are not mutually exclusive, as they are able to work together. Apache Spark can help reduce data interaction complexity, increase processing speed.

As it is written above, Spark’s fast computing ability with Big data is one of the biggest merits of using it. In addition to this merit, from the perspective of an engineer working for a startup company, Spark offers  many advantages for startups.

1, Flexibility of options for data storage
For a distributed storage, Spark can interface with a wide range of systems: Including Hadoop Distributed File System (HDFS), MapR File System (MapR-FS), Amazon S3. A customised solution can also be implemented despite that Hadoop can only be run on HDFS. Especially, Amazon S3[3] would be a reasonable option because of its pricing, redundancy, system extensibility for startup companies.

2, Flexibility of options for programming
Hadoop is implemented in Java. For providing data summarization, query, and analysis, in addition to Java, there are several systems infrastructures built on top of Hadoop.

・Hadoop Streaming

As it is shown, manipulating Hadoop often requires other softwares, and this could introduce further complexities and difficulties for maintenance.

On the other hand, Spark can be manipulated with several programming languages, although Spark is implemented in Scala.


Moreover, Spark SQL that is an API of Spark, let users execute SQL queries written using either a basic SQL syntax to data manipulation.

This flexibility would be helpful for a startup company that has difficulty to find an engineer who is familiar with a particular programming language, although using the API mentioned above lower processing speed because of its computing overhead.

3, Open-source
Last but not least, Spark is one of the most active open source big data projects and it is free. Startup companies often have financial limitation. Therefore, it seems to be a good starting point to use.

[1] Apache Hadoop [http://hadoop.apache.org/]

[2] Apache Spark [http://spark.apache.org/]

[3 ]Amazon Simple Storage Service (Amazon S3)[https://aws.amazon.com/s3/]