Apache Spark is an open-source, distributed processing system commonly used for big data workloads. Apache Spark utilizes in-memory caching and optimized execution for fast performance, and it supports general batch processing, streaming analytics, machine learning, graph databases, and ad hoc queries. 

Apache Spark on Hadoop YARN is natively supported in Amazon EMR, and you can quickly and easily create managed Apache Spark clusters from the AWS Management Console, AWS CLI, or the Amazon EMR API. Additionally, you can leverage additional Amazon EMR features, including fast Amazon S3 connectivity using the Amazon EMR File System (EMRFS), integration with the Amazon EC2 Spot market, and Auto Scaling to add or remove instances from your cluster. Also, you can use Apache Zeppelin to create interactive and collaborative notebooks for data exploration using Apache Spark.

Spark-logo-192x100px
S3_Sketch_Available

By using a directed acyclic graph (DAG) execution engine, Apache Spark can create efficient query plans for data transformations. Apache Spark also stores input, output, and intermediate data in-memory as resilient distributed datasets (RDDs), which allows for fast processing without I/O cost, boosting performance of iterative or interactive workloads.

S3_Sketch_HighPerformance

Apache Spark natively supports Java, Scala, and Python, giving you a variety of languages for building your applications. Also, you can submit SQL or HiveQL queries to Apache Spark using the Spark SQL module. In addition to running applications, you can use the Apache Spark API interactively with Python or Scala directly in the Apache Spark shell on your cluster. You can also leverage Zeppelin to create interactive and collaborative notebooks for data exploration and visualization.

S3_Sketch_Simple

Apache Spark includes several libraries to help build applications for machine learning (MLlib), stream processing (Spark Streaming), and graph processing (GraphX). These libraries are tightly integrated in the Apache Spark ecosystem, and they can be leveraged out of the box to address a variety of use cases.

Benefit_Workflow_Green

Submit Apache Spark jobs with the Amazon EMR Step API, use Apache Spark with EMRFS to directly access data in Amazon S3, save costs using Amazon EC2 Spot capacity, use Auto Scaling to dynamically add and remove capacity, and launch long-running or ephemeral clusters to match your workload. You can also easily configure Spark encryption using an Amazon EMR security configuration. Amazon EMR installs and manages Apache Spark on Hadoop YARN, and you can also add other Hadoop ecosystem applications on your cluster. Click here for more details about Amazon EMR features.


Yelp

Yelp’s advertising targeting team makes prediction models to determine the likelihood of a user interacting with an advertisement. By using Apache Spark on Amazon EMR to process large amounts of data to train machine learning models, Yelp increased revenue and advertising click-through rate.

The Washington Post

The Washington Post uses Apache Spark on Amazon EMR to build models powering its website’s recommendation engine to boost reader engagement and satisfaction. They leverage Amazon EMR's performant connectivity with Amazon S3 to update models in near real-time.

Intent Media

Intent Media operates a platform for advertising on travel commerce sites. The data team uses Apache Spark and MLlib on Amazon EMR to ingest terabytes of e-commerce data daily and use this information to power their decisioning services to optimize customer revenue. Click here to learn more.

Krux

As part of its Data Management Platform for customer insights, Krux runs many machine learning and general processing workloads using Apache Spark. Krux utilizes ephemeral Amazon EMR clusters with Amazon EC2 Spot Capacity to save costs, and uses Amazon S3 with EMRFS as a data layer for Apache Spark.

Read more »

GumGum

GumGum, an in-image and in-screen advertising platform, uses Spark on Amazon EMR for inventory forecasting, processing of clickstream logs, and ad hoc analysis of unstructured data in Amazon S3. Spark’s performance enhancements saved GumGum time and money for these workflows.
 

Read more »

Hearst Corporation

Hearst Corporation, a large diversified media and information company, has customers viewing content on over 200 web properties. Using Apache Spark Streaming on Amazon EMR, Hearst’s editorial staff can keep a real-time pulse on which articles are performing well and which themes are trending.
 

Read more »

CrowdStrike

CrowdStrike provides endpoint protection to stop breaches. They use Amazon EMR with Spark to process hundreds of terabytes of event data and roll it up into higher-level behavioral descriptions on the hosts. From that data, CrowdStrike can pull event data together and identify the presence of malicious activity.
 

Read more »


Consume and process real-time data from Amazon Kinesis, Apache Kafka, or other data streams with Spark Streaming on Amazon EMR. Perform streaming analytics in a fault-tolerant way and write results to Amazon S3 or on-cluster HDFS.

 

Apache Spark on Amazon EMR includes MLlib for a variety of scalable machine learning algorithms, or you can use your own libraries. By storing datasets in-memory during a job, Spark has great performance for iterative queries common in machine learning workloads.

 

 

Use Spark SQL for low-latency, interactive queries with SQL or HiveQL. Apache Spark on Amazon EMR can leverage EMRFS, so you can have ad hoc access to your datasets in Amazon S3. Also, you can utilize Zeppelin notebooks or BI tools via ODBC and JDBC connections.