Amazon EMR

Amazon EMR is a web service that makes it easy to quickly and cost-effectively process vast amounts of data.

Amazon EMR simplifies big data processing, providing a managed Hadoop framework that makes it easy, fast, and cost-effective for you to distribute and process vast amounts of your data across dynamically scalable Amazon EC2 instances. You can also run other popular distributed frameworks such as Apache Spark and Presto in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.

Amazon EMR securely and reliably handles your big data use cases, including log analysis, web indexing, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics.

Introduction to Amazon EMR (3:06)

Get Started with Amazon EMR

Create a Free Account

Need Help? Ask Us!

Amazon EMR on the AWS Big Data Blog

Combine NoSQL and Massively Parallel Analytics Using Apache HBase and Apache Hive on Amazon EMR

Anomaly Detection Using PySpark, Hive, and Hue on Amazon EMR

Optimize Spark-Streaming to Efficiently Process Amazon Kinesis Streams

Submitting User Applications with spark-submit

Turning Amazon EMR into a Massive Amazon S3 Processing Engine with Campanile

What's New from Amazon EMR

Date	Announcement
Aug 02	Amazon EMR release 5.0 now available: Apache Spark 2.0, Hive 2.1, enhanced debugging, and more	Blog
Jun 02	Apache Tez 0.8.3 and Apache Phoenix 4.7.0 now available on Amazon EMR	Blog
Apr 21	Apache HBase 1.2 Now Available on Amazon EMR for Realtime Access for Massive Datasets	Blog
Apr 04	Apache Spark 1.6.1, new versions of Apache Hadoop and Presto, and support for Amazon S3 SSE-KMS now available on Amazon EMR
Mar 14	New open-source applications, dynamic Spark defaults, and Java 8 support now available on Amazon EMR	Blog

Customer Success

Krux uses AWS to manage data processing requirements. »

CrowdStrike uses Amazon EMR with Spark to process hundreds of terabytes of event data to identify the presence of malicious activity. »

GumGum uses Spark on Amazon EMR for inventory forecasting, processing of clickstream logs, and ad hoc analysis of unstructured data in Amazon S3 »

Kik uses Amazon EMR & Hadoop Pig scripts to process vast log file data before they were loaded into Amazon Redshift. »

Yelp was able to save $55,000 in upfront hardware costs. »

Expedia processes clickstream data from a global network of websites. »

The Financial Industry Regulatory Authority uses Amazon EMR to create a flexible platform that can adapt to changing market dynamics. »

DataXu Evaluates 30 Trillion Ad Opportunities Monthly on AWS »

SnowPlow »

Channel 4 analyzes customer interaction data for its video-on-demand service. »

Swipely generates insights from millions of credit card transactions. »

The analytics team leverages Amazon EMR and Hadoop to aggregate and analyze data. »

Open-Source Applications in Amazon EMR

Click to Enlarge — **Amazon EMR Release Velocity**

With versioned releases on Amazon EMR, you can easily select and use the latest open source projects on your EMR cluster, including applications in the Apache Hadoop and Spark ecosystems. Software is installed and configured by Amazon EMR, so you spend less time on administrative tasks and can focus on increasing the value of your data.

Features and Benefits

Easy to Use

You can launch an Amazon EMR cluster in minutes. You don’t need to worry about node provisioning, cluster setup, Hadoop configuration, or cluster tuning. Amazon EMR takes care of these tasks so you can focus on analysis.

Low Cost

Amazon EMR pricing is simple and predictable: You pay an hourly rate for every instance hour you use. You can launch a 10-node Hadoop cluster for as little as $0.15 per hour. Because Amazon EMR has native support for Amazon EC2 Spot and Reserved Instances, you can also save 50-80% on the cost of the underlying instances.

Elastic

With Amazon EMR, you can provision one, hundreds, or thousands of compute instances to process data at any scale. You can easily increase or decrease the number of instances and you only pay for what you use.

Reliable

You can spend less time tuning and monitoring your cluster. Amazon EMR has tuned Hadoop for the cloud; it also monitors your cluster —retrying failed tasks and automatically replacing poorly performing instances.

Secure

Amazon EMR automatically configures Amazon EC2 firewall settings that control network access to instances, and you can launch clusters in an Amazon Virtual Private Cloud (VPC), a logically isolated network you define. For objects stored in Amazon S3, you can use Amazon S3 server-side encryption or Amazon S3 client-side encryption with EMRFS, with AWS Key Management Service or customer-managed keys.

Flexible

You have complete control over your cluster. You have root access to every instance, you can easily install additional applications, and you can customize every cluster. Amazon EMR also supports multiple Hadoop distributions and applications.

Use Cases

Clickstream Analysis

Amazon EMR can be used to analyze click stream data in order to segment users and understand user preferences. Advertisers can also analyze click streams and advertising impression logs to deliver more effective ads.

Learn how Razorfish uses EMR for click stream analysis »

Genomics

Amazon EMR can be used to process vast amounts of genomic data and other large scientific data sets quickly and efficiently. Researchers can access genomic data hosted for free on AWS.

Read about the 1000 Genomes project and AWS »

Log Processing

Amazon EMR can be used to process logs generated by web and mobile applications. Amazon EMR helps customers turn petabytes of un-structured or semi-structured data into useful insights about their applications or users.

Learn how Yelp uses EMR to drive key website features »

Launch Your First Cluster in Minutes

Are you ready to launch your first cluster? Click here to view the Getting Started Tutorial. In the tutorial you will create a cluster that will count the frequency of words in a sample text file. In just a few minutes your cluster will be up and running.