Cloud Dataproc

Managed Spark and Hadoop service which is fast, easy to use, and low cost

Try It Free

Managed Hadoop & Spark

Use Google Cloud Dataproc, an Apache Hadoop, Apache Spark, Apache Pig, and Apache Hive service, to easily process big datasets at low cost. Control your costs by quickly creating managed clusters of any size and turning them off when you're done. Cloud Dataproc integrates across Google Cloud Platform products, giving you a powerful and complete data processing platform.

Managed Hadoop and Spark

Fast & Scalable Data Processing

Create Cloud Dataproc clusters quickly and resize them at any time—from three to hundreds of nodes—so you don't have to worry about your data pipelines outgrowing your clusters. With each cluster action taking less than 90 seconds on average, you have more time to focus on insights, with less time lost to infrastructure.

Fast and Scalable Data Processing

Affordable Pricing

Adopting Google Cloud Platform pricing principles, Cloud Dataproc has a low cost and an easy to understand price structure, based on actual use, measured by the minute. Also, Cloud Dataproc clusters can include lower-cost preemptible instances, giving you powerful clusters at an even lower total cost.

Affordable Pricing

Open Source Ecosystem

The Spark and Hadoop ecosystem provides tools, libraries, and documentation that you can leverage with Cloud Dataproc. By offering frequently updated and native versions of Spark, Hadoop, Pig, and Hive, you can get started without needing to learn new tools or APIs, and you can move existing projects or ETL pipelines without redevelopment.

Open Source Ecosystem

Have You Considered?

Cloud Platform can deliver even more scale, efficiency, and simplicity for key data processing and analysis scenarios. If you use Hive on Hadoop (or SparkSQL) you might consider Google BigQuery, an on-demand SQL analytics service with amazing performance. If you program data transformation pipelines with Spark or MapReduce, you may want to consider Google Cloud Dataflow, a fully-managed service that eliminates the busy work required by other tools and executes a wide range of data processing patterns, including ETL, batch, and streaming computation.

Google Cloud Dataflow

Cloud Dataproc Features

Google Cloud Dataproc is a managed Spark and Hadoop service that is fast, easy to use, and low cost.

Automated Cluster Management
Managed deployment, logging, and monitoring let you focus on your data, not on your cluster. Your clusters will be stable, scalable, and speedy.
Resizable Clusters
Clusters can be created and scaled quickly with a variety of virtual machine types, disk sizes, number of nodes, and networking options.
Integrated
Built-in integration with Cloud Storage, BigQuery, Bigtable, Stackdriver Logging, and Stackdriver Monitoring, giving you a complete and robust data platform.
Versioning
Image versioning allows you to switch between different versions of Apache Spark, Apache Hadoop, and other tools.
Developer Tools
Multiple ways to manage a cluster, including an easy-to-use Web UI, the Google Cloud SDK, RESTful APIs, and SSH access.
Initialization Actions
Run initialization actions to install or customize the settings and libraries you need when your cluster is created.
Automatic or Manual Configuration
Cloud Dataproc automatically configures hardware and software on clusters for you while also allowing for manual control.
Flexible Virtual Machines
Clusters can use custom machine types and preemptible virtual machines so they are the perfect size for your needs.

Cloud Dataproc Pricing

Cloud Dataproc incurs a small incremental fee per virtual CPU in the Compute Engine instances used in your cluster1.

Iowa Oregon Northern Virginia South Carolina Belgium London Sydney Taiwan Tokyo
Machine Type Price
Standard Machines
1-64 Virtual CPUs
High Memory Machines
2-64 Virtual CPUs
High CPU Machines
2-64 Virtual CPUs
Custom Machines
Based on vCPU and memory usage

1 Google Cloud Dataproc incurs a small incremental fee per virtual CPU in the Compute Engine instances used in your cluster while the cluster is operational. Additional resources used by Cloud Dataproc, such as a Compute Engine network, BigQuery, Cloud Bigtable, and others, are billed as they are consumed. For detailed pricing information, please view the pricing guide.