AWS Glue
AWS Cloud
Sign Up for Updates

AWS Glue is a fully managed ETL service that makes it easy to move data between your data stores. AWS Glue simplifies and automates the difficult and time consuming data discovery, conversion, mapping, and job scheduling tasks. AWS Glue guides you through the process of moving your data with an easy to use console that helps you understand your data sources, prepare the data for analytics, and load it reliably from data sources to destinations.

AWS Glue is integrated with Amazon S3, Amazon RDS, and Amazon Redshift, and can connect to any JDBC-compliant data store. AWS Glue automatically crawls your data sources, identifies data formats, and then suggests schemas and transformations, so you don’t have to spend time hand-coding data flows. You can then edit these transformations, if necessary, using the tools and technologies you already know, such as Python, Spark, Git and your favorite integrated developer environment (IDE), and share them with other AWS Glue users. AWS Glue schedules your ETL jobs and provisions and scales all the infrastructure required so your ETL jobs run quickly and efficiently at any scale. There are no servers to manage, and you pay only for resources consumed by your ETL jobs.

For latest information about service availability, sign up here and we will keep you updated via email.

Step 1. Build Your Data Catalog

First, you use the AWS Management Console to register your data sources with AWS Glue. AWS Glue crawls your data sources and constructs a data catalog using pre-built classifiers for many popular source formats and data types, including JSON, CSV, Parquet, and more. You can also add your own classifiers or choose classifiers from the AWS Glue community to add to your crawls.


Step 1. Automatically Build Your Data Catalog
Step 1. Automatically Build Your Data Catalog

Click to view larger image


Step 2. Generate and Edit Transformations

Next, select a data source and target, and AWS Glue will generate Python code to extract data from the source, transform the data to match the target schema, and load it into the target. The auto-generated code handles common error cases, such as bad data or hardware failures. You can edit this code using your favorite IDE and test it with your own sample data. You can also browse code shared by other AWS Glue users and pull it into your jobs.


Step 2. Generate the Transformations
Step 2. Generate the Transformations

Click to view larger image


Step 3. Schedule and Run Your Jobs

Lastly, you can use AWS Glue’s flexible scheduler to run your flows on a recurring basis, in response to triggers, or even in response to AWS Lambda events. AWS Glue automatically distributes your ETL jobs on Apache Spark nodes, so that your ETL run times remain consistent as data volume grows. AWS Glue coordinates the execution of your jobs in the right sequence, and automatically re-tries failed jobs. AWS Glue elastically scales the infrastructure required to complete your jobs on time and minimize costs.


Step 3. Schedule and Run Your Jobs
Step 3. Schedule and Run Your Jobs

Click to view larger image


Done.

That’s it! Once the ETL jobs are in production, AWS Glue helps you track changes to metadata such as schema definitions and data formats, so you can keep your ETL jobs up to date.

For the latest information about service availability, sign up here and we will keep you updated via email.

Sign up for updates