< Return home
Oliver Hookins’s avatar
Oliver HookinsPrincipal DevOps EngineerSydney, Australia
Setting up AWS Glue jobs with Glue Connections that can reach VPC-internal resources can be a challenge, especially if you need to access both RDS datastores and other non-RDS resources. In this blog post we explore some of the challenges we faced when dealing with this recently, and how to work around the limitations.May 13th 2022
Glue and Network Connections

Introduction

AWS Glue is a serverless data processing service, quite analogous to Hadoop and other popular services for processing data at scale, but with the advantage that you don't have to manage the infrastructure. Our data engineers use it for analysing a variety of data from our platform and 3rd-party services that we integrate, in order to build richer understandings of how our customers' bug bounty programs are meeting their goals, how our researchers are faring, and always looking for new ways to make our overall platform more effective.

A data processing system is only as effective as the data you are able to feed to it, so naturally there is network connectivity involved. At the most basic level, you need to provide scripts for your data processing logic, and of course the data - which often resides in a Data Lake (and hopefully not a Data Swamp). Common options on AWS would of course be S3, Redshift or other databases in RDS. Now that I've mentioned the "RDS" keyword, let's dive into which issues this raises.

To VPC or not to VPC

Classically, AWS had no distinction between infrastructure that was "on the internet" and those that are private. In the "VPC Era" (which admittedly we've been in for quite some time), all your resources you provision in your AWS account will be hosted inside a virtual private network, inaccessible (by default) to the Internet. This started with EC2 instances and has spread to most other services that AWS offers. I say "most" very specifically - a couple of notable examples that do not run (by default) in your VPC are the ubiquitous Lambda function, and of course our friend the Glue job. You could also argue that S3 itself does not run in a VPC.

So, S3 is not so much of a problem, but what of our RDS database? It's not going to be accessible to the Glue job by default, but fortunately AWS offers a solution to this in the form of Glue Connections. There are a couple of different kinds of Connection: JDBC, used for accessing databases which includes AWS's native RDS types, and Network which is used to "connect to a data source within an Amazon Virtual Private Cloud environment (Amazon VPC))". Unfortunately at the time of our exploration of this feature, this phrase was not in the documentation. There are also other documents which unfortunately muddy the waters of exactly how interaction between the VPC and Glue operate, such as the guide on seting up VPC interface endpoints for Glue.

This latter topic is entirely the reverse. Summing up, the options you have are:

  • Create a Glue Network Connection for access of resources inside of your VPC from a Glue job.
  • Create a VPC interface endpoint for access to the Glue API privately (i.e. without going over the public Internet) from within your VPC.

Unfortunately these two have little to do with each other and you might spend some time (as I did) working on the VPC interface endpoint, to ultimately find that it does nothing for providing access to your VPC resources from Glue.

Sidebar: Glue Connection operation and caveats

We already had a rich collection of Glue jobs in our AWS account accessing several different RDS databases, and these were functioning correctly using the JDBC type of Glue Connection. When you set up one of these connections, the specific VPC, subnet and security group must be provided. The VPC part of this is fairly self-evident - it must be the same VPC in which your RDS instances reside. The subnet and security group are less clear though. When you create a Glue Connection, behind the scenes AWS is creating an Elastic Network Interface inside the specified subnet and with the specified security group, and uses that for the network communication of the Glue job (assuming you have configured it with this Glue Connection). You might accidentally specify the subnet and security group of your RDS instance, but this would not be correct - and may not work (depending on how you have set up the security group rules).

Fortunately AWS has added more detailed documentation on this topic recently, as when we were dealing with it, it was far less clear. AWS Support was our go-to solution to finding out more about the operation of Glue Connections at the time.

One extremely important thing to bear in mind about the operation of Glue jobs with a configured Glue Connection is that it will use the Connection for all network communication - not just your JDBC database access. This means that access to S3, Secrets Manager, IAM, or indeed any other AWS service will now be using the connectivity available to that combination of VPC, subnet and security group. What's more, is that any outbound Internet access will also need to use the NAT Gateway configured for that subnet. If you had hoped to isolate your Glue Connection's subnet and not provisioned a NAT Gateway, you will surely realise that network access is limited.

This can be a benefit or not, depending on your point of view. If you already have VPC endpoints configured for high-data services (e.g. S3) you can automatically use those and avoid sending traffic over the public Internet. On the other hand, if you have any special Network ACLs (NACLs) in place that limit access for certain subnets you might find that you run into problems. VPC endpoints work by providing DNS entries for the ENI address(es) of the service in question - and DNS is a VPC-wide service.

The use case and problem

You can probably guess from this extended exposition that we ran into a problem at the intersection of Glue and network connecions. We had some computation resources sitting in our VPC in a couple of subnets that, for security reasons, needed to be isolated from the rest of our workloads. This meant that there were NACLs in place that prevented them from reaching any other subnets in our VPC. Prior to understanding exactly how Glue Connections worked exactly, it seemed like we could simply do the following:

  • Leave the existing JDBC Connections in place for the Glue jobs.
  • Add a Network Connection for the subnet for the computation resources, and a security group that allows access to them.
  • Configure this additional Network Connection for the jobs where we need to access these computation resources.

Unfortunately a limitation I believe still exists in Glue is that a given job can only use a single Glue Connection. It is possible to configure several, but only the first will be used. Therefore, you are place in a tricky position if you need a generic connection to arbitrary VPC resources, and also database resources using JDBC. Specifically in our case, what we found was that with the Network Connection configured first in the job, the computation resources would be accessible, but the RDS database would not be accessible, since the computation resources were isolated and could not connect to the RDS subnet. Even more annoyingly, some VPC endpoints we had configured in our VPC were also not accessible (e.g. S3) because of the restrictive NACLs

Conversely, having the JDBC Connection configured first in the job meant that the RDS database was accessible but not the isolated computational resources. What to do?

The solution

Unfortunately, given the limitations of Glue we did have to make a trade-off. However, it can be done by exploiting the undocumented properties of Glue Connections. While the JDBC Connections seem like they are only used for communicating with an RDS database, they actually function much like a Network Connection but with the added convenience of having JDBC authentication parameters for the database. They still create an ENI in the appropriate VPC and subnet, and with the security group of choice. For non-JDBC network connectivity, they actually function normally, sending traffic out of that same ENI, subject to the same limitations.

For our specific case, dropping a Network or JDBC Connection into the same subnet as our isolated computation resources wasn't an option - it simply wouldn't be able to talk to RDS (intentionally and by design). As an alternative, we provisioned a new set of subnets dedicated to our Glue use case, which would only be used by our Glue Connections. These subnets were configured to be able to reach our isolated computation resources (via some small changes in NACLs) and were also allowed to reach our RDS databases. In addition, they would be able to reach external Internet resources and VPC endpoints we had provisioned for S3 and other services.

One additional limitation implicit in the single Network Connection design, is that you can only configure a single subnet. While originally we set up several subnets dedicated to Glue Connections, we can only use one of them at a time. This does present an availability issue if that particular Availability Zone (AZ) has an outage. While it would be nice to work around this, for the moment we are living with that risk. If we see a sustained AZ outage, we can alter our Glue Connection configuration to use one of the other subnets (and AZs) that we have provisioned.

Another aspect of our configuration is that we have both a JDBC connection for jobs that need them, and a basic Network Connection for those that don't. Both are configured to create their ENI in the same subnet and security group, so they have identical VPC access, but if database access is not required, we don't provision it - holding to the access principle of least privilege. While we had not originally intended to provision special subnets and security groups just for Glue Connections, an added benefit is also that we can directly classify any traffic used by our Glue jobs in our VPC flow logs, since it is always going to be originating in subnets specific to that purpose.

Conclusion

With sufficient patience and experimentation, anything is possible! Sadly AWS documentation is not always up to date or as detailed as the products themselves - but isn't that always the case in technology? If you are reading this blog post, I'm sure you have fought this battle for adequate documentation on your own work as well, and can appreciate the challenge.

The other constant challenge with AWS networking (and Cloud provider networking in general) is that even if you understand the fundamentals of TCP/IP, firewalling (e.g. with iptables), Network Address Translation, and even physical network connectivity, it often doesn't translate very well to what you must deal with in Cloud environments. What is a VPC? Is it a VLAN? Are security groups like iptables rules, or are NACLs more like iptables rules? Many of the answers to these questions are a mix of different layers of the networking stack and application behaviour - in other words, a custom flavour of Software Defined Networking.

If you liked reading this post, stayed tuned as we will have more content coming up soon about interesting challenges we've been facing in the intersection of Linux networking and AWS cloud environments.