How a Simple RDS Scheduler Job Led to 21TB Inter-AZ Data Transfer on AWS
We were working on an cost optimization project where we find out a huge data transfer cost for the past 2 months, when we validated the changes and new setup from the past 2 months nothing much changes, the infra was stable as it is. Then we did a deep dive into network analysis to understand what went wrong. Somewhow we enabled tagging on all the applications and found that one specific application was consuming this large data transfer cost.
We first analyzed the type of data transfer charges by its API call name using the Cost and Usage explorer, there we identified InterZone data transfer IN was high. But we could find much data about using the VPC flow log since the log has some minimal columns, so we enable the VPC flow log with all the options in the VPC flow log and let it capture for 1 day. In one day we are charged $6 and the breakdown of the internal data transfer is captured below. Its around 640GB per day.
To validate this, from the VPC flow log we have captured the data transfer for the whole day and the top contributor is RDS to the EC2 instance. 715 GB (this data has some additional metadata, so we can’t get the exact value). Because the Src address points to the RDS IP address. You can find the RDS IP using EC2 console -> Network Interfaces -> Search the SrcAddress.
Then we drilled down to the RDS instance and captured the data traffic every hour. This analysis says, every hour, we are transferring ~30GB of data from RDS to EC2.
Again to validate this, we analyzed the RDS network monitoring metric, and it should show a similar value. 9 MB per second. So 9MB/sec and converting this for 1 hour will be around 30GB.
Then we analyzed the complete packets from the application level, We can see one of the TCP connections is receiving 70MB of data in 10 seconds, around 8-9MB per second.
And here is the Application server-level Network In data.
During the investigation, we had a discussion with the developers to check what are the code level changes are pushed and identified that there was a schduler that runs every 1min to run some queries on RDS just to validate a few things. Then we stopped the scheduled service after that the data transfer between RDS to EC2 drastically reduced.
EC2 Network In
Packet level inspection from the application server, Max 2MB in 7 seconds window = 307KB per second
And the VPC flow log between RDS to EC2. It is reduced from 20+GB to 200MB.