This post is the first in a series, the aim is to create a service that is dependent on multiple storage backends(Postgres, S3), but behaves as if it was only one. Recent data should be kept in Postgres, older data should be transferred to S3.

The rationale behind such a service is to minimize costs, this post is investigating if it is justified to have such a service.

Usecase

we pretend that customers have transaction data, this data is immutable and downloaded once a month(CSV).
The data size varies between 100kb and 1.5 Gb, the larger files are less frequent than the smaller ones.  The data is stored in an AWS RDS Postgresql database. The data is backed up once a month. Total data size < 10 T

For the calculation, I will focus mainly on the storage size and usage-related costs, additional VM costs are omitted.

User data size distribution

for the distribution, I picked a skewed distribution for the distribution
resulting in the following distribution between 100kb and 1.5 Gb:
for creating the distributions the Julia script is available at calculate.jl

AWS RD Postgres, Multi-AZ, us-east 

pricing reference, AWS RDS
storage costs : $0.23 per GB-month
storage backup: $0.010 per GB
data-transfer: $0.09 per GB

function cost_calculation_rds_postgres_per_user(v_invocations, v_gb)
out_costs = 0.09
storage_cost = 0.23
out_costs_per_user = out_costs * v_gb * v_invocations + storage_cost * v_gb
end

function cost_calculation_rds_postgres(v_invocations, v_gb)
user_costs = cost_calculation_rds_postgres_per_user.(v_invocations, v_gb)
backup_costs = sum(v_gb) * 0.010 + sum(v_gb) * 0.022 # s3 1 backup
backup_costs + sum(user_costs)
end

S3 Standard storage

pricing reference, AWS S3
storage costs: $0.23 per GB-month
storage backup: could be considered as a backup
post-costs: $0.005 / 1000 requests * 31 (assuming once a day backup of delta data)
select-costs: 
$0.0004 / 1000 requests

function cost_calculation_s3(v_invocations, v_gb)
storage_costs = sum(v_gb) * 0.023
post_cost = length(v_gb) * 0.005 / 1000 * 31 * 1 # once per day store day transaction file
select_costs = length(v_gb) * v_invocations * 0.0004 / 1000
storage_costs + post_cost + select_costs
end

COST overview per month

Users  Postgres S3
1K $137.61 $9.15
5K $669.60 $44.53
10K $1344.17 $89.38
100K $13539.74 $900.2

Conclusion

It is obvious that the storage for S3 is significantly cheaper even when we duplicate the data as a backup, also additional VM costs should be added for Postgres.
But depending on data access patterns (now fictional), it seems plausible to have a hybrid approach or if the nonfunctional requirements allow it to depend fully on S3 for our storage needs.

To makes things interesting we will go for the hybrid approach in our next blog using GO