Amazon Simple Storage Service, better known as S3, is a service on AWS to store data on the internet.
Buckets are the containers to store data on S3. They can be viewed as a root “folder” responsible for store every object data that is uploaded. Buckets are useful to easily identify accounts storage usage, so we can properly charge users for what they are using.
Buckets are created in a specific region. This allows the architect building the infrastructure to choose the best place to store the data, based on where the users of the application are located. Choosing the right region can lead to a lower latency, improving the application performance as a result. Buckets that are in the same region of an EC2 instance are free of charge when transferring data.
Names are globally unique for each bucket. If someone creates a bucket called mycompany that name is not available to other AWS accounts anymore, even if the root AWS account or the region is different.
It is possible to host a static web page using a bucket. To make it happen we must store our web files on the bucket, make them public and enable the web-hosting on the bucket properties. After that we define our index page and the page that will be presented when an error happens (e.g. someone accessing a protected file without permission) and that’s it! We have our static website hosted on S3.
Each AWS account can have 100 buckets by default. If more buckets are needed Amazon must be contacted in order to increase this number. Another thing to have in mind is that buckets cannot be transferred to another account. If the company wants to move a bucket to another account it will have to delete the original bucket and recreate it in the desired one.
There is no guarantee that the name will be available, though, because someone can create the bucket as soon as it is available. Another important thing to keep in mind is that a bucket deleted in on the region can be instantly recreated in the same region, but it might take up to an hour to recreate it on another region.
Although the total storage size is unlimited, an object stored on S3 has a max size of 5 TB. To upload files that have more than 100 MB Amazon recommends using Multipart upload, which breaks the data into parts. Breaking the object into parts allows the user to stop and resume the upload when necessary.
There are three ways to store data on AWS that developers usually look into. Going from the most expensive to the cheapest they are: S3 Standard Storage, S3 Reduced Redundancy Storage and Amazon Glacier.
S3 Standard Storage, as the name implies, in the default option. It is designed to be highly durable, having 99.999999999% durability, also know as “eleven nines of durability”. It is also designed to be highly available, reaching 99.99% of availability. We should use this option to store important data, that we cannot lose and that aren’t easily reproducible.
S3 Reduced Redundancy Storage, also know as S3 RRS, is a cheaper version than S3 Standard Storage. It also has 99.99% availability, which means that we should be covered when trying to access the data. The main difference is the durability: it guarantees 99.99% of durability. In other words, we might lose data more often than we’d like. S3 RRS is a very good choice when dealing with data that is easily reproducible.
Amazon Glacier is the cheapest of them. It only costs USD 0.004/GB per month to store data on it. As the S3 storage types, Glacier is very durable, but it lacks availability. If we need to access data from there it might take many hours to retrieve the content. Files that the company wants to keep archived, as old logs or data that is not used for a long period of time, can be considered to be moved to Amazon Glacier.
Lifecycle policies and versioning
Sometimes data is not important after a period of time and we can move it to another place, or even remove it permanently. This can be done in S3 using lifecycles policies. To make it work we go into our bucket and add a rule in the lifecycle section.
We can choose if this rule is going to be applied to the whole bucket or to a certain prefix. Choosing a certain prefix allows that only a certain namespace inside our bucket behave in the way we specify. After that, we have to choose what will happen to the object (e.g. move it to Glacier, permanently delete) and how many days after the object creation the action should take effect.
More than a rule is needed? Not a problem, a bucket support many rules at the same time.
What if we need to keep track of the version of the objects? In this case, we could enable versioning. When versioning is active every time an object is replaced for a new one S3 creates a version inside the bucket and we have access to both versions. Besides the obvious version control advantage that it presents, this could be also used as a backup method since the versioning also keeps deleted objects on its track.
Another important thing to keep in mind is that S3 have eventual consistency is some scenarios. If we create a new object and try to access it right away from another availability zone Amazon can guarantee that the object will be there. This will cause a higher latency, though, since the operation needs to ensure all availability zones have been synced. When we make a PUT request to a file that already exists or a DELETE we might run into eventual consistency, which means that if we ask right away for that file the older version might be retrieved.
All buckets and objects are private by default. This means that if we don’t have an AWS account with the proper authorization we won’t be able to access the data stored. Even if someone manages to get the URL to access the object they still won’t be able to get the data since S3 has signed URLs. Sometimes we want to share the content with another people without they have an AWS account. What can be done in this scenario?
S3 allows us to change files to be available to anonymous users. This is done by granting a permission to Everyone open or download the specific file. What happens if we want only our site to display the content? In this case, if someone on the internet copies the link and post on their own website it shouldn’t work. To make it happen we can change the CORS configuration of the bucket. From there we can define what HTTP methods are accepted, which origins are allowed. These changes will make sure only the sites which have permission will access the content from the bucket.