S3 and the Million Song Dataset Experiment


The purpose of this post is to capture notes on the S3 cloning/replication experiment that I am going through. I am using a public AWS dataset (Million Song Dataset found here). The dataset itself is 500G in size. I have created one m1.large EC2 node with a 1TB mounted EBS storage block for housing the dataset. I plan to copy that into S3 and then clone that 500G set into a 1TB set. Once I have that, I can get a rough estimate on how long it will take to copy a 1TB set around as well as a few other insights.


To start with, you need to setup an EC2 node; this is pretty easy to do. You can setup whatever size you want and zip through the other forms until you get to storage. On the storage step, you need to add a new volume and fill in the snapshot ID with the one from the million song dataset link (above). After you set that up along with your permissions and such, start the server.

Once the server is done initializing go ahead and SSH into it. Below is a list of commands that I executed (which you can execute also, just use your own device ID and mount point) to get the EBS volume mounted. I posted the output along with the commands for your reference.

[ec2-user@host]$ lsblk
xvdb  202:16   0   1T  0 disk
xvda1 202:1    0   8G  0 disk /
[ec2-user@host]$ sudo mkdir /million-song-set
sudo mount /dev/xvdb /million-song-set
[ec2-user@host]$ ls
AdditionalFiles  data  LICENSE  lost+found  README

‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍Now you have successfully mounted the EBS million song dataset volume!

Moving on to S3…

Now that we have the dataset, we need to get it into an S3 bucket. I will be doing this with the s3put command that is available on the EC2 node that we just created. Here is an example of this command:

s3put -a YOUR_ACCESS_KEY -s YOUR_ACCESS_SECRET --bucket million-song-set *

Now if you are like me, you won’t have these (your access key and access secret) on hand. Here is a link to the document that explains how you can go about finding them in your account. Once you execute that command you’ll start getting output about how its ‘Copying FILE_NAME to FILE_NAME’…that will likely go on for some time. In fact, I just started this process at 13:50 on Feb 17th and will be waiting for it to complete. Will check back in and continue writing after a bit.

Update: The process completed on Feb 18th at 17:00 for a total run time of 27.5 hours.

Be the first to read my posts!

* indicates required