read

Lately I been moving legacy thumbnails and user images from our applications servers to AWS S3. Cost for storing on S3 is a bargain and S3 can handle infinity amount of objects. I’m really into simplifying and removing as much maintains of our internal systems as possible. During this work I needed to move one folder containing a bit more then 6 million files off to a S3 bucket.

I did this in a two step rocket,

  • first backing up extracting folder into a tar archive and moving that backup into a ec2 instance (to not affect production capacity)

  • second upload all files to S3

First step was easy and nice. Second one is the hard part, started using s3cmd (which is a tool I’ve used in the past to upload/download data from S3), but did run into memory issues as s3cmd allocated a lot of memory. Moved on to s4cmd and same thing there, moving this amount of files turned out to be an interesting issue.

After some googling and some more googling I finally found a script called s3-parallel-put. The python script is quite easy to get going with, here my final conf.

export AWS_ACCESS_KEY_ID=<KEY_ID>
export AWS_SECRET_ACCESS_KEY=<KEY>

s3-parallel-put \
    --bucket=images.example.com \
    --host=s3.amazonaws.com \
    --put=stupid \
    --insecure \
    --processes=30 \
    --content-type=image/jpeg \
    --quiet \
    .

Success!

There’s a couple of win’s here

  • --put=stupid which always uploads the data and doesn’t check if object key is already set (one leas HEAD request for each key)
  • --insecure none ssl is faster
  • --processes=30 a lot of parallel uploads

Going back to final solution I ended up with putting all images user in a S3 CNAME bucket and added Cloudflare caching in front of it. An elegant and cheap solution.

Blog Logo

Love Billingskog Nyberg


Published

Image

jacksoncage

A blog about sysadmin, devops, automation, containers and awesomeness!

Back to Overview