We are currently developing a system which wants to analyze all the domains in the internet. This is a really challenging task and not easily done in a few months time. Besides loads of problems, like finding so many domains and parsing them in a reasonable amount of time we also implement a MongoDB cluster to store the analyzed information. Our database has currently 200GB split into two shards but we expect this to grow up to 1-2 TB of data.
There are a lot of posts like this on the web so I’m probably not telling you something new (especially if you are senior dev dba who just goes: “oh my god… i knew that 10 years ago, it’s the same with every database” :)) but I really wanted to share the following things which bugged me quite a while:
Continue reading