Docker and Salt Redux

Recently I was digging into salt innards again; that meant it was time to dust off the old docker salt-cluster script and shoehorn a few more features in there.

Salt up close and personal.

NaCl up close and personal.

There are some couples that you just know ought to get themselves to a relationship counselor asap. Docker and SSHD fall smack dab into that category. [1]  When I was trying to get my base images for the various Ubuntu distros set up, I ran into issues with selinux, auditd and changed default config options for root, among others. The quickest way to deal with all these annoyances is to turn off selinux on the docker host and comment the heck out of a bunch of things in various pam configs and the sshd config.

The great thing about Docker though is that once you have your docker build files tested and have created your base images from those, starting up containers is relatively quick. If you need a configuration of several containers from different images with different things installed you can script that up and then with one command you bring up your test or development environment in almost no time.

Using this setup, I was able to test multiple combinations of salt and master versions on Ubuntu distros, bringing them up in a minute and then throwing them away when done, with no more concern than for tossing a bunch of temp files. I was also able to model our production cluster (running lucid, precise and trusty) with the two versions of salt in play, upgrade it, and poke at salt behavior after the upgrade.

A good dev-ops is a lazy dev-ops, or maybe it’s the other way around. Anyways, I can be as lazy as the best of ’em, and so when it came to setting up and testing the stock redis returner on these various salt and ubuntu versions, that needed to be scriptified too; changing salt configs on the fly is a drag to repeat manually. Expect, ssh, cp and docker ps are your best friends for something like this. [2]

In the course of getting the redis stuff to work, I ran across some annoying salt behavior, so before you run into it too, I’ll explain it here and maybe save you some aggravation.

The procedure for setting up the redis returner included the following bit:

– update the salt master config with the redis returner details
– restart the master
– copy the update script to the minions via salt

This failed more often than not, on trusty with 2014.1.10. After these steps, the master would be seen to be running, the minions were running, a test.ping on all the minions came back showing them all responsive, and yet… no script copy.

The first and most obvious thing is that the salt master restart returns right away, but the master is not yet ready to work. It has to read keys, spawn worker threads, each of those has to load a pile of modules, etc.  On my 8-core desktop, for 25 workers this could take up to 10 seconds.

Salt works on a sub/pub model [3], using ZMQ as the underlying transport mechanism for this. There’s no ack from the client; if the client gets the message, it runs the job if it’s one of the targets, and returns the results. If the client happens to be disconnected, it won’t get the message. Now salt minions do reconnect if their connection goes away but this takes time.

Salt (via ZMQ) also encrypts all messages. Upon restart, the master generates a new AES key, but the minions don’t learn about this til they receive their first message, typically with some job to run. They will try to use the key they had lying round from a minute ago to decrypt, fail, and then be forced to try again. But this retry takes time. And while the job will eventually be run and the results sent back to the master, your waiting script may have long since given up and gone away.

With the default salt config, the minion reconnect can take up to 5 seconds. And the minion re-auth retry can take up to 60 seconds. Why so long? Because in a production setting, if you restart the master and thousands of minions all try to connect at once, the thundering herd will kill you. So the 5 seconds is an upper limit, and each minion will wait a random amount of time up to that upper limit before reconnect. Likewise the 60 seconds is an upper limit for re-authentication. [4]

This means that after a master restart, it’s best to wait at least 15 seconds before running any job, 10 for master setup and 5 for the salt minion reconnect. This ensures that the salt minion will actually receive the job. (And after a minion restart, it’s best to wait at least 5 seconds before giving it any work to do, for the same reason.)

Then be sure to run your salt command with a nice long timeout of longer than 60 seconds. This ensures that the re-auth and the job run will get done, and the results returned to the master, before your salt command times out and gives up.

Now the truly annoying bit is that, in the name of perfect forward secrecy, an admittedly worthy goal, the salt master will regenerate its key after 24 hours of use, with the default config. And that means that if you happen to run a job within a few seconds of that regen, whenever it happens, you will hit this issue. Maybe it will be a puppet run that sets a grain, or some other automated task that trips the bug. Solution? Make sure all your scripts check for job returns allowing for the possibility that the minion had to re-auth.

Tune in next time for more docker-salt-zmq fun!

[1] Docker ssh issues on github
[2] Redis returner config automation
[3] ZMQ pub/sub docs
[4] Minion re-auth config and Running Salt at scale

Ditching gnome 3 for kde 4

I finally made the switch. I’ve been a long time fan of gnome, critical of kde memory bloat, and not fond of the lack of integration that has haunted kde and its apps for years. But I finally made the switch.

I have an Nvidia graphics card in this three and a half year old laptop on which I run the Nvidia proprietary drivers. Let’s not kid ourselves; in many cases the open source drivers aren’t up to snuff, and this card and laptop is one of those cases. I’m talking about regular use for watching videos, doing my development work and so on, not games, not exotic uses of blender or what have you, nothing out of the ordinary.

Gnome shell has been a memory hog since its inception, with leaks that force the shell to die a horrible death or hang in odd ways after a few days of uptime. Maybe this is caused by interaction with the Nvidia drivers, and maybe not, but it’s a drag.

Nonetheless, it was a drag I was willing to put up with, in the name of ‘use the current technologies, they’ll stabilize eventually’. No, no they won’t. With the latest upgrade to Fedora 20, I noticed a bizarre mouse pointer bug which goes something like this:

Type… typetypetype… woops mouse pointer is gone. Huh, where is it? Try alt-shift-tab to see the window switcher. Ah *whew*, I can at least switch to another window, and now the pointer is back.

Only, that alt-shift-tab trick didn’t always work the first time, and sometimes it didn’t work at all. I was forced often enough to hard power off the laptop (no alternate consoles to switch into, and the system was in hard lockup doing something disk-intensive, who knows what… maybe swapping to death).

After the last round of package updates I started seeing lockups multiple times a day. The bug reporter, on the few times gnome shell would actually segfault, refused to report the bug because it was a dupe and what was I doing using those proprietary drivers anyways.

Usability has a bunch of factors in there, but basic is the ability to use the system without lockup. So… kde 4.11. Five days later I have had no mouse pointer issues, no lockups, no OOM, no swapping. I miss my global emacs key bindings, I couldn’t get gnome terminal to work right because of the random shrinking terminal bug, and the world clock isn’t exactly the way I’d like it but I’ll live with that. Goodbye gnome 3, if you see gnome 4 around some day, my door will be open.

Greek Wikipedian sued for adding sourced documented information

No joke, sadly. Tomorrow is the preliminary injunction hearing where the judge will decide whether the user should be ordered to temporarily remove the content the suing politician doesn’t like, pending trial… only of course to have the edit reverted by someone else, I’m sure. Read all about it here: We are all Diu!

Unfortunately, as goofy as the lawsuit seems, the threat of censorship is real. What happens tomorrow could influence the future of Wikipedia in a big way. Stay tuned! And pass on the word, let people know.

Test clusters via docker containers

baby tux containers

A little while ago we ran into an odd salt bug which was inconvenient to try debugging in production. What I really wanted, since I don’t have a private cloud nor free Rackspace instances, was a way to build and run a cluster of 100 or so salt minions on my desktop for testing. One of the developers suggested using docker containers for this, and so my odyssey began.

Docker has been generating a lot of buzz, and a lot of questions, starting with “Why is this any better than running a virtual machine?” Docker, in a nutshell, runs processes in a chroot, using your host’s kernel. This makes it lighter weight than a VM, and if you want multiple containers running the same process with minor configuration changes, they can all share the same base image, saving disk space. LXC (Linux containers) and devicemapper are used under the hood; docker itself consists of a build system that allows you to write a config file for generating a linux container with specific contents and running specified processed. It also implements a REST -ish API that provides information about images (the base chroot) and containers (the thin copies of the image) as well as allowing for their creation, configuration and deletion.

Docker is very much in development so anything that follows may be superceded by the time you try to use it yourself.

Docker drops most capabilities for processes running inside the container, though there is still morework to be done on this front. SELinux was the first problem I encountered; running Ubuntu precise images with sshd under F20 fails to do anything useful because sshd thinks that SELinux is enabled after checking /proc (mounted from the host running the containers).

After putting together a hackish workaround involving a local build of libselinux, I needed a way to start up salt master to start up first, collect its key fingerprint, and then get that information onto all the minions before they start up. I also needed to get hostname and ip information into /etc/hosts everywhere, which can’t be done from within the container because in-container processes do not have mount capability and /etc/hosts is a read-only file mounted from the host for security reasons. Thus was born another hackish workaround, which relies on puppet apply and a tiny python web server with a REST-ish API to add, apply and remove puppet manifests.

There were and are a few other fun issues, but to keep a long hackish story short, I can now spin up a cluster of up to a few hundred minion of whichever flavor of salt in just a few minutes, do my testing and either save the cluster for later or toss it if I’m done. [1]

Besides its use for generating test environments, another use for docker and perhaps the thing that most people are talking about, is its use in PaaS services/cloud hosting. [2] Before you decide to replace all your VMs with Docker however, you should know that it’s really intended as a way to package up one or two processes, not as a way to run a full LAMP stack, although that’s been done too and more, [3] using supervisord to do the work of process management.

[1] https://github.com/apergos/docker-saltcluster
[2] http://www.rackspace.com/blog/how-mailgun-uses-docker-and-contributes-back/
[3] https://github.com/ricardoamaro/docker-drupal and https://github.com/ricardoamaro/drupal-lxc-vagrant-docker

Image derived from https://commons.wikimedia.org/wiki/File:Baby.tux-alpha-800×800.png and https://commons.wikimedia.org/wiki/File:Different_colored_containers_pic1.JPG

If only bz2 used byte-aligned blocks…

this post would have been written ages ago!  But it doesn’t and so here we are.

TL;DR version: new toy available to play with, uses the bz2 multistream enwikipedia XML dump of current pages and articles, displays text of the article of your choice.  No fancy rendering, and raw wikitext infoboxes still look like &^%$#@, that’s left as an exercise to the reader.  You will need linux, bash, python, the bz2 multistream files from here or from a mirror site, and the code from here. Ah, you’ll also need about 10GB worth of space, the dumps are large these days!

Want more information? Check out the README.txt and HACKING.txt.

A special mention here goes to the author of the Offline Wikipedia project, which used bzip2recover to exploit the block-oriented nature of the compression algorithm, breaking the XML file into tens of thousands of small files and building an index of page titles against these files. If bz2 used byte-aligned blocks, he might have written my code long ago!

Return of the revenge of the disk space

We’ve been generating bundles of media in use on the various Wikimedia projects, so that readers or editors of these projects can download the media with just a few clicks.  This approach is great for the downloader but takes more space than we would like, since files hosted on Commons in use on multiple projects will be stored multiple times.  We were hoping to be clever about this by pulling out the files stored in multiple places and bundling those up separately for download. The first step was to generate a list of projects that have the largest number of files used by some other project.  The results were discouraaging.

Below is a list of the projects with the most media in use, listed in parentheses, followed by the number of media files in common with some other projects in descending order.

enwiki(2237560): 519734|dewiki 480943|frwiki 304120|plwiki 352064|ruwiki 393602|eswiki 354075|itwiki
dewiki(1426181):  519734|enwiki 318361|frwiki 246937|itwiki 236249|ruwiki 223472|plwiki 222664|eswiki
frwiki(1046546):  480943|enwiki 318361|dewiki 255563|eswiki 250681|itwiki 219759|ruwiki 201138|plwiki
ruwiki(649728):   352064|enwiki 236249|dewiki 219759|frwiki 187919|eswiki 185788|itwiki 173388|plwiki

Eliminating all of the duplication between just the first few top projects would entail the creation of multiple separate files for download,  making things significantly less convenient for the downloader without the space gains to justify it.

For just the top five projects as far as media usage, the number of media files in common to them all is only 66979, a pittance.  But even if we took the 500 thousand files in use on dewiki and on enwiki and put them in a separate bundle, with a separate bundle for the rest of enwiki and a separate one for the rest of dewiki, that’s still not much of a gain compared to the nearly 6 million unique media files total in use.

So for now we’ll just keep the media bundles per project like they are.  If anyone has any bright space-saving ideas, please chime in with a comment.

(Disk) space: the final frontier

Where are those awesome little cubes of holographic data that we used to see on Star Trek, which contained seemingly endless amounts of data? While we wait for someone to get on that problem, I get to sort out mirrors and backups of media in a world where servers with large raids cost a hefty chunk of change. Just a few days ago our host that serves scaled media files was down to less than 90 GB left.

Lost in time and lost in space on the scaled media server

In theory scaled media can be regenerated from the originals at any time, but in practice we don’t have a media scaler cluster big enough to scale all media at once. This means that we need to be a bit selective in how we “garbage collect”. Typically we generate a list of media not in use on the Wikimedia projects and delete the scaled versions of those files. The situation was so bad, however, that the delete script–which sleeps in between every batch of deletes–put enough pressure on the scaled media server that it became slow to respond, causing the scalers to slow down and thus affecting service for the entire site for a few minutes.

The solution to this turned out to be to remove the scaled media server completely from the equation, and rely entirely on the new distributed media backed, Openstack’s Swift. Whew! But we are really only putting off a discussion that needs to happen soonish: how do we keep our pile of scaled media from expanding at crazy rates?

Consider that we will generate thumbnails of any size requested, on demand if it doesn’t exist already, and these files are never deleted until the next time we run some sort of cleanup script. With Swift it’s going to be easy to forget that we have limited disk storage and that scaled media are really just long-lived temporary files. Should we limit thumb generation to specific sizes only (which could be pregenerated rather than produced on the fly)? Should we generate anything requested but toss non-standard sizes every day? Should we toss less frequently used thumbs (and how would we know which ones those are) on a daily or weekly basis?