On InnoDB compression in production

Our latest changes have been pushed to public mysql@facebook branch, allowing this post to happen \o/

Recently we started rolling out InnoDB compression to our main database tier, and that has been a huge undertaking for multiple teams and a major test for MySQL. Nizam was sure the hero of all this work, and make sure you don’t miss his talk about it at MySQL conference.

Though MySQL manuals have quite some introduction about benefits of compression, we agree that benefits are good – in theory we can do less reads from disk, keep more data in buffer pool or flashcache and take less disk space on premium disk property. The benefits sounded so great, that our engineering team decided to disregard what Oracle has to say about workload characteristics and make it work with whatever workload we have.

There were major architectural issues – for example writing full compressed page images to transaction log is huge flaw for busy systems, and even with write behind caching on underlying hardware that ended up being bottleneck and resource hog.

Another important architectural difference for OLTP is avoiding failed compressions – which was the major CPU cost. Solution to that was adaptive padding – server tries to maintain uncompressed images at a level that would nearly always compress into smaller block sizes.

There were also various bugs that caused servers to melt down if there was even single compressed table on them, as well as numerous other compression problems to fix.

Obviously compression means much more CPU work, and that is especially costly for the replication thread – as it has way less time to be blocked on disk reads and has to spend more time compressing and decompressing. There’re two ways to approach that problem, one is doing less disk reads, other is doing less of everything else. Of course, if there’re multiple ways to solve a problem, we will approach all of them :)

Proper replication prefetching was at the core of this effort – not only it precaches table data from disk, but also decompresses pages for replication thread, as well as loads relay logs if they have been paged out already. Our newest push has few stability and performance fixes for Percona’s fake changes – apparently sibling page read-ins for InnoDB latching was nearly 95% of our replication thread I/O at some time.

The “everything else” part consisted of various CPU inefficiencies and stalls. For example, InnoDB waits for five milliseconds if it detects that other thread is already reading the compressed page – and these collisions sure happen with active prefetching and busy workloads – we constantly saw replication thread “stuck in 90ies“.

Also, InnoDB was actively double-checksumming pages when decompressing them – though checksum on disk read is sure understandable, checksumming while reading a page from buffer pool is certainly not – few % went down that direction.

There were few other evil behaviors in new code paths – e.g. malloc() was being done while holding InnoDB buffer pool mutex, escalating stalls from other places to InnoDB lockups.

We’re still trying to understand implications of uncompressed LRU heuristics – InnoDB will increase amount of pages held uncompressed if it is doing 50x more decompressions than disk reads, which on 10000 IOPS machine means around 8GB/s of decompressed data. We added the tunables, but for now it looks that some of our machines are still I/O bound and we’re not sure if that is a problem on other type of hardware.

It was a bit fun to spot that more than 2% of time was spent loading database options (yes, that db.opt file that has pretty much nothing in it) – even if they are needed for CREATE TABLE only.

Of course, more instrumentation and monitoring was necessary to understand and manage compression in production – standard InnoDB gives just some global overview, but to properly understand what is going on one needs per-table information.

There’re plenty of possible next steps for the compression future – more efficient packing,  performance improvements, different algorithms, etc – but for now we see that first phase worked out.

TL;DR: compression works for OLTP with newest mysql@facebook changes and there has been lots of fun work by our database teams.

Posted in facebook, mysql | Tagged , | Leave a comment

Google :-(

For past few days (days!!!) I’m not able to log into my account at Google (except GMail, thanks for that! – Voice/Blogger/Plus/Docs/… are all broken though). I’m either getting this message:

Sorry, there seems to be a problem. The service you're looking for is temporarily unavailable. We're working hard to restore your access as soon as possible. Please try again in a few hours. Thanks for your patience.

Sorry, there seems to be a problem.

Or I’m sent to infinite redirect loop. None of their self-help pages are useful or have this error mentioned.

Apparently if you get such message, you should read this message in opposite way:

“Sorry (or you should be sorry), there seems to be a problem. This service is looking for you and you seem to be temporarily unavailable. We’re not working on this at all, it may show up on some report at the end of the year though as lost revenue, sad, we see you spent few hundred dollars on our services, so you trust us, but of course we have enough money from everyone else. You may try again in few minutes, few hours or few days, it won’t help you though. Though we appreciate your patience, you should start researching and debugging this yourself instead, thank you.”

I cleared all Google cookies on my browser and it started working. Apparently this has happened to others too.

Update: again…

Posted in misc | Tagged , | 3 Comments

on MySQL replication prefetching

For the impatient ones, or ones that prefer code to narrative, go here. This is long overdue anyway, and Yoshinori already beat me, hehe…

Our database environment is quite busy – there’re millions of row changes a second, millions of I/O operations a second and impact of that can be felt at each shard. Especially, as we also have to replicate to other datacenters, single threaded replication on MySQL becomes a real bottleneck.

We use multiple methods to understand and analyze replication lag composition – a simple replication thread state sampling via MySQL processlist helps to understand logical workload components (and work in that field yields great results), and pstack/GDB based replication thread sampling shows server internal behavior quite well too (a similar technique was used for accept thread visualisation).

The biggest problem with single replication thread is that it has to read data to execute queries (rather than applying physical page deltas, like PG or just appending to files like HBase, it does logical edits to page data) – we can observe 95% of process time at that state. As generally there’s just one outstanding data read per replication thread, other workload hitting the machine will also make replication reads slower.

Generally, the obvious way to deal with slow I/O is issue more outstanding parallel requests, and the only way to do that apart from parallel replication, is to predict what will be needed in future and try to fetch that.

Many many moons ago Paul Tuckfield discussed about the Youtube replication prefetcher – it would take write statements yet to be executed in relay logs,  convert them to SELECTs and run them before replication thread needs that data. He still says that was one of most satisfying quick hacks :-)

Maatkit (now Percona Toolkit) introduced mk-slave-prefetch (I played with it back in 2008, didn’t put it into operation at that time though), and eventually that looked like a reasonable option for prefetching statements on our database cluster.

5000 lines of Perl is not the easiest code to work with (or to debug), so the journey was quite bumpy. We got it working in some shape, eventually, but Baron, original author, has something to say about it:

Please don’t use mk-slave-prefetch on MySQL unless you are Facebook. Or at least don’t tell your friends, so they won’t use it.

Anyway, our updates rate would saturate mksp.pl if we used anything fancier on it, so it was a constant balancing act, in which looking at the code was something nobody wanted to do ;-) Still, it was (and is) helping us, so getting rid of it wasn’t possible either.

At some point in time we decided to make an experiment – what if we executed statements, then rolled them back – so I did a quick implementation of that method from scratch in Python – resulting piece of code was relatively small and fun to experiment with.

There were multiple problems with such approach – one complication was that queries were grabbing locks for the duration of the statement, and some of those locks would collide with what actual replication thread is doing. Fixing that would require immediate lock wait timeout or transaction kill for prefetcher thread – so, relatively deep dive into InnoDB. Another problem was internal InnoDB lock contention on rollbacks – that was expensive operation, and benefits of pages read in were negated by rollback segments lock contention. Fixing that is even more extensive InnoDB work (though probably some people would like their rollbacks to be efficient ;-)

At that moment we came up with the idea, that InnoDB codebase could be instrumented to not do any real work on updates – just page data in and return to the caller, and if any change accidentally slips in, commits can fail. That looked like a feasible project for the future.

At some point in time we were rolling out a new database tier for one product, which was supposed to have really high volume of changes, but all coming in a uniform format. It took less than hour (as most of the work has been done to create rollback-based one) to come up with a prototype that would efficiently extract literals from uniform statements, then use them for prefetching.

This method worked fine – at tiny fraction of resources used by mk-slave-prefetch we were preloading secondary indexes and could have relatively extensive parallelism.

Meanwhile, our main database cluster was having more and more uniform query workload, thanks to various libraries, abstractions and middleware – so a day of work on lowest hanging fruits provided relatively good coverage of the workload.

We didn’t stop mksp.pl – it still provided some coverage for various odd cases, which were time-consuming to work on manually.

There were few other problems with the new method – apparently we were targeting our SELECTs too accurately – UPDATEs were spending plenty of time in records_in_range. Additionally, optimistic update path was reading in pages that selects wouldn’t (due to inefficiency in B-Tree locking code). There were some odd reads done for INSERTs.

Also, SELECTs are using indexing less efficiently – InnoDB can pinpoint entries in secondary indexes by using PK values, yet that ability is not exposed to SQL layer, so prefetching on indexes that don’t have explicitly defined all fields within them is not that easy.

In theory, all these issues are supposed to be ‘fixed’ by fake changes concept. Percona recently implemented it in their releases, and we started experimenting with those changes. It is still not that mature concept, so we will be revisiting how things are or should be done, but for now test results are quite positive (we did some changes to reduce locking and avoid deadlock in REPLACE INTO, among other things).

I still observe I/Os done by main replication thread, so we’re not in perfect shape yet, but method seems to be working relatively well (at least it definitely speeds up replication). We still have to do lots of testing to qualify this for large-scale production, but this may allow way more write workload on our machines until we get parallel replication all around.

Our code for custom query, fake changes or rollback prefetcher can be checked out from a public repo together with other tools (oops, Bazaar doesn’t give easy access to subdirectories:

bzr co lp:mysqlatfacebook/tools; cd prefetch

Or browse it online.

P.S. There’s also Tungsten Replicator for ones who don’t want to wait for 5.6 parallel replication.

Posted in facebook, mysql | Tagged , | 4 Comments

… in numbers

Posted in wikipedia, wikitech | Tagged | 9 Comments

Blowing up in memory

MySQL isn’t too concerned about table handler memory usage – it will allocate row size buffer thrice per each table invocation. There’s a few year old bug discussing UNION memory usage – for each mention in an union one can allocate nearly 200k of unaccounted memory – so a megabyte sized query can consume 7GB of RAM already.

Partitioning though adds even more pain here – it will allocate those three buffers per each partition, so opening a table with 1000 partitions looks like this on memory profile:

Click to enlarge, and you will see 191MB sent to execute a simple single-row fetching query from a table (I filed a bug on this).

There’re multiple real life situations when this is painful (e.g. any kind of server stall may lead to multiple concurrent threads reading from same table, consuming additional gigabytes or tens of gigabytes of memory). It gets even more painful when combined with UNION bug – a megabyte query on an empty table can now consume 7TB of memory and I doubt anyone has that much on their MySQL servers :-)

P.S. Also, check out how much memory can be wasted for malloc overhead, once discussed here.
P.P.S. And here you can see why innodb_max_dirty_pages_pct=0 doesn’t do what you’d expect.

Posted in mysql | Tagged , , , | 7 Comments

On connections

MySQL is needlessly slow at accepting new connections. People usually work around that by having various sorts of connection pools, but there’s always a scale at which connection pools are not feasible. Sometimes connection avalanches come unexpected, and even if MySQL would have no trouble dealing with queries, it will have problems letting clients in. Something has to be done about it.

Lots of these problems have been low hanging fruits for years – it ‘was not detected’ by benchmarks because everyone who benchmarks MySQL would know that persistent connections are much faster and therefore wouldn’t look at connection speeds anymore.

Usually people attribute most of slowness to the LOCK_thread_count mutex – they are only partially right. This mutex does not just handle the counter of active running connections, but pretty much every operation that deals with increase or decrease of threads (thread cache, active thread lists, etc) has to hold it for a while.

Also, it is common wisdom to use thread cache, but what people quite often miss is that thread cache is something that was created back when OS threads were extremely expensive to create, and all it does is caching pthreads. It does not do any of MySQL specific thread caching magic – everything gets completely reinitialized for each incoming structure.

I decided to attack this problem based on very simple hypothesis – whatever ‘accept thread’ is doing, is bottleneck for whole process. It is very simple to analyze everything from this perspective (and I had some success looking at replication threads from this perspective).

All we need is gdb and two loops – gdb attaches to accept thread, one loop does ‘breakpoint; continue’, another sends signals at a certain sampling rate (I picked 10Hz in order to avoid profiling bias). I posted those scripts on PMP page. After a lunch break I had 50k stacks (long lunch ;-) that I fed into graphviz for full data visualisation and could look at individually:

A picture is worth thousand words (well, is easier than looking at thousands of lines in stack aggregations), and I immediately noticed few things worth looking at:

  • Initializing THD (MySQL thread) structure is CPU-heavy task that resides in choke-point thread
  • There is way too much time spent in syscalls, whatever they do
  • Too much memory allocation done by the master thread
  • There’s mutex contention on thread cache waking up worker threads
  • There’s needless mutex contention in few other places

I didn’t want to look at mutex contention issues first so I ended up with something as simple as looking at syscall costs.

  • 15% was going into actual accept()
  • 8.5% was going into poll()
  • 8% went into fcntl()
  • 7% went into setsockopt()
  • 1.2% went into getsockname()

An strace on mysqld gives a picture that explains quite a bit:

poll([{fd=12, ...}, {fd=13, ...}], 2, -1) = 1
fcntl(12, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(12, F_SETFL, O_RDWR|O_NONBLOCK) = 0
accept(12, {... sin_port=htons(59183), ...) = 32
fcntl(12, F_SETFL, O_RDWR)
getsockname(32, {... sin_port=htons(3306), ...) = 0
fcntl(32, F_SETFL, O_RDONLY) fcntl(32, F_GETFL) = 0x2 (flags O_RDWR)
setsockopt(32, SOL_SOCKET, SO_RCVTIMEO, ...)
setsockopt(32, SOL_SOCKET, SO_SNDTIMEO, ...)
fcntl(32, F_SETFL, O_RDWR|O_NONBLOCK)
setsockopt(32, SOL_IP, IP_TOS, [8], 4)
setsockopt(32, SOL_TCP, TCP_NODELAY, [1], 4)

I’ll skip walking through the code, but essentially what it does here is (12 is accept socket, 32 is connection socket):

  • poll() checks whether there are pending connections. If server is busy, trying to accept first, poll on failure is a better approach. There are side effects with that idea though – other sockets may starve a bit, but it is solvable by injecting occasional poll.
  • What happens next is a bit sad. Instead of storing per-socket flags (nobody is touching that for now anyway), it gets the socket flags, figures out it is a blocking socket, sets it to nonblocking mode, accepts the connection, sets it back to blocking mode. Just setting to nonblocking at the start and using it forever that way is much cheaper and constipates way less.
  • accept() itself can be scaled only by having parallel accept() threads. Maybe most of this post would be not necessary if there were multiple accept threads, but I’m not eager to go into that kind of refactoring for now.
  • getsockname() is used just to verify if socket is correct (probably catching EINVAL later seems to be too complicated), it is a very pessimistic code path for a case that nearly never happens (it probably was added for some random Unix back from nineties)
  • Next fcntl “get flags” call is quite unnecessary – this is a fresh socket and one shouldn’t expect anything special within it. Later non-blocking mode is set, so that overrides whatever was obtained here.
  • Three out of four setsockopt()s are necessary evil (one turns of Nagle’s algorithm, two other set socket timeouts), so they have to be done before network I/O is done on the socket. Fourth setsockopt() is usually completely useless – not every network observes IP_TOS header, and one has to talk to network administrator first about decent values. I’d say it can be optional parameter (yay, more tuning options).

Pretty much every connection socket operation can be done later, in a worker thread, without consuming expensive accept thread time, and pretty much every syscall except accept() can be removed from a busy accept thread(), which is what I did in my testing build.

Once I got rid of syscalls I started looking at other low hanging fruits. The most obvious one was sprintf() called inside vio_new(). Though it accounted only for 4% of thread time, the uselessness of it was depressing. Here it is:

sprintf(vio->desc,
   (vio->type == VIO_TYPE_SOCKET ? "socket (%d)" : "TCP/IP (%d)"),
   vio->sd);

It formats a string that is not used at all by production builds (only few DBUG messages are calling vio_description()). Though I removed this code in non-debug build, as I was moving over network initialization to worker threads, whole my_net_init() and vio() ended up outside of accept thread anyway ;-)

The overall thread cache design is centered around LOCK_thread_count – lock is held while signaling threads, and threads that wake up need the lock too – so there’s lots of overhead involved in the coordination – 13% of time is spent just to pass the task to a worker thread.

Allowing multiple threads to wake up and multiple entries to be placed into thread cache before it is all drained (more of an InnoDB concurrency-queue with FLIFO approach) could be somewhat better – so would be worker threads accepting connections directly (I already said that, I guess). There’s simply too much time wasted waking up and sending threads to sleep, and quite some of that time is on a choke point.

THD initializations are somewhat simpler, as they don’t include SMP madness.

There’re some low hanging fruits of course there as well. For example THD initializer calls sql_rnd_with_mutex(), which locks thread count mutex. Simplest fix could be using another mutex, though lockless random function or on-demand variable initialization would help too.

Some initializers there are quite expensive too – e.g. Warning_info class could initialize dynamic storage only when actually used, and not at THD initialization chokepoint. THD::init can be moved to a worker thread, and lots of THD initialization could be moved over to it.

Quite a lot of time (12%) is spent on malloc() – and lots of that is for allocating lots of various fixed-size structures – slab allocator (or just more efficient malloc implementation) could cut on CPU time there. Of course, more drastic alternative is not dealing with THD at all during accept phase – one can pass stub structure to build upon later, or (oh, am I writing this again) moving accept() part to individual workers.

So far I tested just few optimizations – moved over vio/net initialization to worker threads, reduced number of syscalls, added a new mutex for rand initialization, and that alone got me additional 50% increase in connection accepts. Think how much more one could get from fixing this problem properly ;-)

TL;DR: MySQL sucks at accepting new connections, but there’re lots of low hanging fruit there. Ask your MySQL provider for a fix.

MySQL bug entries:

Posted in facebook, mysql | Tagged , , | 16 Comments

InnoDB subsystems in color

I tried to put every subdirectory of InnoDB codebase into a chart that would explain some of relations between subsystems and modules inside the source. This is what I got (click to enlarge):

Update: Check Vadim’s diagram for a more operational view of InnoDB
Another update: There’s a vector PDF version

Posted in mysql | Tagged , , | 5 Comments