Mixx’s Engine Room - Stoking the fires at Mixx

One of the great strengths of Ruby on Rails is that it abstracts away database access so that you don’t have to worry about SQL when writing your application. Unfortunately, this hides database accesses from developers, which can lead to serious performance problems. The solution to this problem is no further away then the Rails log.

Consider this simple Mixx-ish application, which displays a list of stories:

The controller:
@stories = Story.find(:all, :conditions => "some condition")

The view:
<ul> <% @stories.each do |s| %> <li>"<%= s.title %>" by <%= s.submitter.display_name %> with <%= s.comments.count %> comments </li> <% end %> </ul>
(I should note that Jason and Doug create much better markup than this. This is what you get when a backend guy like me writes code for the purpose of demonstrating something - we wouldn’t use anything this ugly in Mixx production code.)

Seems pretty reasonable, right? But as the number of stories on the list grows, performance quickly goes bad. Let’s find out why.

First, sample output from a run of this code:

“Obama wins!” by joe with 3 comments
“Storms rage everywhere” by julie with 0 comments
“Mixx launches” by chris with 3 comments
“Eggplant farming” by chris with 1 comments

Next, take a look at the Rails log when this list is generated. This can be found in the application directory in logs/development.log:
Story Load (0.000911) SELECT stories.* FROM stories Rendering stories/show User Load (0.000601) SELECT * FROM `users` WHERE (`users`.`id` = 176) Comment Load (0.000391) SELECT count(DISTINCT `comments`.id) AS count_all FROM `comments` WHERE (`comments`.thingy_id = 545) User Load (0.000441) SELECT * FROM `users` WHERE (`users`.`id` = 188) Comment Load (0.000364) SELECT count(DISTINCT `comments`.id) AS count_all FROM `comments` WHERE (`comments`.thingy_id = 99) User Load (0.000424) SELECT * FROM `users` WHERE (`users`.id' = 6) Comment Load (0.000358) SELECT count(DISTINCT `comments`.id) AS count_all FROM `comments` WHERE (`comments`.thingy_id = 85) User Load (0.000408) SELECT * FROM `users` WHERE (`users`.`id` = 6) Comment Load (0.000269) SELECT count(DISTINCT `comments`.id) AS count_all FROM `comments` WHERE (`comments`.thingy_id = 18)
The first line is reasonable - that’s just getting the list of stories. But look what happens when rendering the view: for each story displayed, there are two queries sent to the database. It’s no wonder that generating a long list of stories takes a long time when you are issuing two queries for each story.

A quick look at the code tells us what is happening: the expression “s.submitter.display_name” gets the submitter of the story (which is in the Story model as a belongs_to attribute) and extracts the display_name from it. But that requires a database retrieval (the query against the users table). And the expression “s.comments.count” requires the query against the comments table.

Happily, there’s a number of techniques that can be used to eliminate these extra queries. These include:

1. Use of the “include:” directive when fetching the data. For example, change the above controller code as follows:
@stories = Story.find(:all, :conditions => "some condition", :include => [:submitter])
and all of the users table queries above are replaced with a single query against the users table.

(Note: once upon a time, Rails could be inefficient in its implementation of the :include directive. Happily, this is fixed in Rails 2.0, so there is no longer any reason to avoid :include.)

2. Store frequently used data in the parent table. Admittedly, this is a violation of database normalization principles. Under these principles, we never want to duplicate data in the database. But in some cases, the performance gains that we get from denormalization are worth the potential problems with having data get out of sync. Especially when the data in question is not crucial.

In this case, we added a comment_count field to the stories table. We use after_create and before_destroy actions on the Comment model to keep the comment_count field updated - when we create a new comment, we increment the comment_count value for its parent Story. (We made a conscious decision that, while we could be clever and manage to keep comment_count accurate, we can live if it is incorrect for some stories.) After that, we can use the comment_count field in Story instead of comments.count, thus avoiding the query required to get the count of comments for a story. The resulting view is as follows:
<ul> <% @stories.each do |s| %> <li><%= s.title %> by <%= s.submitter.display_name %> with <%= s.comment_count %> comments </li> <% end %> </ul>
In conjunction with the controller change noted in #1, the query log now looks like this:
Story Load (0.006564) SELECT * FROM `stories` WHERE ( some condition ) User Load (0.000943) SELECT * FROM `users` WHERE (`users`.id IN ('127','193','6','249','216','239','91','240','196','37','93','176', '188','244','235','136')) Rendering stories/show
What used to take nine queries (and could easily grow to more as the number of stories increased) now takes two.

3. In some cases, you can find the data you need in the model’s attributes. If you can thus avoid retrieving via a belongs_to or has_X relationship, then do so

There isn’t an example of this in the sample code. But suppose that we did not want to display the actual submitter of a story in the above example, but we want to display the story differently if it is submitted by the person viewing the page. (In our case, the User model of the viewer of the page is always contained in the @user variable.) We could use any of the following methods to make this test:
if @user == story.submitter # bad! Requires database retrieval of submitter if @user.id == story.submitter.id # bad! Also requires retrieval. if @user.id == story.submitter_id # good! No retrieval required.
Even if you use the :include directive to get the submitter, the first two methods require at least one database retrieval for the page.

Another common case involves a test for the existence of data related to a model via a belongs_to. For example, suppose we want to test whether a story has a submitter at all. Here are two ways of doing it:
if story.submitter # bad - retrieves submitter if present if story.submitter_id # good - no retrieval required
In summary, even though Rails allows you to abstract away database accesses, if you want your application to perform, you need to be aware of how it is using the database. An excellent method to be aware is to review the Rails log while you are developing your application, paying particular attention to repeating instances of the same query (as in the example above, where the queries against the users and comments tables repeat). In fact, you should frequently review the Rails log whenever you are building an application that has to perform well.

Archive for the "Uncategorized" Category

Performance and the Rails log

Browse the Engine Room

Archives

Get to know us