Archive for the ‘Design’ Category

Database and Performance In Web Applications

Monday, May 17th, 2010

This is sort of a follow up to my last post about caching.  One thing I discovered after chatting with @dkubb was that while caching was a fantastic help to make the site faster, but it was really just covering up the actual issue.

As he’s a bit of a database guy, he looked at some of the things I was doing and showed me some good performance testing tools (more on that in a second) and discovered that there was a lot of slowdown from things like repetitive DB calls and repetitive DB calls on unindexed data.  Anyway, it was fairly eye opening to someone who hasn’t done a huge amount of work with high-traffic sites, and has been blindly believing that since it seems to “just work” it should “just work” no matter what the situation is.  I figure I’d list out a few techniques and commands you can use to deal with your own scaling problems before they become problems.

As always, I’d suggest you actually ignore the rest of this post as it’s going to be a lot of naval gazing, and instead go and watch the Scaling Rails podcast series :)

How Do I Tell If There’s a Problem

First thing I did was enable the slow query log in MySQL, assuming that’s your database of choice.  If you have a modern version, set the threshold very low, I have mine set for 0.05.  This means that any query that takes more than 50 milliseconds will be logged.  If you have an older MySQL you may need to patch it to allow microsecond support older versions only support down t0 1 second.  Depending on how bad things are, even a resolution of 1 second or slower can be a huge help.

When MySQL is restarted, the logs may look something like this (possibly, as it was in my case, lots of messages like these):

# User@Host: db[db] @ localhost []
# Query_time: 0  Lock_time: 0  Rows_sent: 1  Rows_examined: 37
SELECT id
FROM   page_content
WHERE  ( ( content_order = '1' AND page_id = '1' ) ) ORDER BY content_order;

This shows you the database (helpful if you have more than one), how long it took (I assume to a resolution of seconds), how many rows the query looked at (more on this in a second) and the actual db that was run.

Ok, not bad, now we have two things, first is a metric to run by.  If you don’t know how to determine how bad things are, it’s really hard to figure out if you’ve fixed it.  In this case I used the slow query log as the metric, if it was empty, the issue is “fixed”, if there are entries going into there, that means that the queries are slow and there is something to go after, either through design or database indexes.

Using “Explain” and Database Indexes

The first thing I saw in my case was that each time I loaded an item, to get it’s image it had to make one of these queries.  Kinda sucky.  Now realize that the gallery is showing 48 images on each page.  So 48 items each looking through 5600+ rows to get their thumbnails.  I’ll let you do the math, but it equalsreally sucky however you do it.  So how to determine exactly where the error is and how to fix it, or at least make it better.

So the above error in the slow log tells us the most important thing, the query being run.  So what I did was take that and run it in the MySQL command line.  Unsurprisingly, it displayed the one row that was expected.

mysql> SELECT id FROM
   page_content WHERE  ((content_order = '1' AND page_id = '1'))
   ORDER BY content_order;
+----+
| id |
+----+
|  5 |
+----+
1 row IN SET (0.00 sec)

Next step, why did it take so long.  Running the MySQL built in command ‘explain’ before your query will give you some interesting information.

mysql> EXPLAIN SELECT id FROM
    page_content WHERE ((content_order = '1' AND page_id = '1'))
    ORDER BY content_order;
+----+-------------+--------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | TABLE        | type | possible_keys | KEY  | key_len | ref  | rows | Extra       |
+----+-------------+--------------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | page_content | ALL  | NULL          | NULL |    NULL | NULL |   37 | USING WHERE |
+----+-------------+--------------+------+---------------+------+---------+------+------+-------------+
1 row IN SET (0.00 sec)

What does this mean? It means that to find the content entry for this page, the database had to go through 37 rows of data, and there were no possible keys (database indexes) for it to try to look through. Another thing I’m told to watch out for is “filesort” in the “extra” column, which means the system is doing extra work that will slow it down.

How To Fix This

Let me prefix this next bit of discussion  by saying I’m not a DBA, and know there’s way more to things like creating database indexes than what I’m presenting here, and that depending on what you do, you could make performance worse if you add too many indexes.

Now 37 isn’t all that bad, but it’s a scaling issue, so lets say the client puts more and more content on it, with no indexes on the database you get what’s called an On problem.  That is, the more database rows you have, the more it has to search through.  In fact, it’s pretty much the definition of “doesn’t scale” :)

A database index basically is a look-up table for queries.  It lets the database say “give me the data for conditions XYZ” and the database simply hands back the correct data.  Without indexes the database has to go through each row in the table to find the data matching the given conditions.   In this case I had of course an index on the primary key of this content table, but I had forgotten that it would be accessed primarily through the main “content_id” field, which does not have an index on it.  Oops, lets rectify this now:

mysql> CREATE INDEX page_content_page_id_index
    ON page_content(page_id);
Query OK, 37 rows affected (0.05 sec)
Records: 37  Duplicates: 0  Warnings: 0

And run the “explain” command again:

mysql> EXPLAIN SELECT id FROM
    page_content WHERE ((content_order = '1' AND page_id = '1'))
    ORDER BY content_order;
+----+-------------+--------------+------+----------------------------+----------------------------+---------+-------+------+-------------+
| id | select_type | TABLE        | type | possible_keys              | KEY                        | key_len | ref   | rows | Extra       |
+----+-------------+--------------+------+----------------------------+----------------------------+---------+-------+------+-------------+
|  1 | SIMPLE      | page_content | ref  | page_content_page_id_index | page_content_page_id_index |       4 | const |    2 | USING WHERE |
+----+-------------+--------------+------+----------------------------+----------------------------+---------+-------+------+-------------+
1 row IN SET (0.00 sec)

So the big thing to look at here is the number of rows examined, 2.  Pretty damn good, but why were there 2?  This is because there are two rows with page_id = 1 and only 1 with content_order = 1.  A further optimization we can do is to add this index:

mysql> CREATE INDEX page_content_page_id_content_order_index
    ON page_content(page_id,content_order);

This creates an index on the page_id and content_order fields. Now when we do our query (slightly edited for width) we get:

mysql> EXPLAIN SELECT id FROM
    page_content WHERE ((content_order = '1' AND page_id = '1'))
    ORDER BY content_order;
+----+-------------+--------------+------+--------------------------------+------------------------------------------+---------+-------------+------+-------------+
| id | select_type | TABLE        | type | possible_keys                  | KEY                                      | key_len | ref         | rows | Extra       |
+----+-------------+--------------+------+--------------------------------+------------------------------------------+---------+-------------+------+-------------+
|  1 | SIMPLE      | page_content | ref  | page_content_page_id_index,... | page_content_page_id_content_order_index |       8 | const,const |    1 | USING WHERE |
+----+-------------+--------------+------+--------------------------------+------------------------------------------+---------+-------------+------+-------------+

Now the number of rows examined to find the results of the query is 1, down from 2, so a win right? Well, yes and no. Down this path could lie madness. You don’t want to create indexes on every single combination because as you increase the number of indexes you up the time that the database takes to find the results. Or something like that, I’m honestly a bit fuzzy on the whole thing, but I do know that you don’t want to add a bazillion database indexes if you can avoid it.

Conclusion

As I said above, I’m by no means a database expert (as you can clearly see from the original blunders I made), but the point of this was more providing you with some tips and pointers on how to do some of the troubleshooting and get yourself onto the path of better performance.  To sum it all up:

  • Keep an eye on your logs to keep an eye out for large numbers of database queries.
  • Enable slow query logging and watch for queries that are going over a large number of rows.  Anything that’s taking more than a second (on the high end) is a big red flag.
  • Learn to love the ‘explain’ command in MySQL and add indexes appropriately.
  • Don’t add too many indexes :)

Hope that helps you!  More tips and pointers as always, appreciated.

Evolving A Simple Twitter to Blog Ruby Program Part 1

Tuesday, May 4th, 2010

In a “quick” and dirty exercise I built a little ruby program to grab my twitter posts and colate them into a list for posting a “tweets of the week” type blog post.  It was a lot more about figuring out how to do it than the actual output (I’d hope you just follow me on twitter than rely on me posting my tweets here).  Tonight at the FV.rb meeting @dkubb helped me a lot in pointing out some glaring non-rubyisms and I thought that going through some of the changes might help others moving from “old school” programming (structured, functional, perl-y) to “new hotness” programming (object oriented, yeilds, and awesomeness).

I started out with this code in a gist.  Nothing hugely bad, it pulls in either a URL or a file, parses the XML for the status updates, and for each one does some HTML replacement (@user, #hash, and URLs get auto-linked) and then spits out some HTML that can be copied and pasted into a blog entry.

Starting Off

The first step was figuring out how to parse the XML.   A bit of googling found some possibilities.  Hpricot, libxml-ruby, and Nokogiri.  The first post I saw noted that libxml-ruby was the fastest, which makes sense as it’ll be pretty close to the bare metal C libraries, so I took a run with that.  Not great success, the biggest challenge was figuring out how Ruby dealt with XML structure.  There was a lot of mucking around in IRB.

ruby-1.8.7-p249 > require 'xml'
 => true
ruby-1.8.7-p249 > parser = XML::Parser.file('twitter.xml')
 => #<LibXML::XML::Parser:0x101170140 @context=#<LibXML::XML::Parser::Context:0x101170168>
ruby-1.8.7-p249 > doc = parser.parse
# snip xml spew to STDOUT
ruby-1.8.7-p249 > doc.class
 => LibXML::XML::Document
# Hmm.... does find work?
ruby-1.8.7-p249 > s = doc.find('/status')
 => #<LibXML::XML::XPath::Object:0x1018ff398>
ruby-1.8.7-p249 > s.methods
# snip list of methods, and searching for what to do
ruby-1.8.7-p249 > s.each { |node| puts node.class }
 => nil
ruby-1.8.7-p249 > s.each { |node| puts node.inspect }
 => nil
# WTF? OK, so what now then?

It was a bit frustrating, though probably mostly because I just didn’t grok how the XML was being represented internally, and thinking of it more like a Perl hash-of-hashes than whatever libxml-ruby was using.  So I moved on.  (Ironically while re-doing some of this for this article I went back and was running the commands figuring now I would get it, and failed miserably :)

Next I looked at Hpricot, but the syntax in the readme and examples scared me away.

Starting Progress on the First Iteration

Someone at work suggested that Nokogiri was the way to go, and realizing that parsing a few kb of XML probably wasn’t going to run me into any performance issues, I took a run at it with this.  I soon found that having a static XML file would be the easiest for testing, so I saved twitter.xml in the same directory as I was running IRB out of and played some more.

Much better.  Then to find out to get a list of the statuses:

ruby-1.8.7-p249 > require 'nokogiri'
 => true
ruby-1.8.7-p249 > doc = Nokogiri::XML(File.new('twitter.xml'))
# snip
ruby-1.8.7-p249 > doc.class
 => Nokogiri::XML::Document
ruby-1.8.7-p249 > doc.xpath('/status')
 => []
ruby-1.8.7-p249 > doc.xpath('//status')
# snip lots more xml spew and more testing until...
ruby-1.8.7-p249 > doc.xpath('//status').each { |node| puts node.xpath(".//text").first.content }
# snip lovely output of each of the tweets in the xml file

Ok, so now I could run “.each()” on this, having discovered that the xpath() function basically allowed me to get a list of XML nodes with that path, and then I could get a list of node data from that, remembering to use the “start from current node” syntax (using the ‘.’ to represent the current location in the tree).

The next steps were (relatively) easy.  Looping through each status, get some information (time, status ID, content, etc), format that into HTML, find and implement a couple of “convert @user to an HTML link” bits of code I found online, and voila, first iteration was completed and working.

Now With Some Expert Advice

So after reading the What I wish I had been told a year ago post, I figured the next stage was to convert it to a class, make it more ruby-y, and give it some tests.  Dan Kubb of DataMapper fame thankfully answered my question to help and moved me on to this current version with some helpful advice.

I’ll continue this later on this week with Part 2, in which I iterate into more awesomeness!

HTML In Browser Gaming

Friday, April 30th, 2010

My buddy @milesforrest of CoderPath fame pointed me to his new in browser tile matching game, wanted to pimp it out a bit.  The idea came from this post on hacker news, which has a site with a full tutorial if you want to make your own.  Also @knowtheory on IRC suggested people check out HTMLDemos for more cool stuff you can do (source here on github).