PHP is an excellent language for developing on the web. It’s portable, versatile and has matured a lot over the past few years. Web applications are becoming more complex each day: they are processing massive amounts of data, talking to other services, and are expected to do so quickly. They get to a point where processing is either too server intensive, or it may be making your site less responsive to users if its doing too much. If you take a step back and think about PHP applications asynchronously, you can build faster, more scalable applications.
Web Application Flow
A typical web application’s job is to turn a request into a response. In order for this to appear responsive, applications should generate that response as quickly as possible – I usually find anything under half a second acceptable. If we were to build the application asynchronously, we’d delegate any complex logic, service calls, or other expensive tasks. These tasks can usually be broken down into two categories…
Preparing Data For This Page
If a page requires us to perform complex queries, pull data from web services, etc. then that will likely add considerable delays to page generation. The usual response is caching, but people tend to naturally still use it in a synchronous fashion. Let’s take a twitter feed as an example (showing latest 10 tweets).
You’ll see code like this quite often:
public function getTweets($username) {
if ($tweets = $this->_cache->get("tweets_$username")) {
return $tweets;
}
$tweets = $this->_twitter->get($username);
$this->_cache->set("tweets_$username", $tweets);
return $tweets;
}
It checks to see if our cache entry exists, and returns it if it does. If not, it calls the routine that fetches the tweets from Twitter, and caches that result (assuming our default cache expiration). Let’s say this is 30 minutes. The problem with this approach is every 30 minutes, it needs to download the information again, making the user wait while the page loads. The other major problem is that when Twitter is down (or for whatever reason, we are unable to connect), the page is broken, because the cache has expired.
A much better approach is to never even attempt to do this work in the main application flow. Instead, let’s have a cron job run every 30 minutes that downloads this information and caches it with no expiration. This way, even if Twitter is down, the cached copy is always expected to work. You could still use the above code as a safeguard, but with this approach it should never have to be used as a fallback. You’d only want to ever update the cache if you successfully retrieve your information.
This model works with almost any data you can cache. You’ll need to decide if you’d rather only perform these tasks on demand (when a user visits a page), or if you want to do a little extra work to ensure the data is always available. Depending on the traffic of the site, or the amount of unique cacheable items you are working with, you’ll have to weigh the benefits.
Processing Needed Later
The other category usually comes from requests that actually need some sort of action performed. One concept we’re going to look at which really helps scale your applications is a work queue or job queue. With a queue in place, we can defer execution of actions so we can quickly complete our request to response cycle. Just because something is delayed, it doesn’t mean there will be a lapse in our accurate data. By simply delaying something for 1, or even 0 seconds, it can be run immediately, but from another process.
The way this works is we need to have a few different processes in place:
- Job Queue
- Storage of tasks to be executed. Example: print spooler
- Job Worker
- A process that reads tasks from the queue, and executes them. Workers can be spread across multiple physical machines, if necessary.
- Client
- A client would be our application, which adds items to the job queue. Similar to workers, we could have multiple applications using the same queue.
Let’s look at some examples…
Sending Mail
Using PHP to send mail is a relatively simple task. If your server can directly send mail, then it is probably very quick, too. A lot of servers will use some sort of mail queue internally to ensure mail gets sent at a steady pace and doesn’t overload the server. However, if you are sending mail via remote web servers (for example, using GMail), then it will take a lot longer, especially if you are sending lots of mail.
Instead of making the user wait a few extra seconds, we can instead toss the mail into a queue (which should be very quick) and not worry about it. Then, one of our mail queue workers can process it as soon as they are available. Depending on your set up, this could be instant, or it could be very delayed. It’s up to you to decide how important the timing of your mail delivery is.
Rebuilding Stats, Caches, etc.
Another common scenario is performing clean-up processing after making changes to your data. Delaying execution until another process can pick it up is usually harmless to the user, and gets them to their next task that much faster.
Different Types of Queues
There are two very common implementations or approaches you can take.
Running as a Daemon
If you need to minimize time between queuing and execution, or more importantly, between scheduled execution and actual execution, you should look at something like Gearman or beanstalkd . These applications run as daemon and can immediately process queued tasks as they are ready to be executed. However, since it requires software to be running, it may not be available or possible in some hosting environments, and takes a little more work to set up. I’d argue, for the accuracy, it will faster than the alternatives since it is not having to continually poll to get new data.
Running Scheduled Tasks
If you are limited, a simple option is to set up a scheduled task for your worker. It should be a simple PHP script that connects to your queue (you could store it in the database), and processes items as they are available. If you want things to run as quickly as possible, you can make your cron job run more frequently, however this may also burden your system depending how frequently you poll, and how intensive your scripts are.
I’m going to follow this post up with some examples of implementing this with both a database-based queue, running as a scheduled task, and also with either gearman or beanstalkd.