HELP! Server overloading
HELP! My server is having load problems!Ok first of all this guide is not going to be entirely comprehensive on everything you need to do towards optimizing a server and figuring what is causing the server to overload. All of the guides in my HELP! series are not meant to replace a professional only give you a general idea of what you can do. If after reading this do not think that there is nothing you can do, it may be you simply have to hire somebody to take a look at it. It is very hard to write every single thing that might be wrong and sometimes it just takes a lot of experience to see what is wrong. The first thing to do is determine what bottleneck is slowing your system down. There are many things that can be causing the load on a server run out of control but the main things are CPU limitations, memory (RAM), or I/O of your disks. Typically people will look at the “uptime” of their server to give a general idea of if it is a load problem causing issues with a server. In general a load of ~1 for each cpu is reasonable, if you have 2 cpus with hyperthreading linux will see them as 4 which means your load can be around 4 without any major problems. That being said it is very possible that your server handles even double what the uptime load shows without any problems. The load from uptime has a lot of factors that go into it and if you are interested in finding out more I would suggest looking on google. When writing this guide i am assuming that your server is optimized so if I say you are running low on RAM you probably need to optimize it some more or get RAM for it.
If the load is below the acceptable limits (1 per cpu) and you are having trouble accessing your server I would suggest looking at the network you are on and that the server is not sustaining a lot of packet loss. There may be other things to look at but I am not going to investigate them in this guide, this guide is meant to look at what happens when the load is too high.
The first quick and dirty tool to look at everything at once and get a general idea of what is going on is top. Simply login to ssh as root and type “top” to bring it up. Below I have pasted the output from top:
13:50:48 up 6 days, 1:11, 2 users, load average: 4.83, 3.40, 2.47
322 processes: 320 sleeping, 1 running, 1 zombie, 0 stopped
CPU states: cpu user nice system irq softirq iowait idle
total 0.3% 25.1% 4.0% 0.0% 0.3% 54.1% 15.7%
cpu00 1.1% 24.6% 3.7% 0.0% 1.1% 64.4% 4.7%
cpu01 0.1% 26.9% 2.9% 0.1% 0.1% 43.2% 26.1%
cpu02 0.1% 23.4% 6.1% 0.0% 0.0% 64.8% 5.3%
cpu03 0.0% 25.6% 3.1% 0.0% 0.1% 44.1% 26.8%
Mem: 2061576k av, 2040180k used, 21396k free, 0k shrd, 15684k buff
1500876k actv, 185468k in_d, 30948k in_c
Swap: 2048276k av, 186376k used, 1861900k free 738560k cached
There are 4 cpus listed above which means this is a dual processor machine with hyperthreading enabled. The 3 numbers of the load show the 5, 10, and 15 minute averages. Since the 10 and 15 minute loads are low this is probably just a spike. Regardless it is still a good idea to figure out what is causing the load to go up.
First we will look at the RAM usage.
Mem: 2061576k av, 2040180k used, 21396k free, 0k shrd, 15684k buff
1500876k actv, 185468k in_d, 30948k in_c
Swap: 2048276k av, 186376k used, 1861900k free 738560k cached
This particular server has 2Gb of ram in it. Swap is a temporary memory that is stored on the hard drive of the server. Because of this as soon as you start to swap you are going to decrease the efficiency of your server. If you are using very much ram at all, generally more then a couple hundred meg, you are probably going to need a ram upgrade. When looking at the actual ram usage keep in mind that linux uses memory much different the windows. Just because all your memory is “used” does not mean that you need more RAM. What is important is to look at the buffer, this is how much RAM is available for use. Linux will consume all of your ram possible but it keeps the nonessential items in the buffer so if the memory space is needed it can be used for what is more important. Use “free -m” to see what the buffer is looking like.
# free -m
total used free shared buffers cached
Mem: 2013 1988 24 0 11 757
-/+ buffers/cache: 1219 793
Swap: 2000 190 1810
The key here is that the server still has 793 memory free in the buffer, which is fine. Since the server is only using 190mb of RAM the chances are pretty low that RAM usage is causing the problem. More RAM will help this server but it is not going to be the fix all right away.
The next step to look at is the I/O (input/output) of the server’s hard drives. A hard drive is a spinning disk and if it is too busy the server is going to have trouble keeping up with the read/write requests. First we are going to look at the IO wait of the server from the output of top. The overall percent for the cpu is 54.1% which is much more then ideal, which should be close to 0%. Since it is high on this server there is most likely a limitation at the CPU or hard drive. The next thing to be done to see what is limiting the server is to look at how fast the disks are able to process information. Since a hard drive is basically a spinning disk as more and more is being done with the disk the slower it is going to be. The hdparm command is used to determine just how fast the disk is able to perform at any given time. If it is run via hdparm -Tt /dev/*da. If the system has SCSI drives it will probably be sda while if it is a IDE drive system it will most likely be hda. It is best to establish a baseline by running this command when nothing is running. Every drive model is going to perform differently so take note of that when you are running hdparm on multiple servers. The higher the number the better for these readings.
# hdparm -Tt /dev/sda
/dev/sda:
Timing buffer-cache reads: 1644 MB in 2.00 seconds = 822.00 MB/sec
Timing buffered disk reads: 40 MB in 3.15 seconds = 12.70 MB/sec
The timing buffer-cache read on this system is pretty high and should not be causing a problem which is normal most time because it is using the disk onboard cache. The timing buffered disk reads is the number that is the more important and what is interested. Most SCSI systems should be running in the area of 50-60MB/sec. In many cases anything much below 20MB/sec and there are going to be significant performance issues. It is very possible that the system will drop even below 10MB/sec when it is very busy. This server needs something done to help increase the IO performance. There are many things that can be done including caching (something like mmcache), upgrading to a better disk (ide to scsi or increasing the drive speed or cache), or splitting the content to multiple disks. One quick and easy way to split the content is to place mysql on a second drive. If this is already done a third disk could be used to store some of the most used system content.
If the RAM and IO both look ok the next problem is simply brute processing power. By eliminating all of the other possible causes the conclusion can generally be drawn that the server in question needs more CPU power. This could be by upgrading to a chip with a higher bus, like a P4 vs a celeron or by adding a second CPU. There are also quad CPU servers available but generally it will be more cost effective to consider some sort of a cluster configuration. I am not going to discuss everything necessary for a cluster but there is plenty of information available on google if one is interested.
My hopes is that this guide will help people help determine their problems and fix them. Hopefully by the end of the guide you should have pretty good idea as to what is slowing your server down. From that point you can look online for ways to optimize your server and help optimize it. If you are still having trouble after following this guide and are unable to determine the exact problem it may be in your best interest to higher a professional and have somebody with years of experience take a look at it. Guides can tell a lot but experiences talks