Linux进程多导致CPU卡死的问题


转载请注明出处:http://blog.csdn.net/guoyjoe/article/details/49924557

一、邮箱收到一堆监控,报警内空大致如下,很明显是CPU不够用了,IO也有点问题:主机: bwebser2__10.253.5.198 时间: 2015.11.15 15:25:17状态: PROBLEM 级别: Warning报警原因: Processor load is too high on bwebser2内容: Processor load (1 min average per core):value=52.53原始事件ID: 30605主机: bwebser2__10.253.5.198时间: 2015.11.18 15:42:23 状态: PROBLEM级别: Warning 报警原因: Disk I/O is overloaded on bwebser2 内容: CPU iowait time:value=68.7 %原始事件ID: 30812

二、用top查看进程,发现有近2000个进程

[[email protected] ~]# top
top - 10:00:32 up 184 days, 19:55,  2 users,  load average: 49.39, 52.06, 53.04
Tasks: 1826 total,   1 running, 1825 sleeping,   0 stopped,   0 zombie
Cpu(s): 22.5%us,  3.8%sy,  0.0%ni, 31.7%id, 41.3%wa,  0.7%hi,  0.0%si,  0.0%st
Mem:   8058056k total,  7631808k used,   426248k free,   718780k buffers
Swap:        0k total,        0k used,        0k free,   358720k cached

三、猜测可能和sendmail有关,查maillog日志,一直报警:No space left on device

[[email protected] ~]# tail -f  /var/log/maillog     
Nov 19 10:12:15 bwebser2 postfix/postdrop[19470]: warning: mail_queue_enter: create file maildrop/878633.19470: No space left on device
Nov 19 10:12:15 bwebser2 postfix/postdrop[27287]: warning: mail_queue_enter: create file maildrop/900082.27287: No space left on device
Nov 19 10:12:15 bwebser2 postfix/postdrop[12347]: warning: mail_queue_enter: create file maildrop/919377.12347: No space left on device
Nov 19 10:12:15 bwebser2 postfix/postdrop[21222]: warning: mail_queue_enter: create file maildrop/937001.21222: No space left on device
Nov 19 10:12:16 bwebser2 postfix/postdrop[25028]: warning: mail_queue_enter: create file maildrop/956095.25028: No space left on device
Nov 19 10:12:16 bwebser2 postfix/postdrop[28123]: warning: mail_queue_enter: create file maildrop/980022.28123: No space left on device
Nov 19 10:12:16 bwebser2 postfix/postdrop[26680]: warning: mail_queue_enter: create file maildrop/999360.26680: No space left on device

四、用lsof确定sendmail、postdrop进程数量,进程数达到2000多个,为什么有这么多呢?

[[email protected] ~]# lsof |grep sendmail |wc -l
24682
[[email protected] ~]# lsof |grep postdrop  |wc -l
24108

 

五、查看文件索引节点inode,发现空间满了:

[[email protected] log]# df -i
Filesystem       Inodes   IUsed    IFree IUse% Mounted on
/dev/xvda1      1310720 1310720        0  100% /
tmpfs           1007257       1  1007256    1% /dev/shm
/dev/xvdb1     13107200    6142 13101058    1% /u01

用df -Th命令:
[email protected] statistics]# df -Th
Filesystem     Type   Size  Used Avail Use% Mounted on
/dev/xvda1     ext4    20G  4.1G   15G  22% /
tmpfs          tmpfs  3.9G     0  3.9G   0% /dev/shm
/dev/xvdb1     ext3   197G   18G  170G  10% /u01

六、通过清除zookeeper监控日志把腾出根的空间

cd /home/zookeeper/monitor 
[[email protected] monitor]# ll
total 8
drwxrwxr-x 163 zookeeper zookeeper 4096 Nov 12 00:16 charts
drwxrwxr-x 167 zookeeper zookeeper 4096 Nov 18 17:31 statistics
[[email protected] monitor]# cd charts
rm -rf *
[[email protected] monitor]# cd ../statistics/
[[email protected] statistics]# rm -rf 201506*
[[email protected] statistics]# rm -rf 201507*
[[email protected] statistics]# rm -rf 201508*
[[email protected] statistics]# rm -rf 201509*
[[email protected] statistics]# rm -rf 201510*

七、杀死所有sendmail和postdrop进程后

[[email protected] ~]#ps -ef|grep sendmail | grep -v grep | awk  '{print "kill -9 " $2}' |sh
[[email protected] ~]#ps -ef|grep postdrop | grep -v grep | awk  '{print "kill -9 " $2}' |sh

八、lsof查看,进程数为0

[[email protected] ~]# lsof |grep sendmail |wc -l
0
[[email protected] ~]# lsof |grep postdrop  |wc -l
0

九、被忽略的/etc/cron.d下的sysstat,修改sysstat,操作如下:

[[email protected] cron.d]#cd /etc/cron.d/
[[email protected] cron.d]# ll
total 12
-rw-r--r--. 1 root root 113 Nov 23  2013 0hourly
-rw-r--r--. 1 root root 108 Apr  7  2014 raid-check
-rw-r--r--. 1 root root 235 Nov 23  2013 sysstat
 
vi sysstat添加&>/dev/null
# run system activity accounting tool every 10 minutes
*/10 * * * * root /usr/lib/sa/sa1 1 1 &>/dev/null
# generate a daily summary of process accounting at 23:53
53 23 * * * root /usr/lib/sa/sa2 -A &>/dev/null

十、再次用top命令查看进程只有100多个,监控报警消失,问题搞定!

[[email protected] cron.d]# service sendmail restart
sendmail: unrecognized service
[[email protected] cron.d]# top
top - 10:43:12 up 184 days, 20:37,  2 users,  load average: 1.03, 1.54, 14.15
Tasks: 105 total,   1 running, 104 sleeping,   0 stopped,   0 zombie
Cpu(s): 43.4%us,  1.3%sy,  0.0%ni, 47.9%id,  7.0%wa,  0.3%hi,  0.0%si,  0.0%st
Mem:   8058056k total,  6762996k used,  1295060k free,  1422060k buffers
Swap:        0k total,        0k used,        0k free,   381392k cached

 

版权声明:本文为博主原创文章,未经博主允许不得转载。

解决zabbix的cannot allocate shared memory of size错误

问题状态:

zabbix_agentd不能启动,系统CentOS 5.8 i386

原因分析:

这是因为内核对share memory的限制造成的。

用到如下命令ipcs [-m|l|a],sysctl [-a|p]

复制代码

 

[[email protected] ~]# ipcs -l

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 0
max total shared memory (kbytes) = 0

min seg size (bytes) = 1

------ Semaphore Limits --------
max number of arrays = 128
max semaphores per array = 250
max semaphores system wide = 32000
max ops per semop call = 32
semaphore max value = 32767

------ Messages: Limits --------
max queues system wide = 16
max size of message (bytes) = 65536
default max size of queue (bytes) = 65536

复制代码

 

从上面可以看到max total shared memory和max seg size都是没有限制的。但是zabbix仍然不能allocate内存。

接下来查看目前的共享内存设置,

[[email protected] ~]# sysctl -a|grep shm
kernel.shmmni = 4096
kernel.shmall = 0
kernel.shmmax = 0

其中kernel.shmall代表总共能分配的共享内存,kernel.shmax代表单个段能allocate的内存(以字节为单位),这里都是0,所以肯定有问题。

然后查看/etc/sysctl.conf

kernel.shmmax = 68719476736
kernel.shmall = 4294967296

得到shmall为4G,shmax更大,为64G,因为是32位系统,所以设置shmall的时候不能超过32位系统能识别的最大内存,所以最多能设置为3G多,所以这个我改为

kernel.shmmax = 1294967296
kernel.shmall = 3294967296

然后执行sysctl -p生效,这时再查看如下。

[[email protected] ~]# sysctl -a|grep shm
kernel.shmmni = 4096
kernel.shmall = 3294967296
kernel.shmmax = 1294967296

的确生效了,然后启动zabbix_agentd成功,查看内存分配情况如下。

[[email protected] ~]# ipcs -m

------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x7401840e 2916352 root 600 4 0
0x6c0180cf 3047425 zabbix 600 527272 6

 

其实不光zabbix,很多程序出现此错误也能使用该方法解决,就是因为内核对资源的限制问题。

 

cacti监控Linux磁盘IO

cacti有一个插件可以监控磁盘IO的(snmpdiskio),在网上找了一下,不太好下载。好不容易搞了下来,结果过程也挺痛苦的,要改一些东西,对cacti模板这一块儿不熟。简单记录一下怎么使用这个插件来监控linux的磁盘IO。

cacti 0.8.8b(目前为止是最新的)

linux (centos 6.4)

首先下载插件,下载完之后解压。

# # cd snmpdiskio-0.9.6/
# cp snmpdiskio /usr/local/bin/
# vi /var/www/html/cacti/resource/snmp_queries/partition.xml #添加以下内容

        Get SNMP Partitions
        Queries a host for a list of monitorable partitions
        .1.3.6.1.4.1.2021.13.15
        hdDescr:hdIndex
        numeric
        |chosen_order_field|
 
        
                
                        Index
                        walk
                        value
                        input
                        .1.3.6.1.4.1.2021.13.15.1.1.1
                
                
                        Description
                        walk
                        value
                        input
                        .1.3.6.1.4.1.2021.13.15.1.1.2
                
                
                        Bytes Written
                        walk
                        value
                        output
                        .1.3.6.1.4.1.2021.13.15.1.1.3
                
                
                        Bytes Read
                        walk
                        value
                        output
                        .1.3.6.1.4.1.2021.13.15.1.1.4
                
        

# chown cacti.cacti /var/www/html/cacti/resource/snmp_queries/partition.xml

然后修改一下snmpd.conf的配置,把以下内容添加到snmpd.conf的最后

# vi /etc/snmp/snmpd.conf
extend .1.3.6.1.4.1.2021.54 hdNum /bin/sh /usr/local/bin/snmpdiskio hdNum
extend .1.3.6.1.4.1.2021.55 hdIndex /bin/sh /usr/local/bin/snmpdiskio hdIndex
extend .1.3.6.1.4.1.2021.56 hdDescr /bin/sh /usr/local/bin/snmpdiskio hdDescr
extend .1.3.6.1.4.1.2021.57 hdInBlocks /bin/sh /usr/local/bin/snmpdiskio hdInBlocks
extend .1.3.6.1.4.1.2021.58 hdOutBlocks /bin/sh /usr/local/bin/snmpdiskio hdOutBlocks

然后重启一下snmp,并测试能否取到数据

# service snmpd restart
# # snmpwalk -v 2c -c mingdao localhost .1.3.6.1.4.1.2021.58
UCD-SNMP-MIB::ucdavis.58.1.0 = INTEGER: 1
UCD-SNMP-MIB::ucdavis.58.2.1.2.11.104.100.79.117.116.66.108.111.99.107.115 = STRING: "/bin/sh"
UCD-SNMP-MIB::ucdavis.58.2.1.3.11.104.100.79.117.116.66.108.111.99.107.115 = STRING: "/usr/local/bin/snmpdiskio hdOutBlocks"
UCD-SNMP-MIB::ucdavis.58.2.1.4.11.104.100.79.117.116.66.108.111.99.107.115 = ""
UCD-SNMP-MIB::ucdavis.58.2.1.5.11.104.100.79.117.116.66.108.111.99.107.115 = INTEGER: 5
UCD-SNMP-MIB::ucdavis.58.2.1.6.11.104.100.79.117.116.66.108.111.99.107.115 = INTEGER: 1
UCD-SNMP-MIB::ucdavis.58.2.1.7.11.104.100.79.117.116.66.108.111.99.107.115 = INTEGER: 1
UCD-SNMP-MIB::ucdavis.58.2.1.20.11.104.100.79.117.116.66.108.111.99.107.115 = INTEGER: 4
UCD-SNMP-MIB::ucdavis.58.2.1.21.11.104.100.79.117.116.66.108.111.99.107.115 = INTEGER: 1
UCD-SNMP-MIB::ucdavis.58.3.1.1.11.104.100.79.117.116.66.108.111.99.107.115 = STRING: "0"
UCD-SNMP-MIB::ucdavis.58.3.1.2.11.104.100.79.117.116.66.108.111.99.107.115 = STRING: "0

能获取到数据,再继续。

把压缩包里的两个文件以模板的形式导致到Cacti

cacti_data_query_snmp_disk_statistics.xml

cacti_graph_template_disk_io_bytessec.xml

把这两个导入就可以了。然后cacti添加监控的时候,使用snmp disk就行了。

上图因为我已经添加过两个盘了,所以出现的是灰色的。其它的不清楚是一些什么设备,因为是云主机,所以可能跟虚拟机有关系。这里我们只添加自己关心的设备就可以了。

安装zabbix

安装zabbix

0、环境准备

zabbix WEB环境搭建
zabbix的安装需要LAMP或者LNMP环境,为了方便直接用yum安装LAMP环境。
yum install mysql-server httpd php

需要其它的软件包

yum install mysql-dev gcc net-snmp-devel curl-devel perl-DBI php-gd php-mysql php-bcmath php-mbstring php-xm

1、下载

wget http://sourceforge.net/projects/zabbix/files/ZABBIX%20Latest%20Stable/2.2.5/zabbix-2.2.5.tar.gz/download
tar zxvf zabbix-2.2.5.tar.gz
cd zabbix-2.2.5/database/mysql

2、创建数据库&&用户

create database zabbix;
grant all on zabbix.* to [email protected] identified by ‘zabbix’;
flush privileges;

3、导入数据&&结构

mysql -uzabbix -pzabbix zabbix < schema.sql
mysql -uzabbix -pzabbix zabbix < images.sql
mysql -uzabbix -pzabbix zabbix < data.sql

4、编译安装

./configure --prefix=/usr/local/zabbix --with-mysql --with-net-snmp --with-libcurl --enable-server --enable-agent --enable-proxy

5、增加用户,用户组 必须添加,在启动服务的时候需要验证用户

groupadd zabbix
useradd zabbix -g zabbix

6、修改配置文件

ln -s /usr/local/zabbix/etc /etc/zabbix

更改配置文件中数据库相关的用户名密码

vim /etc/zabbix/zabbix_server.conf

修改以下三项

DBName=zabbix

DBUser=zabbix

DBPassword=zabbix

更改agentd配置文件

vim /etc/zabbix/zabbix_agentd.conf

Hostname=LOG01
ServerActive=10.10.10.180:20051

7、拷贝相应的web程序到相关WEB服务目录下

cp -r /data/soft/zabbix-2.0.5 /frontends/php/ /var/www/html/zabbix/
chown -R zabbix.zabbix /var/www/html/zabbix

8、修改PHP配置文件php.ini内容

vim /etc/php.ini

date.timezone = Asia/Shanghai
post_max_size = 32M
max_execution_time = 300
max_input_time = 300
memory_limit = 128M
mbstring.func_overload = 2

9、修改后重启httpd

service httpd restart

10、启动zabbix server

/usr/local/zabbix/sbin/zabbix_server

11、启动zabbix agentd

/usr/local/zabbix/sbin/zabbix_agentd

12、浏览器打开 http://192.168.1.1/zabbix/

按照步骤一步步安装,我这里就不截图了,填好之前创建的数据库用户名密码

最后安装完,默认用户名 admin 密码:zabbix

这里初始安装已经完毕

这里是zabbix agentd的安装文档

 

done !