實驗環(huán)境
OS:CentOS 5.10 x86_64(一臺admin,兩臺計算節(jié)點)
主機(jī)名和和IP對應(yīng)關(guān)系如下
admin: 192.168.78.11
node1:192.168.78.12
node2:192.168.78.13
軟件版本:PBS
torque-3.0.6.tar.gz
maui-3.3.1.tar.gz
openmpi-1.8.1.tar.bz2
并行軟件:
apoa1.tar.gz
NAMD_2.9_Linux-x86_64-multicore.tar.gz
一:環(huán)境配置
1.修改hosts文件,添加內(nèi)容如下
192.168.78.11 admin
192.168.78.12 node1
192.168.78.13 node2
2.設(shè)置無密碼訪問
ssh-keygen一直按enter鍵即可,進(jìn)入.ssh目錄生成認(rèn)證密碼,并設(shè)置權(quán)限
[root@admin ~]#cd.ssh/
[root@admin.ssh]#ls
id_rsa id_rsa.pub
[root@admin.ssh]#cp id_rsa.pub authorized_keys
[root@admin.ssh]#chmod 600 authorized_keys
[root@admin.ssh]#ll
total 12
-rw------- 1rootroot 394 Aug 23 03:52 authorized_keys
-rw------- 1rootroot 1675 Aug 23 03:50 id_rsa
-rw-r--r-- 1rootroot 394 Aug 23 03:50 id_rsa.pub
3.然后復(fù)制.ssh目錄到所有計算節(jié)點
[root@admin~]# for i in 1 2 ; do scp -r /root/.ssh node$i:/root/ ; done
第一次要輸入兩臺計算節(jié)點的root密碼,以后都是無密碼訪問了
4.復(fù)制hosts文件到所有計算節(jié)點
[root@admin ~]#for i in 1 2 ; do scp /etc/hosts node$i:/etc/ ; done
5.配置nfs服務(wù)
把管理節(jié)點上的/export作為共享目錄
[root@admin~]#mkdir -p /export/{apps,home,scripts,source} //其中apps為軟件共享目錄,home為共享家目錄
[root@admin ~]#cat /etc/exports
/export 192.168.78.0/255.255.255.0(rw,sync)
6.啟動nfs服務(wù)并檢查啟動是否成功
[root@admin~]#chkconfig portmap on ; /etc/init.d/portmap start
Startingportmap: [ OK ]
[root@admin~]#chkconfig nfs on ; /etc/init.d/nfs start
[root@admin~]#showmount -e localhost
Export listforlocalhost:
/export 192.168.78.0/255.255.255.0
[root@admin ~]#
7.配置autofs
[root@admin ~]#cat /etc/auto.master
/home/etc/auto.home --timeout=1200
/share/ec/auto.share --timeout=1200
[root@admin ~]#cat /etc/auto.share
* admin:/export/&
[root@admin ~]#cat /etc/auto.home
* -nfsvers=3 admin:/export/home/&
[root@admin ~]#
8.啟動autofs服務(wù)
[root@admin~]#chkconfig autofs on ; /etc/init.d/autofs start
9.復(fù)制auto.master auto.share auto.home到所有計算節(jié)點
[root@admin ~]#for i in 1 2; do scp /etc/auto.master node$i:/etc/; done
[root@admin ~]#for i in 1 2; do scp /etc/auto.share node$i:/etc/; done
[root@admin ~]#for i in 1 2; do scp /etc/auto.home node$i:/etc/; done
10.啟動autofs服務(wù)
[root@admin ~]#for i in 1 2; do ssh node$i /etc/init.d/autofs start; done
[root@admin ~]#for i in 1 2; do ssh node$i chkconfig autofs on; done
11.配置NIS服務(wù)
[root@admin ~]#yum -y install ypserv
[root@admin~]#nisdomainname linuxidcyf.com
[root@admin~]#echo "NISDOMAIN=linuxidcyf.com">>/etc/sysconfig/network
[root@admin ~]#cp /usr/share/doc/ypserv-2.19/securenets /var/yp/
[root@admin ~]#vi /var/yp/securenets
修改后內(nèi)容如下
[root@admin~]#grep -v "^#" /var/yp/securenets
255.0.0.0 127.0.0.0
255.255.255.0 192.168.78.0
[root@admin ~]#
12.啟動NIS服務(wù)
[root@admin~]#/etc/init.d/ypserv start ;chkconfig ypserv on
Starting YP servers ervices: [ OK ]
[root@admin~]#/etc/init.d/yppasswdd start ;chkconfig yppasswdd on
Starting YP passwd service: [ OK ]
[root@admin ~]#
13.修改/etc/default/useradd文件
把HOME=/home更改為HOME=/export/home
14.在/etc/skel目錄下創(chuàng)建.ssh目錄并在.ssh目錄下建立一個名為config的文件,設(shè)置如下
[root@admin~]#mkdir /etc/skel/.ssh
[root@admin~]#touch /etc/skel/.ssh/config
[root@admin ~]#cat /etc/skel/.ssh/config
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
[root@admin~]#chmod 600 /etc/skel/.ssh/config
15.創(chuàng)建用于同步用戶的命令
◆在/usr/local/sbin目錄下創(chuàng)建了一個名為sync_users的腳本,內(nèi)容如下:
#!/bin/bash
YPINIT=/usr/lib64/yp/ypinit
for USER in $(sed -n '/export/p' /etc/passwd | awk -F ":" '{print$1}')
do
if [ -z "$USER" ]; then
$YPINIT -m
else
usermod -d /home/$USER $USER
fi
done
$YPINIT -m
◆賦予可執(zhí)行權(quán)限
chmod 755/usr/local/sbin/sync_users
◆以后執(zhí)行sync_users命令就可以同步新創(chuàng)建的用戶
16.創(chuàng)建一個測試用戶linuxidc,并同步該用戶
[root@admin~]#useradd linuxidc
[root@admin~]#echo linuxidc | passwd --stdin linuxidc
[root@admin~]#sync_users
注:以后每添加一個新用戶,都需要執(zhí)行sync_users命令
17. 配置NIS客戶端,在所有計算節(jié)點上安裝ypbind,RHEL默認(rèn)已經(jīng)安裝
[root@admin~]# for i in 1 2; do ssh node$i auth-config --enable-nis --nisdomain=linuxidcyf.com \
--nisserver=admin --update; done
18.驗證NIS服務(wù)配置是否正確
[root@node1~]#ypcat passwd
linuxidc:$1$tsPKQvPP$Kwom9qG/DNR1w/Lq./cQV.:500:500::/home/linuxidc:/bin/bash
[root@admin ~]#for i in 1 2; do ssh node$i id linuxidc; done
uid=500(linuxidc) gid=500(linuxidc) groups=500(linuxidc)
uid=500(linuxidc) gid=500(linuxidc) groups=500(linuxidc)
有上面輸出可知,NIS服務(wù)配置正確
二:安裝和配置torque(管理節(jié)點)
1.首先安裝openmpi
[root@adminparallel]#tar xjvf openmpi-1.8.1.tar.bz2 -C /usr/local/src/
[root@adminparallel]#cd /usr/local/src/openmpi-1.8.1/
[root@adminopenmpi-1.8.1]#./configure --prefix=/share/apps/openmpi
[root@adminopenmpi-1.8.1]#make
[root@adminopenmpi-1.8.1]#make install
[root@adminopenmpi-1.8.1]#cp -r examples/ /share/apps/openmpi
2.添加環(huán)境變量,在/share/scripts目錄先建立了一個Path.sh,以后也方便計算節(jié)點添加環(huán)境變量
[root@adminscripts]#pwd
/share/scripts
[root@adminscripts]#cat Path.sh
#!/bin/bash
grep openmpi /etc/profile || cat >>/etc/profile <<EOF
export PATH=/share/apps/openmpi/bin:\$PATH
export LD_LIBRARY_PATH=/share/apps/openmpi/lib:\$LD_LIBRARY_PATH
EOF
[root@adminscripts]#
[root@adminscripts]#sh Path.sh
[root@adminscripts]#source /etc/profile
3.測試openmpi是否安裝成功
[root@adminscripts]#which mpirun
/share/apps/openmpi/bin/mpirun
[root@adminscriptss]#which mpiexec
/share/apps/openmpi/bin/mpiexec
4.安裝torque
[root@adminparallel]#tar xzvf torque-3.0.6.tar.gz -C /share/source/
[root@adminparallel]#cd /share/source/torque-3.0.6/
[root@admintorque-3.0.6]#./configure --enable-syslog --enable-nvidia-gpus --enable-cpuset --disable-gui --with-rcp=scp --with-sendmail
[root@admintorque-3.0.6]#make
[root@admintorque-3.0.6]#make install
[root@admintorque-3.0.6]#pwd
/share/source/torque-3.0.6
[root@admintorque-3.0.6]#cat install.sh
cd /share/source/torque-3.0.6
make install
[root@admintorque-3.0.6]#
5.初始化torque創(chuàng)建默認(rèn)隊列
[root@admintorque-3.0.6]#./torque.setup root
initializingTORQUE(admin: root@admin)
PBS_Server admin:Create mode and server database exists,
do you wishtocontinue y/(n)?y
root 26351 1 0 06:44? 00:00:00 pbs_server -t create
Max openservers:10239
Max openservers:10239
[root@admintorque-3.0.6]#
6.查看創(chuàng)建的默認(rèn)隊列batch
[root@admintorque-3.0.6]#qmgr -c "p s"
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime= 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = admin
set server admins= root@admin
set server operators = root@admin
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server mom_job_sync = True
set server keep_completed = 300
[root@admintorque-3.0.6]#
7.更改隊列batch部分屬性,以滿足實際需求
[root@admintorque-3.0.6]#qmgr -c "s q batch resources_default.walltime=24:00:00"
[root@admintorque-3.0.6]#qmgr -c "s s query_other_jobs=true"
8.建立mom配置文件,用于復(fù)制到所有計算節(jié)點
[root@adminmom_priv]#pwd
/var/spool/torque/mom_priv
[root@adminmom_priv]#cat config
$pbsserver admin
$logevent 225
9.創(chuàng)建節(jié)點信息文件
[root@adminserver_priv]#pwd
/var/spool/torque/server_priv
[root@adminserver_priv]#cat nodes
node1
node2
[root@adminserver_priv]#
10.查看目前節(jié)點信息均為down狀態(tài)
[root@adminserver_priv]#pbsnodes -a
node1
state = down
np = 1
ntype = cluster
mom_service_port = 15002
mom_admin_port = 15003
gpus = 0
node2
state = down
np = 1
ntype = cluster
mom_service_port = 15002
mom_admin_port = 15003
gpus = 0
[root@adminserver_priv]#
11.復(fù)制pbs_server啟動腳本,并設(shè)置開機(jī)自動啟動
[root@admintorque-3.0.6]#pwd
/share/apps/torque-3.0.6
[root@admintorque-3.0.6]#cp contrib/init.d/pbs_server /etc/init.d/
[root@admintorque-3.0.6]#chmod 755 /etc/init.d/pbs_server
[root@admintorque-3.0.6]#chkconfig pbs_server on
12.復(fù)制pbs_mom腳本,方便復(fù)制到計算節(jié)點
[root@admintorque-3.0.6]#cp contrib/init.d/pbs_mom /etc/init.d/
13.安裝maui
[root@adminparallel]#tar xzvf maui-3.3.1.tar.gz -C /usr/local/src/
[root@admin ~]#cd /usr/local/src/maui-3.3.1/
[root@adminmaui-3.3.1]#./configure --prefix=/usr/local/maui --with-pbs=/usr/local
[root@adminmaui-3.3.1]#make
[root@adminmaui-3.3.1]#make install
14.復(fù)制maui啟動腳本,設(shè)置正確路徑,并設(shè)置為開機(jī)啟動
[root@adminmaui-3.3.1]#cp etc/maui.d /etc/init.d/mauid
[root@adminmaui-3.3.1]#vi /etc/init.d/mauid
更改MAUI_PREFIX=/opt/maui為MAUI_PREFIX=/usr/local/maui
[root@adminmaui-3.3.1]#chmod 755 /etc/init.d/mauid
[root@adminmaui-3.3.1]#chkconfig mauid on
15.啟動maui調(diào)度服務(wù)
[root@adminmaui-3.3.1]#/etc/init.d/mauid start
StartingMAUIScheduler: [ OK ]
[root@adminmaui-3.3.1]#
16.添加maui命令環(huán)境變量
[root@adminmaui-3.3.1]#vi /etc/profile
export PATH=/share/apps/openmpi/bin:/usr/local/maui/bin:$PATH
[root@adminmaui-3.3.1]#source /etc/profile
17.安裝并行軟件到共享目錄
[root@adminnamd]#tar xzvf NAMD_2.9_Linux-x86_64-multicore.tar.gz -C /share/apps/
[root@adminnamd]#tar xzvf apoa1.tar.gz -C /share/apps/
[root@adminapps]#pwd
/share/apps
[root@adminapps]#mv NAMD_2.9_Linux-x86_64-multicore/ namd
18.添加namd命令環(huán)境變量,同時也添加到Path.sh方便計算節(jié)點添加環(huán)境變量
[root@adminmaui-3.3.1]#vi /etc/profile
export PATH=/share/apps/openmpi/bin:/usr/local/maui/bin:/share/apps/namd:$PATH
[root@adminmaui-3.3.1]#source /etc/profile
[root@adminscripts]#which namd2
/share/apps/namd/namd2
[root@adminscripts]#cat Path.sh
#!/bin/bash
grep openmpi /etc/profile || cat >>/etc/profile <<EOF
export PATH=/share/apps/openmpi/bin:/share/apps/namd:\$PATH
EOF
[root@adminscripts]#
至此管理端配置完成
三:計算節(jié)點配置torque
1.計算節(jié)點安裝torque
[root@admin ~]#for i in 1 2; do ssh node$i sh /share/source/torque-3.0.6/install.sh; done
2.復(fù)制mom配置文件到計算節(jié)點
[root@admin ~]#for i in 1 2; do scp /var/spool/torque/mom_priv/confignode$i:/var/spool/torque/mom_priv/; done
3.復(fù)制mom啟動腳本到計算節(jié)點,啟動pbs_mom服務(wù),并設(shè)置開機(jī)啟動
[root@admin ~]#for i in 1 2; do scp /etc/init.d/pbs_mom node$i:/etc/init.d/; done
[root@admin ~]#for i in 1 2; do ssh node$i /etc/init.d/pbs_mom start; done
StartingTORQUEMom: [ OK ]
StartingTORQUEMom: [ OK ]
[root@admin ~]#for i in 1 2; do ssh node$i chkconfig pbs_mom on; done
4.設(shè)置環(huán)境變量
[root@admin ~]#for i in 1 2; do ssh node$i sh /share/scripts/Path.sh; done
5.測試環(huán)境變量設(shè)置是否正確
[root@admin ~]#for i in 1 2; do ssh node$i which mpirun; done
/share/apps/openmpi/bin/mpirun
/share/apps/openmpi/bin/mpirun
[root@admin ~]#for i in 1 2; do ssh node$i which namd2; done
/share/apps/namd/namd2
/share/apps/namd/namd2
[root@admin ~]#
6.此時再觀察計算節(jié)點狀態(tài),已經(jīng)變成free了,即可以提交任務(wù)到計算節(jié)點了
[root@adminapps]#pbsnodes -a
node1
state = free
np = 1
ntype = cluster
status=rectime=1408751492,varattr=,jobs=,state=free,netload=12996103,gres=,loadave=0.01,ncpus=1,physmem=1024932kb,availmem=2082428kb,totmem=2165536kb,idletime=0,nusers=0,nsessions=0,uname=Linuxnode12.6.18-371.el5 #1 SMP Tue Oct 1 08:35:08 EDT 2013 x86_64,opsys=linux
mom_service_port = 15002
mom_admin_port = 15003
gpus = 0
node2
state = free
np = 1
ntype = cluster
status=rectime=1408751482,varattr=,jobs=,state=free,netload=12983275,gres=,loadave=0.03,ncpus=1,physmem=1024932kb,availmem=2082444kb,totmem=2165536kb,idletime=0,nusers=0,nsessions=0,uname=Linuxnode22.6.18-371.el5 #1 SMP Tue Oct 1 08:35:08 EDT 2013 x86_64,opsys=linux
mom_service_port = 15002
mom_admin_port = 15003
gpus = 0
[root@adminapps]#
四:驗證并行集群是否搭建成功
1.在管理節(jié)點上以建立的linuxidc用戶登錄,首先設(shè)置節(jié)點間無密碼互訪,操作和root用戶一樣,只是不需要復(fù)制.ssh目錄
2.復(fù)制namd用軟件apoa1到當(dāng)前目錄下
[linuxidc@admin ~]$cp -r /share/apps/apoa1/ ./
3.創(chuàng)建PBS腳本
[linuxidc@admin~]$touch test.pbs
腳本內(nèi)容如下
[linuxidc@admin ~]$cat test.pbs
#!/bin/bash
#PBS -N linuxidcjob1
#PBS -j oe
#PBS -l nodes=2:ppn=1
NP=`cat $PBS_NODEFILE | wc -l`
echo "This job's id is $PBS_JOBID@$PBS_QUEUE"
echo "This job's workdir is $PBS_O_WORKDIR"
echo "This job is running on following nodes:"
cat $PBS_NODEFILE
echo "This job begins at:" `date`
echo
echo
cd $PBS_O_WORKDIR
mpirun -np $NP-machinefile $PBS_NODEFILE namd2 apoa1/apoa1.namd
echo
echo
echo "This job stops at:" `date`
[linuxidc@admin ~]$
4.提交任務(wù)
[linuxidc@admin ~]$qsub test.pbs
5.查看作業(yè)運(yùn)行狀態(tài)
[linuxidc@admin~]$qstat
Jobid Name User Time UseS Queue
-------------------------------------------------------- -------- - -----
1.admin linuxidcjob1 linuxidc 0 R batch
[linuxidc@admin~]$qstat -n
admin:
Req'd Req'd Elap
JobID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------------------------------- ---------------- ------ ----- ------ ------ ----- - -----
1.admin linuxidc batch linuxidcjob1 6676 2 2 -- 24:00 R --
node2/0+node1/0
[linuxidc@admin ~]$
由上面可知作業(yè)已經(jīng)在node1和node2上運(yùn)行了
至此,linux并行集群搭建完成