IBM High Performance Computing Cluster Health Check - Important Points

To Download Link - DOWNLOAD

1)HPC cluster seems intimidating,many cluster problems are in fact easily resolved through careful verification steps.
2)cluster health check tools mostly test components and a little point-to-point network performance.
3)The purpose of a High Performance Computing (HPC) cluster is to solve scalable problems in a shorter time through parallelism.
4)The goal of verification stage(TESTING) is to gain confidence in hardware and software before introducing it to the user.
5)healthy cluster is built from the bottom to the top. Therefore, you first must make sure that each single device works
as expected before performing the next step.
This approach leads to the pyramid model of verification shown in Figure 3-2 on page 34.


.
6)The InfiniBand software stack on the compute nodes is Mellanox OFED 1.5.3-4.0.22.3. The
fabric management is done by opensm-4.0.0 as provided by Mellanox OFED.
opensm - opensmd - Infiniband subnetmanger and Administrator- Run on master node or Switch..
openib - openibd - Run on the compute node
OFED - Open Fabirc Enterprise Distribution packages.
UFD - Unified Fabric Distribution - Administration Tools.
7) Bonding or ether channeling is the mechanism that is used to tie network interfaces together
so that they form a new device. The purpose is either to provide a higher level of redundancy,
or to increase the network throughput.
8)The ibdiagnet tool is one of the most important tools to check the FABRIC health. This tool is used to check the
routing, and is good for finding credit loops,
especially those caused by mis-wires or poor topology design.
slow-links

0 comments:

Post a Comment