nullHACMP 5.X 新特性与日常维护HACMP 5.X 新特性与日常维护He Bing
hebing@cn.ibm.com 目录
目录
HACMP概念回顾
HACMP 5.x 新功能介绍
常见HA架构
日常管理
目录
目录
HACMP概念回顾
HACMP 5.x 新功能介绍
常见HA架构
日常管理
Although Hardware is Now Very Reliable, Hardware Failures
Account for a Small Minority of System OutagesAlthough Hardware is Now Very Reliable, Hardware Failures
Account for a Small Minority of System OutagesSeveral studies place the proportion between 20% and 45%
Human error, software error and planned maintenance cause the majority of service outages HACMP—(High Availability Cluster Multi Processing) HACMP—(High Availability Cluster Multi Processing)为什么需要高可用性?
什么是HACMP?
High Availability:
系统可用性或运行时间最大化
系统宕机时间最小化
multi-processing:
一个cluster里的各个节点上可以运行多个应用
共享数据或并发访问数据.
HACMP的目的
消除单点故障(SPOF),实现高可用
High Availability is fault resilient
not fault tolerant
高可用 & 容错高可用 & 容错Software Layers on a HACMP nodeSoftware Layers on a HACMP nodeApplication
Uses the services made highly available by HACMP
HACMP
Makes services highly available for applications
Co-ordinates resource availability through the cluster
RSCT
Provides reliable communication between nodes
Co-ordination of subsystems
AIX
Operating system services
LVM
Logical storage management
TCP/IP
Manages communications at a logical layer什么环境不适合HACMP什么环境不适合HACMPYou cannot suffer any downtime
Failovers will cause at least some downtime
Your environment is not stable
HACMP depends on stable software levels and stable configuration
HACMP is susceptible to the “fiddle factor”
Your application needs manual intervention to recover from a failure
Manual reset of a device, etc.使用HACMP的考虑点使用HACMP的考虑点Application must be able to recover from a stop/restart operation
Must release all resources when stopped—either normally or abnormally
Must tolerate a loss of memory contents
Must tolerate a loss of processor state
Must perform a restart from a checkpoint
Must recover from partial data writes
Must operate in a “transactional” protocol
There must not be a single point of failure in the HA cluster
Shared power supply, non-protected disk, etc.
HACMP is a software solution
目录
目录
HACMP概念回顾
HACMP 5.x 新功能介绍
常见HA架构
日常管理
HACMP V5.1新特性1/3 HACMP V5.1新特性1/31.HACMP “classic” (HAS) has been dropped; only HACMP/ES was available
based on IBM Reliable Scalable Cluster Technology
2.SMIT “Standard” and “Extended” configuration paths (procedures)
3.Automated configuration discovery
4.Custom resource groupsThis is the version that introduced major changes, from configuration simplification and performance enhancements to changing HACMP terminology.
Some of the important new features in HACMP V5.1 were: HACMP V5.1新特性2/3 HACMP V5.1新特性2/35.Non IP networks based on heartbeating over disks
6.Fast disk takeover
7.Forced varyon of volume groups
8.Heartbeating over IP aliases
9.Improved security, by using cluster communication daemon (eliminating the need of using AIX “r” commands, thus eliminating the need for the /.rhosts file)
10.Improved performance for cluster configuration and synchronization
11.Normalization of HACMP terminology (aligning it with other HA products)
12.Simplification of configuration and maintenance HACMP V5.1新特性3/3 HACMP V5.1新特性3/313.Online Planning Worksheets enhancements
14.Various C-SPOC enhancements
15.GPFS integration HACMP V5.2新特性1/2 HACMP V5.2新特性1/21.Two-Node Configuration Assistant, with both SMIT menus and a Java™ interface (in addition to the SMIT “Standard” and “Extended” configuration paths).
2.File collections
3.User password management
4.Classic resource groups are not used anymore, replaced by custom resourcegroups Introduced in July 2004, HACMP V5.2 added more improvements,include configuration simplification, automation, and performance areas. Here is a summary of the improvements in HACMP V5.2: HACMP V5.2新特性2/2 HACMP V5.2新特性2/25.Automated test procedures
6.Automatic cluster verification
7.Improved Online Planning Worksheets (OLPW) can now import a configuration from an existing HACMP cluster
8.Event management (EM) has been replaced by Resource Monitoring and Control (RMC) subsystem (standard in AIX)
9.Enhanced security
10.Resource group dependencies
11.Self-healing clusters (correcting certain cluster configuration errors)
12.HACMP Smart Assist for WebSphere® Application Server HACMP V5.3新特性1/3 HACMP V5.3新特性1/31.Cluster verification at cluster
Additional corrective actions taken during verification
clverify warns of recognizable single points of failure
clverify integrates HACMP/XD options - PPRC; GeoRM; GLVM
clverify automatically creates clhosts.client file to be used as the prototype of the clhosts file on client nodesStarting July 2005, the new HACMP V5.3 continued the development of HACMP,by adding further improvements in management, configuration simplification,automation, and performance areas. Here is a summary of the improvements inHACMP V5.3: HACMP V5.3新特性2/3 HACMP V5.3新特性2/32.XML file format for OLPW files and ability to convert existing snapshot files into XML cluster configuration files
3.OEM volume and file system support,Veritas Volume Manager,Veritas File System
4.Further integration of HACMP with RSCT
5.More ‘Smart Assist’ options - DB2® and Oracle Application Server
6.Removal of certain site related restrictions from HACMP
7.Location dependency added for Resource Groups
8.Distribution preference for the IP service aliases
9.WebSMIT security improved by:
client data validation before any HACMP commands are executed
Server side validation of parameters
WebSMIT authentication tools integrated with the AIX authentication
mechanisms HACMP V5.3新特性3/3 HACMP V5.3新特性3/310.Cluster manager (clstrmgrES) daemon running at all times (regardless of cluster status - up or down) to support further automation of cluster configuration and enhanced administration
11.Cluster multi-peer extension daemon (clsmuxpdES) and cluster information daemon (clinfoES) changed
12.The Cluster Lock Manager (cllockd or cllockdES) is no longer supported as of HACMP 5.2. During node-by-node migration, it is uninstalled. Installing HACMP 5.2 or 5.3 removes the Lock Manager binaries and definitions
13.In order to improve HACMP security, all HACMP ODMs will be owned by root, group hacmp. Group "hacmp" is created if it does not already exist
14.The command line utilities cldiag and clverify are removed. All functionality is available from SMIT in HACMP 5.3HACMP 5.4新特性1/4: SimplerHACMP 5.4新特性1/4: Simpler1.Expanded and standardized Smart Assist Framework for Automatic Application Discovery and Configuration. Has a common look-and-feel with other Smart Assistants and allows for simpler management of clusters with selected applications
2.Enhanced Smart Assist for Oracle provides assistance to those involved with the installation of Oracle Application Server and/or Oracle Database Manage
3.Improved and extended WebSMIT provides an easier-to-use GUI with enhanced functionality for easier management of the HACMP cluster
4.Enhanced Cluster Test Tool provides several additional test scenarios, enabling more thorough validation of the cluster configurationHACMP 5.4新特性2/4 : Simpler (Cont)HACMP 5.4新特性2/4 : Simpler (Cont)5.Manual Resource Group Movement Enhancements for better usability
6.Enhancements to “Forced Down” enable customers to better manage scheduled maintenance and unscheduled downtime
7.Customers will be able to upgrade to new PTF levels and new releases without disrupting their application service
8.Customers are now able to start cluster services without disrupting their application service, thereby allowing better quality of serviceHACMP 5.4新特性3/4 : Faster & SmarterHACMP 5.4新特性3/4 : Faster & Smarter9.Cluster Verification facilities continue to be expanded, to better help customers prevent problems before they occur
10.Fast Failure Detection allows for faster detection of node failures on certain cluster configurations
11.HACMP now supports Linux on System p hardware with a selected feature set, enabling customers to utilize the proven capabilities of HACMP in a Linux environmentHACMP 5.4新特性4/4 : Goes the DistanceHACMP 5.4新特性4/4 : Goes the Distance12.HACMP/XD GLVM enhancements allow customers to utilize up to four data mirroring networks, as well as to use Enhanced Concurrent Volume Groups with GLVM
13.Support GPFS 2.3 to allow customers to use a cluster-wide filesystem
14.IPAT support on Geographic Networks to allow for better utilization of network resources
15.HACMP/XD for MetroMirror support leverages customers’ storage and MetroMirror investment for business resilience 目录
目录
HACMP概念回顾
HACMP 5.x 新功能介绍
常见HA架构
日常管理
常见HA架构
常见HA架构
两节点拓扑介绍
两节点资源组介绍
两节点接管介绍
多节点架构
其他高可用系统架构两节点HACMP拓扑结构示意图两节点HACMP拓扑结构示意图Network ClientsSerial HeartbeatpSeries Cluster NodepSeries Cluster NodeIP NetworkService & Standby
Network AdaptersShared DiskIP HeartbeatsCluster NodesCluster NodesSince the cluster is treated as a single entity, we refer to the individual computers as nodes.
Each node is an independent system
Inter node communication is defined when the cluster is initialized.Service IP aliasesService IP aliases"Service Address" or "Service Label" is the connection to the computer
AIX allows many addresses on a single adapter
Does not affect the original configuration
Allows separation of services
Faster to move if necessaryIP地址切换(IPAT)方式一 (替换方式)IP地址切换(IPAT)方式一 (替换方式)At system bootWith HACMP runningAfter adapter failureAfterfailureAdapter Type192.168.0.1192.168.0.6nanaBoot / Service1.1.1.11.1.1.1naStandbyBoot1.1.1.21.1.1.2Standby192.168.0.2192.168.0.2192.168.0.6192.168.0.6192.168.0.2192.168.0.21.1.1.2Node ANode Bhost Two logical IP networks (Netmask 255.255.255.0)
One physical network
Clients always access 192.168.0.6
MAC address takeover or ARP cache update is also neededIP地址切换(IPAT)方式二 (别名方式)IP地址切换(IPAT)方式二 (别名方式)At system bootWith HACMP runningAfter adapter failure192.168.0.110.1.1.1nana1.1.1.11.1.1.1na1.1.1.21.1.1.2192.168.0.2192.168.0.210.1.1.150192.168.0.110.1.1.15010.1.1.110.1.1.15010.1.1.16010.1.1.160192.168.0.210.1.1.160192.168.0.210.1.1.1601.1.1.210.1.1.1Node ANode BAfterfailurehost 1.1.1.11.1.1.2 Initially configured addresses (Boot IP)
Persistent IP addresses - useful for applications like Tivoli
Service IP addresses - used by clients to access the cluster
- multiple are allowedPersistent Node IP labelPersistent Node IP label
是一个 IP alias ,它可以分配给cluster里的一个特定节点
总是位于同一个节点
可以位于一块已经拥有 service 或 non-service IP label 的网卡上
不需在节点上安装额外的物理网卡
不属于任何资源组
能被用于对指定的节点进行管理
每个节点只能配置一个.
在节点启动后即可用,当HACMP服务停止后也始终保持可用
如果网卡失败,它只会迁移到相同网络的同一个节点上的其它网卡
如果节点失败,该IP标识不会迁移到群集中的其它节点Persistent Node IP label磁盘心跳(Heartbeat via disk)HACMP5.x的新功能
能够使用下列任何一种共享磁盘阵列 (Fibre Channel,SCSI, 或 SSA)
使用的磁盘是一个 enhanced concurrent volume group 的一部分, 唯一的
要求
对教师党员的评价套管和固井爆破片与爆破装置仓库管理基本要求三甲医院都需要复审吗
是这个 VG必须在两个节点都有定义
磁盘心跳(Heartbeat via disk)常见HA架构
常见HA架构
两节点拓扑介绍
两节点资源组介绍
两节点接管介绍
多节点架构
其他高可用系统架构
How Volume Groups are HandledHow Volume Groups are HandledTwo types:
Shared
Non-shared
Shared volume groups can "migrate"
Non-Shared volume groups are node bound
Application data must be on a shared volume group to be "moved"
Application code may be on either type of diskApplication Server ScriptsApplication Server Scripts"Application server", a name given to a series of scripts:
Start the application
Stop the application
Monitor the application (optional)
Re-start the application (optional)
Applications must be able to be started from a previously unknown state by a script
Applications must be able to be stopped by a scriptResource GroupsResource GroupsLogical constructs that group related attributes together
The "container" used by HACMP to "move" resources
Participating node list
default node priorities
Home node
Have Policies on:
Start up
Fall over
Fall back
Distribution policy
Dependant resource groupsResource Group Policies: StartupResource Group Policies: StartupResource group start up occurs:
during initial cluster start up
initial acquisition of the resource group
May be modified by a "settling" timerOnline on Home Node Only (OHNO)
only start on the highest priority
Online on First Available Node (OFAN)
will start on any one node
Online on All Available Nodes (OAAN)
The resource groups will start on all nodes
Online Using Distribution Policy (OUDP)
One resource group per network or node depending on the distribution policyResource Group Policies: FalloverResource Group Policies: FalloverResource group fallover occurs:
When the current node can no longer support the resource group and it is "moved" to another node
Failure has occurred
Graceful shutdown with tabkover of the current nodeFallover to Next Priority Node (FNPN)
Resource group is moved to the next node in the resource group's node list
Fallover using Dynamic Node Priority (FDNP)
Resource group is moved to the next node in the resource group's node list as recalculated based on the dynamic node criteria policy
Bring Offline on Error Node (BOEN)
Resource group is set to an offline state on this node onlyResource Group Policies: FallbackResource Group Policies: FallbackResource group fallback occurs:
The resource group is not on its home node
A higher priority node becomes available
Can be modified by a fallback timerFallback to a Higher Priority Node (FHPN)
When the higher priority node is available and/or the optional timer expires, the resource group moves
Never Fallback (NFB)
Regardless if a higher priority node becomes available, the resource group will not move常见HA架构
常见HA架构
两节点拓扑介绍
两节点资源组介绍
两节点接管介绍
多节点架构
其他高可用系统架构HACMP资源组(Online on Home Node Only)HACMP资源组(Online on Home Node Only)Fallover to Next Priority NodeOnline on Home Node OnlyFallback to a Higher Priority NodeHACMP资源组(Online on Home Node Only)HACMP资源组(Online on Home Node Only)HACMP资源组(Online on First Available Node)HACMP资源组(Online on First Available Node)Fallover to Next Priority NodeOnline on First Available NodeNever FallbackHACMP资源组(Online on First Available Node)HACMP资源组(Online on First Available Node)HACMP资源组(Online on All Available Nodes)HACMP资源组(Online on All Available Nodes)Bring Offline on Error NodeOnline on All Available NodesNever Fallback常见HA架构
常见HA架构
两节点拓扑介绍
两节点资源组介绍
两节点接管介绍
多节点架构
其他高可用系统架构
Failover possibilitiesFailover possibilitiesThree-node Mutual Takeover ClusterThree-node Mutual Takeover ClusterIncreased resiliency vs. 2-node cluster
Redundant connections to storage and networks
Server capacity must be sized to handle additional workload in failover scenarios
Ideally, each node should be sized to run all workloads (in case 2 of 3 nodes failed)
Some increase in complexity of cluster configuration and management“n + 1” HA Cluster“n + 1” HA ClusterIncreased resiliency vs. 2-node cluster
Some efficiency gain (only one server “on standby”)
Server capacity must be considered
Ideally, Server D capacity should be sized to handle all workloads from Servers A, B, and C
Some clients size Server D smaller; assuming that risk of Servers A, B, and C all failing at once is small
Some increase in complexity of cluster configuration and management
HA Clustering and VirtualizationHA Clustering and VirtualizationStill need two servers to avoid server as SPoF
Base example shown at right:
Two-node clusters configured to failover across physical boundaries
Distribute primary nodes evenly across servers so that single server failure results in failover of only 50% of primary nodes
Other common cluster configs can also be used in virtual environment常见HA架构
常见HA架构
两节点拓扑介绍
两节点资源组介绍
两节点接管介绍
多节点架构
其他高可用系统架构
GLVM架构GLVM架构 目录
目录
HACMP概念回顾
HACMP 5.x 新功能介绍
常见HA架构
日常管理
日常管理 日常管理版本管理
安装及配置
测试要点
日志管理
DARE
C-SPOC
参数调整
NFS & HACMP
常用命令
HACMP软件规划-硬件平台 HACMP软件规划-硬件平台HACMP 版本V 5.2, V5.3 and V5.4,以及 HACMP Linux版本V5.4 支持的POWER5+的服务器有:
IBM System p5 505 and 505Q (9115-505)
IBM System p5 510 and 510Q (9110-510 and 9110-51A)
IBM System p5 520 and 520Q (9111-520 and 9131-52A)
IBM System p5 550 and 550Q (9113-550 and 9133-55A)
IBM System p5 560Q (9116-561)
IBM System p5 570 (9115-570)
IBM System p5 590 and 595 (9119-590 and 9119-595)
IBM System p5 285 (9111-285) HACMP软件规划-系统软件 HACMP软件规划-系统软件操作系统的版本和补丁要求
信息查看:
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD101347
补丁下载
http://www-912.ibm.com/eserver/support/fixes/fcgui.jsp
HACMP对power6的支持 HACMP对power6的支持 HACMP软件规划-应用软件 HACMP软件规划-应用软件一般来说,在一个cluster中,涉及到的应用软件版本一致,这样易于管理.
因为HACMP产品对应用软件并没有严格的限制,用户可以根据实际需求选择需要加入cluster的应用软件,并通过自己的脚本来管理. 日常管理 日常管理版本管理
安装及配置
测试要点
日志管理
DARE
C-SPOC
参数调整
NFS & HACMP
常用命令
HACMP软件的安装HACMP软件的安装需要安装的组件
操作系统的补丁
HACMP软件
HACMP软件的补丁
软件的安装方法
NIM
光盘安装
本地硬盘安装
HACMP软件的配置过程HACMP软件的配置过程 HACMP配置前的准备工作
配置IP地址
编辑/etc/hosts文件
编写应用程序的启动/停止脚本
创建vg和文件系统
准备串口设备及磁盘心跳设备
HACMP的Standard配置过程
添加Cluster和节点
配置Cluster资源
创建Cluster资源组
同步HACMP的配置
HACMP的Extended配置过程
添加心跳
定制Cluster资源HACMP配置菜单HACMP配置菜单Smitty hacmp配置管理Extended ConfigurationExtended Configuration123Extended Topology ConfigurationExtended Topology Configuration1.11.21.31.41.5Extended Resource ConfigurationExtended Resource Configuration2.12.2Extended Resources ConfigurationExtended Resources Configuration2.1.12.1.2Extended Resource Group ConfigurationExtended Resource Group Configuration2.2.12.2.2启动和停止HACMP服务启动和停止HACMP服务启动HACMP服务 (V5.4版)启动HACMP服务 (V5.4版)停止HACMP服务 (V5.4版)停止HACMP服务 (V5.4版)Graceful down
Take over
Force down 日常管理 日常管理版本管理
安装及配置
测试要点
日志管理
DARE
C-SPOC
参数调整
NFS & HACMP
常用命令
Verifying That Cluster Services Have StoppedVerifying That Cluster Services Have StoppedHACMP排错要点HACMP排错要点Cluster Log Files
Cluster Daemons
Monitoring Cluster:
clstat/xclstat
check log files
check daemons by lssrc -g cluster or ps -ef
lsvg -o
ifconfig -a
netstat -in
lslpp -l cluster.*
Dead man Switch
Apply patch 日常管理 日常管理版本管理
安装及配置
测试要点
日志管理
DARE
C-SPOC
参数调整
NFS & HACMP
常用命令
HACMP相关的日志文件1/7HACMP相关的日志文件1/7/tmp/clstrmgr.debug
Contains time-stamped, formatted messages generated by the clstrmgrES daemon. The default messages are verbose and are typically adequate for troubleshooting most problems, however IBM support may direct you to enable additional debugging.
Recommended Use: Information in this file is for IBM Support personnel.
/tmp/cspoc.log
Contains time-stamped, formatted messages generated by HACMP C-SPOC commands. The /tmp/cspoc.log file resides on the node that invokes the C-SPOC command.
Recommended Use: Use the C-SPOC log file when tracing a C-SPOC command’s execution on cluster nodes.
/tmp/emuhacmp.out
Contains time-stamped, formatted messages generated by the HACMP Event Emulator. The messages are collected from output files on each node of the cluster, and cataloged together into the /tmp/emuhacmp.out log file. In verbose mode (recommended), this log file contains a line-by-line record of every event emulated. Customized scripts within the event are displayed, but commands within those scripts are not executed.
HACMP相关的日志文件2/7HACMP相关的日志文件2/7/tmp/hacmp.out
Contains time-stamped, formatted messages generated by HACMP scripts on the current day.In verbose mode (recommended), this log file contains a line-by-line record of every command executed by scripts,including the values of all arguments to each command.An event summary of each high-level event is included at the end of each event’s details.
Recommended Use: Because the information in this log file supplements and expands upon the information in the /usr/es/adm/cluster.log file, it is the primary source of information when investigating a problem.
Note: With recent changes in the way resource groups are handled and prioritized in fallover circumstances, the hacmp.out file and its event summaries have become even more important in tracking the activity and resulting location of your resource groups. In HACMP releases prior to 5.2, non-recoverable event script failures result in the event_error event being run on the cluster node where the failure occurred. The remaining cluster nodes do not indicate the failure. With HACMP 5.2 and up, all cluster nodes run the event_error event if any node has a fatal error. All nodes log the