Oracle 19.8新特性"asmcmd credverify" and "asmcmd credfix"
平台:Redhat Linux 7.6
数据库版本:Oracle 19.10
问题描述:进行OCR磁盘替换DD破坏实验,并删除原有的OCR磁盘组。中间过程中,因spfile文件以及asm密码文件没有进行迁移新的OCR磁盘。导致两个节点crs集群的其中一个节点只能手工进行启动。
注:模拟破坏OCR磁盘过程中,发现OCR备份在OCR磁盘组中,一旦OCR发生损坏,恢复过程中无法读取OCR磁盘组中的OCR备份。无法按照常规restore进行还原。(参考:2717306.1 2256199.1 1929673.1 2264675.1)
1.通过以下命令查看集群状态,发现组件crsd异常、ora.storage异常,其余都是正常的
crsctl stat res -t -init
2.通过检查crsd日志发现问题
该问题是此次事故的第一个具体原因。
3.通过查看gpnp,来进一步观察spfile文件位置
通过以下可以看到文件位置是正常的。
4.通过init.ora文件将asm实例拉起
[grid@bjyctzdb10 ~]$ cat init.ora
+ASM1.__large_pool_size=20971520
+ASM2.__large_pool_size=20971520
+ASM1.__oracle_base=’/u02/app/grid’#ORACLE_BASE set from in memory value
+ASM2.__oracle_base=’/u02/app/grid’#ORACLE_BASE set from in memory value
+ASM1._asm_max_connected_clients=4
+ASM2._asm_max_connected_clients=4
*.asm_diskgroups='ARCH’,'DATA’,'OCR_NEW’#Manual Mount
.asm_diskstring=’/dev/oracleasm/disks/’
*.asm_power_limit=1
*.large_pool_size=12M
*.remote_login_passwordfile='EXCLUSIVE’
SQL> startup pfile=’/home/grid/init.ora’;
ASM instance started
Total System Global Area 1137173312 bytes
Fixed Size 8905536 bytes
Variable Size 1103101952 bytes
ASM Cache 25165824 bytes
ASM diskgroups mounted
SQL>
5.查看集群中spfile位置描述
#从这里发现并未指定spfile文件位置,可能是该原因找不到spfile文件,导致集群无法启动。
SQL> show parameter spfile
NAME TYPE VALUE
-------------------------------------------
spfile string
指定spfile文件位置。
SQL> alter system set spfile=’+OCR_NEW/bjyctzd-cluster/ASMPARAMETERFILE/registry.253.1072795999’;
System altered.
SQL> show parameter spfile
NAME TYPE VALUE
--------------------------------------------
spfile string +OCR_NEW/bjyctzd-cluster/ASMPA
RAMETERFILE/registry.253.10727
95999
6.重启集群,再次观察–需要重启asm实例才可以
和之前问题一样。
7.再次使用init.ora去启动asm磁盘
SQL> startup pfile=’/home/grid/init.ora’;
ASM instance started
Total System Global Area 1137173312 bytes
Fixed Size 8905536 bytes
Variable Size 1103101952 bytes
ASM Cache 25165824 bytes
ASM diskgroups mounted
SQL> show parameter spfile
NAME TYPE VALUE
--------------------------------------------
spfile string
SQL> show parameter spfile
NAME TYPE VALUE
--------------------------------------------
spfile string
SQL> alter system set spfile=’+OCR_NEW/bjyctzd-cluster/ASMPARAMETERFILE/registry.253.1072795999’;
System altered.
SQL> create pfile=’/home/grid/init.ora’ from spfile;
File created.
SQL> shutdown immediate
ASM diskgroups dismounted
ASM instance shutdown
SQL> startup
ASM instance started
Total System Global Area 1137173312 bytes
Fixed Size 8905536 bytes
Variable Size 1103101952 bytes
ASM Cache 25165824 bytes
ASM diskgroups mounted
SQL> show parameter spfile
NAME TYPE VALUE
--------------------------------------------
spfile string +OCR_NEW/bjyctzd-cluster/ASMPA
RAMETERFILE/registry.253.10727
95999
SQL>
8.再次重启集群还是如上问题,asm实例需要手工启动、asm实例启动之后crsd集群组件才可以启动,通过查看日志发现问题
9.通过命令查看asm密码文件位置
发现密码文件和密码文件备份都在之前的OCR磁盘中,但是OCR已经dismount。
[grid@bjyctzdb10 ~]$ srvctl config asm
ASM home:
Password file: +OCR/orapwASM
Backup of Password file: +OCR/orapwASM_backup
ASM listener: LISTENER
ASM instance count: 3
Cluster ASM listener: ASMNET1LSNR_ASM
[grid@bjyctzdb10 ~]$
修改asm密码文件位置以及备份位置。
[root@bjyctzdb10 ~]# srvctl modify asm -pwfile OCR_NEW
[root@bjyctzdb10 ~]# srvctl config asm
ASM home:
Password file: OCR_NEW
Backup of Password file: +OCR/orapwASM_backup
ASM listener: LISTENER
ASM instance count: 3
Cluster ASM listener: ASMNET1LSNR_ASM
[root@bjyctzdb10 ~]#
[root@bjyctzdb10 ~]# srvctl modify asm -pwfilebackup ocr_new
[root@bjyctzdb10 ~]# srvctl config asm
ASM home:
Password file: OCR_NEW
Backup of Password file: ocr_new
ASM listener: LISTENER
ASM instance count: 3
Cluster ASM listener: ASMNET1LSNR_ASM
[root@bjyctzdb10 ~]#
10.再次尝试关闭集群以及启动集群
11.通过查看日志,还是因为无法找到密码文件打开asm磁盘
12.进一步观察asm密码文件
从这里我们可以看到Password file:后面没有跟具体的密码文件。
[grid@bjyctzdb09 ~]$ srvctl config asm
ASM home:
Password file: OCR_NEW
Backup of Password file: ocr_new
ASM listener: LISTENER
ASM instance count: 3
Cluster ASM listener: ASMNET1LSNR_ASM
[grid@bjyctzdb09 ~]$ exit
13.进入asmcmd,查看是否有密码文件
从下面我们可以发现没有asm密码文件。
ASMCMD> cd ocr_new
ASMCMD> ls
bjyctzd-cluster/
14.重新生成一个密码文件
提示已经存在一个密码文件。
[grid@bjyctzdb09 ~]$ orapwd file=’+ocr_new/orapwasm’ asm=y
Enter password for SYS:
OPW-00010: Could not create the password file. This resource has a Password File.
通过这个,我们可以看到密码文件是OCR_NEW。
[grid@bjyctzdb09 ~]$ srvctl config asm
ASM home:
Password file: OCR_NEW
Backup of Password file: ocr_new
ASM listener: LISTENER
ASM instance count: 3
Cluster ASM listener: ASMNET1LSNR_ASM
[grid@bjyctzdb09 ~]$ exit
15.根据上面提示,我们重新将asm密码文件位置设置为空
[grid@bjyctzdb09 ~]$ srvctl modify asm -pwfile
[grid@bjyctzdb09 ~]$ srvctl config asm
ASM home:
Password file:
Backup of Password file: ocr_new
ASM listener: LISTENER
ASM instance count: 3
Cluster ASM listener: ASMNET1LSNR_ASM
16.重新生成密码文件,并再次查看密码文件位置
这是我们可以发现,对应的密码文件节点1和节点2已经存在。
[grid@bjyctzdb09 ~]$ orapwd file=’+ocr_new/orapwasm’ asm=y
Enter password for SYS:
[grid@bjyctzdb09 ~]$ srvctl config asm
ASM home:
Password file: +OCR_NEW/orapwasm
Backup of Password file: ocr_new
ASM listener: LISTENER
ASM instance count: 3
Cluster ASM listener: ASMNET1LSNR_ASM
[grid@bjyctzdb10 ~]$ srvctl config asm
ASM home:
Password file: +OCR_NEW/orapwasm
Backup of Password file: ocr_new
ASM listener: LISTENER
ASM instance count: 3
Cluster ASM listener: ASMNET1LSNR_ASM
这是我们发现我们的密码文件已经完整的回来了。
17.再次重启集群,观察是否可以自动将crsd资源拉起来
失败,和之前的失败原因是一样的。
18.本次具体的报错信息
alert.log日志报错。
2021-05-17 19:09:52.326 [CRSD(123152)]CRS-0804: Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage Storage layer error [Insufficient quorum to open OCR devices] [0]]. Details at (:CRSD00111:) in /u02/app/grid/diag/crs/bjyctzdb09/crs/trace/crsd.trc.
2021-05-17 19:09:52.533 [CRSD(123176)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 123176
2021-05-17 19:09:58.341 [CRSD(123176)]CRS-1013: The OCR location in an ASM disk group is inaccessible. Details in /u02/app/grid/diag/crs/bjyctzdb09/crs/trace/crsd.trc.
2021-05-17 19:09:58.346 [CRSD(123176)]CRS-0804: Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage Storage layer error [Insufficient quorum to open OCR devices] [0]]. Details at (:CRSD00111:) in /u02/app/grid/diag/crs/bjyctzdb09/crs/trace/crsd.trc.
2021-05-17 19:09:58.555 [CRSD(123220)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 123220
ohasd_orarootagent_root.trc报错信息。
2021-05-17 18:58:56.791 : USRTHRD:2394859264: [ INFO] {0:5:3} [ora.storage] Error [kgfoAl06] in [kgfokge] at kgfo.c:3180
2021-05-17 18:58:56.791 : USRTHRD:2394859264: [ INFO] {0:5:3} [ora.storage] ORA-01017: invalid username/password; logon denied
2021-05-17 18:58:56.791 : USRTHRD:2394859264: [ INFO] {0:5:3} [ora.storage] Category: 7
2021-05-17 18:58:56.791 : USRTHRD:2394859264: [ INFO] {0:5:3} [ora.storage] DepInfo: 1017
2021-05-17 18:58:56.791 : USRTHRD:2394859264: [ INFO] {0:5:3} [ora.storage] – trace dump end –
2021-05-17 18:58:56.791 : USRTHRD:2394859264: [ INFO] {0:5:3} Thread:kgfoCheckMountExt isRunning is reset to false here
2021-05-17 18:58:56.791 : USRTHRD:2394859264: [ INFO] {0:5:3} Thread:kgfoCheckMountExt isFinished set to true
2021-05-17 18:58:56.791 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Tasklet::doTask m_cv.timewait returned
2021-05-17 18:58:56.791 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Tasklet::doTask(kgfoCheckMountExt) executed in 3 seconds
2021-05-17 18:58:56.791 :CLSDYNAM:3566778112: [ora.storage]{0:5:3} [start] StorageAgent::parsekgforetcodes retcode = 7, kgfoCheckMount(OCR_NEW), flag 2
2021-05-17 18:58:56.791 :CLSDYNAM:3566778112: [ora.storage]{0:5:3} [start] (null) category: 7, operation: kgfoAl06, loc: kgfokge, OS error: 1017, other: ORA-01017: invalid username/password; logon denied
2021-05-17 18:58:56.791 :CLSDYNAM:3566778112: [ora.storage]{0:5:3} [start] StorageAgent::check 260 kgfo returncode 1
2021-05-17 18:58:56.791 :CLSDYNAM:3566778112: [ora.storage]{0:5:3} [start] (:CLSN00140:)StorageAgent::check 300 parsekgforretcodes OCR dgName OCR_NEW state 1
2021-05-17 18:58:56.791 :CLSDYNAM:3566778112: [ora.storage]{0:5:3} [start] Storage::start waiting for check to not return PARTIAL or UNPLANNED_OFFLINE 1
2021-05-17 18:58:57.791 :CLSDYNAM:3566778112: [ora.storage]{0:5:3} [start] StorageAgent::check 160 NODEROLE_HUB getOCRdetails
2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt thread constructor exit this:9c07c820 m_pThnd:0 m_thndMX:9c07c840, m_tintMX:9c07c890 &m_postMX:0x7f0a9c07c870
2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} TaskletObjManager::manageTasklets START
2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt thread destructor entry { this:0x7f0a9c03a8e0 m_pThnd:0x7f0a9c0851f0 m_thndMX:0x7f0a9c03a900, m_tintMX:0x7f0a9c03a950 &m_postMX:0x7f0a9c03a930
2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt thread destructor stop
2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt stop { m_pThnd:0x7f0a9c0851f0 tid:2394859264 running:0 alive:0
2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Thread stop:thread is not running:0 or alive:0
2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Thread stop:thread acquire m_pThnd:0x7f0a9c0851f0 m_thndMX:0x7f0a9c03a900
2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Thread stop:thread sltstjn
2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt stop release m_thndMX:0x7f0a9c03a900 m_tid:2394859264}
2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt thread destructor m_pThnd:(nil) exit }
2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} TaskletObjManager::manageTasklets END
2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Thread::start { acquire thndMX:9c07c840
2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Thread::start spawn pThnd:0x7f0a9c04dff0 thndType:1
2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Thread::start thread spawned tid:2394859264
2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Thread::start spawned release thndMX:9c07c840 }
2021-05-17 18:58:57.799 : CLSNS:2394859264: clsns_SetTraceLevel:trace level set to 1.
2021-05-17 18:58:57.827 : USRTHRD:2394859264: [ INFO] {0:5:3} [ora.storage] 9607 Error 4 querying length of attr ASM_DISCOVERY_ADDRESS
19.查看asm实例中的用户名以及权限信息
从这里发现此次事故的第二个具体原因。
[root@bjyctzdb09 dbs]# asmcmd lspwusr
Username sysdba sysoper sysasm
SYS TRUE TRUE TRUE
对照正常的asm实例用户如下:(12c之后的版本都是如下–11g版本没有验证)。
20.添加对应的用户,并授予对应的权限
[root@bjyctzdb09 dbs]# asmcmd orapwusr --add ASMSNMP
Enter password: *******
[root@bjyctzdb09 dbs]# asmcmd orapwusr --grant sysdba ASMSNMP
[root@bjyctzdb09 dbs]# asmcmd lspwusr
Username sysdba sysoper sysasm
SYS TRUE TRUE TRUE
ASMSNMP TRUE FALSE FALSE
[root@bjyctzdb09 dbs]# orapwusr --add CRSUSER__ASM_001
-bash: orapwusr: command not found
[root@bjyctzdb09 dbs]# asmcmd orapwusr --add CRSUSER__ASM_001
Enter password: ******
[root@bjyctzdb09 dbs]# asmcmd orapwusr --grant sysdba CRSUSER__ASM_001
[root@bjyctzdb09 dbs]# asmcmd orapwusr --grant sysasm CRSUSER__ASM_001
从这里我们可以发现,用户已经完全正确,且对应的权限也没有什么问题。
[root@bjyctzdb09 dbs]# asmcmd lspwusr
Username sysdba sysoper sysasm
SYS TRUE TRUE TRUE
ASMSNMP TRUE FALSE FALSE
CRSUSER__ASM_001 TRUE FALSE TRUE
21.再次重启crs,观察是否能够自动启动
报错,和之前一样的报错信息。
22.进一步查阅mos文档,发现最终问题
数据库版本19.8或者之后的ru补丁后的数据库,带来一个新特性"asmcmd credverify" and “asmcmd credfix” 命令来创建asm密码文件。
执行如下命令。该命令OCR、OLR和密码文件中是否一致,以及验证是否存在连接集群的凭证。
[grid@bjyctzdb10 ~]$ asmcmd --nocp credverify
credverify: Cannot connect using credentials for CRSUSER__ASM_001, please run 'credfix’ to fix the credentials.
根据上述提示,无法使用功能CRSUSER__ASM_001,并需要运行一下命令来修复凭证。
从下面可以看出删除了CRSUSER__ASM_001用户,并创建了新的用户CRSUSER__ASM_002,而且最后将连接集群的证书修复。
[grid@bjyctzdb10 ~]$ asmcmd --nocp credfix
credfix: Failed to connect using credentials for CRSUSER__ASM_001.
op=addcrscreds wrap=/tmp/creds0.xml
credfix: Creating new credentials, no valid credentials in OCR.
credfix: New user CRSUSER__ASM_002 created.
credfix: Starting SSH session on node bjyctzdb09.
credfix: OLR for bjyctzdb09 has been fixed if credentials were created incorrectly. Exiting SSH session.
op=credimport wrap=/tmp/creds0.xml olr=true force=true
credfix: OLR for bjyctzdb10 has been fixed if credentials were created incorrectly.
op=delcrscreds crs_user=CRSUSER__ASM_001
credfix: Deleted CRSUSER__ASM_001 from OCR.
credverify: starting SSH session on node bjyctzdb09
credverify: Credentials created correctly on bjyctzdb09. Exiting SSH session.
credverify: Credentials created correctly on bjyctzdb10.
credfix: Credentials have been fixed if they were created incorrectly
23.再次查看asm实例中的用户和密码
[grid@bjyctzdb10 ~]$ asmcmd lspwusr
Username sysdba sysoper sysasm
SYS TRUE TRUE TRUE
ASMSNMP TRUE FALSE FALSE
CRSUSER__ASM_002 TRUE FALSE TRUE
24.再次重启集群,观察crs是否自动启动
到此,问题得到完美解决!
墨天轮原文链接:https://www.modb.pro/db/65395(复制到浏览器或者点击“阅读原文”立即查看)