案例 | 12c新特性引发的ora-01017故障问题解决方案

和记平台（OLM）

作者:王飞扬和记平台·北京

近日，某客户反映应用连接数据库异常，应用经常报错ora-01017错误，经了解，数据库版本为12.1.0.2.0RAC环境，操作系统为linuxRedhat 6.5，在询问了详细的异常情况后，得知错误是在应用连接时配置连接串为service name时导致，由于是生产系统，于是决定先做应急处理，让用户把连接串servicename改为sid方式，暂时指定到一个节点上，作为应急方案，不能影响用户使用。

远程连接到客户现场，先验证了几个自己的猜想，首先想到密码错误？用户密码大小写敏感？口令文件出问题了？TNS文件配置问题？……

在查看alert日志后，没有发现什么有价值的信息，随后继续查看监听日志，也没什么信息，而且监听文件内容较多不适合检索，在经过一系列的确认后，暂时未发现故障原因，于是决定手动重现异常现象，当我们使用sqlplus/ as sysdba本地连接时，两节点均不报错，但是使用sqlplus user/passwd@tnsname连接时，故障现象重现了，下面就是本人基于故障现象模拟重现了一次。

解决方案过程

基于应用描述情况，新建用户进行手动测试，测试过程如下：

1、新建数据库测试用户test

2、建立本地tnsname文件，使用两节点的vip建立tnsname文件，大致如下

TEST =

(DESCRIPTION =

(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.3.88)(PORT=1521)))

(CONNECT_DATA =

(SERVER = DEDICATED)

(SERVICE_NAME = orcl)

3、使用sqlplus user/passwd@tnsname方式多次测试连接，故障重现，故障现象为偶发报错ORA-01017错误，并不是每一次都报错，这很奇怪，数据库偶尔可以登录，偶尔失败，且错误只在节点1上发生，节点2无异常现象。

$ sqlplus test/oracle@test -------本次登录正常

SQL*Plus: Release12.1.0.2.0 Production on Wed Jun 6 10:25:35 2018

Last Successful login time:Wed Jun 06 2018 10:25:33 +08:00

Connected to:

Oracle Database 12cEnterprise Edition Release 12.1.0.2.0 - 64bit Production

With the Partitioning, RealApplication Clusters, Automatic Storage Management, OLAP,

Advanced Analytics and RealApplication Testing options

SQL> exit

登陆失败

报错 ora-01017

$ sqlplus test/oracle@test -------再次登录失败

SQL*Plus: Release12.1.0.2.0 Production on Wed Jun 6 10:25:38 2018

ERROR:

ORA-01017: invalidusername/password; logon denied

Enter user-name: ^C

冷静下来仔细思考一番，在排除其他问题原因后，再一次检查监听状态和数据库资源状态，其中检查监听状态时发现，节点1监听状态如下：

$ lsnrctl status

…….

Service "+ASM"has 1 instance(s).

Instance "+ASM1", status READY, has1 handler(s) for this service...

Service"-MGMTDBXDB" has 1 instance(s).

Instance "-MGMTDB", status READY,has 1 handler(s) for this service...

Service "_mgmtdb"has 1 instance(s).

Instance "-MGMTDB", status READY,has 1 handler(s) for this service...

Service "orcl"has 2 instance(s).

Instance "-MGMTDB", status READY,has 1 handler(s) for this service...

Instance "orcl1", status READY, has1 handler(s) for this service...

Service "orclXDB"has 1 instance(s).

Instance "orcl1", status READY, has1 handler(s) for this service...

The command completedsuccessfully

对比节点2监听状态：

Services Summary...

Service "+ASM"has 1 instance(s).

Instance "+ASM2", status READY, has1 handler(s) for this service...

Service "orcl"has 1 instance(s).

Instance "orcl2", status READY, has1 handler(s) for this service...

Service "orclXDB"has 1 instance(s).

Instance "orcl2", status READY, has1 handler(s) for this service...

The command completedsuccessfully

突然发现了之前没有注意的一个地方，就是节点1比节点2多出一个实例-MGMTDB，问了一下客户，这个是什么，客户说他们也不清楚，不是他们的东西，但是仔细一想，它们在同一个service下，如果连接被传递给mgmtdb实例的话，那么肯定会发生ora-01017。于是查阅相关资料，发现其与12c的新特性有关：原本在11g中由Berkeley DB管理的CHM repository改成了Oracle db管理：

官方解释如下：

MGMTDB is new databaseinstance which is used for storing Cluster Health Monitor (CHM) data. In 11gthis was being stored in berkley database but starting Oracle database 12c itis configured as Oracle Database Instance.

在了解了MGMTDB是什么后，猜想这个会不会是12c的bug，于是到mos上进一步查看相关文档，果然发现了如下bug：

MGMTDB registers Database Service (Doc ID 2063662.1)

GIMR (Management Database) Registers Into Same Service that the DatabaseInstance also registers On RAC (Doc ID 2024572.1)

文档中描述，该问题在数据库与cluster name同名时发生，会导致mgmtdb把自己注册到这个与cluster name同名数据库的default service下。

MGMTDB registers with default service which is same as the cluster name. If the database nameis same as the cluster name, MGMTDB registered to the service that isdatabase name because it is same as the cluster name.

于是和客户确认，其数据库名与Cluster name相同，均为orcl。

根据mos文档，GIMR (Management Database) Registers Into Same Service that the DatabaseInstance also registers On RAC (文档 ID2024572.1)

官方给出的解决方案为：

The following workaroundworked for some, so try the following workaround if changing the database nameis not feasible:

1)connect to MGMTDB

$ su - grid
$ export ORACLE_SID=-MGMTDB
$ sqlplus / as sysdba

将MGMTDB实例的local_listener参数设置成监听私网ip

2)modify local_listener ofMGMTDB
SQL> alter system set local_listener='(ADDRESS=(PROTOCOL=TCP)(HOST=<node1interconnect IP>)(PORT=<mgmtlsnr port number>))','(ADDRESS=(PROTOCOL=TCP)(HOST=<node2interconnect IP>)(PORT=<mgmtlsnr port number>))' scope=both;

于是按照MOS给出的解决方案，进行修改。

修改完成后，再次查看两节点监听状态：

节点1

Services Summary...

Service "+ASM"has 1 instance(s).

Instance "+ASM1", status READY, has1 handler(s) for this service...

Service "orcl"has 1 instance(s).

Instance "orcl1", status READY, has1 handler(s) for this service...

Service "orclXDB"has 1 instance(s).

Instance "orcl1", status READY, has1 handler(s) for this service...

The command completedsuccessfully

节点2

Services Summary...

Service "+ASM"has 1 instance(s).

Instance "+ASM2", status READY, has1 handler(s) for this service...

Service "orcl"has 1 instance(s).

Instance "orcl2", status READY, has1 handler(s) for this service...

Service "orclXDB"has 1 instance(s).

Instance "orcl2", status READY, has1 handler(s) for this service...

The command completedsuccessfully

可以看到此处，MGMTDB消失了，于是再次测试是否还会出现ora-01017问题，经测试，问题没有再出现，经检查，两节点状态均正常，于是告知客户，可以把连接串修改回service name，后经应用测试，原有连接方式可以正常使用，故障解决。

以上就是我本次故障处理的经过了，分享给大家希望可以给大家一些帮助。

微信图片_20180626150221.jpg

| 北京 | 上海 | 广州 | 成都 |

4008-906-960