作者:赵世雄 | 和记平台 · 广州分公司
我们通常碰到过的ORACLE数据库控制文件损坏情况有哪些呢?你是否碰到过controlfile sequence number值被耗尽而引起损坏的情况?下面我来跟大家讲述一个和记平台广州团队曾经碰到过的一个案例。
在2017年国庆小长假快结束的时候,和记平台广州团队接到了一个客户的紧急报障电话,客户电话里急促的说道:"我们数据库告警日志出现控制文件损坏的报错",我们的工程师立即远程连到客户系统,查看告警日志:
Completed checkpoint up to RBA [0x1dd32.2.10], SCN: 10184774533893
Fri Oct 06 12:08:31 2017
Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/xxx/trace/xxx_ora_37874.trc:
ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'
Fri Oct 06 12:08:31 2017
Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_m000_32185.trc:
ORA-00235: control file read without a lock inconsistent due to concurrent update
Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_ora_37874.trc (incident=4273683):
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'
Incident
details in: /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/incident/incdir_4273683/XXX_ora_37874_i4273683.trc
Fri Oct 06 12:08:32 2017
Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_ora_70265.trc:
ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'
Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_ora_70265.trc (incident=4275459):
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'
Incident details in: /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/incident/incdir_4275459/XXX_ora_70265_i4275459.trc
.........................内容较多,省略中间部分内容
.........................内容较多,省略中间部分内容
Fri Oct 06 13:19:00 2017
Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_lgwr_49423.trc:
ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'
Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_lgwr_49423.trc:
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'
告警日志里,我们看到ctrl01.ctl控制文件持续出现损坏报错,本以为用ctrl02.ctl控制文件copy一份就能修复ctrl01.ctl控制文件。心里嘀咕着:这是个小问题的时候,数据库突然宕掉了。
Fri Oct 06 13:19:00 2017
Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_lgwr_49423.trc:
ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'
Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_lgwr_49423.trc:
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'
LGWR (ospid: 49423): terminating the instance due to error 227
Fri Oct 06 13:19:15 2017
opiodr aborting process unknown ospid (71057) as a result of ORA-1092
Fri Oct 06 13:19:17 2017
System state dump requested by (instance=1, osid=49423 (LGWR)), summary=[abnormal instance termination].
Fri Oct 06 13:19:25 2017
ORA-1092 : opitsk aborting process
Fri Oct 06 13:19:26 2017
opiodr aborting process unknown ospid (71043) as a result of ORA-1092
System State dumped to trace file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_diag_49394.trc
Fri Oct 06 13:19:27 2017
opiodr aborting process unknown ospid (71045) as a result of ORA-1092
Fri Oct 06 13:19:27 2017
ORA-1092 : opitsk aborting process
Fri Oct 06 13:19:27 2017
ORA-1092 : opitsk aborting process
Fri Oct 06 13:19:27 2017
opiodr aborting process unknown ospid (71047) as a result of ORA-1092
Fri Oct 06 13:22:47 2017
opiodr aborting process unknown ospid (71055) as a result of ORA-1092
Fri Oct 06 13:19:27 2017
ORA-1092 : opitsk aborting process
Fri Oct 06 13:19:27 2017
ORA-1092 : opitsk aborting process
Instance terminated by LGWR, pid = 49423
心里一阵拔凉,问题没那么简单,难道ctrl02.ctl控制文件也损坏了?
确实正如我们工程师所料,使用ctrl02.ctl控制文件重启数据库的时候,报ctrl02.ctl也损坏。
此系统是客户的核心系统,先帮客户恢复系统再分析原因,于是利用备份的控制文件来修复控制文件。
$mv /database/XXX/XXX1/XXX/ctrl01.ctl /database/XXX/XXX1/XXX/ctrl01.ctl_bak
$mv /database/XXX/XXX2/XXX/ctrl02.ctl /database/XXX/XXX2/XXX/ctrl02.ctl_bak
RMAN>restore controlfile from '/backup/ctl.bak';
系统恢复后,开始查找原因。通常碰到控制文件损坏,我们的第一反应就是存储是不是出问题了。于是查看操作系统日志,没发现任何异常。在定位原因的过程时,控制文件再次损坏。我们通过把控制文件创建到本地硬盘、切换到备机、禁用HA Cluseter测试、断开容灾链路测试等一系列测试,问题依旧。在上面这些测试期间,反复跟原厂沟通确认是否是Bug 20324049引起,原厂回复比较含糊,不能百分百肯定。
Bug 20324049 ORA-227 Controlfile Corruption when reaching Maximum Value for Control Seq kccfhcsq
This note gives a brief overview of bug 20324049.
The content was last updated on: 17-APR-2018
Click here for details of each of the sections below.
Affects:
Product (Component) | Oracle Server (Rdbms) |
Range of versions believed to be affected | (Not specified) |
Versions confirmed as being affected | 12.1.0.2 (Server Patch Set) 11.2.0.4 10.2.0.5 8.1.7.4 |
Platforms affected | Generic (all / most platforms affected) |
Fixed:
The fix for 20324049 is first included in | 18.1.0 12.2.0.1.170919 (Sep 2017) Database Release Update (DB RU) 12.2.0.1.DBOCT2017RUR:180417 (Apr 2018) Database Release Update Revision(DB RUR) 12.2.0.1.DBJAN2018RUR:180417(Apr 2018) Database Release Update Revision(DB RUR) 12.1.0.2.180116 (Jan 2018) Database Patch Set Update (DB PSU) 12.1.0.2.180116 (Jan 2018) Database Proactive Bundle Patch 11.2.0.4.180116 (Jan 2018) Database Patch Set Update (DB PSU) 11.2.0.4.180116 (Jan 2018) Exadata Database Bundle Patch 12.2.0.1.171017 (Oct 2017) Bundle Patch for Windows Platforms 12.1.0.2.180116 (Jan 2018) Bundle Patch for Windows Platforms 11.2.0.4.180116 (Jan 2018) Bundle Patch for Windows Platforms |
Interim patches may be available for earlier versions - click here to check.
Symptoms:
Corruption (Logical)
Instance May Crash
HCHECK script reports this issue
Error May Occur
ORA-227 / HCKE-50
Related To:
Instance Startup
Description
The controlfile sequence number is not expected to increment so
rapidly that it would ever reach the architectural limit during
the database lifetime, but if the controlfile sequence number
does reach the limit for some reason, then processes will fail
with ORA-227 errors causing the instance to terminate, then after
this, the database cannot be mounted again because the FG process
doing the mount will always fail with the same ORA-00227 error.
This fix writes regular warning messages to the alert log in the
case where the controlfile sequence number approaches the
architectural limit. And this fix also provides an event which
can be set to modify the behavior of "CREATE CONTROLFILE" with
"NORESETLOGS" so that it will reset the controlfile sequence
number to 1 in the datafiles, the online redo logs, and the
new controlfile.
If hcheck in Note:136697.1 is run, it may detect this issue
with error HCKE-0050, see Note:2128446.1.
Rediscovery Notes
Look for all the following:
- various processes start reporting the following error:
ORA-227: corrupt block detected in control file: (block 1,# blocks 1)
- a common stack trace for the ORA-227 errors might be this:
kcvsursuht kcc_begin_txn_internal kccocx kccchb kcccsi ...
- then a fatal background process gets an ORA-227 error and terminates the instance
- then the database cannot be mounted again
- do a hex dump of the controlfile via
$ xxd -g4 <controlfile> > cf.hexdump
and search forward in the hexdump for the first occurrence
of the database name, eg "R12B" in the example below:
...
0004000: 15c20000 01000000 00000000 00000104 ................
0004010: 59320000 00000000 0000200b 24e7e472 Y2........ .$..r
0004020: 52313242 00000000 00000000 00020000 R12B............
... ^^^^^^^^
the 3rd 32-byte word in the row with the database name is
the controlfile sequence number, it will be zero (but it
should never be zero, and it's the reason for the ORA-227's)
- an "xxd" hex dump of a restored recent backup of the controlfile
should show the controlfile sequence number is some high value
eg e1ffffff (which is 0xffffffe1 after endian conversion).
Workaround
There is no workaround for the ORA-227 problem after it has
happened.
It is possible to proactively prevent the ORA-227 problem from
happening (before it happens) in the specific case where the
controlfile seq# is incrementing very quickly (eg many times per
second) due to storing of last scn/time of nologging operations
in the controlfile; in this specific case, the workaround is to
set: db_unrecoverable_scn_tracking=FALSE
Solution:
After installing this fix:
1. verify the controlfile sequence# is at or above 0xFF000000:
set numwidth 15
select max(FHCSQ)
from x$kcvfh;
2. Generate Trace file to recreate the controlfile:
alter database backup controlfile to trace noresetlogs;
3. shutdown
4. startup nomount
5. alter session set events '20324049 trace name context forever, level 1';
6. execute the commands in the tracefile generated by step#2
7. alter session set events '20324049 trace name context off';
8. confirm the controlfile sequence# is now low with the same query in 1.
Take new backups after applying this solution, otherwise recovery will not be possible
failing with these errors:
ORA-00283: recovery session canceled due to errors
ORA-01122: database file 1 failed verification check
ORA-01110: data file 1: '/oracle/dbs/t_db1.f'
ORA-01207: file is more recent than control file - old control file
此系统的数据库是11.2.0.3版本,Bug 20324049描述影响的版本不涉及到11.2.0.3版本,由于主机、存储上已经做了各种排查也未能发现原因跟其相关。于是我们建议把数据库升级到11.2.0.4版本,并打上Bug 20324049补丁,并重建控制文件,把controlfile sequence number的通过此补丁压回值到1。
CREATE CONTROLFILE is being performed with event 20324049 set, this will now reset the controlfile sequence number to 1 in the datafiles, the online redo logs, and the new controlfile.
datafile 1 header updated: controlfile sequence# reset to 1
datafile 2 header updated: controlfile sequence# reset to 1
datafile 3 header updated: controlfile sequence# reset to 1
datafile 4 header updated: controlfile sequence# reset to 1
datafile 5 header updated: controlfile sequence# reset to 1
datafile 6 header updated: controlfile sequence# reset to 1
datafile 7 header updated: controlfile sequence# reset to 1
datafile 8 header updated: controlfile sequence# reset to 1
datafile 9 header updated: controlfile sequence# reset to 1
datafile 10 header updated: controlfile sequence# reset to 1
datafile 11 header updated: controlfile sequence# reset to 1
datafile 12 header updated: controlfile sequence# reset to 1
..........................................................
..........................................................
datafile 344 header updated: controlfile sequence# reset to 1
datafile 345 header updated: controlfile sequence# reset to 1
datafile 346 header updated: controlfile sequence# reset to 1
logfile /database/xxxx/xxxx/xxxx/redo01-1.log header updated: controlfile sequence# reset to 1
logfile /database/xxxx/xxxx/xxxx/redo01-2.log header updated: controlfile sequence# reset to 1
logfile /database/xxxx/xxxx/xxxx/redo02-1.log header updated: controlfile sequence# reset to 1
logfile /database/xxxx/xxxx/xxxx/redo02-2.log header updated: controlfile sequence# reset to 1
logfile /database/xxxx/xxxx/xxxx/redo03-1.log header updated: controlfile sequence# reset to 1
logfile /database/xxxx/xxxx/xxxx/redo03-2.log header updated: controlfile sequence# reset to 1
logfile /database/xxxx/xxxx/xxxx/redo04-1.log header updated: controlfile sequence# reset to 1
logfile /database/xxxx/xxxx/xxxx/redo04-2.log header updated: controlfile sequence# reset to 1
当然,我们做这个升级是先利用一份存储镜像出来的数据进行测试,测试一切都进行顺利,且经过应用一个周期的测试,控制文件未出现损坏。于是,我们在国庆小长假结束前的凌晨立即对生产库进行修复。通过和记平台广州团队的协力合作,终于在国庆小长假结束前帮客户恢复了系统的正常使用。
OERR: ORA-00227 corrupt block detected in controlfile: (block %s, # blocks %s) Master Note / Troubleshooting, Diagnostic and Solution (文档 ID 48808.1)
APPLIES TO:
Oracle Database - Enterprise Edition - Version 8.0.6.0 to 12.1.0.2 [Release 8.0.6 to 12.1]
Information in this document applies to any platform.
PURPOSE
This article provides information about error ORA-00227 and possible actions.
SCOPE
This note is intended for general audience as initial starting point for beginning diagnosis of ORA-00227.
DETAILS
Error: ORA 227
Text: corrupt block detected in controlfile: (block %s, # blocks %s)
..........................................................
Cause: A block header corruption or checksum error was detected on reading the controlfile.
Action: Use the CREATE CONTROLFILE or RECOVER DATABASE USING BACKUP CONTROLFILE command.
Cause
The controlfile is corrupted. This could be caused by a problem external to Oracle like a Hardware/OS problem or an Oracle Defect.
Identify if Bug 20324049 may be causing the error ORA-227
If Bug 20324049 is causing the error ORA-227; then the maximum sequence number in the Database will be close to the maximum value (4294967295):
set numwidth 15
select max(FHCSQ),
case trunc(max(FHCSQ)/4294967295,1)
when 0.9 then 'WARNING: Reference Bug 20324049'
else 'NO Warning' end "Bug 20324049"
from x$kcvfh;
控制文件序列值(controlfile sequence number)最大值为:4294967295(即43亿左右,2的32次方减1),可以利用上面的语句检查你的系统是否存在控制文件序列值超过极限值的风险。
Take a backup of the control files as they are now
Take a backup of the existent control files with a regular copy (cp) or any other mechanism. This can be used for future diagnosis and in case they are needed for additional recovery.
Identify which control file is failing
ORA-00227 is normally accompanied by ORA-202 which prints the affected control file name. Review the alert log for more details.
Execute DBVERIFY on all Control Files
DBVERIFY is a tool that is intended to identify corruptions in Datafiles but in some cases may help to identify block corruptions in Control Files.
Identify control file block size by executing dbfsize. Example:
$ dbfsize /oradata/controlfile/control1.ctl
Database file: /oradata/controlfile/control1.ctl
Database file type: file system
Database file size: 614 16384 byte blocks
In this case the control file block size is 16384.
Alternatively identify the control file block size by querying view V$CONTROLFILE.
Execute DBVERIFY on each Control File copy using the block size identified in 1.1. Example:
dbv file=/oradata/controlfile/control1.ctl blocksize=16384
Solution
Solution 1. Use another mirror copy of the control file
If it is determined that the control file is damaged (the corruption is persistent) and if the database is down, then take one of the other control files from the control_files parameter and copy it over the bad control file noted above or startup the database with one control file at a time.
Try opening the database and if error persists then go to Solution 2:
Solution 2. Recreate the control file
Use the next article to recreate the control file:
Doc ID 735106.1 How to Recreate a Controlfile
Solution 3. Restore a backup of the control file and apply media recovery
Restore a backup of control file and apply media recovery using RECOVER DATABASE USING BACKUP CONTROLFILE
If the database is on IBM AIX, make sure to review the issue described in Note 2237498.1
发生检查点
日志文件的切换
归档online redolog
运行崩溃后的恢复
热备的开始和结束
DML通过nologging等选项执行对象时
对大象进行直接DML操作等
还有其他你能想到的方面吗?
欢迎补充
| 北京 | 上海 | 广州 | 成都 |
4008-906-960