Importance of testing yours backup strategy
Most of you for sure know, that ability to restore data in case of failure is a primary skill for each DBA. You should always be able to restore and recover data you’re responsible for. This is an axiom. To be sure, that you’re able to do it, you should test it on regular basis. There is of course possibility to use some Oracle features, like backup ... validate or restore ... validate commands, but if you want to be certain as much as possible, the only way is just to run real restore and recovery. Doing it periodically for big amount of databases is extremely tough, both because of resources needed and DBA time. That’s why one of our DBAs Ruben Gaspar Aparicio has created recovery system, which is heavily used at CERN. Good news - it is available as Open Source on Sourceforge (http://sourceforge.net/projects/recoveryplat). Since its release we've introduced many modifications to it, but still it could be good starting point to check the source code in order to start developing your own solution. We’re using it as a validation of our backup strategy, to run real restore and recovery every week or two for most of our databases.
To prove (do I really have to? ;)) that doing regular restores and recoveries is crucial (even if you're using very well tested backup strategy and backup system or scripts), I would like to describe problem we've got some time ago in our environment.
Story started when one of our automatic recoveries failed with following error:
channel c1: looking for AUTOBACKUP on day: 20121217 channel c1: looking for AUTOBACKUP on day: 20121216 channel c1: looking for AUTOBACKUP on day: 20121215 channel c1: looking for AUTOBACKUP on day: 20121214 channel c1: looking for AUTOBACKUP on day: 20121213 channel c1: looking for AUTOBACKUP on day: 20121212 channel c1: looking for AUTOBACKUP on day: 20121211 channel c1: no AUTOBACKUP in 7 days found released channel: c1 RMAN-00571: =========================================================== RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS =============== RMAN-00571: =========================================================== RMAN-03002: failure of restore command at 12/17/2012 00:56:07 RMAN-06172: no AUTOBACKUP found or specified handle is not a valid copy or piece
After checking a few things (for example if CONTROLFILE AUTOBACKUP is configured properly for this database), I decided to have a look into backup logs. In archived logs backup logs I found such messages:
skipping archived log of thread 1 with sequence 53600; already backed up kgefec: fatal error 0 kgepop: no error frame to pop to for error 603
Afterwards, there were few messages from our backup system, that job finished without problems... So we were not aware about any problems, even though we should be notified by email if RMAN exit code is different than 0.
So I manually tried to run archived logs backup to check the exit code:
Starting backup at 17-DEC-2012 12:22:10 current log archived using target database control file instead of recovery catalog allocated channel: ORA_SBT_TAPE_1 channel ORA_SBT_TAPE_1: SID=403 device type=SBT_TAPE channel ORA_SBT_TAPE_1: Data Protection for Oracle: version 18.104.22.168 allocated channel: ORA_SBT_TAPE_2 channel ORA_SBT_TAPE_2: SID=14 device type=SBT_TAPE channel ORA_SBT_TAPE_2: Data Protection for Oracle: version 22.214.171.124 skipping archived logs of thread 1 from sequence 47439 to 52209; already backed up skipping archived log of thread 1 with sequence 52212; already backed up (...) skipping archived log of thread 1 with sequence 53602; already backed up skipping archived log of thread 1 with sequence 53605; already backed up kgefec: fatal error 0 kgepop: no error frame to pop to for error 603 $ echo $? 0
As you can see, exit code is 0... Quick check in MOS and such error seems to be known: Rman Fails: kgepop: no error frame to pop to for error 603 [ID 1458718.1]. I tested suggested solution:
$ ulimit -s 10240 $ ulimit -s unlimited $ ulimit -s unlimited
Unfortunately it didn't help and we got the same error again, so I decided to check the number of existing archivelogs (in this case they should be deleted because delete all input clause is used in backup command):
SQL> select trunc(completion_time) day, count(*) from v$archived_log where deleted != 'YES' and standby_dest != 'YES' group by trunc(completion_time) order by 1;
It appeared that in the beginning of the month quite many were generated (sizing of them is another question, but this database has not yet been used in production):
DAY COUNT(*) -------------------- -------------- 02-DEC-2012 00:00:00 1568 03-DEC-2012 00:00:00 1508 04-DEC-2012 00:00:00 1461 05-DEC-2012 00:00:00 5585 06-DEC-2012 00:00:00 208 07-DEC-2012 00:00:00 194 08-DEC-2012 00:00:00 224 09-DEC-2012 00:00:00 318 10-DEC-2012 00:00:00 246 11-DEC-2012 00:00:00 222 12-DEC-2012 00:00:00 220 13-DEC-2012 00:00:00 224 14-DEC-2012 00:00:00 196 15-DEC-2012 00:00:00 232 16-DEC-2012 00:00:00 292 17-DEC-2012 00:00:00 198
Because errors were related to memory problems and were generated during deletion (so fortunately they were backed up), I tried to delete many of them manually, using:
RMAN> delete archivelog until time 'sysdate - 3' backed up 1 times to device type sbt;
Afterwards, backups started to finish without problems... And how whole problem appeared? It seems, that because standby database was in place and was not applying redo for few days (which accidentally happened during very high redo generation on primary), backed up archived logs were not deleted. After reaching some number, they could not be deleted because RMAN was crashing due to memory problems. But the worst part is that RMAN was crashing "silently", so we were not aware about any problems with backups. Fortunately, automatic recovery was able to show us the problem, before it was too late...