Importance of testing yours backup strategy

Submitted by sskorupi on
Blog article:

Most of you for sure know, that ability to restore data in case of failure is a primary skill for each DBA. You should always be able to restore and recover data you’re responsible for. This is an axiom. To be sure, that you’re able to do it, you should test it on regular basis. There is of course possibility to use some Oracle features, like backup ... validate or restore ... validate commands, but if you want to be certain as much as possible, the only way is just to run real restore and recovery. Doing it periodically for big amount of databases is extremely tough, both because of resources needed and DBA time. That’s why one of our DBAs Ruben Gaspar Aparicio has created recovery system, which is heavily used at CERN. Good news - it is available as Open Source on Sourceforge (http://sourceforge.net/projects/recoveryplat). Since its release we've introduced many modifications to it, but still it could be good starting point to check the source code in order to start developing your own solution. We’re using it as a validation of our backup strategy, to run real restore and recovery every week or two for most of our databases.

To prove (do I really have to? ;)) that doing regular restores and recoveries is crucial (even if you're using very well tested backup strategy and backup system or scripts), I would like to describe problem we've got some time ago in our environment.

Story started when one of our automatic recoveries failed with following error:

 

channel c1: looking for AUTOBACKUP on day: 20121217
channel c1: looking for AUTOBACKUP on day: 20121216
channel c1: looking for AUTOBACKUP on day: 20121215
channel c1: looking for AUTOBACKUP on day: 20121214
channel c1: looking for AUTOBACKUP on day: 20121213
channel c1: looking for AUTOBACKUP on day: 20121212
channel c1: looking for AUTOBACKUP on day: 20121211
channel c1: no AUTOBACKUP in 7 days found
released channel: c1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 12/17/2012 00:56:07
RMAN-06172: no AUTOBACKUP found or specified handle is not a valid copy or piece

 

After checking a few things (for example if CONTROLFILE AUTOBACKUP is configured properly for this database), I decided to have a look into backup logs. In archived logs backup logs I found such messages:


skipping archived log of thread 1 with sequence 53600; already backed up
kgefec: fatal error 0
kgepop: no error frame to pop to for error 603

 

Afterwards, there were few messages from our backup system, that job finished without problems... So we were not aware about any problems, even though we should be notified by email if RMAN exit code is different than 0.

So I manually tried to run archived logs backup to check the exit code:


Starting backup at 17-DEC-2012 12:22:10
current log archived
using target database control file instead of recovery catalog
allocated channel: ORA_SBT_TAPE_1
channel ORA_SBT_TAPE_1: SID=403 device type=SBT_TAPE
channel ORA_SBT_TAPE_1: Data Protection for Oracle: version 5.5.2.0
allocated channel: ORA_SBT_TAPE_2
channel ORA_SBT_TAPE_2: SID=14 device type=SBT_TAPE
channel ORA_SBT_TAPE_2: Data Protection for Oracle: version 5.5.2.0
skipping archived logs of thread 1 from sequence 47439 to 52209; already backed up skipping archived log of thread 1 with sequence 52212; already backed up
(...)
skipping archived log of thread 1 with sequence 53602; already backed up skipping archived log of thread 1 with sequence 53605; already backed up
kgefec: fatal error 0
kgepop: no error frame to pop to for error 603

$ echo $?
0

 

As you can see, exit code is 0... Quick check in MOS and such error seems to be known: Rman Fails: kgepop: no error frame to pop to for error 603 [ID 1458718.1]. I tested suggested solution:


$ ulimit -s
10240
$ ulimit -s unlimited
$ ulimit -s
unlimited

 

Unfortunately it didn't help and we got the same error again, so I decided to check the number of existing archivelogs (in this case they should be deleted because delete all input clause is used in backup command):


SQL> select trunc(completion_time) day, count(*)
     from v$archived_log
     where deleted != 'YES'
     and standby_dest != 'YES'
     group by trunc(completion_time)
     order by 1;

 

It appeared that in the beginning of the month quite many were generated (sizing of them is another question, but this database has not yet been used in production):


DAY                        COUNT(*)
-------------------- --------------
02-DEC-2012 00:00:00           1568
03-DEC-2012 00:00:00           1508
04-DEC-2012 00:00:00           1461
05-DEC-2012 00:00:00           5585
06-DEC-2012 00:00:00            208
07-DEC-2012 00:00:00            194
08-DEC-2012 00:00:00            224
09-DEC-2012 00:00:00            318
10-DEC-2012 00:00:00            246
11-DEC-2012 00:00:00            222
12-DEC-2012 00:00:00            220
13-DEC-2012 00:00:00            224
14-DEC-2012 00:00:00            196
15-DEC-2012 00:00:00            232
16-DEC-2012 00:00:00            292
17-DEC-2012 00:00:00            198

 

Because errors were related to memory problems and were generated during deletion (so fortunately they were backed up), I tried to delete many of them manually, using:


RMAN> delete archivelog until time 'sysdate - 3' backed up 1 times to device type sbt;

 

Afterwards, backups started to finish without problems... And how whole problem appeared? It seems, that because standby database was in place and was not applying redo for few days (which accidentally happened during very high redo generation on primary), backed up archived logs were not deleted. After reaching some number, they could not be deleted because RMAN was crashing due to memory problems. But the worst part is that RMAN was crashing "silently", so we were not aware about any problems with backups. Fortunately, automatic recovery was able to show us the problem, before it was too late...

 

Add new comment

CAPTCHA
Enter the characters shown in the image.
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.

Disclaimer

The views expressed in this blog are those of the authors and cannot be regarded as representing CERN’s official position.

CERN Social Media Guidelines

 

Blogroll