A little bit scary title, isn't it? Please keep in mind that definitely it is neither supported nor advised method to solve your problems and you should be really careful while doing it - hopefully not on production environment. But it may sometimes happen that you end up with the situation where creating your own merge patch for Oracle database could not be as crazy idea as it sounds :).
Recently, again thanks to our automatic recovery system, we were able to detect problem at the beginning of restore phase of 27 TB database, which failed with the same error I've already described here:
kgefec: fatal error 0
kgepop: no error frame to pop to for error 603
Like before, it is about memory, precisely the stack size of RMAN process, which is quickly filled up due to the way backup pieces are handled. It should happen only when you have too big number of backup pieces to be restored. In this case we had such a big number (2597 backup pieces in total, 870 for the biggest backup set), as we have been recently asked by our TSM team, to try setting smaller backup piece size (10 GB) for this database which is the exceptional one still not backed up to disk, but directly to tapes. This means that if backup is running for long time, we are "blocking" tape drives which cannot be used by other jobs. By setting smaller maxpiecesize we are able to tell TSM to backup the files to staging disk area first, from which the files are sent to tapes later.
Apart from the MOS note mentioned in the previous blog post (1458718.1), I've found another one with information exactly about this problem during restore: Restore database fails with "kgefec: fatal error 0 gepop: no error frame to pop to for error 603" (Doc ID 1472984.1). The note mentions workaround - specify bigger maxpiecesize value to have smaller number of backup pieces generated. There are other options to limit their number per backupset, like using smaller filesperset parameter which will cause less datafiles to be put in single backup set and/or use multisection backups. This should of course help (and we've already increased maxpiecesize), but only for new backups, the old ones are still not usable. There is also another suggestion, to try restoring single tablespaces or even datafiles to allow RMAN to process less backup pieces during the job. This could work, depending on what is the maximum number of backup pieces per backupset, but it would complicate your restore which is not desired when you're working under stress.
Previously we were only getting this error during backups - it was important to solve it as soon as possible, but - assuming that nothing bad happens to the database in the meantime - we had time to analyse it and find the solution. With the new error likewise, as this was test recovery. But imagine that it happens during emergency, where you need to restore your database, you start the process and it fails with such an error. It would not be so comfortable...
Fortunately there is also a patch for this problem: Patch 18071181: RMAN RESTORE ACTIONS FAILED WHEN THERE ARE OVER 700 BACKUP PIECES. Unfortunately, it is available only for Oracle 22.214.171.124 and we needed one for Oracle 126.96.36.199. Additionally, we already have 37 interim patches installed, some of them solving other RMAN bugs by changing recover.bsq file which contains the code used by RMAN during its operations. After downloading the patch it appeared, that the only file it provides is recover.bsq. So we could not apply the patch in our environment, without requesting specific merge patch, which of course would take time. After the patch would be provided we should test it and finally apply it on production databases. So I thought that it would be good to have other, faster solution in case we need for example to restore affected database to the time before new full backup with smaller number of backup pieces has been taken.
I got idea to try finding what is the minimal set of changes needed to solve the problem. As expected, quick comparison of patched file and our current one, was not very encouraging:
$ diff recover.bsq.patched $ORACLE_HOME/rdbms/admin/recover.bsq | wc -l 1902
But because it was a patch for bare Oracle 188.8.131.52 file the best idea was to compare it with recover.bsq file provided with this version. This time the result was much better:
$ diff recover.bsq.patched recover11203.bsq | wc -l 109
I was able to read both files and play a little bit with diff and vimdiff to understand what is the minimal set of changes needed to solve the problem and introduce it manually to our version of the file. It was then time to create my own merge patch:
$ diff -u recover.bsq recover.bsq.corrected > recover.patch
When the patch was ready, I've tried to run restore again to confirm that the problem is still there:
(...) new media label is 6781 for piece DBNAME_20150725_c3qanu32_862_1_lvl0A new media label is 6781 for piece DBNAME_20150725_c3qanu32_863_1_lvl0A kgefec: fatal error 0 kgepop: no error frame to pop to for error 603
Failed, as expected so I was ready to apply my patch using very handy patch utility, making backup of the file just in case something goes wrong:
$ cp recover.patch $ORACLE_HOME/rdbms/admin $ cd $ORACLE_HOME/rdbms/admin $ cp recover.bsq recover.bsq.bak $ patch --verbose recover.bsq < recover.patch Hmm... Looks like a unified diff to me... The text leading up to this was: -------------------------- |--- recover.bsq 2015-07-30 18:13:35.967285372 +0200 |+++ recover.bsq.corrected 2015-07-30 17:55:40.917532711 +0200 -------------------------- Patching file recover.bsq using Plan A... Hunk #1 succeeded at 9939. Hunk #2 succeeded at 10021. Hunk #3 succeeded at 10071. Hunk #4 succeeded at 11035. done
After that, restore of the database not only still started correctly, but also went further, so the problem has been solved!
new media label is 6781 for piece DBNAME_20150725_c3qanu32_862_1_lvl0A new media label is 6781 for piece DBNAME_20150725_c3qanu32_863_1_lvl0A channel c1: starting datafile backup set restore channel c1: specifying datafile(s) to restore from backup set channel c1: restoring datafile 00001 to /ORA/dbs03/oradata/DBNAME/DBNAME_DRAC10/datafile/o1_mf_system_%u_.dbf channel c1: restoring datafile 00003 to /ORA/dbs03/oradata/DBNAME/DBNAME_DRAC10/datafile/o1_mf_sysaux_%u_.dbf
As I did not want to apply this patch permanently, but rather have it ready in case the problem appears, I've again used patch utility, to rollback all introduced changes:
$ patch --verbose -R recover.bsq < recover.patch Hmm... Looks like a unified diff to me... The text leading up to this was: -------------------------- |--- recover.bsq 2015-07-30 18:13:35.967285372 +0200 |+++ recover.bsq.corrected 2015-07-30 17:55:40.917532711 +0200 -------------------------- Patching file recover.bsq using Plan A... Hunk #1 succeeded at 9939. Hunk #2 succeeded at 9983. Hunk #3 succeeded at 10033. Hunk #4 succeeded at 10981. done
Final run of the restore again confirmed that the problem is back - which was expected and desired. Please find below the contents of the patch created - this is just for reference to give you an idea what kind of changes were needed to solve the problem, but please do not apply it in your environment, as it would not work correctly, unless you have exactly the same version of recover.bsq file as we have:
$ cat recover.patch --- recover.bsq 2015-07-30 18:13:35.967285372 +0200 +++ recover.bsq.corrected 2015-07-30 17:55:40.917532711 +0200 @@ -9939,12 +9939,50 @@ end if; >>> +define 'x$print_controlfile' +<<< +procedure print_controlfile is + src_name varchar2(512); + dest_name varchar2(512); +begin + for i in 1..9999 loop + dest_name := sys.dbms_backup_restore.getparm( + sys.dbms_backup_restore.control_file, i); + exit when dest_name is null; + krmicd.writeMsg(8505, dest_name); + end loop; +end; +>>> + # # restore_piece - restore one backup piece. Applies to all types of full # restore conversations. # -define restore_piece +define 'x$restore_piece' <<< +function restore_piece_int(pieces_done IN OUT binary_integer + ,piecenum IN number + ,fhandle IN varchar2 + ,done OUT boolean + ,params IN varchar2 + ,outhandle OUT varchar2 + ,outtag OUT varchar2 + ,failover OUT boolean + ,err_msg OUT varchar2 + ,val_bs_only IN boolean + ,validate IN boolean + ,devtype OUT varchar2 + ,bmr IN boolean + ,set_stamp IN number + ,set_count IN number + ,start_time IN date + ,incremental IN boolean + ,currcf IN boolean) return boolean is + elapsed number; + hours number; + mins number; + secs number; +begin -- restore_piece if (pieces_done+1) = piecenum then begin @@ -9983,7 +10021,7 @@ if (devtype is not null) then krmicd.writeMsg(8612, krmicd.getChid, devtype); end if; - goto restore_failover; + return true; when others then raise; end; @@ -10033,10 +10071,26 @@ getFileRestored(FALSE); end if; sys.dbms_backup_restore.restoreCancel(TRUE); - return; + return false; end if; pieces_done := pieces_done + 1; end if; + return false; +end; +>>> + +define restore_piece +<<< + if (restore_piece_int(pieces_done, piecenum, fhandle, done, params, + outhandle, outtag, failover, err_msg, val_bs_only, validate, + devtype, bmr, set_stamp, set_count, start_time, incremental, + currcf)) then + goto restore_failover; + end if; + + if done then + return; + end if; >>> define restore_end @@ -10981,21 +11035,6 @@ end; >>> -define 'x$print_controlfile' -<<< -procedure print_controlfile is - src_name varchar2(512); - dest_name varchar2(512); -begin - for i in 1..9999 loop - dest_name := sys.dbms_backup_restore.getparm( - sys.dbms_backup_restore.control_file, i); - exit when dest_name is null; - krmicd.writeMsg(8505, dest_name); - end loop; -end; ->>> - # rpctest: test async RPCs # define "rpctest"