Starting workload capture hangs - is it really problem with RAT?
If you plan to introduce changes in your environment and want to estimate their impact, Real Application Testing feature seems to be one of the best options. As we needed to check the influence of changes planned in our databases, I've started to look for good candidates to capture the workloads. I wanted to capture only workloads associated with small number of schemas, but from several databases, to be able to properly simulate as much types of production workloads existing in our databases as possible. This strategy would also allow us to use/test new Consolidated Replay feature of RAT. Ideal set of schemas should be responsible for quite a big amount of workload in the database, but should not be too big in terms of space - to be easily exportable using EXPDP with flashback_scn option.
I was able to capture few workloads from our 126.96.36.199 databases without problems, except on one, quite busy (especially in terms of I/O) database, where dbms_workload_capture.start_capture was hanging forever. I've done usual checks - using gv$session and other performance views. It appeared that the session was running full table scans, with big amount of reads (physical and logical), as well as generating too big than expected amount of redo. Column row_wait_obj# reported sys.wrm$_wr_control object, which looked strange for me, as well as sql_text showing:
begin prvt_hdm.auto_execute( :dbid, :inst_num , :end_snap_id ); end;
MOS and Google were not very helfpful, so I've decided to trace the session, which allowed me to find that full table scans were performed on sys.wrh$_active_session_history instead of sys.wrm$_wr_control. Another weird thing - my usual query using gv$session_longops with time_remaining > 0 condition was showing nothing. So I queried this view just to look for specified session and found quite interesting rows, with time_remaining set to null and message:
Table Scan: SYS.WRH$_ACTIVE_SESSION_HISTORY: 679679 out of 107265 Blocks done.
Having all this data in mind, I've started to think what happens while capture is starting. Full table scans on AWR related tables pointed me to snapshot taking. While dbms_workload_capture.start_capture was running, dba_workload_captures view was still empty - maybe because capture related tables were not yet populated, because of missing data, like snapshot id (awr_begin_snap column in dba_workload_captures)? So I killed my session and ran:
SQL> exec dbms_workload_repository.create_snapshot();
to confirm that the problem is related with this - symptoms were the same. MOS started to be helpful this time, with the note Troubleshooting: AWR Snapshot Collection issues (Doc ID 1301503.1). Additional tracing showed no problems with snapshots itself, what directed me to focus on ADDM, which is automatically run after each snapshot is taken. New search in MOS and...bingo: ADDM Jobs are in Status Executing or Running for a Long Time (Doc ID 1557550.1). Unfortunately the cause is still unknown, but the workaround to disable automatic ADDM runs after snapshot taking works. To do it you have to run:
SQL> alter system set "_addm_auto_enable"=false scope=both sid='*';
Only then I was able to properly start the capture, even though the problem was not related to RAT (almost) at all...