How to effectively tune a query that does not even finish (SQL Monitoring and ASH in action)
Have you ever been called to tune a query that runs forever?
Even worse, what if the query is very complex, you don't know anything about the "logic"
behind the SQL - since it was written by someone else - and when you see the execution
plan it is several pages long?
Moreover, what if the query is for retrieving a list of customers for a business critical
campaign that has to run asap, and therefore the beloved management is over your shoulder,
waiting for you to solve the problem!
Problem Definition
Never ending queries, or at least queries that cannot finish within an acceptable time
frame are quite common in large Data Warehouses. Unfortunately, in this situation one
cannot use methods where execution statistics are gathered first and then analysi
s of the execution steps can take place. Simply because the query does not finish!
For example such a great tuning method would be the following:
- execute the query with the hint /*+ gather_plan_statistics */
- (or alternatively issuing an ALTER SESSION SET STATISTICS_LEVEL=ALL prior to query execution), in order to enable the collection of the so-called "row-source statistics"
- (i.e., execution statistics for each operation (row-source) in the execution plan).
- After the completion of the query, use DBMS_XPLAN.DISPLAY_CURSOR, or directly query V$SQL_PLAN_STATISTICS_ALL, in order to:
- Identify problematic cardinality estimations by the optimizer by comparing actual rows
- retrieved in each step of the plan versus estimated rows.
- Identify bottlenecks in the plan, i.e., operations that take too long to complete.
(By the way, two great posts that describe how to use DBMS_XPLAN.DISPLAY_CURSOR to tune a query, can be found here(by Kyle Hailey) and here (by Nikolay Savvinov)).
So what can we do? We need a way to get accurate execution statistics at the individual operation level, so as to easily identify the bottlenecks, while the query is still running. What are our options?
11g has the Solution!
Well, from version 11g and onwards there is a great feature called Real-Time SQL Monitoring that allows you to monitor a query executing in real-time and provides useful statistics at an individual plan operation level that help you easily (and quickly!) identify the bottleneck in an execution plan. (Note that the Real-Time SQL Monitoring feature is licensed with Oracle Tuning Pack in order to use it in production). If you want a more detailed description of this feature check out this Oracle white paper and of course the documentation (here).
Another option is to use one of the greatest-ever Oracle features (available since 10g), the Active Session History (ASH) (licensed with Oracle Diagnostic Pack). Note that from version 11g, ASH has become even greater since it includes columns indicating the individual execution plan operation and thus making the identification of time-consuming steps very easy.
In this post, we will give an example of a complex query, that runs forever and describe how we can tune it (i.e., to identify the bottleneck and then try to fix it) with the use of both of these tools (SQL monitoring and ASH).
A real-life Example
So we have a complex query that has been running for some hours and the execution plan, which is retrieved with a call like this:
nkarag@DWHPRD> select * from table( dbms_xplan.display_cursor('&sql_id', '&child_number', 'ALLSTATS LAST'))2 /
Listing 1: How we retrieve the plan of a running query based on sql_id. ---------------------------------------------------------------------------------------------------------------------
looks like this:
Plan hash value: 2144277924
---------------------------------------------------------------------------------------------------------------------------------| Id | Operation | Name | E-Rows | OMem | 1Mem | Used-Mem | Used-Tmp|| 0 | SELECT STATEMENT | | | | | | |---------------------------------------------------------------------------------------------------------------------------------| 1 | TEMP TABLE TRANSFORMATION | | | | | | || 4 | VIEW | VW_NSO_1 | 642K| | | | || 2 | LOAD AS SELECT | | | 256K| 256K| | ||* 3 | HASH JOIN RIGHT SEMI | | 642K| 25M| 3344K| | ||* 9 | INDEX STORAGE FAST FULL SCAN | UCM_CUSTOMER_DIM_UK2 | 7462K| | | | ||* 5 | HASH JOIN | | 642K| 47M| 5162K| | ||* 6 | HASH JOIN | | 642K| 586K| 586K| | ||* 8 | TABLE ACCESS STORAGE FULL | M_W_INSTALL_ASSETS | 10M| | | | ||* 7 | TABLE ACCESS STORAGE FULL | PRODUCT_DIM | 900 | | | | ||* 10 | INDEX STORAGE FAST FULL SCAN | UCM_CUSTOMER_DIM_UK2 | 7462K| | | | || 15 | PX SEND QC (RANDOM) | :TQ10006 | 872K| | | | || 11 | HASH UNIQUE | | 1 | 436K| 436K| | || 12 | CONCATENATION | | | | | | || 14 | PX COORDINATOR | | | | | | ||* 13 | FILTER | | | | | | ||* 16 | HASH JOIN RIGHT OUTER BUFFERED | | 872K| 686K| 686K| | || 22 | PX RECEIVE | | 981K| | | | || 17 | BUFFER SORT | | | 4096 | 4096 | | || 18 | PX RECEIVE | | 59 | | | | || 20 | TABLE ACCESS STORAGE FULL | SEGMENT_OTE_BASIC_DIM | 59 | | | | || 19 | PX SEND BROADCAST | :TQ10002 | 59 | | | | ||* 21 | HASH JOIN | | 872K| 60M| 5023K| 9M (0)| ||* 30 | HASH JOIN | | 11M| 278M| 13M| 52M (0)| || 23 | PX SEND HASH | :TQ10004 | 981K| | | | ||* 24 | HASH JOIN BUFFERED | | 981K| 84M| 7595K| 21M (0)| 6144 || 26 | PX RECEIVE | | 642K| | | | || 25 | BUFFER SORT | | | 14M| 1032K| | || 27 | PX SEND BROADCAST | :TQ10000 | 642K| | | | || 31 | BUFFER SORT | | | 46M| 1767K| | || 28 | VIEW | | 642K| | | | || 29 | TABLE ACCESS STORAGE FULL | SYS_TEMP_0FD9FD04F_8289D6BF | 642K| | | | || 39 | PX RECEIVE | | 10M| | | | || 32 | PX RECEIVE | | 7462K| | | | || 33 | PX SEND HASH | :TQ10001 | 7462K| | | | || 35 | PX RECEIVE | | 11M| | | | ||* 34 | TABLE ACCESS STORAGE FULL | UCM_CUSTOMER_DIM | 7462K| | | | || 36 | PX SEND HASH | :TQ10003 | 11M| | | | || 40 | PX SEND HASH | :TQ10005 | 10M| | | | || 37 | PX BLOCK ITERATOR | | 11M| | | | ||* 38 | TABLE ACCESS STORAGE FULL | BILLING_ACCOUNT_DIM | 11M| | | | || 49 | PX RECEIVE | | 32198 | | | | || 41 | PX BLOCK ITERATOR | | 10M| | | | ||* 42 | TABLE ACCESS STORAGE FULL | CLI_DIM | 10M| | | | || 44 | TABLE ACCESS STORAGE FULL | SYS_TEMP_0FD9FD04F_8289D6BF | 642K| | | | ||* 43 | VIEW | | 642K| | | | ||* 45 | FILTER | | | | | | ||* 48 | HASH JOIN | | 29649 | 4663K| 1167K| | || 46 | PX COORDINATOR | | | | | | || 47 | PX SEND QC (RANDOM) | :TQ20006 | 29649 | | | | ||* 60 | HASH JOIN RIGHT OUTER BUFFERED| | 374K| 686K| 686K| | || 50 | PX SEND BROADCAST | :TQ20005 | 32198 | | | | ||* 51 | HASH JOIN BUFFERED | | 32198 | 25M| 3498K| | || 52 | BUFFER SORT | | | 73728 | 73728 | | || 54 | PX SEND HASH | :TQ20002 | 642K| | | | || 53 | PX RECEIVE | | 642K| | | | || 55 | VIEW | | 642K| | | | || 58 | PX SEND HASH | :TQ20004 | 374K| | | | || 56 | TABLE ACCESS STORAGE FULL | SYS_TEMP_0FD9FD04F_8289D6BF | 642K| | | | || 57 | PX RECEIVE | | 374K| | | | || 72 | PX BLOCK ITERATOR | | 11M| | | | ||* 59 | FILTER | | | | | | | | 61 | BUFFER SORT | | | 73728 | 73728 | | || 62 | PX RECEIVE | | 59 | | | | || 64 | TABLE ACCESS STORAGE FULL | SEGMENT_OTE_BASIC_DIM | 59 | | | | || 63 | PX SEND BROADCAST | :TQ20000 | 59 | | | | ||* 65 | HASH JOIN | | 11M| 329M| 11M| | || 68 | PX SEND HASH | :TQ20001 | 7462K| | | | || 66 | BUFFER SORT | | | 73728 | 73728 | | || 67 | PX RECEIVE | | 7462K| | | | |13 - filter( IS NOT NULL)|* 69 | TABLE ACCESS STORAGE FULL| UCM_CUSTOMER_DIM | 7462K| | | | | | 70 | PX RECEIVE | | 11M| | | | ||* 73 | TABLE ACCESS STORAGE FULL| BILLING_ACCOUNT_DIM | 11M| | | | || 71 | PX SEND HASH | :TQ20003 | 11M| | | | || 74 | PX BLOCK ITERATOR | | 10M| | | | ||* 75 | TABLE ACCESS STORAGE FULL | CLI_DIM | 10M| | | | || 77 | TABLE ACCESS STORAGE FULL | SYS_TEMP_0FD9FD04F_8289D6BF | 642K| | | | ||* 76 | VIEW | | 642K| | | | |---------------------------------------------------------------------------------------------------------------------------------7 - storage(("A"."ROOT_PRODUCT_DESC"='IP TV' OR "A"."ROOT_PRODUCT_DESC"='TV SAT' OR "A"."ROOT_PRODUCT_DESC"='TV SATPredicate Information (identified by operation id):---------------------------------------------------5 - access("SBL_CUSTOMER_ID"="SIEBEL_CUSTOMER_ID")3 - access("SBL_CUSTOMER_ID"="C1")9 - storage("SBL_CUSTOMER_ID" IS NOT NULL)6 - access("A"."SOC_PRODUCT_SK"="PRODUCT_SK") Hotel'))filter(("A"."ROOT_PRODUCT_DESC"='IP TV' OR "A"."ROOT_PRODUCT_DESC"='TV SAT' OR "A"."ROOT_PRODUCT_DESC"='TV SATHotel')) 8 - storage("STATUS"='Active') filter("STATUS"='Active')76 - filter("D1"."C1"=:B1)filter("SBL_CUSTOMER_ID" IS NOT NULL)10 - storage("SBL_CUSTOMER_ID" IS NOT NULL)filter("SBL_CUSTOMER_ID" IS NOT NULL)16 - access("A"."OTE_BASIC_SEGMENT_SK"="B"."OTE_BASIC_SEGMENT_SK")21 - access("GNV_BILLING_ACCOUNT_NUM"="GNV_ACCOUNT_NUM")24 - access("SBL_CUSTOMER_ID"="D1"."C1")30 - access("SBL_CUSTOMER_ID"="SBL_CUSTOMER_ID")34 - storage("SBL_CUSTOMER_ID" IS NOT NULL)38 - storage(:Z>=:Z AND :Z<=:Z)filter("SBL_CUSTOMER_ID" IS NOT NULL)42 - storage(:Z>=:Z AND :Z<=:Z)48 - access("GNV_BILLING_ACCOUNT_NUM"="GNV_ACCOUNT_NUM")43 - filter("D1"."C1"=:B1)45 - filter(LNNVL( IS NOT NULL))51 - access("SBL_CUSTOMER_ID"="D1"."C1")59 - filter(("B"."OTE_BASIC_SEGMENT_GROUP_DESC"='┼╨┴├├┼╦╠┴╘╔┼╙' OR "B"."OTE_BASIC_SEGMENT_GROUP_DESC"='╧╔╩╔┴╩╧╔'))60 - access("A"."OTE_BASIC_SEGMENT_SK"="B"."OTE_BASIC_SEGMENT_SK")65 - access("SBL_CUSTOMER_ID"="SBL_CUSTOMER_ID")filter("SBL_CUSTOMER_ID" IS NOT NULL)69 - storage("SBL_CUSTOMER_ID" IS NOT NULL)75 - storage(:Z>=:Z AND :Z<=:Z)73 - storage(:Z>=:Z AND :Z<=:Z AND (("ACCOUNT_TYPE_DESC"='┼╨┴├├┼╦╠┴╘╔┼╙' OR "ACCOUNT_TYPE_DESC"='╧╔╩╔┴╩╧╔')))filter(("ACCOUNT_TYPE_DESC"='┼╨┴├├┼╦╠┴╘╔┼╙' OR "ACCOUNT_TYPE_DESC"='╧╔╩╔┴╩╧╔'))
Listing 2: The execution plan of the problematic query. ---------------------------------------------------------------------------------------------------------------------
Take a look how easy it is to identify the bottleneck with SQL monitoring. Here is the srcipt I use:
nkarag@DWHPRD> host cat mons_text.sql
set longchunksize 1000000set long 1000000set linesize 999set heading off verify off autotrace off feedback offset timing offset wrap onset pagesize 1000spool sm.txt replaceaccept sql_id prompt "Enter value for sql_id: "set termout offreport_level => 'ALL') AS reportSELECT DBMS_SQLTUNE.report_sql_monitor(sql_id => '&&sql_id',type => 'TEXT',FROM dual;spool offhost cat sm.txtundef sql_id
Listing 3: A script for retrieving SQL Monitor results for a specific sql_id ---------------------------------------------------------------------------------------------------------------------
and here are the results:
SQL Plan Monitoring Details (Plan Hash Value=2144277924)
==========================================================================================================================================================================================================================================================| Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Read | Read | Write | Write | Cell | Mem | Temp | Activity | Activity Detail |==========================================================================================================================================================================================================================================================| | | | (Estim) | | Active(s) | Active | | (Actual) | Reqs | Bytes | Reqs | Bytes | Offload | | | (%) | (# samples) || 0 | SELECT STATEMENT | | | | | | 1 | | | | | | | | | | || 3 | HASH JOIN RIGHT SEMI | | 642K | 414K | 6 | +8 | 1 | 326K | | | | | | | | 0.05 | Cpu (2) || 1 | TEMP TABLE TRANSFORMATION | | | | | | 1 | | | | | | | | | | || 2 | LOAD AS SELECT | | | | 1 | +13 | 1 | 1 | | | 22 | 5MB | | | | | || 8 | TABLE ACCESS STORAGE FULL | M_W_INSTALL_ASSETS | 10M | 392K | 5 | +4 | 1 | 31M | 24205 | 23GB | | | 96.31% | | | 0.02 | Cpu (1) || 4 | VIEW | VW_NSO_1 | 642K | 404K | 6 | +8 | 1 | 742K | | | | | | | | | || 5 | HASH JOIN | | 642K | 404K | 10 | +4 | 1 | 742K | | | | | | | | 0.05 | Cpu (2) || 7 | TABLE ACCESS STORAGE FULL | PRODUCT_DIM | 900 | 1355 | 4 | +1 | 1 | 926 | 376 | 36MB | | | | | | 0.05 | cell multiblock physical read (2) || 6 | HASH JOIN | | 642K | 393K | 7 | +2 | 1 | 742K | | | | | | | | 0.07 | Cpu (3) || 9 | INDEX STORAGE FAST FULL SCAN | UCM_CUSTOMER_DIM_UK2 | 7M | 4659 | 6 | +8 | 1 | 7M | 341 | 315MB | | | | | | 0.07 | Cpu (1) || -> 13 | FILTER | | | | 4168 | +24 | 1 | 791K | | | | | | | | 0.07 | Cpu (3) || | | | | | | | | | | | | | | | | | cell multiblock physical read (2) || 10 | INDEX STORAGE FAST FULL SCAN | UCM_CUSTOMER_DIM_UK2 | 7M | 4659 | 1 | +13 | 1 | 7M | | | | | | | | | || -> 12 | CONCATENATION | | | | 4168 | +24 | 1 | 791K | | | | | | | | | || -> 11 | HASH UNIQUE | | 1 | 1M | 4168 | +24 | 1 | 0 | | | 85 | 8MB | | 3M | 10M | 0.05 | Cpu (2) || -> 14 | PX COORDINATOR | | | | 4168 | +24 | 9 | 791K | 10 | 320KB | | | | | | 0.02 | Cpu (1) || 20 | TABLE ACCESS STORAGE FULL | SEGMENT_OTE_BASIC_DIM | 59 | 4 | 1 | +18 | 1 | 59 | | | | | | | | | || -> 15 | PX SEND QC (RANDOM) | :TQ10006 | 872K | 186K | 4167 | +25 | 8 | 796K | | | | | | | | | || -> 16 | HASH JOIN RIGHT OUTER BUFFERED | | 872K | 186K | 4174 | +18 | 8 | 796K | 296 | 65MB | 406 | 89MB | | 35M | 112M | | || 18 | PX RECEIVE | | 59 | 4 | 2 | +18 | 8 | 472 | | | | | | | | | || 17 | BUFFER SORT | | | | 2 | +18 | 8 | 472 | | | | | | | | | || 19 | PX SEND BROADCAST | :TQ10002 | 59 | 4 | 1 | +18 | 1 | 472 | | | | | | | | | || 28 | VIEW | | 642K | 217 | 1 | +14 | 1 | 326K | | | | | | | | | || 21 | HASH JOIN | | 872K | 186K | 17 | +18 | 8 | 1M | | | | | | | | 0.07 | Cpu (3) || 22 | PX RECEIVE | | 982K | 160K | 2 | +18 | 8 | 717K | | | | | | | | | || 24 | HASH JOIN BUFFERED | | 982K | 160K | 1 | +18 | | | | | | | | | | 0.07 | Cpu (3) || 23 | PX SEND HASH | :TQ10004 | 982K | 160K | | | | | | | | | | | | | || 25 | BUFFER SORT | | | | | | | | | | | | | | | | || 29 | TABLE ACCESS STORAGE FULL | SYS_TEMP_0FD9FD04F_8289D6BF | 642K | 217 | 1 | +14 | 1 | 326K | 7 | 5MB | | | | | | | || 26 | PX RECEIVE | | 642K | 217 | | | | | | | | | | | | | || 27 | PX SEND BROADCAST | :TQ10000 | 642K | 217 | 1 | +14 | 1 | 3M | | | | | | | | | || 37 | PX BLOCK ITERATOR | | 11M | 18775 | 3 | +17 | 8 | 11M | | | | | | | | | || 30 | HASH JOIN | | 11M | 160K | 2 | +17 | | | | | | | | | | 0.26 | Cpu (11) || 31 | BUFFER SORT | | | | 2 | +16 | | | | | | | | | | 0.07 | Cpu (3) || 33 | PX SEND HASH | :TQ10001 | 7M | 141K | 5 | +14 | 1 | 7M | | | | | | | | | || 32 | PX RECEIVE | | 7M | 141K | | | | | | | | | | | | | || 34 | TABLE ACCESS STORAGE FULL | UCM_CUSTOMER_DIM | 7M | 141K | 5 | +14 | 1 | 7M | 14649 | 8GB | | | 92.70% | | | 0.05 | Cpu (2) || 38 | TABLE ACCESS STORAGE FULL | BILLING_ACCOUNT_DIM | 11M | 18775 | 3 | +17 | 120 | 11M | 13150 | 7GB | | | 80.00% | | | 0.17 | Cpu (6) || 35 | PX RECEIVE | | 11M | 18775 | | | | | | | | | | | | | || 36 | PX SEND HASH | :TQ10003 | 11M | 18775 | 3 | +17 | 8 | 11M | | | | | | | | | || | | | | | | | | | | | | | | | | | resmgr:cpu quantum (2) || | | | | | | | | | | | | | | | | | cell smart table scan (1) || 39 | PX RECEIVE | | 10M | 25360 | 17 | +18 | 8 | 10M | | | | | | | | | || 41 | PX BLOCK ITERATOR | | 10M | 25360 | | | | | | | | | | | | | || 40 | PX SEND HASH | :TQ10005 | 10M | 25360 | 1 | +21 | | | | | | | | | | 0.05 | Cpu (2) || 42 | TABLE ACCESS STORAGE FULL | CLI_DIM | 10M | 25360 | 6 | +19 | | | | | | | | | | 0.95 | Cpu (5) || -> 44 | TABLE ACCESS STORAGE FULL | SYS_TEMP_0FD9FD04F_8289D6BF | 642K | 217 | 4168 | +24 | 627K | 4G | 1 | 32768 | | | | | | 26.73 | Cpu (1124) || | | | | | | | | | | | | | | | | | cell single block physical read (35) || -> 43 | VIEW | | 642K | 217 | 4168 | +24 | 627K | 627K | | | | | | | | 70.76 | Cpu (2981) || 55 | VIEW | | 642K | 217 | | | | | | | | | | | | | || 45 | FILTER | | | | | | | | | | | | | | | | || 46 | PX COORDINATOR | | | | | | | | | | | | | | | | || 47 | PX SEND QC (RANDOM) | :TQ20006 | 29649 | 186K | | | | | | | | | | | | | || 49 | PX RECEIVE | | 32198 | 160K | | | | | | | | | | | | | || 48 | HASH JOIN | | 29649 | 186K | | | | | | | | | | | | | || 50 | PX SEND BROADCAST | :TQ20005 | 32198 | 160K | | | | | | | | | | | | | || 53 | PX RECEIVE | | 642K | 217 | | | | | | | | | | | | | || 51 | HASH JOIN BUFFERED | | 32198 | 160K | | | | | | | | | | | | | || 52 | BUFFER SORT | | | | | | | | | | | | | | | | || 67 | PX RECEIVE | | 7M | 141K | | | | | | | | | | | | | || 54 | PX SEND HASH | :TQ20002 | 642K | 217 | | | | | | | | | | | | | | | 56 | TABLE ACCESS STORAGE FULL | SYS_TEMP_0FD9FD04F_8289D6BF | 642K | 217 | | | | | | | | | | | | | || 57 | PX RECEIVE | | 374K | 160K | | | | | | | | | | | | | || 59 | FILTER | | | | | | | | | | | | | | | | || 58 | PX SEND HASH | :TQ20004 | 374K | 160K | | | | | | | | | | | | | || 60 | HASH JOIN RIGHT OUTER BUFFERED | | 374K | 160K | | | | | | | | | | | | | || 63 | PX SEND BROADCAST | :TQ20000 | 59 | 4 | | | | | | | | | | | | | || 61 | BUFFER SORT | | | | | | | | | | | | | | | | || 62 | PX RECEIVE | | 59 | 4 | | | | | | | | | | | | | |==========================================================================================================================================================================================================================================================| 64 | TABLE ACCESS STORAGE FULL | SEGMENT_OTE_BASIC_DIM | 59 | 4 | | | | | | | | | | | | | | | 65 | HASH JOIN | | 11M | 160K | | | | | | | | | | | | | || 68 | PX SEND HASH | :TQ20001 | 7M | 141K | | | | | | | | | | | | | || 66 | BUFFER SORT | | | | | | | | | | | | | | | | || 69 | TABLE ACCESS STORAGE FULL | UCM_CUSTOMER_DIM | 7M | 141K | | | | | | | | | | | | | || 70 | PX RECEIVE | | 11M | 18776 | | | | | | | | | | | | | || 72 | PX BLOCK ITERATOR | | 11M | 18776 | | | | | | | | | | | | | || 71 | PX SEND HASH | :TQ20003 | 11M | 18776 | | | | | | | | | | | | | || 73 | TABLE ACCESS STORAGE FULL | BILLING_ACCOUNT_DIM | 11M | 18776 | | | | | | | | | | | | | || -> 76 | VIEW | | 642K | 217 | 4168 | +24 | 627K | 627K | | | | | | | | | || 74 | PX BLOCK ITERATOR | | 10M | 25360 | | | | | | | | | | | | | || 75 | TABLE ACCESS STORAGE FULL | CLI_DIM | 10M | 25360 | | | | | | | | | | | | | || -> 77 | TABLE ACCESS STORAGE FULL | SYS_TEMP_0FD9FD04F_8289D6BF | 642K | 217 | 4168 | +24 | 627K | 4G | | | | | | | | | |
Listing 4: SQL Monitoring results for the problematic query. ---------------------------------------------------------------------------------------------------------------------
Look at line 13 in listing 4 (the FILTER operation). The arrow on the left indicates that this step is currently executing. It has been running for 4168 seconds as the column "Time Active(s)" reports. So for sure it is among the bottlenecks that we search for. In our case, this filter operation has two children operations. Among these, one of the ones that is running for the longest period is operation 44 (TABLE ACCESS STORAGE FULL on a temp table). It is also running for 4168 seconds. If we look a bit to the right, we can immediately see the the "Execs" column for this operation is 627K, which means that this operation has been executed up to now for 627K times! Oops that's a lot of times for doing a full table scan on a large table, isn't it? Moreover, if we look a bit further to the right we will see that up to now this operation has returned (output rows) 4G rows!
So, it is quite easy to understand the problem of this execution plan is this extremely inefficient filter operation (which executes in a nested loop fashion). See how easy it was, when you use the right tools?
Can we have the same results with ASH? Oh yes we can!
Here is a little script that identifies the plan operation that runs for the largest percent of wall-clock time (the script gives you an option to choose sorting of plan operations by wall-clock time descending or by DB time descending. With the former you can identify the most time consuming step while with the latter you can identify the most resource consuming step of the plan):
nkarag@DWHPRD> host cat ash_ops.sqlset pagesize 999set lines 999accept how_to_sort prompt "Do you want operations sorted by DBtime desc (D) or Wall Clock time desc (W) - default is W:"def how_to_sort = "W"col sort_by_wctime noprint new_value _SORT_BY_WCTIMEcol sort_by_dbtime noprint new_value _SORT_BY_DBTIMEset timing offdecode(upper(nvl('&&how_to_sort','W')),'W','','--') sort_by_wctimeselect decode(upper(nvl('&&how_to_sort','W')),'D','','--') sort_by_dbtime,from dual;set timing oncount(*) "DB Time (secs)",select SQL_EXEC_START, SQL_PLAN_LINE_ID, SQL_PLAN_OPERATION, SQL_PLAN_OPTIONS, owner, object_name, object_type,count(distinct sample_time) "WC Time (secs)",db_time_secs "DB Time Total (secs)",select SQL_EXEC_START, sample_time, SQL_PLAN_LINE_ID, A.SQL_PLAN_OPERATION, SQL_PLAN_OPTIONS, owner, object_name, object_type, --count(*) over() cnttot,wc_time_secs "WC Time Total (secs)",round(count(*)/db_time_secs *100) "DB Time (%)",round(count(distinct sample_time)/wc_time_secs *100) "WC Time (%)"from (count(*) over(partition by SQL_EXEC_START) db_time_secs,and sql_id = nvl('&sql_id',sql_id) and SQL_CHILD_NUMBER = nvl('&SQL_CHILD_NUMBER',0)count(distinct sample_time) over(partition by SQL_EXEC_START) wc_time_secsfrom gv$active_session_history a left outer join dba_objects b on(a.CURRENT_OBJ# = b.object_id)whereSAMPLE_TIME > sysdate - (&minutes_from_now/(24*60))/and sql_exec_id is not null)group by SQL_EXEC_START, db_time_secs, wc_time_secs, SQL_PLAN_LINE_ID, SQL_PLAN_OPERATION, SQL_PLAN_OPTIONS, owner, object_name, object_type --, cnttotorder by SQL_EXEC_START desc,&_SORT_BY_DBTIME "DB Time (%)" desc&_SORT_BY_WCTIME "WC Time (%)" desc
Listing 5: A script for returning the plan operation of a running query that consumed the most wall-clock time based on ASH. ---------------------------------------------------------------------------------------------------------------------
and here are the results when we run the above script for the problematic query which is executing:
SQL_EXEC_START SQL_PLAN_LINE_ID SQL_PLAN_OPERATION SQL_PLAN_OPTIONS OWNER OBJECT_NAME OBJECT_TYPE DB Time (secs) WC Time (secs) DB Time Total (secs) WC Time Total (secs) DB Time (%) WC Time (%)------------------- ---------------- ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------- -------------- -------------- -------------------- -------------------- ----------- -----------27/01/2015 16:17:41 44 TABLE ACCESS STORAGE FULL 989 989 3592 3592 28 2827/01/2015 16:17:41 43 VIEW 2595 2595 3592 3592 72 7227/01/2015 16:17:41 13 FILTER 5 5 3592 3592 0 027/01/2015 16:17:41 11 HASH UNIQUE 1 1 3592 3592 0 027/01/2015 16:17:41 14 PX COORDINATOR 2 2 3592 3592 0 0
Listing 6: Results of ASH for the problematic query. ---------------------------------------------------------------------------------------------------------------------
We have run the script for a 60 minute window (the maximun retention period for ASH) and we clearly see that the most delay is in operations 43 and 44. The results in listing 6 are sorted in descending order of percent of wall-clock time consumed by an operation.
An even better approach would be to first run the ASH script in order to identify the most consuming operation and then, run the SQL monitoring script to get an overall view of the plan and get the more accurate execution time metrics, execution counts of each step etc. Imagine that in a very long plan, it is difficult for the human eye to catch the largest values in column "Time Active(s)".
So clearly SQL monitoring ans ASH have helped us easily and quickly identify the problem. Now, of course we have to find a fix.
How to fix the problem
In the previous section we showed how easy it is to identify the "bottleneck operations" of a running query with the help of SQL monitoring and/or ASH in version 11g. Once you identify the problem, then you have only traveled half-way because you also have to fix it. The fix of course depends by the specific case at hand (see some brief categorization of problems here).
For our running example the problem lies in the FILTER operation (operation 13 in Listing 4). The problematic full table scan (operation 44) is a child operation of this filter. The FILTER operation with two child sources is a clear indicator that the optimizer did not manage to unnest a subquery. After some careful examination of the SQL statement, we easily managed to find the pain point! It was an IN subquery after an OR condition. Something with the following form:
SELECT
FROM
WHERE
<condition1>
AND <condition2>
AND (
<confition3>
OR <condition 4>
OR <condition 5>
OR <values> IN (subquery)
)
Once we removed the bold part above from the problematic statement the query finished in 35 seconds! Actually as it turned out the this condition was redundant and could be discarded all together.
Summary
In this post we have shown a real-life example of how to identify efficiently the most time consuming operation in the execution plan of a running query that does not finish. We have used two great 11g features, namely:
- Real-time SQL Monitoring
- Active Session History (essentially the new columns added in 11g that reveal the operation running)
A similar example with the use of ASH we have posted in the past here.
If you have to tune queries as your daily job then you must learn to use SQL monitoring and ASH really good. They are great life saviors!
No comments:
Post a Comment