Oracle LIKE predicate and cardinality estimations

Submitted by fpachot on
Blog article:

There are not many ways to access efficiently to table rows. Either you want lot of them, because your predicate is not very selective, and you read the whole table in the fastest you can do. This is Table Full Scan. Or you use a structure that gives you access to the subset of rows you need. There are mostly two structures for that: sort and hash.
In a sorted structure you can go directly to a range of value: you seek to the first value, read and stop at the last value. This can be a permanent structure like an index, or a range partitioning. Or a temporary build one like the buffer workarea used by of a sort merge join. This works for a range (inequality) or a unique value (equality).
In a hashed structure you can go directly to a unique value, but not a range. This can be a permanent one, like a hash cluster or hash partitioning, or build temporarily like the workarea for a hash join.

The LIKE predicate can be an equality (when you have no wildcard in it), a range (when not prefixed with a wildcard) or nothing, when starting with a pattern. For example:

  • LIKE '12345' is an equality similar to: ='12345'
  • LIKE '1%' is a range, similar to: >='1' AND <'2'
  • LIKE '%5' needs a full scan and no structure can help

The '%' replaces many characters, like the '.*' regexp or '*' shell expansion, and '_' replaces one character, like the '.' regexp or '?' shell expansion. When you use a user input for a LIKE expression, you should always check if those characters were entered as wildcard or as their character, and then you need to escape them.

This post is about '_' in a name used with LIKE in an application which forgot to escape it. As an example, if you want to query DBA_OBJECTS where OBJECT_NAME like 'DBMS_LOCK' you take the risk that one day an object like 'DBMS0LOCK' is created and will also be returned by your query. You have have then a risk of wrong result.

But even if you are sure that this will never happen, doing this will lead to bad execution plans because the optimizer processes it as a wildcard which may have multiple matches.

This long introduction was to set the context where I did a simple test to verify the cardinality estimations.

 


22:00:10 SQL> create table demo as select to_char(rownum,'FM999999999') text from xmltable('1 to 9999999');

Table created.

22:00:21 SQL> exec dbms_stats.gather_table_stats(user,'DEMO',estimate_percent=>100,no_invalidate=>false);

PL/SQL procedure successfully completed.

I have 9999999 rows with values from '1' to '9999999' as text and I am ready to run some LIKE predicates on it. The goal is to see the execution plan cardinality estimation.

 

LIKE '%'


22:00:41 SQL> select /*+ gather_plan_statistics */ min(text),max(text),count(*) from DEMO where text like '%';

MIN(TEXT)   MAX(TEXT)     COUNT(*) 
---------   ---------     --------
1           9999999        9999999 


22:00:42 SQL> select * from dbms_xplan.display_cursor(format=>'allstats last');

SQL_ID  0m1dd0mbnpmt6, child number 0                                                   
-------------------------------------                                                   
select /*+ gather_plan_statistics */ min(text),max(text),count(*) from                  
DEMO where text like '%'                                                                
-------------------------------------------------------------------------------------   
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
-------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT   |      |      1 |        |      1 |00:00:01.08 |   17842 |   
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:01.08 |   17842 |   
|*  2 |   TABLE ACCESS FULL| DEMO |      1 |   9999K|   9999K|00:00:01.04 |   17842 |   
-------------------------------------------------------------------------------------   
Predicate Information (identified by operation id):       
---------------------------------------------------       
   2 - filter(("TEXT" LIKE '%' AND "TEXT" IS NOT NULL))   

LIKE '%' matches all rows which are not null. The estimation is correct here: 9999999

 

LIKE '10001'

 

                                                          
22:00:42 SQL> select /*+ gather_plan_statistics */ min(text),max(text),count(*) from DEMO where text like '10001';
MIN(TEXT)   MAX(TEXT)     COUNT(*) 
---------   ---------     --------
10001       10001                1 

22:00:42 SQL> select * from dbms_xplan.display_cursor(format=>'allstats last');
SQL_ID  4ykxf8pb2gruw, child number 0                                                   
-------------------------------------                                                   
select /*+ gather_plan_statistics */ min(text),max(text),count(*) from                  
DEMO where text like '10001'                                                            
-------------------------------------------------------------------------------------   
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
-------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT   |      |      1 |        |      1 |00:00:00.15 |   17842 |   
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.15 |   17842 |   
|*  2 |   TABLE ACCESS FULL| DEMO |      1 |      1 |      1 |00:00:00.15 |   17842 |   
-------------------------------------------------------------------------------------   
Predicate Information (identified by operation id):   
---------------------------------------------------   
   2 - filter("TEXT"='10001')                         
                                                      

LIKE '10001' matches only one row. The estimation is correct here. We can see in the predicate note that the optimizer replaced it with an equality as there are no wildcards.

 

LIKE '1%'

 


22:00:42 SQL> select /*+ gather_plan_statistics */ min(text),max(text),count(*) from DEMO where text like '1%';

MIN(TEXT)   MAX(TEXT)     COUNT(*) 
---------   ---------     --------
1           1999999        1111111 

22:00:43 SQL> select * from dbms_xplan.display_cursor(format=>'allstats last');
SQL_ID  g3ma4ghym61bq, child number 0                                                   
-------------------------------------                                                   
select /*+ gather_plan_statistics */ min(text),max(text),count(*) from                  
DEMO where text like '1%'                                                               
-------------------------------------------------------------------------------------   
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
-------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT   |      |      1 |        |      1 |00:00:00.62 |   17842 |   
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.62 |   17842 |   
|*  2 |   TABLE ACCESS FULL| DEMO |      1 |   1216K|   1111K|00:00:00.14 |   17842 |   
-------------------------------------------------------------------------------------   
Predicate Information (identified by operation id):   
---------------------------------------------------   
   2 - filter("TEXT" LIKE '1%')                       
                                                      

LIKE '1%' matches 1, 10-19, 100-199, 1000-1999, 10000-19999... The estimation is correct: 1111K for actually 1216K rows. This one cannot be served by a range scan, as it covers several ranges.

 

LIKE '1____'

 


22:00:43 SQL> select /*+ gather_plan_statistics */ min(text),max(text),count(*) from DEMO where text like '1____';

MIN(TEXT)   MAX(TEXT)     COUNT(*) 
---------   ---------     --------
10000       19999            10000 

22:00:43 SQL> select * from dbms_xplan.display_cursor(format=>'allstats last');
SQL_ID  4jutzqyfj1zhz, child number 0                                                   
-------------------------------------                                                   
select /*+ gather_plan_statistics */ min(text),max(text),count(*) from                  
DEMO where text like '1____'                                                            
-------------------------------------------------------------------------------------   
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
-------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT   |      |      1 |        |      1 |00:00:00.51 |   17842 |   
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.51 |   17842 |   
|*  2 |   TABLE ACCESS FULL| DEMO |      1 |   1216K|  10000 |00:00:00.01 |   17842 |   
-------------------------------------------------------------------------------------   
Predicate Information (identified by operation id):   
---------------------------------------------------   
   2 - filter("TEXT" LIKE '1____')                    

LIKE '1____' matches 10000-19999 and the estimation is not correct at all here: 1216K which is 12% of the total rows, but the reality is 0.1% only. Actually we can see that the optimizer considered '1____'exactly the same as '1%' in the previous test. This estimates for 'all values starting by 1' without considering that the length specified with the underscore wildcards cannot return all of them.

 

LIKE '1_001'

 


22:00:43 SQL> select /*+ gather_plan_statistics */ min(text),max(text),count(*) from DEMO where text like '1_001';
MIN(TEXT)   MAX(TEXT)     COUNT(*) 
---------   ---------     --------
10001       19001               10 

22:00:44 SQL> select * from dbms_xplan.display_cursor(format=>'allstats last');
SQL_ID  9x0d1j3taquuu, child number 0                                                   
-------------------------------------                                                   
select /*+ gather_plan_statistics */ min(text),max(text),count(*) from                  
DEMO where text like '1_001'                                                            
-------------------------------------------------------------------------------------   
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
-------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT   |      |      1 |        |      1 |00:00:00.51 |   17842 |   
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.51 |   17842 |   
|*  2 |   TABLE ACCESS FULL| DEMO |      1 |   1216K|     10 |00:00:00.01 |   17842 |   
-------------------------------------------------------------------------------------   
Predicate Information (identified by operation id):   
---------------------------------------------------   
   2 - filter("TEXT" LIKE '1_001')                    

This is the case I encountered with the application which overlooked that '_' is a wildcard. LIKE '1_001' matches only 10 rows here. In a single-byte characterset, it cannot match more than 255 different values. In any case, it cannot match all rows. But the optimizer considers this again as the same a '1%' and over-estimates to 1216K. This difference can lead to poor optimization on a query with several joins.

 

LIKE '1\_001' ESCAPE '\'

The solution is, of course, to escape the wildcard if not expected as a wildcard.


22:00:44 SQL> select /*+ gather_plan_statistics */ min(text),max(text),count(*) from DEMO where text like '1\_001' escape '\';

MIN(TEXT)   MAX(TEXT)     COUNT(*) 
---------   ---------     --------
                                 0 

22:00:44 SQL> select * from dbms_xplan.display_cursor(format=>'allstats last');
SQL_ID  g2w6b5j87zfhr, child number 0                                                   
-------------------------------------                                                   
select /*+ gather_plan_statistics */ min(text),max(text),count(*) from                  
DEMO where text like '1\_001' escape '\'                                                
-------------------------------------------------------------------------------------   
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
-------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT   |      |      1 |        |      1 |00:00:00.13 |   17842 |   
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.13 |   17842 |   
|*  2 |   TABLE ACCESS FULL| DEMO |      1 |      1 |      0 |00:00:00.13 |   17842 |   
-------------------------------------------------------------------------------------   
Predicate Information (identified by operation id):   
---------------------------------------------------   
   2 - filter("TEXT" LIKE '1_001' ESCAPE )            
                                                      

Here in my example, I have no rows with this exact '1_001' value (A-Rows=0) and the estimation is E-Rows=1 because the optimizer never put 0. This is correct.

 

Dynamic sampling

If the wildcard is there on purpose, you cannot rely on the optimizer estimation. When you reach the limit of static statistics, the solution is dynamic sampling. Here I just delete the statistics to force it.


22:00:44 SQL> exec dbms_stats.delete_table_stats(user,'DEMO',no_invalidate=>false);
PL/SQL procedure successfully completed.

22:00:46 SQL> select /*+ gather_plan_statistics */ min(text),max(text),count(*) from DEMO where text like '1_001';
MIN(TEXT)   MAX(TEXT)     COUNT(*)
---------   ---------     --------
10001       19001               10

22:00:46 SQL> select * from dbms_xplan.display_cursor(format=>'allstats last');
SQL_ID  9x0d1j3taquuu, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ min(text),max(text),count(*) from
DEMO where text like '1_001'
-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |      1 |00:00:00.55 |   17843 |
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.55 |   17843 |
|*  2 |   TABLE ACCESS FULL| DEMO |      1 |    196 |     10 |00:00:00.01 |   17843 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("TEXT" LIKE '1_001')
Note
-----
   - dynamic statistics used: dynamic sampling (level=2)

Dynamic sampling has applied the LIKE predicate as-is on a sample of blocks. Estimation is 196 where actual count is 10. This is not too far and will give better execution plan than the static statistics estimate of 1216K.

So what?

It looks like the CBO stops at the first wildcard in a LIKE pattern and calculates the cardinality of the range defined by this prefix. This can be misleading in some cases. If you want to see the example, just query DBA_OBJECTS for the object name 'DBA_OBJECT'. You expect 2 (the view and the public synonym) but the optimized expects 1346 - the number of objects starting with 'DBA':


  1* select count(*) from OBJ$ where name LIKE 'DBA_OBJECTS'
SQL> /

  COUNT(*)
----------
         2

Execution Plan
----------------------------------------------------------
Plan hash value: 2699154286

--------------------------------------------------------------------------------
| Id  | Operation             | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |        |     1 |    19 |    41   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE       |        |     1 |    19 |            |          |
|*  2 |   INDEX FAST FULL SCAN| I_OBJ2 |  1346 | 25574 |    41   (0)| 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("NAME" LIKE 'DBA_OBJECTS')

 

The comments are closed on this post, but do not hesitate to give feedback, comment or questions on Twitter (@FranckPachot)

Submitted by Rainer Stenzel (not verified) 5 years ago

Somehow the prefix length is taken into consideration too:

select * from CUSTOMER_BEHAVIOR
union all
select * from CUSTOMER_BEHAVIOR where NAME like 'a_'
union all
select * from CUSTOMER_BEHAVIOR where NAME like 'ab_'
union all
select * from CUSTOMER_BEHAVIOR where NAME like 'abc_'
union all
select * from CUSTOMER_BEHAVIOR where NAME like 'abcd_'

Plan hash value: 3540635705

----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 14M| 2227M| 464K (1)| 00:00:19 |
| 1 | UNION-ALL | | | | | |
| 2 | TABLE ACCESS FULL| CUSTOMER_BEHAVIOR | 14M| 2195M| 92901 (1)| 00:00:04 |
|* 3 | TABLE ACCESS FULL| CUSTOMER_BEHAVIOR | 203K| 31M| 92822 (1)| 00:00:04 |
|* 4 | TABLE ACCESS FULL| CUSTOMER_BEHAVIOR | 794 | 125K| 92820 (1)| 00:00:04 |
|* 5 | TABLE ACCESS FULL| CUSTOMER_BEHAVIOR | 4 | 648 | 92820 (1)| 00:00:04 |
|* 6 | TABLE ACCESS FULL| CUSTOMER_BEHAVIOR | 1 | 162 | 92820 (1)| 00:00:04 |
----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - filter("NAME" LIKE 'a_')
4 - filter("NAME" LIKE 'ab_')
5 - filter("NAME" LIKE 'abc_')
6 - filter("NAME" LIKE 'abcd_')

table name masking inspiration from
https://dilbert.com/strip/2000-11-13

Disclaimer

The views expressed in this blog are those of the authors and cannot be regarded as representing CERN’s official position.

CERN Social Media Guidelines

 

Blogroll