Do you have tables with a column storing filenames? Long filenames with full path? If this is the case, then you probably realized how an index on this can be large. And when looking at the values sorted, you have seen the inefficiency of it: a big part of the full name is reapeated because it has the same prefix for files in the same (sub)directory. The 12cR2 Advanced Index Compression (COMPRESS ADVANCED LOW) does not help here because it only compresses identical values, like the basic compression of tables. With unique filenames, we cannot expect any benefit. But 12cR2 introduced a more complex algorithm (COMPRESS ADVANCED HIGH) which can compress values while keeping the order.
You can read all details about it on Richard Foote blog: https://richardfoote.wordpress.com/category/advanced-index-compression/
My goal here is to see how it can benefit to this specific case: filenames all different but with repeated prefixes, and query on a range of it (such as all files under a subfolder)
Test case
As I'm running on the Autonomous Datawarehouse Cloud, I will disable parallel query (through parameter) and result cache(through hints)
SQL> column index_name format a20 SQL> set echo on pagesize 1000 linesize 120 SQL> alter session disable parallel query; Session altered. SQL> alter session set optimizer_ignore_hints=false; Session altered.
I create a table for my test. The 3 columns SPELL_NONE, SPELL_LOW, SPELL_HIGH are all the same. I use 3 columns just to be able to create different indexes at the same time and compare them:
SQL> create table DEMO_SPELL (num,spell_none,spell_low,spell_high) compress as select n,s,s,s from ( with n as ( select to_char(date'-4712-01-01'+rownum-1,'JSP') s from xmltable('1 to 25') ) select rownum n,'/'||n1.s||'/'||n2.s||'/'||n3.s||'/'||n4.s||'/'||n5.s||'/FILE'||mod(rownum,10000)||'.txt' s from n n1,n n2,n n3,n n4,n n5 ); Table DEMO_SPELL created.
This creates some filenames like the following, basically generating file names with a full path of 5 directories, each having 25 different values:
/ONE/EIGHTEEN/FOUR/EIGHTEEN/FOURTEEN/FILE7939.txt /TWO/SEVENTEEN/THIRTEEN/SEVEN/TWELVE/FILE8287.txt /FOUR/TWO/TWENTY-THREE/ONE/FOUR/FILE1254.txt
Index size
I'm now creating the indexes, one with no compression, one with 'advanced low' compression and one with 'advanced high' compression.
SQL> create index DEMO_SPELL_NONE on DEMO_SPELL(SPELL_NONE) pctfree 0; Index DEMO_SPELL_NONE created. SQL> create index DEMO_SPELL_LOW on DEMO_SPELL(SPELL_LOW) pctfree 0 compress advanced low; Index DEMO_SPELL_LOW created. SQL> create index DEMO_SPELL_HIGH on DEMO_SPELL(SPELL_HIGH) pctfree 0 compress advanced high; Index DEMO_SPELL_HIGH created.
CREATE INDEX is supposed to gather statistics so I can look at the size
SQL> select index_name,num_rows,compression,blevel,leaf_blocks,clustering_factor,visibility,pct_free from user_indexes where table_name='DEMO_SPELL'; INDEX_NAME NUM_ROWS COMPRESSION BLEVEL LEAF_BLOCKS CLUSTERING_FACTOR VISIBILIT PCT_FREE -------------------- ---------- ------------- ---------- ----------- ----------------- --------- ---------- DEMO_SPELL_NONE 9765625 DISABLED 3 86183 963951 VISIBLE 10 DEMO_SPELL_LOW 9765625 ADVANCED LOW 3 86232 963951 VISIBLE 10 DEMO_SPELL_HIGH 9765625 ADVANCED HIGH 3 0 963951 VISIBLE 10
There's obviously a bug here for high compression. I expect it smaller, but not with zero leaves. I'll gather statistics later, but want to see if this has bad consequences for the optimizer. I run an index range scan. My goal is to use the index to find files belonging to the same directory.
Execution plan
SQL> select /*+ gather_plan_statistics no_result_cache index(DEMO_SPELL) */ count(*) from DEMO_SPELL where SPELL_NONE like '/FOUR/TWO%'; COUNT(*) ---------- 15625 SQL> select * from dbms_xplan.display_cursor(format=>'allstats last +cost'); PLAN_TABLE_OUTPUT ------------------------------------------------------------------------------------------------------------------------ SQL_ID 58t8mncyzzz5f, child number 0 ------------------------------------- select /*+ gather_plan_statistics no_result_cache index(DEMO_SPELL) */ count(*) from DEMO_SPELL where SPELL_NONE like '/FOUR/TWO%' Plan hash value: 3383582924 --------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers | Reads | --------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 4 (100)| 1 |00:00:01.74 | 128 | 131 | | 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:01.74 | 128 | 131 | |* 2 | INDEX RANGE SCAN| DEMO_SPELL_NONE | 1 | 1 | 4 (0)| 15625 |00:00:00.01 | 128 | 131 | --------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("SPELL_NONE" LIKE '/FOUR/TWO%') filter("SPELL_NONE" LIKE '/FOUR/TWO%') 21 rows selected. SQL> select /*+ gather_plan_statistics no_result_cache index(DEMO_SPELL) */ count(*) from DEMO_SPELL where SPELL_LOW like '/FOUR/TWO%'; COUNT(*) ---------- 15625 SQL> select * from dbms_xplan.display_cursor(format=>'allstats last +cost'); PLAN_TABLE_OUTPUT ------------------------------------------------------------------------------------------------------------------------ SQL_ID 284699580541x, child number 0 ------------------------------------- select /*+ gather_plan_statistics no_result_cache index(DEMO_SPELL) */ count(*) from DEMO_SPELL where SPELL_LOW like '/FOUR/TWO%' Plan hash value: 1340321241 -------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers | Reads | -------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 4 (100)| 1 |00:00:01.55 | 129 | 128 | | 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:01.55 | 129 | 128 | |* 2 | INDEX RANGE SCAN| DEMO_SPELL_LOW | 1 | 1 | 4 (0)| 15625 |00:00:00.01 | 129 | 128 | -------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("SPELL_LOW" LIKE '/FOUR/TWO%') filter("SPELL_LOW" LIKE '/FOUR/TWO%') 21 rows selected. SQL> select /*+ gather_plan_statistics no_result_cache index(DEMO_SPELL) */ count(*) from DEMO_SPELL where SPELL_HIGH like '/FOUR/TWO%'; COUNT(*) ---------- 15625 SQL> select * from dbms_xplan.display_cursor(format=>'allstats last +cost'); PLAN_TABLE_OUTPUT ------------------------------------------------------------------------------------------------------------------------ SQL_ID 1cmy31s1ny1a4, child number 0 ------------------------------------- select /*+ gather_plan_statistics no_result_cache index(DEMO_SPELL) */ count(*) from DEMO_SPELL where SPELL_HIGH like '/FOUR/TWO%' Plan hash value: 3948406928 --------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers | Reads | --------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 3 (100)| 1 |00:00:00.57 | 65 | 67 | | 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:00.57 | 65 | 67 | |* 2 | INDEX RANGE SCAN| DEMO_SPELL_HIGH | 1 | 1 | 3 (0)| 15625 |00:00:00.17 | 65 | 67 | --------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("SPELL_HIGH" LIKE '/FOUR/TWO%') filter("SPELL_HIGH" LIKE '/FOUR/TWO%') 21 rows selected.
I read 131 buffers to get my 15625 file names from the non-compressed index. Nearly the same, 128 buffers, for the 'low' compressed index. This is expected because this compression is about repeated value in blocks, but all my filenames are unique. They have only some part in common. The 'high' compressed index read 2x less buffers, 67 here.
Gather statistic
Because of the bug in statistics, I gather statistics:
SQL> exec dbms_stats.gather_table_stats(user,'DEMO_SPELL'); PL/SQL procedure successfully completed.
Now checking the size
SQL> select index_name,compression,blevel,leaf_blocks,clustering_factor,visibility,pct_free from user_indexes where table_name='DEMO_SPELL'; INDEX_NAME COMPRESSION BLEVEL LEAF_BLOCKS CLUSTERING_FACTOR VISIBILIT PCT_FREE -------------------- ------------- ---------- ----------- ----------------- --------- ---------- DEMO_SPELL_NONE DISABLED 3 91292 1111506 VISIBLE 10 DEMO_SPELL_LOW ADVANCED LOW 3 85578 1036872 VISIBLE 10 DEMO_SPELL_HIGH ADVANCED HIGH 3 38821 997572 VISIBLE 10
This looks correct and follows what we have seen in the queries: no significant improvement for 'compress low' here but 2x smaller with 'compress high'.
Storing full file names is always a problem. If you have some Documentum database, you are probaly familiar with this. It is large, has lot of parts duplicated but cannot benefit from basic compression. It is also badly managed by histograms as they act on a substring only (32 bytes in 11g, 64 bytes in 12c). This is something to think about at design time. A file is a file and a directory is a directory. And we have awesome CONNECT BY or recursive CTE to get the full path from the hierarchy.
Substring
If we can't change the design, then we can look at some workarounds. Let's say that I know that I'll often query with a known prefix less than 10 characters. I can build an index that will find the rows without having to store the whole file name:
SQL> create index DEMO_SPELL_SUBSTR on DEMO_SPELL(substr(SPELL_NONE,1,10)) compress advanced high; Index DEMO_SPELL_SUBSTR created.
This index is obviously smaller:
SQL> exec dbms_stats.gather_table_stats(user,'DEMO_SPELL'); PL/SQL procedure successfully completed. SQL> select index_name,compression,blevel,leaf_blocks,clustering_factor,visibility,pct_free from user_indexes where table_name='DEMO_SPELL'; INDEX_NAME COMPRESSION BLEVEL LEAF_BLOCKS CLUSTERING_FACTOR VISIBILIT PCT_FREE -------------------- ------------- ---------- ----------- ----------------- --------- ---------- DEMO_SPELL_SUBSTR ADVANCED HIGH 2 2386 88769 VISIBLE 10
However, I can't query it with the LIKE statement. See https://blog.dbi-services.com/index-on-truncdate-do-you-still-need-old-index-1/ about it. The optimizer cannot infer a BETWEEN predicate from it and I have to do it myself:
SQL> select /*+ gather_plan_statistics no_result_cache no_parallel index(DEMO_SPELL DEMO_SPELL_SUBSTR) */ count(*) from DEMO_SPELL where SPELL_NONE between '/FOUR/TWO' and '/FOUR/TWO~'; COUNT(*) ---------- 15625
The execution plan shows the usage of my new index:
SQL> select * from dbms_xplan.display_cursor(format=>'allstats last +cost'); PLAN_TABLE_OUTPUT ---------------------------------------------------------------------------------------------------------------------------------------------- SQL_ID dzrf2c3j00ga1, child number 0 ------------------------------------- select /*+ gather_plan_statistics no_result_cache no_parallel index(DEMO_SPELL DEMO_SPELL_SUBSTR) */ count(*) from DEMO_SPELL where SPELL_NONE between '/FOUR/TWO' and '/FOUR/TWO~' Plan hash value: 2610913085 --------------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 403 (100)| 1 |00:00:00.01 | 135 | | 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:00.01 | 135 | |* 2 | TABLE ACCESS BY INDEX ROWID BATCHED| DEMO_SPELL | 1 | 1 | 403 (1)| 15625 |00:00:00.01 | 135 | |* 3 | INDEX RANGE SCAN | DEMO_SPELL_SUBSTR | 1 | 43945 | 3 (34)| 15625 |00:00:00.01 | 7 | --------------------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter(("SPELL_NONE"<='/FOUR/TWO~' AND "SPELL_NONE">='/FOUR/TWO')) 3 - access("DEMO_SPELL"."SYS_NC00005$">='/FOUR/TWO' AND "DEMO_SPELL"."SYS_NC00005$"<='/FOUR/TWO~')
The access to the index is cheaper, because it is smaller, but I have to go to the table to get the full name. This is just an example where I query only one column. A real life problem will probably have to read the table anyway. Note however that this solution is optimal only if all filtering is done from the index. With a prefix larger than the substring, the risk is to do unnecessary accesses to the table for rows that will be filtered out later.
Of course, all this really depends on the shape of data, and performance requirements, so the idea of this post is not to give solutions, but show what can be tested. But always try to challenge the design. Relational databases are build to avoid redundancy. The relationship between files, directories, their names and hierarchy is something that can be modeled in one table with self reference. Compression should not be a solution for redundancy introduced by wrong design.
The comments are closed on this post, but do not hesitate to give feedback, comment or questions on Twitter (@FranckPachot)