Alex Fatkulin's Blog

Exadata 12c PX Adaptive Offloading

noreply@blogger.com (Alex Fatkulin) — Sat, 15 Jul 2017 18:44:00 +0000

Here is yet another case when you may not see as much offloading on your Exadata as you expect.

I was recently investigating a simple select count(*) query producing a lot of buffered read events:

select count(*) from whs.trans partition (sys_p10236650) f

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  resmgr:pq queued                                1        0.00          0.00
  Disk file operations I/O                        1        0.00          0.00
  cell single block physical read                41        0.00          0.04
  enq: KO - fast object checkpoint                3        0.00          0.00
  reliable message                                1        0.00          0.00
  enq: PS - contention                           24        0.00          0.00
  PX Deq: Join ACK                               48        0.00          0.00
  PX Deq: Parse Reply                            24        0.01          0.06
  SQL*Net message to client                       2        0.00          0.00
  SQL*Net message from client                     2        0.00          0.00
  PX Deq: Execute Reply                         338        0.08          1.57
  PX Deq: Signal ACK EXT                         24        0.01          0.01
  PX Deq: Slave Session Stats                    24        0.00          0.00
  cursor: pin S wait on X                        23        0.01          0.18
  PX Deq: Execution Msg                         362        0.25          5.12
  cell multiblock physical read                3402        0.03          9.07
  gc current grant busy                           5        0.00          0.00
  latch: gc element                              60        0.00          0.00
  gc cr grant 2-way                             136        0.00          0.03
  gc cr multi block request                     131        0.00          0.06
  cell smart table scan                        7714        0.02          2.27
  library cache: mutex X                         15        0.00          0.00
  library cache pin                               3        0.00          0.00
  latch: cache buffers lru chain                 13        0.00          0.00
  latch: ges resource hash list                   1        0.00          0.00
  gc current block 2-way                          1        0.00          0.00
  cell list of blocks physical read               2        0.00          0.01
  gc cr grant congested                           2        0.00          0.00
  latch: object queue header operation            2        0.00          0.00
  latch: gcs resource hash                        3        0.00          0.00
********************************************************************************

As you can see we've got 3402 cell multiblock physical read and 41 cell single block physical read alongside 7714 cell smart table scan events. I looked at all the usual smart scan preventing suspects but came up with nothing.

Tracing

In order to further understand the source of buffered reads I've enabled the following traces:

alter session set events 'trace[nsmtio]';
alter session set "_px_trace" = high,granule,high,execution;

What I discovered in the trace was quite interesting. For smart scans the following sections were present in the trace:

2017-07-15 13:54:17.777305*:PX_Control:kxfx.c@10122:kxfxsGetNextGranule():  Receiving 1 granules
...
kxfxContAdaptOff: offload granule:1 offload enabled:1 ratio:0.800000 cellrate:0.000000 cacherate:0.000000 dr_amount:0 dr_time:0 bc_amount:0 bc_time:0 system offloading ratio:0.800000
...
NSMTIO: kxfxghwm:[DirectRead]: Buffer Cache Consumption rate of granule is less and direct read enabled.
...
WAIT #140442514780840: nam='cell smart table scan' ela= 208 cellhash#=459971463 p2=0 p3=0 obj#=36324502 tim=13309302766869
WAIT #140442514780840: nam='cell smart table scan' ela= 224 cellhash#=83369134 p2=0 p3=0 obj#=36324502 tim=13309302767638
WAIT #140442514780840: nam='cell smart table scan' ela= 157 cellhash#=3883045144 p2=0 p3=0 obj#=36324502 tim=13309302768329
...

And for buffered reads the trace had the following:

2017-07-15 13:54:18.151520*:PX_Control:kxfx.c@10122:kxfxsGetNextGranule():  Receiving 1 granules
...
kxfxContAdaptOff: offload granule:0 offload enabled:1 ratio:0.800000 cellrate:1320.421053 cacherate:0.000000 dr_amount:0 dr_time:0 bc_amount:0 bc_time:0 system offloading ratio:0.800000
...
NSMTIO: kxfxghwm:[GRANULE_AFFINITIZED]: Either size is small OR medium and can be cached.
...
WAIT #140442514780840: nam='cell multiblock physical read' ela= 2057 cellhash#=2689352169 diskhash#=3123344858 bytes=1048576 obj#=36324502 tim=13309303143810
WAIT #140442514780840: nam='cell multiblock physical read' ela= 3162 cellhash#=2689352169 diskhash#=3123344858 bytes=1048576 obj#=36324502 tim=13309303151338
WAIT #140442514780840: nam='cell multiblock physical read' ela= 1383 cellhash#=2689352169 diskhash#=3123344858 bytes=450560 obj#=36324502 tim=13309303156489
...

The above sections were alternating for each granule received. After receiving a granule the entire granule was processed using either smart scans or buffered reads. It also appeared that the decision was driven by the kxfxContAdaptOff (Control Adaptive Offloading?) function. Notice kxfxContAdaptOff emitting cellrate statistic into the trace as well. This got me thinking that it might be checking for storage cells workload (or something similar) and using it as one of the inputs to arrive at the decision which way to process a granule.

PX Adaptive Offloading

With a little bit of outside help kxfxContAdaptOff function lead us to two underscore parameters:

SQL> select ksppinm, ksppstvl, ksppstdfl, ksppdesc
  2   from x$ksppi x, x$ksppcv y
  3   where (x.indx = y.indx)
  4    and ksppinm like '_px_adaptive_offload%';
KSPPINM                         KSPPSTVL KSPPSTDFL KSPPDESC
------------------------------- -------- --------- --------------------------------------------------------------------------------
_px_adaptive_offload_threshold  10       10        threshold (GB/s) for PQ adaptive offloading of granules
_px_adaptive_offload_percentage 80       30        percentage for PQ adaptive offloading of granules

The description of these parameters appeared to match the behavior I was seeing. It appeared that there was a threshold and a percentage controlling how much granules were subjected to adaptive offloading.

_px_adaptive_offload_threshold

I have repeated my test using _px_adaptive_offload_threshold=0 and here are the events I've got:

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  library cache pin                               2        0.00          0.00
  resmgr:pq queued                                1        0.00          0.00
  enq: KO - fast object checkpoint                3        0.00          0.00
  reliable message                                1        0.00          0.00
  enq: PS - contention                           39        0.00          0.00
  PX Deq: Join ACK                               48        0.00          0.00
  PX Deq: Parse Reply                            24        0.01          0.02
  Disk file operations I/O                        1        0.00          0.00
  SQL*Net message to client                       2        0.00          0.00
  SQL*Net message from client                     2        0.00          0.00
  PX Deq: Execute Reply                         338        0.23          4.46
  PX Deq: Signal ACK EXT                         24        0.06          0.07
  PX Deq: Slave Session Stats                    24        0.00          0.00
  library cache lock                              1        0.00          0.00
  PX Deq: Execution Msg                         362        0.47         14.54
  cell multiblock physical read               14536        0.04         44.74
  gc current grant busy                          24        0.00          0.01
  gc cr grant 2-way                             626        0.00          0.11
  gc cr multi block request                     553        0.00          0.18
  cell single block physical read              3459        0.02          2.69
  latch: gcs resource hash                       29        0.00          0.01
  latch: gc element                             149        0.00          0.02
  latch: cache buffers lru chain                 42        0.00          0.01
  latch: cache buffers chains                     4        0.00          0.00
  latch: object queue header operation            3        0.00          0.00
  gc cr grant congested                           1        0.00          0.00
********************************************************************************

Notice complete absence of smart scans -- everything got processed using buffered reads! This confirmed that I was on the right track. And here are the events with _px_adaptive_offload_threshold=1000:

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  resmgr:pq queued                                1        0.00          0.00
  enq: KO - fast object checkpoint                3        0.02          0.02
  reliable message                                1        0.00          0.00
  enq: PS - contention                           29        0.00          0.00
  PX Deq: Join ACK                               48        0.00          0.00
  PX Deq: Parse Reply                            24        0.01          0.03
  Disk file operations I/O                        1        0.00          0.00
  SQL*Net message to client                       2        0.00          0.00
  SQL*Net message from client                     2        0.00          0.00
  PX Deq: Execute Reply                         338        0.09          1.37
  PX Deq: Signal ACK EXT                         24        0.08          0.09
  PX Deq: Slave Session Stats                    24        0.00          0.00
  PX Deq: Execution Msg                         362        0.45         16.56
  cell multiblock physical read                2018        0.10          7.04
  gc current grant busy                           5        0.00          0.00
  gc cr grant 2-way                              57        0.00          0.00
  gc cr multi block request                      41        0.00          0.01
  cell smart table scan                        8641        0.10          5.23
  cell single block physical read               602        0.02          0.58
  library cache pin                               2        0.00          0.00
  latch: cache buffers lru chain                  2        0.00          0.00
  latch: gc element                               4        0.00          0.00
  latch: object queue header operation            1        0.00          0.00
  library cache: mutex X                          1        0.00          0.00
********************************************************************************

Notice that while I've got more smart scans the buffered reads were still there as well. Interestingly enough I could not fully eliminate buffered reads no matter how high I've set this parameter to. This likely means that there are more variables at play here and/or that _px_adaptive_offload_percentage dictates that a certain amount of granules always gets processed using buffered reads.

I'm not entire sure what _px_adaptive_offload_percentage does at the moment as setting it to one value or the other did not appear to make any changes. While this is a guess on my part the default value (80) matches ratio:0.800000 reported by kxfxContAdaptOff and it appears to be hard coded in the function.

Final thoughts

Clearly 12c got some new code introduce which can dynamically make a decision on how to process each PX granule. The unfortunate part is none of this appears to be documented and it looks like quite a big change which warrants some sort of an explanation.

Updated a little bit later...

It appears that setting _px_adaptive_offload_threshold=0 disables Adaptive Offloading after all. The reason I've got only buffered reads when I set it to 0 was due to the system running with Auto DOP so In-Memory PQ kicked in. When set together with _parallel_cluster_cache_policy=adaptive it reverted everything to the expected behavior -- everything got processed using smart scans.

12.2 ACFS compression, part I

noreply@blogger.com (Alex Fatkulin) — Fri, 07 Jul 2017 18:06:00 +0000

One of the new 12.2 features is the ability to transparently compress ACFS filesystems.

Compression is enabled using acfsutil utility:

[root@raca1 ~]# acfsutil -h compress on
Usage: acfsutil [-h] compress on [-a ] 
                              - Set default compression algorithm
                                Currently only 'lzo' available
                 - Enable compression on volume

Clearly there is support for more compression algorithms to be added in the future but right now only lzo is supported.

Let's go ahead and enable compression on the ACFS filesystem I have:

[root@raca1 ~]# acfsutil compress on /u02/oradata

Compression status can be checked using acfsutil info fs:

[root@raca1 ~]# acfsutil info fs /u02/oradata
/u02/oradata
    ACFS Version: 12.2.0.1.0
    on-disk version:       46.0
    compatible.advm:       12.2.0.0.0
    ACFS compatibility:    12.2.0.0.0
    flags:        MountPoint,Available
    mount time:   Fri Jul  7 12:53:39 2017
    mount sequence number: 0
    allocation unit:       4096
    metadata block size:   4096
    volumes:      1
    total size:   8589934592  (   8.00 GB )
    total free:   6935588864  (   6.46 GB )
    file entry table allocation: 393216
    primary volume: /dev/asm/data-95
        label:
        state:                 Available
        major, minor:          250, 48641
        logical sector size:   4096
        size:                  8589934592  (   8.00 GB )
        free:                  6935588864  (   6.46 GB )
        metadata read I/O count:         23833
        metadata write I/O count:        39103
        total metadata bytes read:       257896448  ( 245.95 MB )
        total metadata bytes written:    421969920  ( 402.42 MB )
        ADVM diskgroup:        DATA
        ADVM resize increment: 67108864
        ADVM redundancy:       unprotected
        ADVM stripe columns:   8
        ADVM stripe width:     1048576
    number of snapshots:  0
    snapshot space usage: 0  ( 0.00 )
    replication status: DISABLED
    compression status: ENABLED

As a quick test I've created a new empty database inside ACFS filesystem. We can now check compressed file sizes:

[root@raca1 ora12cr2]# acfsutil compress info /u02/oradata/ora12cr2/sysaux01.dbf
Compression Unit size: 8192
Disk storage used:   (  88.70 MB )
Disk storage saved:  ( 461.31 MB )
Storage used is 16% of what the uncompressed file would use.
File is not scheduled for asynchronous compression.
[root@raca1 ora12cr2]# ls -l /u02/oradata/ora12cr2/sysaux01.dbf
-rw-r----- 1 oracle dba 576724992 Jul  7 13:45 /u02/oradata/ora12cr2/sysaux01.dbf

It is worth noting that not all file types are compressed. For example redo logs are left uncompressed:

[root@raca1 ora12cr2]# acfsutil compress info /u02/oradata/ora12cr2/redo01.log
The file /u02/oradata/ora12cr2/redo01.log is not compressed.

So far so good -- the feature indeed allows you to save some space by compressing the files. I will continue exploring this capability in the next part.

Oracle SuperCluster M7 SLOB LIO Tests vs Intel Xeon E5-2699 V4

noreply@blogger.com (Alex Fatkulin) — Mon, 22 May 2017 16:54:00 +0000

Here are some SLOB LIO figures from a DB zone configured with 16 threads running on an Oracle SuperCluster M7 hardware. For comparison I've also included numbers from an Intel Xeon E5-2699 V4 CPU.

It makes sense to mention that this is not exactly a fair comparison -- a single SPARC M7 core has 8 threads associated with it so my zone is able to utilize a total of two SPARC M7 cores (16 threads total with 8 threads per core). E5-2699 V4 is currently top of the line Intel CPU core packed model with 22 cores. So we're comparing two SPARC M7 cores vs 16 E5-2699 cores. It does however help answer the question -- if you're running on a certain number of Intel cores what kind of performance can you expect when you move over to a heavy threaded M7 if you transfer your workload "as is"?

Below are the results:

The first thing worth mentioning is that M7 still exhibits a large deficit when it comes to single threaded performance -- a single SPARC M7 thread is about 60% performance compared to an E5-2699 V4 core (which is not even a top bin CPU when it comes to frequency). Throughput is a different story -- two M7 cores are almost able to match four E5-2699 V4 cores thanks to a heavily threaded design.

ORA-15410: Disks in disk group do not have equal size

noreply@blogger.com (Alex Fatkulin) — Thu, 23 Jul 2015 19:25:00 +0000

12.1.0.2 ASM introduced a new feature which might prevent you from adding dissimilar size disks into a normal or high redundancy disk groups. The relevant MOS note is Doc ID 1938950.1.

Unfortunately I've found that some information in this note requires further clarification.

COMPATIBLE.ASM

The note suggests that the check is only enforced when the Disk Group has 'COMPATIBLE.ASM'='12.1.0.2' attribute set. This is incorrect and can be easily demonstrated:

SQL> select name, compatibility from v$asm_diskgroup;

NAME                           COMPATIBILITY
------------------------------ --------------------
DATA                           11.2.0.3.0

SQL> alter diskgroup data add disk '/dev/sdc';
alter diskgroup data add disk '/dev/sdc'
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15410: Disks in disk group DATA do not have equal size.

As you can see my Disk Group has 'COMPATIBLE.ASM'='11.2.0.3.0' which did not prevent it from getting ORA-15410.

What happens if you have disks of dissimilar sizes already in the Disk Group?

Let's say you've upgraded from a previous release and your normal/high redundancy disk group had disks of dissimilar sizes in it already. What happens when you try to add a disk after the upgrade to 12.1.0.2 ASM? Some of the questions that immediately jump to mind:

Will it always rise an error because the existing disks have different sizes forcing you to make everything of the same size first?
Will it let you add a disk as long as the size is the same to one of the disks you already have?
Will it let you add a disk with a new size?

Here I have a 12.1.0.2 normal redundancy Disk Group with two disks of dissimilar size:

SQL> select dg.name group_name,
    dg.type,
    dg.compatibility,
    d.name disk_name,
    d.total_mb
  from v$asm_diskgroup dg, v$asm_disk d
  where dg.group_number=d.group_number
  order by d.disk_number;

GROUP_NAME TYPE   COMPATIBILITY        DISK_NAME    TOTAL_MB
---------- ------ -------------------- ---------- ----------
DATA       NORMAL 12.1.0.2.0           DATA_0000        4096
DATA       NORMAL 12.1.0.2.0           DATA_0001        5120

Let's add a third disk of some other size:

SQL> alter diskgroup data add disk '/dev/sdd' size 2g;

Diskgroup altered.

SQL> select dg.name group_name,
    dg.type,
    dg.compatibility,
    d.name disk_name,
    d.total_mb
  from v$asm_diskgroup dg, v$asm_disk d
  where dg.group_number=d.group_number
  order by d.disk_number;

GROUP_NAME TYPE   COMPATIBILITY        DISK_NAME    TOTAL_MB
---------- ------ -------------------- ---------- ----------
DATA       NORMAL 12.1.0.2.0           DATA_0000        4096
DATA       NORMAL 12.1.0.2.0           DATA_0001        5120
DATA       NORMAL 12.1.0.2.0           DATA_0002        2048

As you can see, as long as you have disks of dissimilar sizes already in the disk group, the check appears to be ignored. I can resize the disks as well:

SQL> alter diskgroup data resize disk DATA_0002 size 4g;

Diskgroup altered.

SQL> select dg.name group_name,
    dg.type,
    dg.compatibility,
    d.name disk_name,
    d.total_mb
  from v$asm_diskgroup dg, v$asm_disk d
  where dg.group_number=d.group_number
  order by d.disk_number;  2    3    4    5    6    7    8

GROUP_NAME TYPE   COMPATIBILITY        DISK_NAME    TOTAL_MB
---------- ------ -------------------- ---------- ----------
DATA       NORMAL 12.1.0.2.0           DATA_0000        4096
DATA       NORMAL 12.1.0.2.0           DATA_0001        5120
DATA       NORMAL 12.1.0.2.0           DATA_0002        4096

Now what happens if I drop DATA_0001 thus making all disks to be of equal size?

SQL> alter diskgroup data drop disk DATA_0001;

Diskgroup altered.

SQL> alter diskgroup data add disk '/dev/sdc';
alter diskgroup data add disk '/dev/sdc'
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15410: Disks in disk group DATA do not have equal size.

It appears that the moment the Disk Group got all disks of the same size the check got enforced.

Conclusion

Based on the tests I've demonstrated above the rules for ORA-15410 on 12.1.0.2 ASM appears to be:

If all the disks in a Disk Group are of the same size then the check is enforced, regardless of what 'COMPATIBLE.ASM' value is.
If a Disk Group contains disks of different sizes then the check is ignored, regardless of what 'COMPATIBLE.ASM' value is.

(keeping in mind that the check can only happen for normal/high redundancy Disk Groups)

Wrong Results

noreply@blogger.com (Alex Fatkulin) — Thu, 16 Jul 2015 00:03:00 +0000

It is interesting how a combination of technologies in Oracle can play in a way which produce a seemingly bizarre outcomes.

Consider the following query:

SQL> with v as
(
        select 20 n from dual
) select distinct v.n
                from v, t
                where v.n=t.n(+);

no rows selected

Note that I'm doing a left outer join, however, the query somehow managed to loose a single row I have in the subquery factoring (and just in case you're wondering I've used subquery factoring for simplicity and replacing it with a real table makes no difference).

The first reaction is how could something as simple as this go so terribly wrong?

Let's take a look at the plan:

SQL> with v as
(
        select 20 n from dual
) select distinct v.n
                from v, t
                where v.n=t.n(+);  2    3    4    5    6

Execution Plan
----------------------------------------------------------
Plan hash value: 627307704

----------------------------------------------------------------------------------------
| Id  | Operation                       | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |      |     1 |     9 |     6  (17)| 00:00:01 |
|   1 |  HASH UNIQUE                    |      |     1 |     9 |     6  (17)| 00:00:01 |
|*  2 |   FILTER                        |      |       |       |            |          |
|*  3 |    HASH JOIN OUTER              |      |     1 |     9 |     5   (0)| 00:00:01 |
|   4 |     VIEW                        |      |     1 |     3 |     2   (0)| 00:00:01 |
|   5 |      FAST DUAL                  |      |     1 |       |     2   (0)| 00:00:01 |
|*  6 |     MAT_VIEW REWRITE ACCESS FULL| MV_T |    10 |    60 |     3   (0)| 00:00:01 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("MV_T"."GID"=0)
   3 - access("V"."N"="MV_T"."N"(+))
   6 - filter("MV_T"."N"(+) IS NOT NULL)

As it turns out the query rewrite is in play but the real oddity lies in the Predicate Information section. Line #2 reads filter("MV_T"."GID"=0) which took me by surprise (where did it come from?) as well as explained why the row went missing. A predicate like that essentially turned our query into an inner join!

The answer lies in how this materialized view was created and the cool trick the optimizer tried to do which didn't quite work out. Indeed, I have created the table and the materialized view in the following way:

SQL> create table t as
  2   select mod(level, 10) n, level m
  3    from dual
  4    connect by level <= 1000;
Table created

SQL> create materialized view mv_t enable query rewrite as
  2   select n, sum(m), grouping_id(n) gid
  3    from t
  4    group by rollup(n);
Materialized view created

The materialized view has two levels of aggregation due to the use of a rollup. I'm also using a grouping_id function to identify each grouping level. So what the optimizer tried to do is apply the GID = 0 predicate in order to eliminate the rollup data but, unfortunately, it also turned our query into an inner join equivalent as a result (final query taken from the optimizer trace):

SELECT DISTINCT "V"."N" "N"
 FROM (SELECT 20 "N" FROM "SYS"."DUAL" "DUAL") "V", "ROOT"."MV_T" "MV_T"
 WHERE "MV_T"."GID" = 0
  AND "V"."N" = "MV_T"."N"(+)

Of course the correct predicate in this case should be "MV_T"."GID" (+) = 0.

I have tried the above test case on both 11.2.0.3 and 12.1.0.2.0 with both versions producing the same wrong results.

Converting non-CDB database to a PDB when TDE is in use

noreply@blogger.com (Alex Fatkulin) — Sat, 07 Mar 2015 23:52:00 +0000

Converting a non-CDB database to a PDB is a rather straightforward process. However once TDE (Transparent Data Encryption) gets involved certain things become not so obvious so I've decided to write a small guide on how to accomplish that. In order for a non-CDB database to be converted to a PDB it's version needs to be at least 12.1.0.2.0.

Encryption Wallet Location

My encryption wallet location is set to the following (sqlnet.ora):

ENCRYPTION_WALLET_LOCATION=
  (SOURCE =
    (METHOD = FILE)
    (METHOD_DATA =
      (DIRECTORY = /u01/app/oracle/admin/$ORACLE_SID/wallet/tde)
    )
  )

Create a non-CDB database

Let's start by creating a non-CDB database. I will call this database db1 and it will later be converted to a pluggable database called pdb1:

dbca -silent \
-createDatabase \
-templateName New_Database.dbt \
-gdbName db1 \
-createAsContainerDatabase false \
-sysPassword oracle \
-systemPassword oracle \
-emConfiguration none \
-datafileDestination /u02/oradata \
-redoLogFileSize 128 \
-recoveryAreaDestination /u02/fra \
-storageType FS \
-characterSet al32utf8 \
-nationalCharacterSet al16utf16 \
-automaticMemoryManagement false \
-initParams filesystemio_options=setall \
-initParams session_cached_cursors=100 \
-totalMemory 1024
Creating and starting Oracle instance
1% complete
...
100% complete

Since I have PSU2 applied I need to run datapatch once the database has been created:

[oracle@ora12cr1 ~]$ . oraenv
ORACLE_SID = [oracle] ? db1
The Oracle base remains unchanged with value /u01/app/oracle
[oracle@ora12cr1 ~]$ $ORACLE_HOME/OPatch/datapatch -verbose

Create the wallet and encrypted table

We're now ready to setup the wallet and create a user with an encrypted table:

[oracle@ora12cr1 ~]$ mkdir -p /u01/app/oracle/admin/db1/wallet/tde
[oracle@ora12cr1 ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Sat Mar 7 15:51:21 2015

Copyright (c) 1982, 2014, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> alter system set encryption key identified by "db1";

System altered.

SQL> alter system set db_create_file_dest='/u02/oradata';

System altered.

SQL> create tablespace encrypted datafile size 64m encryption using 'AES256' default storage (encrypt);

Tablespace created.

SQL> create user encrypt_user identified by "encrypt_user" default tablespace encrypted;

User created.

SQL> alter user encrypt_user quota unlimited on encrypted;

User altered.

SQL> create table encrypt_user.z_encrypted as
        select dbms_random.string('x', 100) s
                from dual
                connect by level <= 100;  2    3    4

Table created.

Note that I did not use the new administer key management syntax on purpose and instead created the wallet the old style way as I expect most of the existing environments to be the same. We will explore the significance of this in a moment.

Create a CDB

Before our non-CDB database can be converted to a PDB and plugged into a CDB we need to create the container database first:

[oracle@ora12cr1 ~]$ dbca -silent \
> -createDatabase \
> -templateName New_Database.dbt \
> -gdbName cdb12cr1 \
> -createAsContainerDatabase true \
> -sysPassword oracle \
> -systemPassword oracle \
> -emConfiguration none \
> -datafileDestination /u02/oradata \
> -redoLogFileSize 128 \
> -recoveryAreaDestination /u02/fra \
> -storageType FS \
> -characterSet al32utf8 \
> -nationalCharacterSet al16utf16 \
> -automaticMemoryManagement false \
> -initParams filesystemio_options=setall \
> -initParams session_cached_cursors=100 \
> -totalMemory 1024
Creating and starting Oracle instance
1% complete
...
100% complete

As before I need to run the datapatch utility:

[oracle@ora12cr1 ~]$ . oraenv
ORACLE_SID = [oracle] ? cdb12cr1
The Oracle base remains unchanged with value /u01/app/oracle
[oracle@ora12cr1 ~]$ $ORACLE_HOME/OPatch/datapatch -verbose

Setup CDB with a wallet

Our newly created CDB needs to be setup with a wallet before we proceed with plugging a PDB which utilizes TDE. Let's do it now:

[oracle@ora12cr1 ~]$ mkdir -p /u01/app/oracle/admin/cdb12cr1/wallet/tde
[oracle@ora12cr1 ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Sat Mar 7 17:23:58 2015

Copyright (c) 1982, 2014, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> administer key management
        create keystore '/u01/app/oracle/admin/cdb12cr1/wallet/tde'
        identified by "cdb12cr1";  2    3

keystore altered.

SQL> administer key management
        set keystore open
        identified by "cdb12cr1";  2    3

keystore altered.

SQL> administer key management
        set key identified by "cdb12cr1"
        with backup;  2    3

keystore altered.

Here I have setup the wallet and the master encryption key using the new syntax.

Wallets

Before we move forward let's explore the differences between db1 (created using the old syntax) and cdb12cr1 (created using the new syntax) wallets:

[oracle@ora12cr1 ~]$ orapki wallet display -wallet /u01/app/oracle/admin/db1/wallet/tde
Oracle PKI Tool : Version 12.1.0.2
Copyright (c) 2004, 2014, Oracle and/or its affiliates. All rights reserved.

Enter wallet password:
Requested Certificates:
Subject:        CN=oracle
User Certificates:
Oracle Secret Store entries:
ORACLE.SECURITY.DB.ENCRYPTION.AY/sdMkWZk/4v3YgBKzZtcIAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ORACLE.SECURITY.DB.ENCRYPTION.MASTERKEY
ORACLE.SECURITY.ID.ENCRYPTION.
ORACLE.SECURITY.KB.ENCRYPTION.
Trusted Certificates:
[oracle@ora12cr1 ~]$  orapki wallet display -wallet /u01/app/oracle/admin/cdb12cr1/wallet/tde
Oracle PKI Tool : Version 12.1.0.2
Copyright (c) 2004, 2014, Oracle and/or its affiliates. All rights reserved.

Enter wallet password:
Requested Certificates:
Subject:        CN=oracle
User Certificates:
Oracle Secret Store entries:
ORACLE.SECURITY.DB.ENCRYPTION.Ad+A607CLU+Ivx4f4E7KUYIAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ORACLE.SECURITY.DB.ENCRYPTION.MASTERKEY
ORACLE.SECURITY.ID.ENCRYPTION.
ORACLE.SECURITY.KB.ENCRYPTION.
ORACLE.SECURITY.KM.ENCRYPTION.Ad+A607CLU+Ivx4f4E7KUYIAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Trusted Certificates:

Notice how db1 has the encryption key listed under ORACLE.SECURITY.DB.ENCRYPTION while cdb12cr1 in addition has the encryption key listed under the new ORACLE.SECURITY.KM.ENCRYPTION as well.

In practice what I found is unless your encryption key is listed under ORACLE.SECURITY.KM.ENCRYPTION it will not be exported when doing administer key management export keys. As a result when you import the export file the master key required to decrypt the data will not be there potentially leaving you in a peculiar situation especially if you have used an in-place conversion. Even if you specifically export the key using the with identifier in syntax you will not be able to import the key because specific key exports are not permitted to be imported into the PDBs.

Add ORACLE.SECURITY.KM.ENCRYPTION to db1's wallet

This step is only required if you created the wallet without using the new administer key management syntax. Re-keying the wallet will generate a new master key preserving the old master key necessary to decrypt the data while adding missing ORACLE.SECURITY.KM.ENCRYPTION entries at the same time. Execute while connected to db1:

SQL> administer key management
        set key identified by "db1"
        with backup;  2    3

keystore altered.

If we look at the wallet we can see that it now contains the necessary entires:

[oracle@ora12cr1 ~]$ orapki wallet display -wallet /u01/app/oracle/admin/db1/wallet/tde
Oracle PKI Tool : Version 12.1.0.2
Copyright (c) 2004, 2014, Oracle and/or its affiliates. All rights reserved.

Enter wallet password:
Requested Certificates:
Subject:        CN=oracle
User Certificates:
Oracle Secret Store entries:
ORACLE.SECURITY.DB.ENCRYPTION.AXZlzWtP/U/xv+vihPzeaGkAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ORACLE.SECURITY.DB.ENCRYPTION.AY/sdMkWZk/4v3YgBKzZtcIAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ORACLE.SECURITY.DB.ENCRYPTION.MASTERKEY
ORACLE.SECURITY.ID.ENCRYPTION.
ORACLE.SECURITY.KB.ENCRYPTION.
ORACLE.SECURITY.KM.ENCRYPTION.AXZlzWtP/U/xv+vihPzeaGkAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ORACLE.SECURITY.KM.ENCRYPTION.AY/sdMkWZk/4v3YgBKzZtcIAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ORACLE.SECURITY.KM.ENCRYPTION.MASTERKEY
Trusted Certificates:

Prepare db1 to be converted into a PDB

Before db1 can be plugged into a container database it needs to be converted to a PDB and it's encryption keys exported. Shutdown db1 and open it in read only:

SQL> show parameter db_name

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_name                              string      db1
SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup mount restrict
ORACLE instance started.

Total System Global Area 1241513984 bytes
Fixed Size                  2923872 bytes
Variable Size             452985504 bytes
Database Buffers          771751936 bytes
Redo Buffers               13852672 bytes
Database mounted.
SQL> alter database open read only;

Database altered.

We can now export the encryption keys:

SQL> administer key management
        set keystore open
        identified by "db1";  2    3

keystore altered.

SQL> administer key management
        export keys with secret "db1"
        to '/u01/app/oracle/admin/cdb12cr1/wallet/tde/db1.exp'
        identified by "db1";  2    3    4

keystore altered.

Note that in order for the export operation to work the wallet needs to be explicitly opened with a password otherwise you will receive an error. In case of an auto login (local) wallet you will have to close the wallet and reopen it with a password.

The next step is to generate metadata necessary for PDB conversion and shutdown db1 database:

SQL> exec dbms_pdb.describe(pdb_descr_file => '/u01/app/oracle/db1.xml');

PL/SQL procedure successfully completed.

SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.

Convert and plug db1 into a CDB

We can now login into cdb12cr1 and perform in-place conversion of db1 which does not requires any datafiles to be copied. Of course if something were to go wrong with the conversion process you might end up in a situation where you need to restore your original database from a backup so use this approach with care.

SQL> show parameter db_name

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_name                              string      cdb12cr1
SQL> create pluggable database pdb1
        using '/u01/app/oracle/db1.xml'
        nocopy tempfile reuse;  2    3

Pluggable database created.

Before pdb1 can be opened we need to run the script which will convert pdb1's data dictionary:

SQL> alter session set container=pdb1;

Session altered.

SQL> @?/rdbms/admin/noncdb_to_pdb.sql

Once the script completes we can open pdb1:

SQL> alter pluggable database pdb1 open;

Warning: PDB altered with errors.

The error while opening the PDB tells us that the encryption key is missing (can be seen in pdb_plug_in_violations view). Let's go and import the key now:

[oracle@ora12cr1 ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Sat Mar 7 18:22:23 2015

Copyright (c) 1982, 2014, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> alter session set container=pdb1;

Session altered.

SQL> administer key management
        set keystore open
        identified by "cdb12cr1";  2    3

keystore altered.

SQL> administer key management
        import keys with secret "db1"
        from '/u01/app/oracle/admin/cdb12cr1/wallet/tde/db1.exp'
        identified by "cdb12cr1"
        with backup;  2    3    4    5

keystore altered.

SQL> alter pluggable database pdb1 close;

Pluggable database altered.

SQL> alter pluggable database pdb1 open;

Pluggable database altered.

The encryption keys are now imported and we no longer get an error. Note that as with the export in order to import the keys the wallet must be explicitly opened with a password. We can verify that everything is in order by querying the encrypted table:

SQL> administer key management
        set keystore open
        identified by "cdb12cr1";  2    3

keystore altered.

SQL> select count(*) from encrypt_user.z_encrypted;

  COUNT(*)
----------
       100

Zone Maps On Commit Refresh Oddities

noreply@blogger.com (Alex Fatkulin) — Tue, 09 Sep 2014 01:34:00 +0000

One of the ways Zone Maps can be refreshed when the underlying table data is changed is fast on commit. This is similar to how materialized views can be refreshed with the exception that a Zone Map does not need a materialized view log to do so.

It can also lead to some peculiar side effects.

Test setup

Let's begin by creating a test table with the on commit refresh materialized zone map:

SQL> create table t pctfree 95 clustering by linear order (n) as
  2   select level n, rpad('x', 200, 'x') v
  3    from dual
  4    connect by level <= 5000;
 
Table created
 
SQL> create materialized zonemap zm$t refresh fast on commit on t (n);
 
Done

SQL> select * from zm$t order by 1;
 
  ZONE_ID$    MIN_1_N    MAX_1_N ZONE_LEVEL$ ZONE_STATE$ ZONE_ROWS$
---------- ---------- ---------- ----------- ----------- ----------
3241022750          1        748           0           0        748
3241022750        749       1756           0           0       1008
3241022750       1757       2764           0           0       1008
3241022750       2765       3772           0           0       1008
3241022750       3773       4780           0           0       1008
3241022750       4781       5000           0           0        220
 
6 rows selected

Zone Staleness (first session)

When a session updates a row in the table it will also mark the corresponding zone in the zone map as stale by setting ZONE_STATE$=1:

SQL> update t set n=0 where n=1;
 
1 row updated
 
SQL> select * from zm$t order by 1;
 
  ZONE_ID$    MIN_1_N    MAX_1_N ZONE_LEVEL$ ZONE_STATE$ ZONE_ROWS$
---------- ---------- ---------- ----------- ----------- ----------
3241022750          1        748           0           1        748 <=== Zone marked as stale
3241022750        749       1756           0           0       1008
3241022750       1757       2764           0           0       1008
3241022750       2765       3772           0           0       1008
3241022750       3773       4780           0           0       1008
3241022750       4781       5000           0           0        220
 
6 rows selected

A stale zone is always scanned when performing Zone Map pruning regardless of the predicates involved. This makes sense otherwise we won't be able to select the updated row back while performing Zone Map pruning at the same time because the value is now out of range. With the zone marked as stale we have no such problem:

SQL> select n from t where n=0;
 
         N
----------
         0

A zone will be marked as fresh when the session commits by setting ZONE_STATE$=0. Note that our session did not commit yet.

Second session

So what happens if some other session updates a row belonging to the stale zone map and then commits? Does the zone gets invalidated by being considered fresh again? Let's find out what happens:

SQL> update t set n=2 where n=2;
 
1 row updated
 
SQL> commit;
 
Commit complete
 
SQL> select * from zm$t order by 1;
 
  ZONE_ID$    MIN_1_N    MAX_1_N ZONE_LEVEL$ ZONE_STATE$ ZONE_ROWS$
---------- ---------- ---------- ----------- ----------- ----------
3241022750          1        748           0           0        748 <=== Zone is no longer marked as stale
3241022750        749       1756           0           0       1008
3241022750       1757       2764           0           0       1008
3241022750       2765       3772           0           0       1008
3241022750       3773       4780           0           0       1008
3241022750       4781       5000           0           0        220
 
6 rows selected

Indeed the zone is no longer marked as stale! Now let's go back to our first session...

First session

So what happens if we try to select the same updated row in the first session?

SQL> select n from t where n=0;

no rows selected

We've lost the ability to see the updated row thanks to the second session invalidating the zone! Worse yet we still won't be able to see the row even after this session commits meaning any query in the database which relies on this Zone Map to do the pruning will be unable to see the data either:

SQL> commit;

Commit complete.

SQL> select n from t where n=0;

no rows selected

SQL> select /*+ no_zonemap(t scan) */ n from t where n=0;

         N
----------
         0

The only way to select the row is to disable Zone Map pruning.

The end result is we have a perfectly "valid" Zone Map as long as Oracle is concerned which is out of sync with the actual table data:

SQL> select zonemap_name, invalid, stale, unusable, compile_state from dba_zonemaps;
 
ZONEMAP_NAME INVALID STALE   UNUSABLE COMPILE_STATE
------------ ------- ------- -------- -------------------
ZM$T         NO      NO      NO       VALID

The fact that Oracle documentation claims that on commit refresh zone maps stay transactionally fresh suggests that this behavior is clearly a bug resulted from incorrect handling of concurrent transactions modifying the data belonging to the same zone.

Zone Map Zone ID's

noreply@blogger.com (Alex Fatkulin) — Tue, 19 Aug 2014 16:03:00 +0000

Just a quick follow up to my previous post on how Zone ID's are calculated.

Let's take the following example:

SQL> select rid, sys_op_zone_id(rid) zone_id
  2   from
  3   (
  4    select chartorowid('AAAS5KAAHAAABYDAAA') rid
  5     from dual
  6   );
 
RID                     ZONE_ID
------------------ ------------
AAAS5KAAHAAABYDAAA 324580438021

Recalling that extended ROWID has the following format (a nice picture from Oracle Documentation):

In the binary format that would correspond to:

Data Object Number -- 32 bits
Relative File Number -- 10 bits
Block Number -- 22 bits

Row Number -- 16 bits

We know that Row Number is irrelevant for the Zone Maps because they deal with block ranges. With that in mind a simple conversion to a (base 10) number would be:

(Block Number) + (Relative File Number) * 2 ^ 22 + (Data Object Number) * 2 ^ 32

Applying the above formula to our ROWID:

SQL> select
  2    dbms_rowid.rowid_block_number(rid) +
  3    dbms_rowid.rowid_relative_fno(rid) * power(2,22) +
  4    dbms_rowid.rowid_object(rid) * power(2,32) base_10
  5   from
  6   (
  7    select chartorowid('AAAS5KAAHAAABYDAAA') rid
  8     from dual
  9   );
 
        BASE_10
---------------
332370368534019

What's left after that is to slice the number into Zone Map chunk size (2^10 by default, thanks to Timur for pointing that out in the comments sections for the previous post):

SQL> select
  2    trunc((
  3    dbms_rowid.rowid_block_number(rid) +
  4    dbms_rowid.rowid_relative_fno(rid) * power(2,22) +
  5    dbms_rowid.rowid_object(rid) * power(2,32)
  6    )/power(2,10)) zone_id,
  7    sys_op_zone_id(rid) sys_zone_id
  8   from
  9   (
 10    select chartorowid('AAAS5KAAHAAABYDAAA') rid
 11     from dual
 12   );
 
     ZONE_ID  SYS_ZONE_ID
------------ ------------
324580438021 324580438021

That's all there is to it!

Zone Maps

noreply@blogger.com (Alex Fatkulin) — Mon, 18 Aug 2014 17:11:00 +0000

Zone Maps is a new feature that got officially introduced in 12.1.0.2.0 so I've decided to take a closer look.

From the Oracle Documentation:

For full table access, zone maps allow I/O pruning of data based on the physical location of the data on disk, acting like an anti-index.

Let's start by creating a test table:

SQL> create table t pctfree 95 clustering by linear order (n) as
  2   select level n, rpad('x', 200, 'x') v
  3    from dual
  4    connect by level <= 5000;
 
Table created

I've used a high pctfree setting to make sure the table gets spread out on disk -- each row will occupy it's own block:

SQL> select count(*) num_rows,
  2    count(distinct dbms_rowid.rowid_block_number(rowid)) num_blocks
  3   from t;
 
  NUM_ROWS NUM_BLOCKS
---------- ----------
      5000       5000

Zone Maps do not require attribute clustering, however, I did use clustering in this particular case to make sure that value ranges for column N do not overlap when physically stored on disk to make it easier to see what's happening when a Zone Map is created.

Now let's create a Zone Map:

SQL> create materialized zonemap zm$t on t (n);
 
Done

The first thing to notice is we now got additional table in our schema with the same name as a Zone Map:

SQL> select * from zm$t order by zone_id$;
 
    ZONE_ID$    MIN_1_N    MAX_1_N ZONE_LEVEL$ ZONE_STATE$ ZONE_ROWS$
------------ ---------- ---------- ----------- ----------- ----------
324421046272          1        748           0           0        748
324421046273        749       1756           0           0       1008
324421046274       1757       2764           0           0       1008
324421046275       2765       3772           0           0       1008
324421046276       3773       4780           0           0       1008
324421046277       4781       5000           0           0        220
 
6 rows selected

As you can see we've got six zones defined with most of them covering a range of about thousand rows with the exception of the first and the last ones. I can now map each ZONE_ID$ to it's respective block range on disk:

SQL> select zone_id$, min_block_id, max_block_id, zone_rows$
  2  from (
  3  select zm$t.zone_id$,
  4    min(dbms_rowid.rowid_block_number(t.rowid)) min_block_id,
  5    max(dbms_rowid.rowid_block_number(t.rowid)) max_block_id,
  6    max(zone_rows$) zone_rows$
  7   from t, zm$t
  8   where t.n between zm$t.min_1_n and zm$t.max_1_n
  9   group by zm$t.zone_id$
 10  ) order by zone_id$;
 
  ZONE_ID$   MIN_BLOCK_ID MAX_BLOCK_ID ZONE_ROWS$
------------ ------------ ------------ ----------
324421046272          179         1023        748
324421046273         1026         2047       1008
324421046274         2050         3071       1008
324421046275         3074         4095       1008
324421046276         4098         5119       1008
324421046277         5122         5343        220
 
6 rows selected

Based on a couple more tests I've done the algorithm appears to work itself out until the fist block number in a segment which divides evenly by 1024 after which all subsequent blocks get mapped to 8MB regions. The last map has 221 blocks because that's where the end of the last table block happens to be.

Zone Maps seems to be completely independent from the actual extents on disk. In my tests I was able to get multiple Zone Maps pointing at the same extent and in the case above we have multiple extents covered by the same Zone Map. In all cases zones were 8MB in size (more on that a little bit later). By the way if you're wondering why do we have 1008 rows and a little bit strange starting block_ids (i.e. 1026 instead of 1024 and so on) remember that the fist couple blocks are used for first level bmb and do not store actual table data.

So how is this all pieced together?

SYS_OP_ZONE_ID

SYS_OP_ZONE_ID function computes ZONE_ID$ for a given ROWID value. The calculation appears to be a straightforward math based on the ROWID value and assumes 8MB Zone Map chunks -- this explains why Zone Maps are created adjacent to 8MB boundaries since ZONE_ID$ values in the Zone Map are computed using the same function:

SQL> select rowid, SYS_OP_ZONE_ID(rowid)
  2   from t
  3   where rownum=1;
 
ROWID              SYS_OP_ZONE_ID(ROWID)
------------------ ---------------------
AAAS4kAAFAAAACzAAA          324421046272

Essentially if we imagine the entire possible database space divided into 8MB regions this function would return into which region a particular ROWID value would belong.

SYS_ZMAP_FILTER

Let's look at the explain plan for the following query:

SQL> select * from t where n=1;


Execution Plan
----------------------------------------------------------
Plan hash value: 2931408918

--------------------------------------------------------------------------------
| Id  | Operation                              | Name | Rows  | Bytes | Cost (%C
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |      |     1 |   205 |  1380
|*  1 |  TABLE ACCESS STORAGE FULL WITH ZONEMAP| T    |     1 |   205 |  1380
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - storage("N"=1)
       filter(SYS_ZMAP_FILTER('/* ZM_PRUNING */ SELECT "ZONE_ID$", CASE WHEN
              BITAND(zm."ZONE_STATE$",1)=1 THEN 1 ELSE CASE WHEN (zm."MIN_1_N" > :1 OR zm."MAX_1_N"
              < :2) THEN 3 ELSE 2 END END FROM "ROOT"."ZM$T" zm WHERE zm."ZONE_LEVEL$"=0 ORDER BY
              zm."ZONE_ID$"',SYS_OP_ZONE_ID(ROWID),1,1)<3 AND "N"=1)

We can see that SYS_ZMAP_FILTER appears to be the function involved in figuring out which Zone Maps needs to be accessed in order to execute the query. The condition inside the query (zm."MIN_1_N" > :1 OR zm."MAX_1_N" < :2) will be used to eliminate Zone Maps which do not have a value we're looking for and is dynamically constructed based on the predicate(-s) we have in the query. From here ZONE_ID$ can be mapped back to ROWID ranges (a reverse operation relative to SYS_OP_ZONE_ID) in order to scan only required portions of the data on disk.

This looks to be a very exciting feature and I can't help but think that it's a little bit disappointing that it's limited to Exadata storage only as it could have been very useful on other systems due to lack of storage indexes support.

AWR Top 5 Timed Foreground Events

noreply@blogger.com (Alex Fatkulin) — Thu, 20 Feb 2014 20:41:00 +0000

I've noticed that people post how to get AWR Top 5 Timed Foreground Events other a range of snapshots using a SQL query from time to time. Since this is something I've done for years here is the version of the SQL I use in case somebody finds it useful:

select case wait_rank when 1 then inst_id end "Inst Num",
 case wait_rank when 1 then snap_id end "Snap Id",
 case wait_rank when 1 then begin_snap end "Begin Snap",
 case wait_rank when 1 then end_snap end "End Snap",
 event_name "Event",
 total_waits "Waits",
 time_waited "Time(s)",
 round((time_waited/total_waits)*1000) "Avg wait(ms)",
 round((time_waited/db_time)*100, 2) "% DB time",
 substr(wait_class, 1, 15) "Wait Class"
from (
select
  inst_id,
  snap_id, to_char(begin_snap, 'DD-MM-YY hh24:mi:ss') begin_snap,
  to_char(end_snap, 'hh24:mi:ss') end_snap,
  event_name,
  wait_class,
  total_waits,
  time_waited,
  dense_rank() over (partition by inst_id, snap_id order by time_waited desc)-1 wait_rank,
  max(time_waited) over (partition by inst_id, snap_id) db_time
from (
select
  s.instance_number inst_id,
  s.snap_id,
  s.begin_interval_time begin_snap,
  s.end_interval_time end_snap,
  event_name,
  wait_class,
  total_waits-lag(total_waits, 1, total_waits) over
   (partition by s.startup_time, s.instance_number, stats.event_name order by s.snap_id) total_waits,
  time_waited-lag(time_waited, 1, time_waited) over
   (partition by s.startup_time, s.instance_number, stats.event_name order by s.snap_id) time_waited,
  min(s.snap_id) over (partition by s.startup_time, s.instance_number, stats.event_name) min_snap_id
from (
 select dbid, instance_number, snap_id, event_name, wait_class, total_waits_fg total_waits, round(time_waited_micro_fg/1000000, 2) time_waited
  from dba_hist_system_event
  where wait_class not in ('Idle', 'System I/O')
 union all
 select dbid, instance_number, snap_id, stat_name event_name, null wait_class, null total_waits, round(value/1000000, 2) time_waited
  from dba_hist_sys_time_model
  where stat_name in ('DB CPU', 'DB time')
) stats, dba_hist_snapshot s
 where stats.instance_number=s.instance_number
  and stats.snap_id=s.snap_id
  and stats.dbid=s.dbid
  and s.dbid=3870213301
  and s.instance_number=1
  and stats.snap_id between 190 and 195
) where snap_id > min_snap_id and nvl(total_waits,1) > 0
) where event_name!='DB time' and wait_rank <= 5
order by inst_id, snap_id;

Inst Snap  Begin Snap        End Snap Event                            Waits    Time(s) Avg wait(ms)  % DB time Wait Class
---- ----- ----------------- -------- --------------------------- ---------- ---------- ------------ ---------- ---------------
   1   191 20-02-14 14:10:10 14:20:10 cell smart table scan           631829    9569.43           15      79.03 User I/O
                                      DB CPU                                    1202.09                    9.93 
                                      direct path read temp            66074    1006.82           15       8.32 User I/O
                                      PX Deq: Slave Session Stats      11730     429.91           37       3.55 Other
                                      latch: shared pool               28134     162.47            6       1.34 Concurrency
   1   192 20-02-14 14:20:10 14:30:11 cell smart table scan          1391832    4620.11            3      67.39 User I/O
                                      DB CPU                                    1017.78                   14.85 
                                      direct path read temp            76329     977.95           13      14.26 User I/O
                                      PX Deq: Slave Session Stats      25043     401.53           16       5.86 Other
                                      latch free                       38836      214.1            6       3.12 Other
   1   193 20-02-14 14:30:11 14:40:14 cell smart table scan          2448539   11075.29            5       79.3 User I/O
                                      DB CPU                                    1529.93                   10.95 
                                      PX Deq: Slave Session Stats      44242    1520.01           34      10.88 Other
                                      direct path read temp            77583     985.65           13       7.06 User I/O
                                      latch free                       67518     376.52            6        2.7 Other
   1   194 20-02-14 14:40:14 14:50:15 direct path read temp            99224      857.3            9      71.63 User I/O
                                      DB CPU                                     328.78                   27.47 
                                      name-service call wait              91        5.4           59       0.45 Other
                                      PX Deq: Slave Session Stats         83       0.17            2       0.01 Other
                                      direct path write                  194       0.12            1       0.01 User I/O
   1   195 20-02-14 14:50:15 15:00:18 DB CPU                                    1188.84                   98.15 
                                      log switch/archive                   1      10.01        10010       0.83 Other
                                      direct path read temp              775       3.96            5       0.33 User I/O
                                      cell smart table scan             1393        1.1            1       0.09 User I/O
                                      cell single block physical         148        0.9            6       0.07 User I/O
                                      read                                                                      
 
 
25 rows selected

'active txn count during cleanout', part II

noreply@blogger.com (Alex Fatkulin) — Fri, 14 Feb 2014 18:19:00 +0000

In part I I've shown some interesting side effects that happen when you're trying to select from a table block which have an outstanding active transaction in it. In this post we're going to make things a little bit more interesting by introducing indexes into the picture.

Test Setup

I'll create a table with two rows and an index:

SQL> create table test as
  2   select level n
  3    from dual
  4    connect by level <= 2;
 
Table created
 
SQL> create index i_test on test (n);
 
Index created

Session 1

I'll update one row in my first session and leave the transaction open:

SQL> update test set n=3 where n=1;
 
1 row updated

Here is xid for this transaction:

SQL> select '0x'||to_char(XIDUSN, 'fm000x')||'.'||
  2    to_char(XIDSLOT, 'fm00x')||'.'||
  3    to_char(XIDSQN, 'fm0000000x') xid
  4   from v$transaction
  5   where addr=(
  6    select taddr
  7     from v$session
  8     where sid=sys_context('userenv','sid')
  9    );
 
XID
----------------------
0x0004.01c.00001fd5

Index Block Dump 1

Since I only have two rows in the table I will end up with a special case where my index root block will be able to hold all the data essentially playing a role of both the root block and a leaf block at the same time. This makes it easier for me to dump the relevant index block because I know there is only one index block to dump:

Block header dump:  0x0100008b
 Object id on Block? Y
 seg/obj: 0x12f46  csc: 0x00.1efcb3c  itc: 2  flg: E  typ: 2 - INDEX
     brn: 0  bdba: 0x1000088 ver: 0x01 opc: 0
     inc: 0  exflg: 0

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
0x02   0x0004.01c.00001fd5  0x00c011ac.09a5.0b  ----    2  fsc 0x000e.00000000

We have two rows locked in the index block because the row with value=1 got deleted and a row with value=3 got inserted, as per our update. Let's notice block cleanout scn (csc) value: 0x00.1efcb3c

Session 2

I'll update another row in the second session leaving the transaction open as well:

SQL> update test set n=4 where n=2;
 
1 row updated

XID
----------------------
0x0003.01f.00001eab

Index Block Dump 2

Here is how index block dump looks right now:

Block header dump:  0x0100008b
 Object id on Block? Y
 seg/obj: 0x12f46  csc: 0x00.1efcd0c  itc: 3  flg: E  typ: 2 - INDEX
     brn: 0  bdba: 0x1000088 ver: 0x01 opc: 0
     inc: 0  exflg: 0

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
0x02   0x0004.01c.00001fd5  0x00c011ac.09a5.0b  ----    2  fsc 0x000e.00000000
0x03   0x0003.01f.00001eab  0x00c00e73.0982.31  ----    2  fsc 0x000e.00000000

Notice that csc value has changed from 0x00.1efcb3c to 0x00.1efcd0c

What happened is just another variation of the theme we saw in part I -- when our second session updates the index block it notices an active transaction in the ITL list and tries to perform a cleanout. It will do the same for the table block as well but since I've shown all the relevant mechanics in the previous post I'll leave it at that.

Undo Segment Header Checks

The important consequence from all the above is that a session which tries to perform a cleanout will have to look into the other transaction(-s) undo segment header block in order to find out whether the other transaction has committed or not:

SQL> select
  2    trn.xidusn,
  3    rbs.file_id,
  4    rbs.block_id header_block,
  5    trn.ubablk undo_block,
  6    '0x'||to_char(trn.XIDUSN, 'fm000x')||'.'||
  7    to_char(trn.XIDSLOT, 'fm00x')||'.'||
  8    to_char(trn.XIDSQN, 'fm0000000x') xid
  9   from v$transaction trn, dba_rollback_segs rbs
 10   where trn.XIDUSN=rbs.segment_id
 11   order by 1;
 
    XIDUSN    FILE_ID HEADER_BLOCK UNDO_BLOCK XID
---------- ---------- ------------ ---------- ----------------------
         3          3          160       3699 0x0003.01f.00001eab
         4          3          176       4524 0x0004.01c.00001fd5

Our first session xid was 0x0004.01c.00001fd5 so when our second session performed the update it had to look into block 176 (undo header block) to check the other transaction status and block 4524 (undo block) in order to rollback the other session changes for write consistency checks:

WAIT #140055864053216: nam='db file sequential read' ela= 341 file#=3 block#=176 blocks=1 obj#=0 tim=1392400391392719
WAIT #140055864053216: nam='db file sequential read' ela= 675 file#=3 block#=4524 blocks=1 obj#=0 tim=1392400391393679

I'll continue setting the up stage for a perfect disaster with delayed block cleanout and parallel DML in the upcoming series.

'active txn count during cleanout', part I

noreply@blogger.com (Alex Fatkulin) — Tue, 28 Jan 2014 14:56:00 +0000

I was going to write a blog post about some peculiar side effects you can get into with the delayed block cleanout when running parallel DML but soon discovered that the entry became so big that I've decided to split it up into a series of more manageable posts.

For a good background on various themes of block cleanout check out Clean it up by Jonathan Lewis.

Active transactions, consistent reads and table scans

First I'm going to show you some interesting observations about what happens when you're trying to select from the block which has an outstanding open transaction in it. Let's create a test table with one row and update it leaving the transaction open. I'm using 11.2.0.4 here:

SQL> create table test (n number, m number);
 
Table created
 
SQL> insert into test values (1, 1);
 
1 row inserted
 
SQL> commit;
 
Commit complete

SQL> select dbms_rowid.rowid_relative_fno(rowid) file#,
  2    dbms_rowid.rowid_block_number(rowid) block#
  3    from test;
 
     FILE#     BLOCK#
---------- ----------
         4        134
 
SQL> update test set n=n;
 
1 row updated

If we were to look into the buffer headers view (x$bh) we would find the following:

     FILE#     DBABLK STATE CR_SCN_BAS CR_SCN_WRP
---------- ---------- ----- ---------- ----------
         4        134 xcur           0          0
         4        134 cur       595913          0

Now lets select from this table in a different session while checking couple stats at the same time:

NAME                                                                  VALUE
---------------------------------------------------------------- ----------
immediate (CR) block cleanout applications                                0
active txn count during cleanout                                          0
 
SQL> select  * from test;
 
         N          M
---------- ----------
         1          1
 
NAME                                                                  VALUE
---------------------------------------------------------------- ----------
immediate (CR) block cleanout applications                                1
active txn count during cleanout                                          1
 
SQL> select  * from test;
 
         N          M
---------- ----------
         1          1
 
NAME                                                                  VALUE
---------------------------------------------------------------- ----------
immediate (CR) block cleanout applications                                2
active txn count during cleanout                                          2
 
SQL> select  * from test;
 
         N          M
---------- ----------
         1          1

NAME                                                                  VALUE
---------------------------------------------------------------- ----------
immediate (CR) block cleanout applications                                3
active txn count during cleanout                                          3

The first thing worth mentioning is that both immediate (CR) block cleanout applications and active txn count during cleanout statistics increment every time we execute the select. Immediate (CR) block cleanout applications indicates that the session is performing delayed block cleanout while doing the consistent read (CR). Active txn count during cleanout indicates how many currently active transactions the cleanout process encountered in each block. From one perspective this makes sense -- when our select reads the block and discovers that there are open transactions it may not know whether these transactions are currently active or the block indeed requires a cleanout. However, something interesting happened when we look into x$bh again:

     FILE#     DBABLK STATE CR_SCN_BAS CR_SCN_WRP
---------- ---------- ----- ---------- ----------
         4        134 xcur           0          0
         4        134 cur       595913          0
         4        134 cur       595922          0
         4        134 cur       595926          0
         4        134 cur       595940          0

Clearly each subsequent select generated a new consistent read version of the block at a different SCN. Indeed, if we were to dump the table's block before and after the last select here is what we would find:

Before:

buffer tsn: 4 rdba: 0x01000086 (4/134)
scn: 0x0000.000917d6 seq: 0x01 flg: 0x04 tail: 0x17d60601
frmt: 0x02 chkval: 0xf347 type: 0x06=trans data
...
Block header dump:  0x01000086
 Object id on Block? Y
 seg/obj: 0x113da  csc: 0x00.917d6  itc: 2  flg: E  typ: 1 - DATA
     brn: 0  bdba: 0x1000080 ver: 0x01 opc: 0
     inc: 0  exflg: 0

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0002.01f.00000242  0x00c0c19c.0035.19  C---    0  scn 0x0000.00090e7e
0x02   0x0008.00b.00000241  0x00c00989.0039.38  ----    1  fsc 0x0000.00000000

After:

buffer tsn: 4 rdba: 0x01000086 (4/134)
scn: 0x0000.000917e4 seq: 0x01 flg: 0x04 tail: 0x17e40601
frmt: 0x02 chkval: 0xf375 type: 0x06=trans data
...
Block header dump:  0x01000086
 Object id on Block? Y
 seg/obj: 0x113da  csc: 0x00.917e4  itc: 2  flg: E  typ: 1 - DATA
     brn: 0  bdba: 0x1000080 ver: 0x01 opc: 0
     inc: 0  exflg: 0

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0002.01f.00000242  0x00c0c19c.0035.19  C---    0  scn 0x0000.00090e7e
0x02   0x0008.00b.00000241  0x00c00989.0039.38  ----    1  fsc 0x0000.00000000

Notice that cleanout scn (cscn) is different and the block got updated with the same scn as well. As you probably have guessed by now, each select generates a block cleanout redo record too:

REDO RECORD - Thread:1 RBA: 0x00000c.00035faf.0010 LEN: 0x006c VLD: 0x05
SCN: 0x0000.000917e4 SUBSCN:  1 01/28/2014 00:16:44
(LWN RBA: 0x00000c.00035faf.0010 LEN: 0001 NST: 0001 SCN: 0x0000.000917e4)
CHANGE #1 TYP:0 CLS:1 AFN:4 DBA:0x01000086 OBJ:70618 SCN:0x0000.000917d6 SEQ:1 OP:4.1 ENC:0 RBL:0
Block cleanout record, scn:  0x0000.000917e4 ver: 0x01 opt: 0x01, entries follow...

From another perspective why even bother doing any of this? It almost looks like the cleanout code is moving along in tiding up the block, updates scn plus cscn and then discovers that there is really nothing to cleanup because the other transaction is still active. But it did changes to the current version of the block by now which then results in proliferation of CR copies from the selects and increased redo generation from the (essentially) shell block cleanout records. There is going to be more interesting details worth mentioning in the next post.

crsd.bin core dumps

noreply@blogger.com (Alex Fatkulin) — Sat, 25 Jan 2014 18:35:00 +0000

Core dump issues sometimes can be notoriously difficult to troubleshoot. I've got a call this morning from one of my customers saying that after a power outage Grid Infrastructure is not able to fully come up on some nodes on their Exadata cluster. After further examining the situation it turned out that crsd.bin binary is simply core dumping upon start up.

Troubleshooting Grid Infrastructure startup issues when nothing is core dumping sometimes could be a chore so what could be more fun when it's not able to fully start due to a major daemon core dumping?

One of the useful things to do when a binary core dumps is to get a stack trace to see which function raised the exception (you can examine the core file the gdb, for example, in order to do that). Let's see what the stack trace holds for us:

Core was generated by `/u01/app/11.2.0.3/grid/bin/crsd.bin reboot'.
Program terminated with signal 6, Aborted.
#0  0x0000003ea3e30285 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003ea3e30285 in raise () from /lib64/libc.so.6
#1  0x0000003ea3e31d30 in abort () from /lib64/libc.so.6
#2  0x0000003ea56bed94 in __gnu_cxx::__verbose_terminate_handler() ()
   from /usr/lib64/libstdc++.so.6
#3  0x0000003ea56bce46 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x0000003ea56bce73 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5  0x0000003ea56bcef9 in __cxa_rethrow () from /usr/lib64/libstdc++.so.6
#6  0x0000000000df8672 in Acl::Acl (this=0x4556d440, domain=..., resource=...,
    aclString=..., useOcr=true, $U7=,
    $U8=, $U9=,
    $V0=, $V1=) at acl.cpp:120
#6  0x0000000000df8672 in Acl::Acl (this=0x4556d440, domain=..., resource=...,
    aclString=..., useOcr=true, $U7=,
    $U8=, $U9=,
    $V0=, $V1=) at acl.cpp:120
#7  0x0000000000df879c in Acl::_ZN3CAA3AclC1ERKSsS2_S2_b (this=0x4556d440,
    $U7=, $U8=,
    $U9=, $V0=,
    $V1=)
#8  0x0000000000a4d81e in SrvResource::initUserId (this=0x7f15803d7550,
    $1=) at clsAgfwSrvResource.cpp:204

We can see that the source of the exception is in the Acl::Acl which is then propagated through the standard libraries. Moreover, function SrvResource::initUserId appears in the stack trace as well, which makes you wonder whether there is some issue with some of the resource's Access Control List, in particular with it's user id setting.

Armed with that knowledge you can now sift through the Grind Infrastructure logs in a much more effective way because these logs are notoriously big and "chatty" (I think my worst nightmare is when the database alert log will become like GI alert log thereby making it much less useful). And there we have it:

Exception: ACL entry creation failed for: owner:ggate:rwx

Turned out the nodes which were core dumping were recently added to the cluster and the user ggate, which is the owner of the GoldenGate resource, simply did not exist on these nodes. Apparently that was enough to cause crsd.bin core dumps. Yikes!

Oracle 12cR1, UDF Pragma and HyperLogLog

noreply@blogger.com (Alex Fatkulin) — Tue, 03 Dec 2013 20:20:00 +0000

One interesting enhancement in 12cR1 PL/SQL is UDF pragma which has the following description:

The UDF pragma tells the compiler that the PL/SQL unit is a user defined function that is used primarily in SQL statements, which might improve its performance.

I though it would be very cool to try it out with my HyperLogLog post I did recently and see if it results in any measurable performance improvement.

Test Table

I'll use the same test table as I did in my original post:

SQL> create table z_hll_test as
  2   select dbms_random.string('x', 4)||rpad('x', 500, 'x') n
  3    from dual
  4    connect by level <= 1000000;
 
Table created

SQL> alter table z_hll_test cache;
 
Table altered

Note that I'm explicitly setting the table to cache in order to make in-memory PQ kick in and eliminate disk I/O as a factor from my test case. For each test I made sure that no physical I/O has happened (including temp I/O for native distinct test).

Regular Function

create or replace function num_zeroes(
 p_n binary_integer
 ) return binary_integer deterministic is
 l_t binary_integer;
 l_b binary_integer;
begin
 --assume 32-bit hash value, 10-bits for bucket
 if (p_n = 0) then return 22; end if;

 l_t := 1;
 l_b := 0;

 while ( bitand(p_n,l_t) = 0 )
 loop
  l_t := l_t*2;
  l_b := l_b+1;
 end loop;

  return l_b;

end num_zeroes;

SQL> select
  2    case
  3      when hll <= 2560 and zeroes > 0 then round(1024*ln(1024*1/zeroes))
  4      when hll >  1/30 * power(2,32) then round(-power(2,32) * ln(1 - hll/power(2,32)))
  5      else round(hll)
  6    end num_distinct
  7    from (
  8      select 0.72054 * (1048576/(current_sum+zeroes)) hll, zeroes
  9        from (
 10          select sum(1/power(2, val)) current_sum, (1024-count(*)) zeroes
 11            from (
 12              select /*+ parallel(z 4) */ mod(ora_hash(n), 1024) bucket,
 13                  max(num_zeroes(trunc(ora_hash(n)/1024)))+1 val
 14                from z_hll_test z
 15                group by mod(ora_hash(n), 1024)
 16            )
 17        )
 18  );
 
NUM_DISTINCT
------------
      748175
 
Executed in 0.889 seconds

Pragma UDF Function

create or replace function num_zeroes(
 p_n binary_integer
 ) return binary_integer deterministic is
 pragma udf;
 l_t binary_integer;
 l_b binary_integer;
begin
 --assume 32-bit hash value, 10-bits for bucket
 if (p_n = 0) then return 22; end if;

 l_t := 1;
 l_b := 0;

 while ( bitand(p_n,l_t) = 0 )
 loop
  l_t := l_t*2;
  l_b := l_b+1;
 end loop;

  return l_b;

end num_zeroes;

SQL> select
  2    case
  3      when hll <= 2560 and zeroes > 0 then round(1024*ln(1024*1/zeroes))
  4      when hll >  1/30 * power(2,32) then round(-power(2,32) * ln(1 - hll/power(2,32)))
  5      else round(hll)
  6    end num_distinct
  7    from (
  8      select 0.72054 * (1048576/(current_sum+zeroes)) hll, zeroes
  9        from (
 10          select sum(1/power(2, val)) current_sum, (1024-count(*)) zeroes
 11            from (
 12              select /*+ parallel(z 4) */ mod(ora_hash(n), 1024) bucket,
 13                  max(num_zeroes(trunc(ora_hash(n)/1024)))+1 val
 14                from z_hll_test z
 15                group by mod(ora_hash(n), 1024)
 16            )
 17        )
 18  );
 
NUM_DISTINCT
------------
      748175
 
Executed in 0.593 seconds

Pragma UDF gives us ~33% performance boost which is not too bad considering we didn't have to do anything else. However, that's not the most interesting part -- let's take a look at the native distinct next.

Native Distinct

SQL> select /*+ parallel(z 4) */ count(distinct n) from z_hll_test z;
 
COUNT(DISTINCTN)
----------------
          753204
 
Executed in 0.983 seconds

Note that this was an optimal execution!

Summary

Let's summarize results in a table:

Regular Function	Pragma UDF Function	Native Distinct
0.889	0.593	0.983

As you can see pragma udf actually beats native implementation by a considerable margin which is very impressive given the fact that distict had an optimal execution.

Result Cache Latch and PDBs

noreply@blogger.com (Alex Fatkulin) — Tue, 26 Nov 2013 16:46:00 +0000

One interesting aspect of Oracle 12cR1 database when it comes to PDBs is how latching is done. For example, if all PDBs have to work under the same latch then contention in one PDB can easily affect users in other PDBs too.

Continuing my series of posts on the Result Cache latch today I'll check what happens when you try to acquire RC latch from two different PDBs.

Session 1

The first session is connected under container name TEST:

SQL> select sys_context('userenv', 'con_name') con_name from dual;

CON_NAME
--------------------
TEST

SQL> select addr from v$latch where name='Result Cache: RC Latch';

ADDR
----------------
0000000060041C78

Session 2

The second session is connected under container name TEST2:

SQL> select sys_context('userenv', 'con_name') con_name from dual;

CON_NAME
--------------------
TEST2

SQL> select addr from v$latch where name='Result Cache: RC Latch';

ADDR
----------------
0000000060041C78

As you can see the address of the latch is exactly the same under both containers so both PDBs appear to share exactly the same latch. Let's confirm it by trying to acquire the latch in exclusive mode in both sessions:

Session 1

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call kslgetsl_w 0x0000000060041C78 1 1 1 16
Function returned 1

Session 2

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call kslgetsl_w 0x0000000060041C78 1 1 1 16
...session hangs...

...and we can confirm that it waits on the RC latch:

SQL> select event, p1, to_char(p1, 'fmxxxxxxxxxxx') addr, state, seconds_in_wait
  2   from v$session_wait
  3   where sid=18;
 
EVENT              P1 ADDR         STATE               SECONDS_IN_WAIT
---------- ---------- ------------ ------------------- ---------------
latch free 1610882168 60041c78     WAITING                          25

The bottom line is that the single RC latch appears to be shared by all PDBs.

How to use HyperLogLog to incrementally maintain number of distinct values

noreply@blogger.com (Alex Fatkulin) — Mon, 18 Nov 2013 17:36:00 +0000

In this post I'll show how extremely easy it is to maintain the number of distinct values when using HyperLogLog. Please reference to my previous post for some description how HyperLogLog works.

Let's assume we have a table with some existing data:

SQL> create table existing_data as
  2   select round(dbms_random.value(0, 77777)) n
  3    from dual
  4    connect by level <= 100000;
 
Table created

Precise number of distinct values:

SQL> select count(distinct n) from existing_data;
 
COUNT(DISTINCTN)
----------------
           56192

Now in order for the incremental refresh to work we first need to create HyperLogLog synopsis:

SQL> create table existing_hll as
  2   select mod(ora_hash(n), 1024) bucket,
  3     max(num_zeroes(trunc(ora_hash(n)/1024)))+1 val
  4    from existing_data
  5    group by mod(ora_hash(n), 1024);
 
Table created

The table is extremely small as it contains only 1024 rows yet it would be enough to describe data with billions of rows in it. Of course we can now use that synopsis to estimate number of distinct values we have using HyperLogLog:

SQL> select
  2    case
  3     when hll <= 2560 and zeroes > 0 then round(1024*ln(1024*1/zeroes))
  4     when hll >  1/30 * power(2,32) then round(-power(2,32) * ln(1 - hll/power(2,32)))
  5     else round(hll)
  6   end num_distinct
  7   from (
  8    select 0.72054 * (1048576/(current_sum+zeroes)) hll, zeroes
  9     from (
 10      select sum(1/power(2, val)) current_sum, (1024-count(*)) zeroes
 11       from existing_hll
 12       )
 13   );
 
NUM_DISTINCT
------------
       57676

Again, the result is within 2% of the precise distinct, which is pretty good considering how little information we had to store about the entire data set.

Now let's say there is some new data you want to add into the existing table:

SQL> create table new_data as
  2   select round(dbms_random.value(5555, 111111)) n
  3    from dual
  4    connect by level <= 10000;
 
Table created

So what we're going to do is calculate HyperLogLog synopsis about this new data and then merge it with HyperLogLog synopsis for the existing data:

SQL> merge into existing_hll e
  2   using (
  3    select mod(ora_hash(n), 1024) bucket,
  4      max(num_zeroes(trunc(ora_hash(n)/1024)))+1 val
  5     from new_data
  6     group by mod(ora_hash(n), 1024)
  7   ) n on (e.bucket=n.bucket)
  8   when matched then update set e.val=greatest(e.val, n.val)
  9   when not matched then insert values (n.bucket, n.val);

1024 rows merged.

SQL> commit;

Commit complete.

Of course the above is a lot more efficient than what you would have to do otherwise, i.e. calculate the number of distinct values from scratch for the entire data set. Once the synopsis has been refreshed we can estimate the new number of distinct values we have:

SQL> select
  2    case
  3     when hll <= 2560 and zeroes > 0 then round(1024*ln(1024*1/zeroes))
  4     when hll >  1/30 * power(2,32) then round(-power(2,32) * ln(1 - hll/power(2,32)))
  5     else round(hll)
  6   end num_distinct
  7   from (
  8    select 0.72054 * (1048576/(current_sum+zeroes)) hll, zeroes
  9     from (
 10      select sum(1/power(2, val)) current_sum, (1024-count(*)) zeroes
 11       from existing_hll
 12       )
 13   );
 
NUM_DISTINCT
------------
       62288

And it's no different had I computed HyperLogLog for both data sets from scratch:

SQL> select
  2    case
  3     when hll <= 2560 and zeroes > 0 then round(1024*ln(1024*1/zeroes))
  4     when hll >  1/30 * power(2,32) then round(-power(2,32) * ln(1 - hll/power(2,32)))
  5     else round(hll)
  6   end num_distinct
  7   from (
  8    select 0.72054 * (1048576/(current_sum+zeroes)) hll, zeroes
  9     from (
 10      select sum(1/power(2, val)) current_sum, (1024-count(*)) zeroes
 11       from (
 12        select mod(ora_hash(n), 1024) bucket,
 13          max(num_zeroes(trunc(ora_hash(n)/1024)))+1 val
 14         from (
 15          select n from existing_data
 16          union all
 17          select n from new_data
 18         ) group by mod(ora_hash(n), 1024)
 19       )
 20       )
 21   );
 
NUM_DISTINCT
------------
       62288

And the precise distinct count:

SQL> select count(distinct n) from
  2   (
  3   select * from existing_data
  4   union all
  5   select * from new_data
  6   );
 
COUNT(DISTINCTN)
----------------
           60983

I think the ability to incrementally refresh the data is the most powerful aspect of HyperLogLog.

HyperLogLog in Oracle

noreply@blogger.com (Alex Fatkulin) — Sun, 17 Nov 2013 19:45:00 +0000

Calculating number of unique values using Oracle distinct for big data sets has two major problems:

It may require lots of memory for sort/hash group by.
It is very difficult to refresh distinct numbers incrementally meaning every time you append some new data you generally have to perform distinct calculations from scratch.

Oracle introduced approximate NDV algorithm in 11G to solve the above problems but it's practical application is pretty much limited to stats gathering at the moment.

Places like Facebook and Google solve the above problems by using a very interesting algorithm called HyperLogLog. There is a pretty good description of how it works but, in a nutshell, it relies on counting the number of trailing (or leading) zeroes in the binary stream and using that information to estimate number of unique values with a very high degree of accuracy.

HyperLogLog is widely used due to the following properties:

Extremely low memory footprint -- 1KB of memory is enough to estimate the number of unique values with a very high degree of precision across billions of rows.
It can be used to refresh the numbers incrementally by calculating HyperLogLog for the new data set and then combining it with the HyperLogLog for the existing one.
It is very parallel friendly. Each piece of data can be independently computed before being combined for the final result -- essentially utilizing the same mechanism as would be used for incremental refresh.

After reading a couple of HyperLogLog papers I became fascinated by it. The elegance of the math behind it is simply brilliant. So I became naturally curious to try it out in Oracle.

Of course the simplest way was to implement it as a user-defined aggregate, which I did. Unfortunately the performance of that solution left a lot to be desired due to a simple fact that there is quite a bit of overhead in executing a user-defined aggregate function for each row in the data set. Based on that I don't think there is a lot of practical applications for HyperLogLog in this form apart from doing a technology demo.

However, it's quite simple to compute most of the HyperLogLog directly inside the SQL statement itself and, as expected, that provides a much better performance. So here is a quick example which demonstrates HyperLogLog in action.

Fist I will create a test table:

SQL> create table z_hll_test as
  2   select dbms_random.string('x', 4)||rpad('x', 500, 'x') n
  3    from dual
  4    connect by level <= 1000000;
 
Table created

Next I will set my pga_aggregate_target=64m to simulate a situation where distinct will have to do a lot of spilling to disk due to large amounts of data being processed as I don't want to spent a lot of time generating hundreds of billions of rows instead.

First let's run Oracle's distinct:

SQL> set timing on
SQL> select /*+ parallel(z 4) */ count(distinct n) from z_hll_test z;
          753521

Elapsed: 00:00:46.02

As you may figure out, most of the 46 seconds were spent spilling to temp:

Let's see what happens if we use HyperLogLog instead. First I will create a function to calculate the number of trailing zeroes we have in a number:

create or replace function num_zeroes(
 p_n binary_integer
 ) return binary_integer deterministic is
 l_t binary_integer;
 l_b binary_integer;
begin
 --assume 32-bit hash value, 10-bits for bucket
 if (p_n = 0) then return 22; end if;

 l_t := 1;
 l_b := 0;

 while ( bitand(p_n,l_t) = 0 )
 loop
  l_t := l_t*2;
  l_b := l_b+1;
 end loop;

  return l_b;

end num_zeroes;

Now we can use this function to compute the HyperLogLog value directly in a SQL statement:

SQL> select
        case
  2    3                when hll <= 2560 and zeroes > 0 then round(1024*ln(1024*1/zeroes))
                when hll >  1/30 * power(2,32) then round(-power(2,32) * ln(1 - hll/power(2,32)))
  4    5                else round(hll)
  6     end num_distinct
  7  from (
  8  select 0.72054 * (1048576/(current_sum+zeroes)) hll, zeroes
  9  from (
 10     select sum(1/power(2, val)) current_sum, (1024-count(*)) zeroes
 11     from (
 12             select /*+ parallel(z 4) */
 13                             mod(ora_hash(n), 1024) bucket,
 14                             max(num_zeroes(trunc(ora_hash(n)/1024)))+1 val
 15                     from z_hll_test z
 16                     group by mod(ora_hash(n), 1024)
 17             )
 18     )
 19  );
      738105

Elapsed: 00:00:02.08

The SQL might look a bit complicated but it's pretty straightforward once you realize how HyperLogLog works (see the link I provided above). For you see, the performance is 22x faster because the entire thing was able to compute in-memory (there are only 1024 distinct values left after the group by as far as Oracle is concerned) and result is within 2% of the precise distinct count -- very impressive!

The algoirithm can have even more potential in the Exadata environments, if we imagine for a second that Oracle gives us a native implementation which is off-loaded, because each cell can independently compute it's part of the data and then simply sent the results out for final merge. Indeed, this is how places like Facebook and Google scale it across lots of small servers.

Flashback query FTS costs

noreply@blogger.com (Alex Fatkulin) — Tue, 27 Aug 2013 22:29:00 +0000

There has been some information written on the subject already (see this post by Randolf Geist).

In a nutshell, the way optimizer costs full table scans when using flashback query makes it look much more expensive than without. What further complicates the problem is the fact that index access costs remain the same regardless of whether you're using flashback query or not. As a result you will be much more likely to see an index access paths when using flashback query.

Consider the following example:

SQL> create table test as
  2   select level n, rpad('x', 200, 'x') v
  3    from dual
  4    connect by level <= 10000;
 
Table created
 
SQL> alter table test add constraint pk_test primary key (n);
 
Table altered
 
SQL> exec dbms_stats.gather_table_stats(user, 'test');
 
PL/SQL procedure successfully completed

If I were to execute the following select it will run using a HASH JOIN:

SQL> with v as
(select /*+ cardinality(100) */ level n from dual connect by level <= 100)
select *
        from v, test t
        where v.n=t.n;  2    3    4    5

Execution Plan
----------------------------------------------------------
Plan hash value: 690578125

---------------------------------------------------------------------------------------
| Id  | Operation                      | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |      |   100 | 21800 |    60   (2)| 00:00:01 |
|*  1 |  HASH JOIN                     |      |   100 | 21800 |    60   (2)| 00:00:01 |
|   2 |   VIEW                         |      |   100 |  1300 |     2   (0)| 00:00:01 |
|*  3 |    CONNECT BY WITHOUT FILTERING|      |       |       |            |          |
|   4 |     FAST DUAL                  |      |     1 |       |     2   (0)| 00:00:01 |
|   5 |   TABLE ACCESS FULL            | TEST | 10000 |  2001K|    57   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("V"."N"="T"."N")
   3 - filter(LEVEL<=100)

In the particular case I'm talking about the flashback query was hidden behind the view (to make it transparent to the application) so we essentially had the following:

SQL> create or replace view v_test as
  2  select * from test as of scn dbms_flashback.get_system_change_number;
 
View created

(dbms_flashback.get_system_change_number is only used here as a substitute example)

Now let's see what happens if we run the same query as above but join into this view instead:

SQL> with v as
(select /*+ cardinality(100) */ level n from dual connect by level <= 100)
select *
        from v, v_test t
        where v.n=t.n;  2    3    4    5

Execution Plan
----------------------------------------------------------
Plan hash value: 3196053776

-------------------------------------------------------------------------------------------
| Id  | Operation                       | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |         |   100 | 21800 |   102   (0)| 00:00:02 |
|   1 |  NESTED LOOPS                   |         |       |       |            |          |
|   2 |   NESTED LOOPS                  |         |   100 | 21800 |   102   (0)| 00:00:02 |
|   3 |    VIEW                         |         |   100 |  1300 |     2   (0)| 00:00:01 |
|*  4 |     CONNECT BY WITHOUT FILTERING|         |       |       |            |          |
|   5 |      FAST DUAL                  |         |     1 |       |     2   (0)| 00:00:01 |
|*  6 |    INDEX UNIQUE SCAN            | PK_TEST |     1 |       |     0   (0)| 00:00:01 |
|   7 |   TABLE ACCESS BY INDEX ROWID   | TEST    |     1 |   205 |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - filter(LEVEL<=100)
   6 - access("V"."N"="N")

The plan suddenly changed to NL join! What happened is flashback query cranked the FTS cost up while leaving index access cost to be the same thus making an FTS to be much less appealing choice for the optimizer. This can make a lot of a difference especially if you're running in the Exadata environment. So what do you do?

One way to deal with the problem is to see how much more expensive the table scan became when using flashback query:

SQL> select * from test
union all
select * from v_test;  2    3

Execution Plan
----------------------------------------------------------
Plan hash value: 2275963031

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      | 20000 |  4003K|   367  (85)| 00:00:05 |
|   1 |  UNION-ALL         |      |       |       |            |          |
|   2 |   TABLE ACCESS FULL| TEST | 10000 |  2001K|    57   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| TEST | 10000 |  2001K|   310   (0)| 00:00:04 |
---------------------------------------------------------------------------

The cost went up from 57 to 310. That's how much more expensive the view will going to make an FTS look like to the optimizer. So now we can counter-balance that increase with the corresponding increase in cost of index scans using opt_param hint in our view:

SQL> select round((310/57)*100) x from dual;
 
         X
----------
       544
 
SQL> create or replace view v_test as
  2  select /*+ opt_param('optimizer_index_cost_adj',544) */ *
  3   from test as of scn dbms_flashback.get_system_change_number;
 
View created

The plan will now go back to a HASH JOIN

SQL> with v as
(select /*+ cardinality(100) */ level n from dual connect by level <= 100)
select *
        from v, v_test t
        where v.n=t.n;  2    3    4    5

Execution Plan
----------------------------------------------------------
Plan hash value: 690578125

---------------------------------------------------------------------------------------
| Id  | Operation                      | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |      |   100 | 21800 |   313   (1)| 00:00:04 |
|*  1 |  HASH JOIN                     |      |   100 | 21800 |   313   (1)| 00:00:04 |
|   2 |   VIEW                         |      |   100 |  1300 |     2   (0)| 00:00:01 |
|*  3 |    CONNECT BY WITHOUT FILTERING|      |       |       |            |          |
|   4 |     FAST DUAL                  |      |     1 |       |     2   (0)| 00:00:01 |
|   5 |   TABLE ACCESS FULL            | TEST | 10000 |  2001K|   310   (0)| 00:00:04 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("V"."N"="N")
   3 - filter(LEVEL<=100)

Of course there are a lot of limitations to this approach, mainly that any query which references the view will get the index access cost adjusted accordingly so if you have a bunch of flashback and non-flashback tables in the same query it deserves a careful thought.

Enkitec E4 2013

noreply@blogger.com (Alex Fatkulin) — Thu, 25 Jul 2013 17:58:00 +0000

Just a quick note that I'll be presenting at this year's Enkitec E4 conference. You can find the schedule here. I did some under the hood investigation regarding how the whole DBFS stack works from the performance perspective and, needless to say, some findings simply left me startled. If you want to see a session which will forever change the way you look at DBFS then this will definitely be the one to attend.

Oracle GoldenGate Integrated Capture #2

noreply@blogger.com (Alex Fatkulin) — Tue, 23 Jul 2013 19:47:00 +0000

I have already written some words on the subject before. However, since then some interesting things have happened.

To recap (or save you some time if you don't want to read the original article) GoldenGate Integrated Capture is nothing else but Oracle Streams Capture in disguise. When running in the Integrated Capture mode pretty much the entire GoldenGate front-end gets yanked out of the picture and replaced with Oracle Streams Capture technology instead.

Let's take a closer look at some of the differences that happened between now and then. Here are two capture processes:

SQL> select capture_name, rule_set_name, purpose from dba_capture order by 1;
 
CAPTURE_NAME                   RULE_SET_NAME                  PURPOSE
------------------------------ ------------------------------ -------------------
OGG$CAP_GG1_EXT1               OGG$GG1_EXT1_CAPTURE_I         GoldenGate Capture
OGG$CAP_GG1_EXT2                                              GoldenGate Capture

OGG$CAP_GG1_EXT1 was created using GoldenGate 11.2.1.0.3 while OGG$CAP_GG1_EXT2 was created using 11.2.1.0.6BP3. Notice how RULE_SET_NAME is empty for the second one? Without the rule set the Capture process will just capture everything that's happening in the database so my first concern was whether it's going to negatively affect the performance?

Indeed, if we take a look at the stats for both processes...

 
       SID CAPTURE_NAME     STARTUP_TIME          TOTAL_PREFILTER_DISCARDED TOTAL_MESSAGES_ENQUEUED
---------- ---------------- --------------------- ------------------------- -----------------------
        33 OGG$CAP_GG1_EXT1 23/07/2013 2:21:56 PM                        47                      71
       147 OGG$CAP_GG1_EXT2 23/07/2013 2:21:56 PM                        28                    2589

...notice a big difference in TOTAL_MESSAGES_ENQUEUED. Both processes were started at exactly the same time and they are capturing from exactly the same objects. This difference makes sense -- without a rule set in place the second process will enqueue every change it sees. By itself, this change would be pretty bad, but something else has happened as well:

SQL> select capture_name, v.program
  2   from v$streams_capture c, v$session v
  3   where c.SID = v.SID;
 
CAPTURE_NAME                   PROGRAM
------------------------------ ------------------------------------------------
OGG$CAP_GG1_EXT1               oracle@gg1.quadro.com (CP01)
OGG$CAP_GG1_EXT2               extract@gg1.quadro.com (TNS V1-V3)

Notice how OGG$CAP_GG1_EXT1 is an Oracle shadow process (a Streams Capture process) while OGG$CAP_GG1_EXT2 is the Extract process itself! What we had before is the Extract process acting as an XStreams client to the Streams Capture process. With this change it is no longer required and we can see that the XStreams client is now detached from the server:

SQL> select server_name, capture_name, status from dba_xstream_outbound order by 1;
 
SERVER_NAME                    CAPTURE_NAME                   STATUS
------------------------------ ------------------------------ --------
OGG$GG1_EXT1                   OGG$CAP_GG1_EXT1               ATTACHED
OGG$GG1_EXT2                   OGG$CAP_GG1_EXT2               DETACHED

That clears the concern of an empty rule set -- the Extract can now filter the records internally because nothing needs to be shipped out anyways where before it would have resulted in a massive traffic between the Capture process, buffered queue and the XStreams client.

Parallel unfriendly

noreply@blogger.com (Alex Fatkulin) — Thu, 14 Mar 2013 02:42:00 +0000

Take a look at the following Parallel section of a SQL Monitor report:

Any query which produces such a report won't care about how much parallel you're running because virtually all the time is spent by the query coordinator (which is a serial process) being busy.

In this case the query in question is quite simple:

select /*+ parallel(t,8) */ median(basket_amount) from whs.fact_sale t

The reason it behaves the way it does has everything to do with how Oracle executes it:

Execution Plan
----------------------------------------------------------
Plan hash value: 712547042

-----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name       | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |            |     1 |     4 |   110K  (3)| 00:00:03 |       |       |        |      |            |
|   1 |  SORT GROUP BY                |            |     1 |     4 |            |          |       |       |        |      |            |
|   2 |   PX COORDINATOR              |            |       |       |            |          |       |       |        |      |            |
|   3 |    PX SEND QC (RANDOM)        | :TQ10000   |   724M|  2763M|   110K  (3)| 00:00:03 |       |       |  Q1,00 | P->S | QC (RAND)  |
|   4 |     PX BLOCK ITERATOR         |            |   724M|  2763M|   110K  (3)| 00:00:03 |     1 |1048575|  Q1,00 | PCWC |            |
|   5 |      TABLE ACCESS STORAGE FULL| FACT_SALE  |   724M|  2763M|   110K  (3)| 00:00:03 |     1 |1048575|  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------------------------------------------------

Each parallel query slave will gets it's own chunk of the table to read from and then simply send data back to the coordinator. The coordinator will then have to deal with all this data by sorting more than 700M rows which, of course, won't be particularly fast. In this sense median poses an interesting problem since Oracle can't calculate (or, rather, discover) it without having access to the entire data set and query coordinator is the only process which can do it.

So what do you do when you get impacted by a particular choice of algorithm implemented by Oracle? One way to deal with it is to see whether you can trade one set of problem for another, in case the alternative can be executed in a better way. In this particular case the fact table contains the sale transactions for a particular store chain. While there are many different ways to spent money, the number of distinct spending amounts should be relatively low compared to the number of rows we have in the table and in such a case we can calculate the median in a different way.

What we can do instead is count how many occurrences of each spending we have and, when sorted by the spending amount, that will give us a compressed form of the raw data which still retains all the information required to find a median. Let's say you have a table with the following data:

SQL> select n from z_t;
 
         N
----------
         1
         1
         2
         3
         3
         5
         7
 
7 rows selected

The first step is to find how many occurrences of each value do we have:

SQL> select n, count(*) cnt
  2   from z_t
  3   group by n;
 
         N        CNT
---------- ----------
         1          2
         2          1
         5          1
         3          2
         7          1

If the number of distinct values is relatively low, the group by will be able to collapse the result set enough as to make subsequent work to be not very significant as well as do it in a very parallel friendly way. The next step is to calculate the cardinality of the data set, at which places we have each distinct value as well as how many values are there:

SQL> select n, lag(running_sum, 1, 0) over (order by n) prev_running_sum, running_sum, total_row_count
  2  from (
  3  select n,
  4   sum(cnt) over (order by n) running_sum,
  5   sum(cnt) over () total_row_count
  6  from (
  7  select n, count(*) cnt
  8   from z_t
  9   group by n
 10  ));
 
         N PREV_RUNNING_SUM RUNNING_SUM TOTAL_ROW_COUNT
---------- ---------------- ----------- ---------------
         1                0           2               7
         2                2           3               7
         3                3           5               7
         5                5           6               7
         7                6           7               7

So what the above tells us is we have two 1s, followed by a single 2, followed by two 3s and so on. Because we have seven elements in our data set, we know that the median will be the item number four which we can now easily discover:

SQL> select avg(n) from (
  2  select n, lag(value_begin, 1, 0) over (order by n) prev_value_begin, value_begin, total_row_count
  3  from (
  4  select n,
  5   sum(cnt) over (order by n) value_begin,
  6   sum(cnt) over () total_row_count
  7  from (
  8  select n, count(*) cnt
  9   from z_t
 10   group by n
 11  ))) where total_row_count/2 between prev_value_begin and value_begin;
 
    AVG(N)
----------
         3

The avg is there for a case where we have an even number of elements since the median in this case equals to a mean value of two values in the middle.

Our new real query will look like this:

select avg(n) from (
select n, lag(value_begin, 1, 0) over (order by n) prev_value_begin, value_begin, total_row_count
from (
select n,
 sum(cnt) over (order by n) value_begin,
 sum(cnt) over () total_row_count
from (
select /*+ parallel(t,8) */ basket_amount n, count(*) cnt
 from whs.fact_sale t
 group by basket_amount
))) where total_row_count/2 between prev_value_begin and value_begin;

So what does a group by and a set of analytic functions is able to bring to a table? Let's take a look:

The total execution time has dropped from almost 26 minutes down to 28 seconds. Moreover, the workload is now much more skewed towards parallel query slaves, which is exactly what we want to see. Of course, the trick only works if the group by is able to collapse the data sufficiently enough.

In-memory PQ and physical reads

noreply@blogger.com (Alex Fatkulin) — Tue, 26 Feb 2013 17:38:00 +0000

In my previous post I've demonstrated how in-memory PQ can access the table directly from the buffer cache even when you're using manual DOP.

One interesting question, however, is what happens when PQ slave needs to read some blocks from disk given that object has been qualified for in-memory (cached) access? Would the slave do it using direct or buffered I/O?

The answer becomes somewhat clear once you realize that in-memory PQ is enabled by simply utilizing buffered reads. Since direct path reads can not take advantage of any blocks stored in the buffer cache (local or remote), trying to do direct path reads will defeat the whole point of in-memory PQ.

To demonstrate the point here is the excerpt from a tkprof output for one of the parallel query slaves:

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         0          0          0  SORT AGGREGATE (cr=0 pr=0 pw=0 time=0 us)
         0          0          0   PX COORDINATOR  (cr=0 pr=0 pw=0 time=0 us)
         0          0          0    PX SEND QC (RANDOM) :TQ10000 (cr=0 pr=0 pw=0 time=0 us)
         1          1          1     SORT AGGREGATE (cr=62838 pr=62500 pw=0 time=7492751 us)
     57696      62500      72120      PX BLOCK ITERATOR (cr=62838 pr=62500 pw=0 time=24528381 us cost=18846 size=0 card=500000)
     57696      62500      72120       TABLE ACCESS FULL Z_TEST (cr=62838 pr=62500 pw=0 time=23944184 us cost=18846 size=0 card=500000)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  PX Deq: Execution Msg                         128        0.19          1.06
  db file scattered read                       4546        0.16         43.99
  latch free                                      1        0.00          0.00
  latch: cache buffers lru chain                 31        0.01          0.03
  latch: cache buffers chains                     3        0.00          0.00
  latch: object queue header operation            4        0.00          0.00

Note that the slave waited on db file scattered read event which is nothing else but buffered multiblock reads. If you were to run the same test on the Exadata platform you would see cell multiblock physical read event instead, given that in-memory PQ did get utilized. There are a couple of consequences for this:

There is no need to do object level checkpoint. This makes in-memory PQ somewhat more friendly to be running in OLTP environment since it doesn't need to flush any dirty buffers to disk.
If you running on the Exadata platform, none of the offloading will happen, even if you have to read the significant portion of the table from disk.

On another note it looks like the things came full circle. Serial sessions are now able to utilize direct path reads while parallel query slaves are able to do buffered I/O.

All tests were done on 11.2.0.3 (both Exadata and non-Exadata).

Does in-memory PQ work with PARALLEL_DEGREE_POLICY=MANUAL?

noreply@blogger.com (Alex Fatkulin) — Mon, 18 Feb 2013 20:18:00 +0000

In-memory parallel execution seems to be gaining popularity especially among people running x2-8 and x3-8 Exadata systems or any other system that have large amounts of memory capable of caching lots of data.

Oracle documentation suggests that in order to utilize in-memory PQ, parallel_degree_policy needs to be set to auto.

_parallel_cluster_cache_policy

One of the parameters influenced by parallel_degree_policy is _parallel_cluster_cache_policy. When using Auto DOP _parallel_cluster_cache_policy will be set to cached. The question then becomes what happens if we set _parallel_cluster_cache_policy=cached while still keeping Manual DOP? Will the system use in-memory PQ?

Test table

Below is a test table setup:

SQL> create table z_test tablespace data as
        select level n, rpad('*', 4000, '*') v
                from dual
                connect by level <= 500000;

Table created.

SQL> alter table z_test add constraint pk_z_test primary key (n);

Table altered.

SQL> select bytes/power(1024,2) mb
        from user_segments
        where segment_name='Z_TEST';

        MB
----------
      3968

SQL> exec dbms_stats.gather_table_stats(user, 'z_test');

PL/SQL procedure successfully completed.

The instance is running with 12G buffer cache so it'll have no problems fully caching the above table. All tests were done on my in-house test lab with Oracle 11.2.0.3.3 running inside a Linux VM.

Classic PQ #1

Without setting any additional parameters PQ behave the way it always did -- by utilizing direct path reads directly to the process' memory:

SQL> set timing on
SQL> set autot trace exp stat
SQL> select /*+ parallel(8) full(z_test) */ count(*) from z_test;

Elapsed: 00:00:02.86

Execution Plan
----------------------------------------------------------
Plan hash value: 2128527892

--------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Rows  | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |          |     1 | 18846   (1)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE        |          |     1 |            |          |        |      |            |
|   2 |   PX COORDINATOR       |          |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM) | :TQ10000 |     1 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE     |          |     1 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR |          |   500K| 18846   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|   6 |       TABLE ACCESS FULL| Z_TEST   |   500K| 18846   (1)| 00:00:01 |  Q1,00 | PCWP |            |
--------------------------------------------------------------------------------------------------------

Note
-----
   - Degree of Parallelism is 8 because of hint


Statistics
----------------------------------------------------------
         25  recursive calls
          0  db block gets
     500525  consistent gets
     500000  physical reads
          0  redo size
        526  bytes sent via SQL*Net to client
        524  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

It took us about 2.86 seconds to full scan the table which equals 1387MB/s throughput (my test lab storage setup is described here). The above clearly shows that we had to do physical reads in order to access the entire table.

Caching the table

Of course, before testing the in-memory PQ, we need to make sure that our entire table sits in the buffer cache. The easiest way to do it is perform an FTS on the table using an index:

SQL> select /*+ index(z_test,pk_z_test) */ v from z_test;

500000 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 579016438

-----------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |   500K|  1907M|   501K  (1)| 00:00:03 |
|   1 |  TABLE ACCESS BY INDEX ROWID| Z_TEST    |   500K|  1907M|   501K  (1)| 00:00:03 |
|   2 |   INDEX FULL SCAN           | PK_Z_TEST |   500K|       |  1052   (1)| 00:00:01 |
-----------------------------------------------------------------------------------------


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
     534311  consistent gets
     501105  physical reads
          0  redo size
 2021185355  bytes sent via SQL*Net to client
     367187  bytes received via SQL*Net from client
      33335  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
     500000  rows processed

Let's check to make sure all table blocks are in the buffer cache:

SQL> set autot off
SQL> select count(*)
        from v$bh
        where objd= (select data_object_id from user_objects where object_name='Z_TEST')
                and status='xcur';

  COUNT(*)
----------
    500001

Now we're good to go!

Classic PQ #2

Even with the table entirely cached we still get it using physical reads when utilizing classic PQ -- as it should be:

SQL> set autot trace exp stat
SQL> select /*+ parallel(8) full(z_test) */ count(*) from z_test;

Elapsed: 00:00:02.83

Execution Plan
----------------------------------------------------------
Plan hash value: 2128527892

--------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Rows  | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |          |     1 | 18846   (1)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE        |          |     1 |            |          |        |      |            |
|   2 |   PX COORDINATOR       |          |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM) | :TQ10000 |     1 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE     |          |     1 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR |          |   500K| 18846   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|   6 |       TABLE ACCESS FULL| Z_TEST   |   500K| 18846   (1)| 00:00:01 |  Q1,00 | PCWP |            |
--------------------------------------------------------------------------------------------------------

Note
-----
   - Degree of Parallelism is 8 because of hint


Statistics
----------------------------------------------------------
         24  recursive calls
          0  db block gets
     500525  consistent gets
     500000  physical reads
          0  redo size
        526  bytes sent via SQL*Net to client
        523  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

In-memory PQ

Let's flip the parameter responsible for in-memory PQ (while still keeping parallel_degree_policy=manual) and see what happens:

SQL> alter session set "_parallel_cluster_cache_policy"=cached;

Session altered.

Elapsed: 00:00:00.01
SQL> select /*+ parallel(8) full(z_test) */ count(*) from z_test;

Elapsed: 00:00:00.36

Execution Plan
----------------------------------------------------------
Plan hash value: 2128527892

--------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Rows  | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |          |     1 | 18846   (1)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE        |          |     1 |            |          |        |      |            |
|   2 |   PX COORDINATOR       |          |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM) | :TQ10000 |     1 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE     |          |     1 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR |          |   500K| 18846   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|   6 |       TABLE ACCESS FULL| Z_TEST   |   500K| 18846   (1)| 00:00:01 |  Q1,00 | PCWP |            |
--------------------------------------------------------------------------------------------------------

Note
-----
   - Degree of Parallelism is 8 because of hint


Statistics
----------------------------------------------------------
         24  recursive calls
          0  db block gets
     502709  consistent gets
          0  physical reads
          0  redo size
        526  bytes sent via SQL*Net to client
        524  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

For you see -- the entire table got read from the buffer cache this time and much faster! The fact that we did zero physical IOs shows in-memory PQ kicking in.

Conclusion

It is possible to use in-memory PQ with Manual DOP by setting _parallel_cluster_cache_policy=cached. Of course, always consult with Oracle support before flipping any of the underscore parameters.

GoldenGate and transient PK updates

noreply@blogger.com (Alex Fatkulin) — Mon, 28 Jan 2013 15:37:00 +0000

The problem of transient PK updates is well known and comes from the fact that pretty much every replication solution on the market applies changes using row-by-row approach.

To quickly recap the problem, if you have a table like:

SQL> create table tpk (n number primary key);
 
Table created
 
SQL> insert into tpk values (1);
 
1 row inserted
 
SQL> insert into tpk values (2);
 
1 row inserted
 
SQL> commit;
 
Commit complete

Then executing the following statement...

update tpk set n=n+1

...will result in a transient PK problem since the replication solution will have to decompose it into the following two statements:

update tpk set n=2 where n=1;
update tpk set n=3 where n=2;

There are two immediate (and major) problems with the above statements. The first problem is that we can't execute the first statement without violating the primary key constraint. Another problem is, even if we somehow could execute the first statement, the second statement will result in updating both rows since they now have the same value!

Oracle Streams historically dealt with that problem using internal mechanism which you could leverage by executing a specially constructed LCR. Other (third-party) replication solution were pretty much out of luck and had to rely on elaborate tricks in order to work around the problem.

This is all about to change.

The necessity to better integrate GoldenGate and bring it feature set up has driven quite a bit of exciting innovation. One of these innovations solves the transient PK problem not only for GoldenGate but for everybody else.

dbms_xstream_gg package

The above package has been available at least since 11.2.0.2 and has two procedures which are directly relevant to the problem described above. I'm talking about enable_tdup_workspace and disable_tdup_workspace. Here is a quick demonstration of how they work:

SQL> --this will result in PK violation
SQL> update tpk set n=2 where n=1;
 
update tpk set n=2 where n=1
 
ORA-00001: unique constraint (ROOT.SYS_C005031) violated
 
SQL> exec dbms_xstream_gg.enable_tdup_workspace;
 
PL/SQL procedure successfully completed
 
SQL> --this is now works!
SQL> update tpk set n=2 where n=1;
 
1 row updated
 
SQL> update tpk set n=3 where n=2;
 
1 row updated
 
SQL> exec dbms_xstream_gg.disable_tdup_workspace;
 
PL/SQL procedure successfully completed
 
SQL> commit;
 
Commit complete

As you can see, the procedure allows us to avoid classical transient PK problem! Indeed, that's what GoldenGate uses internally to avoid getting in troubles as well. The implementation seems to be leveraging the same delete+insert trick Oracle Streams did:

SQL> exec dbms_xstream_gg.enable_tdup_workspace;
 
PL/SQL procedure successfully completed
 
SQL> select n, rowid from tpk;
 
         N ROWID
---------- ------------------
         1 AAAO25AAFAAClWFAAA
         2 AAAO25AAFAAClWFAAB
 
SQL> update tpk set n=2 where n=1;
 
1 row updated
 
SQL> select n, rowid from tpk;
 
         N ROWID
---------- ------------------
         2 AAAO25AAFAAClWFAAB
 
SQL> update tpk set n=3 where n=2;
 
1 row updated
 
SQL> select n, rowid from tpk;
 
         N ROWID
---------- ------------------
         2 AAAO25AAFAAClWFAAA
         3 AAAO25AAFAAClWFAAB

Note how the row mysteriously disappears after the first update and then suddenly comes back after the second one?

I think anybody who were into any sort of replication and its problems will find this to be one of the most significant new features made available. The only caveat is that the above package is not documented so anyone thinking about leveraging it needs to carefully think about the way it behaves.

Oracle GoldenGate Sequence Replication

noreply@blogger.com (Alex Fatkulin) — Thu, 20 Dec 2012 21:36:00 +0000

When using Oracle GoldenGate sequence replication there is a number of issues you need to be aware of especially if you replicate quite a lot of busy sequences.

The first issue is that GoldenGate sequence replication does not use bind variables. Let's execute the following statements on the source system:

SQL> create sequence rep1.s1 nocache;
 
Sequence created
 
SQL> select rep1.s1.nextval from dual;
 
   NEXTVAL
----------
         1
 
SQL> select rep1.s1.nextval from dual;
 
   NEXTVAL
----------
         2

GoldenGate uses PL/SQL procedure called replicateSequence each time it needs to sync sequence values. The following calls will be made on the destination system as a result of the above statements:

BEGIN ggext .replicateSequence (TO_NUMBER(2), TO_NUMBER(20), TO_NUMBER(1), 'REP1', TO_NUMBER(0), 'S1', UPPER('ggrep'), TO_NUMBER(1), TO_NUMBER (0), ''); END;
BEGIN ggext .replicateSequence (TO_NUMBER(3), TO_NUMBER(20), TO_NUMBER(1), 'REP1', TO_NUMBER(0), 'S1', UPPER('ggrep'), TO_NUMBER(1), TO_NUMBER (0), ''); END;

The first parameter is a target sequence value (seq$.highwater) and it's the one which is causing most of the issues, especially if the sequence has been declared with relatively low cache value (or nocache at all, as in my example). Every time a new sequence last value gets written into the source system data dictionary we get a hard parse on the destination!

When using higher cache values the problem of hard parses gets somewhat mitigated but there is another issue. When replicating such a sequence GoldenGate follows these steps:

Sets sequence to nocache
Executes sequence.nextval until it reaches the target value
Restores altered sequence properties

So if you have, say, a sequence with cache 1000 then each time a new value gets written into seq$.highwater on the source, GoldenGate is going to execute sequence.nextval one thousand times on the destination! As you can imagine this aren't performing particularly fast as getting every next value will result in Oracle updating actual row in the data dictionary. All of the above means that replicating sequences can sometimes put quite a bit of strain on the system. A much faster (and simpler) approach would be to use step to reach target sequence value instead of fetching it on-by-one in nocache mode. Last but not least, you can end up with nocache sequence if procedure crashes in the middle.