Pranilesh DBA Journey: 2011

Tuesday, 13 December 2011

Block Recovery Fails : ORA-00283: recovery session canceled due to errors ; ORA-00267: name of archived log file not needed ; ORA-06512: at "SYS.DBMS_BACKUP_RESTORE", line 4655 ; ORA-00368: checksum error in redo log block

We had certain blocks in the database getting corrupted due to issues with attached SAN disks. Once the SAN disks were fixed, we were supposed to recover the corrupted blocks in the database and followed the following procedure to recover all the corrupted blocks:

1. Perform database logical check:
run {
allocate channel d1 type disk;
backup check logical validate database;
release channel d1;
}

2. Check whether there are corrupt blocks in the database
select * from v$database_block_corruption;

3. Recover all the corrupted blocks
run {
allocate channel d1 type disk;
blockrecover corruption list;
release channel d1;
}
Note: The syntax to recover blocks indiviually is:
RMAN> BLOCKRECOVER DATAFILE <file#> BLOCK <block#>,<block#> DATAFILE <file#> BLOCK <block#>;
eg. RMAN> BLOCKRECOVER DATAFILE 2 BLOCK 12, 13 DATAFILE 3 BLOCK 5, 98, 99 DATAFILE 4 BLOCK 19;

While the recovery was going, the following error was reported and the recovery canceled:
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of blockrecover command at
RMAN-10035: exception raised in RPC:
ORA-00283: recovery session canceled due to errors
ORA-00267: name of archived log file not needed
ORA-06512: at "SYS.DBMS_BACKUP_RESTORE", line 4655
RMAN-10031: RPC Error: ORA-00283 occurred during call to DBMS_BACKUP_RESTORE.BMRDOMEDIARECOVERY

After doing some analysis, we found that one ofthe current redo log files had a corrupt block reported in the alert log file as well:
ORA-00368: checksum error in redo log block
ORA-00353: log corruption near block 143704 change 7917341986636 time 09/20/2010 22:26:57
ORA-00334: archived log: '/u01/oradata/DB/redo01a.log'

To fix the above, we firstly checked that all the other redo log files were not having any corrupt blocks by running the following against every logfile:
SQL> alter system dump logfile 'full path of redo log' validate;

It was then noted that all the other logfiles were not reporting any errors on block corruption apart from the above. As part of good practice we had another copy of the redo log file in the same group in another location. Hence, the database was shutdown cleanly and the healthy redo log member was copied over the corrupted one and renamed accordingly. The database was then started.
After performing the above, the blockrecover was done again and this time it was successful.

Hope this helps for others out there.

Monday, 14 November 2011

Compressed Tables Improves Query Performance

I performed the following test to see how much performance improvment I could get if Icompress a table and run a full scan on it.

1. Ran the following to get the initial response time of full scan on my table:

SQL> select count(*) from test;

Elapsed: 00:01:33.19

Execution Plan

-------------------------------------------------------------------

-------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | 72313 (1)| 00:14:28 |

| 1 | SORT AGGREGATE | | 1 | | |

| 2 | TABLE ACCESS FULL| TEST | 24M| 72313 (1)| 00:14:28 |

-------------------------------------------------------------------

2. Ran the below PLSQL block against the above table to find the compression ratio for above table. The compression ratio for above table was 2.82

set serveroutput on

declare

pragma autonomous_transaction;

-- sample percentage

pct number := 0.000099;

-- original block count (should be less than 10k)

blkcnt number := 0;

-- compressed block count

blkcntc number;

begin

execute immediate ' create table TEMP_UNCOMPRESSED pctfree 0

as select * from &&tabname where rownum < 1';

while ((pct < 100) and (blkcnt < 1000)) loop

execute immediate 'truncate table TEMP_UNCOMPRESSED';

execute immediate 'insert into TEMP_UNCOMPRESSED select * from &&tabname sample block (' || pct || ',10)';

execute immediate 'select

count(distinct(dbms_rowid.rowid_block_number(rowid)))

from TEMP_UNCOMPRESSED' into blkcnt;

pct := pct * 10;

end loop;

execute immediate 'create table TEMP_COMPRESSED compress as

select * from TEMP_UNCOMPRESSED';

execute immediate 'select

count(distinct(dbms_rowid.rowid_block_number(rowid)))

from TEMP_COMPRESSED' into blkcntc;

execute immediate 'drop table TEMP_COMPRESSED';

execute immediate 'drop table TEMP_UNCOMPRESSED';

dbms_output.put_line('Compression Ratio for &&tabname: '||round(blkcnt/blkcntc,2));

end;

3. Compressed the table:

alter table test move compress;

4. Ran the full scan on the compressed table:

SQL> select count(*) from test;

Elapsed: 00:00:28.16

Execution Plan

--------------------------------------------------------------------

--------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | 22964 (1)| 00:04:36 |

| 1 | SORT AGGREGATE | | 1 | | |

| 2 | TABLE ACCESS FULL| TEST | 25M| 22964 (1)| 00:04:36 |

--------------------------------------------------------------------

With the above test, it is quite clear that if we compress a table with high compression factor, the response time will definitley improve.
Hence, I have a golden rule and that is to compress the tables whose compression factor is more than 2 and the table is quite static. This is very helpful for a datawarehouse environment.

Wednesday, 26 October 2011

Apply Linux Kernel Update with ASMLib configured

Before your actual database server which has Linux running on is patched, you need to ensure that you have downloaded the right asmlib rpm for the new kernel version.

The asmlib rpm can be downloaded from:
http://www.oracle.com/technetwork/server-storage/linux/downloads/rhel5-084877.html

After the kernel update is applied on the server, you need to install the new asmlib rpm as root user:
For example, for the below kernel version, you need to install the following rpm
# uname -r
2.6.18-274.7.1.el5
# rpm -Uvh oracleasm-2.6.18-274.7.1.el5-2.0.5-1.el5.x86_64.rpm

Once the new rpm for the patched kernel is installed, you will need to restart the asm so that the disks can be discovered. To do this, issue the following as root:

# service oracleasm restart
Dropping Oracle ASMLib disks:                              [ OK ]
Shutting down the Oracle ASMLib driver:                    [ OK ]
Initializing the Oracle ASMLib driver:                     [ OK ]
Scanning the system for Oracle ASMLib disks:               [ OK ]
# /etc/init.d/oracleasm listdisks
VOL1
VOL2
VOL3
VOL4
VOL5
VOL6
You should now be able to start your asm instance successfully with all the required diskgroups mounted and ready to go.

Wednesday, 12 October 2011

OUI-67200:Make failed to invoke "/usr/bin/make -f ins_net_client.mk client_sharedlib ..." 'genclntsh: genclntsh: Could not locate /u01/db/tech_st/11.1.0/rdbms/admin/shrept.lst

We had a situation where applying CPU patch to a database generated the following warnings:

OUI-67200:Make failed to invoke "/usr/bin/make -f ins_net_client.mk client_sharedlib ORACLE_HOME=/u01/db/tech_st/11.1.0"....'genclntsh: genclntsh: Could not locate /u01/db/tech_st/11.1.0/rdbms/admin/shrept.lst

make: *** [client_sharedlib] Error 1

OUI-67124:Re-link fails on target "client_sharedlib".

Re-link fails on target "client_sharedlib".

Inorder to fix it, we simply copied the shrept.lst file from another Oracle Home which was at same patchset level to the current $ORACLE_HOME/rdbms/admin location. We re-ran the opatch utility to apply the CPU patch and this time it did not display the above warnings at all.

We were also able to start SQLPLUS successfully as well.

Wednesday, 5 October 2011

OPatch failed to locate Central Inventory; The Central Inventory is corrupted

If you ever get this error while running opatch, don't think that it is end of the world and you have to reinstall oracle.
There is a fix for it and it works.
We got the following error for a client environment while running opatch:

$ $ORACLE_HOME/OPatch/opatch lsinventory
Invoking OPatch 11.1.0.6.2
Oracle Interim Patch Installer version 11.1.0.6.2
Copyright (c) 2007, Oracle Corporation. All rights reserved.

Oracle Home       : /u01/app/product/11.1.0/db_1
Central Inventory : /u01/app/oraInventory
   from           : /etc/oraInst.loc
OPatch version    : 11.1.0.6.2
OUI version       : 11.1.0.7.0
OUI location      : /u01/app/product/11.1.0/db_1/oui
Log file location : /u01/app/product/11.1.0/db_1/cfgtoollogs/opatch/opatch2011-10-06_09-11-34AM.log
OPatch failed to locate Central Inventory.
Possible causes are:
    The Central Inventory is corrupted
    The oraInst.loc file specified is not valid.
LsInventorySession failed: OPatch failed to locate Central Inventory.
Possible causes are:
    The Central Inventory is corrupted
    The oraInst.loc file specified is not valid.
OPatch failed with error code 73
And here is how it was fixed:

1. Ensure /etc/oraInst.loc has the correct entry for inventory location. This was the output for ours:
$ cat /etc/oraInst.loc
inventory_loc=/u01/app/oraInventory
inst_group=oinstall
2. The following command was run to attach the current home to the inventory
$ cd $ORACLEW_HOME/oui/bin
$ ./runInstaller -silent -invPtrLoc "/etc/oraInst.loc" -attachHome ORACLE_HOME="/u01/app/product/11.1.0/db_1" ORACLE_HOME_NAME="Ora11gHome"
3. After the above command was run successfully, the opatch was running file:
$ opatch lsinventory
Invoking OPatch 11.1.0.6.2
Oracle Interim Patch Installer version 11.1.0.6.2
Copyright (c) 2007, Oracle Corporation. All rights reserved.

Oracle Home       : /u01/app/product/11.1.0/db_1
Central Inventory : /u01/app/oraInventory
   from           : /etc/oraInst.loc
OPatch version    : 11.1.0.6.2
OUI version       : 11.1.0.7.0
OUI location      : /u01/app/product/11.1.0/db_1/oui
Log file location : /u01/app/product/11.1.0/db_1/cfgtoollogs/opatch/opatch2011-10-06_09-32-11AM.log
Lsinventory Output file location : /u01/app/product/11.1.0/db_1/cfgtoollogs/opatch/lsinv/lsinventory2011-10-06_09-32-11AM.txt
--------------------------------------------------------------------------------
Installed Top-level Products (2):
Oracle Database 11g                                                  11.1.0.6.0
Oracle Database 11g Patch Set 1                                      11.1.0.7.0
There are 2 products installed in this Oracle Home.

There are no Interim patches installed in this Oracle Home.

--------------------------------------------------------------------------------
OPatch succeeded.

Sunday, 25 September 2011

Oracle SQL "like clause" uses full table scan

It is normal behaviour within Oracle to do full table scan if 'like' clause is used to perform searches in where clause.
A sql statement as below will still perform FTS even if the column col2 is indexed.
select * from tab1 where col2 like 'Test%';

In order to let Oracle use index when indexed column is used in 'like' clause, you should implement Oracle Text feature and use 'contains' clause instead of like 'clause'.

Another way to do this without Oracle text is to use function based index which I will show below.

Below is my 1st test case with normal index:
1. Create dummy table and populate it with some test data.
SQL> create table tab1 (col1 number,col2 varchar2(200));
Table created.

SQL> begin
2 for i in 1..1000 loop
3 insert into tab1 values (i,'Test full scan');
4 end loop;
5 insert into tab1 values (1001,'Test1 full scan');
  6 end;
  7 /
PL/SQL procedure successfully completed.
SQL> commit;
Commit complete.

2. Create index
create index idx1 on tab1(col2);

3. Generate execution plan of sql to see whether index gets used or not:
set autotrace traceonly explain
select * from tab1 where upper(col2) like 'TEST1%';
Execution Plan
----------------------------------------------------------
Plan hash value: 2211052296
--------------------------------------------------------------------------
| Id | Operation         | Name | Rows | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT |      |     1 |   115 |     3   (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TAB1 |     1 |   115 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------

As you can see, the Full Table Scan is done.

Now, lets try to create a function based index and see how that works.
1. Drop the old index
drop index idx1;

2. Create function based index
create index idx1 on tab1 (upper(col2));

3. Generate the execution plan of the same SQL:
set autotrace traceonly explain
select * from tab1 where upper(col2) like 'TEST%';
------------------------------------------------------------------------------------
| Id | Operation                   | Name | Rows | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |      |     1 |   115 |     2   (0)| 00:00:01 |
|   1 | TABLE ACCESS BY INDEX ROWID| TAB1 |     1 |   115 |     2   (0)| 00:00:01 |
|* 2 |   INDEX RANGE SCAN          | IDX1 |     9 |       |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------------

And walla!!! The index gets used.

Please note that the function based index will only get used if the 'like' clause will have a wildcard '%' character at the end of the string. It will still do full table scan if the wildcard character is placed in the begining of the string. If that is the case, then I storngly suggest that you implement and use Oracle Text feature.

Wednesday, 14 September 2011

SYSAUX Tablespace Space Management

To see what is occupying most of the space in SYSAUX tablespace, you can run the following on the database:
@?/rdbms/admin/awrinfo.sql

The above will prompt for an output file. Once it finishes to run, review the file to see which component occupies the most space. To find out which component occupies the most space, scroll to section:

(1b) SYSAUX occupants space usage (v$sysaux_occupants)
********************************************************
|
| Occupant Name        Schema Name               Space Usage
| -------------------- -------------------- ----------------
| SM/OPTSTAT           SYS                       18,911.9 MB
| SM/AWR               SYS                          302.8 MB
| AO                   SYS                           56.8 MB
| XSOQHIST             SYS                           56.8 MB
| XDB                  XDB                           54.9 MB
...

If the SM/AWR is occuping the most space, then try to reduce the AWR retention.
We had a situation where SM/OPSTAT was taking up most of the space. The reason for this is that from 10g onwards, the database tries to retain backup of the object stats in the SYSAUX tablespace, the retention of this can be found by running following:

select dbms_stats.get_stats_history_retention from dual;

By default, the retention is for 31 days. To reduce this to 10days, you can run the following:

exec dbms_stats.alter_stats_history_retention(10);

After the above is run, you can let the default database stats job purge the object stats that are older than 31 days.

You may also find that after the purging is done, some of the objects still occupy a lot of space. You will need to shrink it.
Going through the report, we found the following objects were occupying the highest space:

(4) Space usage by non-AWR components (> 500K)
**********************************
COMPONENT        MB SEGMENT_NAME                                                          SEGMENT_TYPE
--------- --------- --------------------------------------------------------------------- ---------------
NON_AWR     7,171.0 SYS.WRI$_OPTSTAT_HISTHEAD_HISTORY                                     TABLE
NON_AWR     6,133.0 SYS.I_WRI$_OPTSTAT_HH_OBJ_ICOL_ST                                     INDEX
NON_AWR     3,817.0 SYS.I_WRI$_OPTSTAT_HH_ST                                              INDEX
NON_AWR       391.0 SYS.WRI$_OPTSTAT_IND_HISTORY                                          TABLE
NON_AWR       339.0 SYS.I_WRI$_OPTSTAT_IND_ST                                             INDEX
NON_AWR       298.0 SYS.I_WRI$_OPTSTAT_IND_OBJ#_ST                                        INDEX
NON_AWR       204.0 SYS.I_WRI$_OPTSTAT_TAB_ST                                             INDEX
...
In order to srhink the above, can we moved the SYS.WRI$_OPTSTAT_HISTHEAD_HISTORY table and rebuilt SYS.I_WRI$_OPTSTAT_HH_OBJ_ICOL_ST, SYS.I_WRI$_OPTSTAT_HH_ST indexes:

alter table SYS.WRI$_OPTSTAT_HISTHEAD_HISTORY move;
alter index SYS.I_WRI$_OPTSTAT_HH_OBJ_ICOL_ST rebuild;
alter index SYS.I_WRI$_OPTSTAT_HH_ST rebuild;

After doing the above, we were able to reclaim around 7GB of space.

Sunday, 21 August 2011

Stop Datapump jobs

It really gets frustrating when you are not able to stop nor kill a datapump job by just killing the script. Anyway, here are the steps to stop/kill it cleanly:

1. First gather the list of datapump jobs:

CONNECT / as sysdba
SET lines 200
COL owner_name FORMAT a10;
COL job_name FORMAT a20
COL state FORMAT a11
COL operation LIKE state
COL job_mode LIKE state

SELECT owner_name, job_name, operation, job_mode,
state, attached_sessions
FROM dba_datapump_jobs
WHERE job_name NOT LIKE 'BIN$%'
ORDER BY 1,2;

OWNER_NAME JOB_NAME OPERATION JOB_MODE STATE ATTACHED
-------------------- -------------------- ------------------- ------------------ ------------------- ------------------
SCOTT IMPDP_TEST IMPORT SCHEMA EXECUTING 1

Now to kill the job, we need to note 2 parameters which we will use in next step to stop the jobs:
OWNER_NAME (Which is SCOTT)
JOB_NAME (Which is IMPDP_TEST)

2. With that information, we can now stop and kill the job:
CONNECT scott/tiger
SET serveroutput on
SET lines 100
DECLARE
h1 NUMBER;
BEGIN
h1 := DBMS_DATAPUMP.ATTACH('IMPDP_TEST','SCOTT');
DBMS_DATAPUMP.STOP_JOB (h1);
END;
/

3. Check that the job has stopped and removed:

CONNECT / as sysdba
SET lines 200
COL owner_name FORMAT a10;
COL job_name FORMAT a20
COL state FORMAT a11
COL operation LIKE state
COL job_mode LIKE state

SELECT owner_name, job_name, operation, job_mode,
state, attached_sessions
FROM dba_datapump_jobs
WHERE job_name NOT LIKE 'BIN$%'
ORDER BY 1,2;

no rows selected

Sunday, 7 August 2011

How to Drop database in Oracle 10g and above

In order to drop database in 10g and above versions, it can be done in simple 3 easy steps:

1. Shutdown the database

SQL> shutdown immediate;

2. Startup in restricted mode

SQL> startup restrict mount;

3. Drop database

SQL> drop database;

Note: Please note that the above command does not clear off the archive log files. You will have to delete them maually.

Thursday, 4 August 2011

ORA-12547: TNS:lost contact

I was getting the following error while trying to connect to database locally as a different user:

oracledbtst: error while loading shared libraries: libskgxp11.so: cannot open shared object file: No such file or directory
ERROR:
ORA-12547: TNS:lost contact

I was told that the Oracle Software from which the database was running was copied from another server.

Hence, to resolve it, the following was done:

In order to fix it, this is what I did:
1. Stopped the database and the listener running from the home of the database.

2. Gave correct permission to $ORACLE_HOME/bin/oracle:

$ chmod 6751 $ORACLE_HOME/bin/oracle

3. Relinked the Oracle binary as follows since the database version was 11.1 on Linux

$ cd $ORACLE_HOME/bin
$ relink all

4. Started my database and the listener

5. Tested the connection and it worked without any errors.

Wednesday, 3 August 2011

ORA-27140: attach to post/wait facility failed

I was having this error when another user was trying to connect to the database on the same user. The database was actually running as user oraA and the user who was trying to connect to it was oraB.

After doing some analysis, it was found that user oraA did not had oinstall as it primary group while oraB was having it. And the permission on $ORACLE_HOME/bin/oracle was -rwxr-xr-x

To fix it this is what I did:
1. Shutdown the database and listener as user oraA
2. As root user, added oinstall as primary group for oraA user:

# usermod -g oinstall oraA

3. Connected as oraA user and changed the permission or oracle binary as follows:

$ chmod 6751 $ORACLE_HOME/bin/oracle

$ ls -l $ORACLE_HOME/bin/oracle

-rwsr-s--x 1 oraA oinstall 196020393 Nov 29 2009 oracle

4. Started the listener and the database

5. Tested the connectivity from user oraB and it all worked.

Monday, 1 August 2011

Migrating database from 32 bit to 64 bit

In order to migrate oracle database from 32 bit to 64 bit, we used the following RMAN restore method.

1. Ensure the new 64 bit OS has the 64 bit Oracle binary installed and configured.

2. Create the pfile of the database on the 64 bit. This can simply be copied from 32 bit. Make sure it is updated with correct files locations if it is different on the new server.

3. Take the full backup of the database via RMAN on the 32 bit by running the following command:

RMAN> run
{
allocate channel c1 type disk;
allocate channel c2 type disk;
backup database plus archivelog;
backup current controlfile;
}

4. Copy backup files to the 64 bit server.

5. Start the database on 64 bit in nomount state:
SQL> startup nomount;

6. Connect to RMAN and set the dbid:
$ rman target /

RMAN> set dbid=<dbid from 32 server>;

7. Restore the controlfile and database:

RMAN> restore controlfile from '<location of the backup>';
RMAN> alter database mount;
RMAN> restore database;

8. Recover database to the point you want. If all the backups are available, then recover the database to the latest archivelog:

RMAN> run
{
set until sequence xxxx;
recover database;
}

RMAN> exit

9. Once the recovery is done, perform the following tasks in order to start the database on 64 bit platform:

SQL> recover database until cancel using backup controlfile;
cancel
SQL> alter database open resetlogs migrate;
SQL> @ ?/rdbms/admin/utlirp.sql

SQL> shutdown immediate;
SQL> startup;
SQL> @ ?/rdbms/admin/utlrp.sql

SQL> shutdown immediate;
SQL> startup

10. The database is now ready to be used.

NOTE: If the database that is migrated has a standby database as well, then the standby database should be recreated since the primary database was started in resetlogs mode.

Tuesday, 26 July 2011

ORA-29279: SMTP permanent error: 530 5.7.1 Client was not authenticated

I was trying to use utl_smtp to send mails from Oracle database to myself.

This is the PLSQL code I used to do it:

declare

c utl_smtp.connection;
msg_from varchar2(50) := 'Oracle9.2';
mailhost VARCHAR2(30) := 'mail server';

dbserver VARCHAR2(30) := 'db server';

from_eml VARCHAR2(40) := 'from email address';

to_eml VARCHAR2(40) := 'to email address';

begin

c := utl_smtp.open_connection(mailhost, 25); -- SMTP on port 25
utl_smtp.helo(c, dbserver);
utl_smtp.mail(c, from_eml);
utl_smtp.rcpt(c, to_eml');

UTL_SMTP.open_data (c);
utl_smtp.write_data(c, 'Subject: Your Subject Line Here');
UTL_SMTP.write_data (c, 'This is test message');
UTL_SMTP.close_data (c);
UTL_SMTP.quit (c);
END;

However, when running the above block, it was giving the following error:

ORA-29279: SMTP permanent error: 530 5.7.1 Client was not authenticated
ORA-06512: at "SYS.UTL_SMTP", line 20
ORA-06512: at "SYS.UTL_SMTP", line 98
ORA-06512: at "SYS.UTL_SMTP", line 221

It seems like that the exchange server was not relaying the IP address for the database server to send emails.

To fix it, the system admin managing the exchange added the database server IP address to relay server list.

And it all started to work then.

Monday, 27 June 2011

Oracle RAC instance showing different times

We had a situation where the database timezone was set to Australia/Sydney while the server timezone was local New Zealand time. The users were complaining that they were seeing local New Zealand time when they were connected to Instance 1 while it was showing Australia/Sydney time when connected to Instance 2.

The following query returned the right time regardless to whichever instance the user connected:
select systimestamp AT TIME ZONE dbtimezone from dual;

However, when the issued the folowing, the times returned were different on both instances:
select systimestamp from dual;

After doing some investigation, it was found that the reson for this was that Instance 1 was started via SQLPLUS and while Instance 2 was started using the srvctl.
Hence, the database was shutdown and started using srvctl which fixed the issue.

How to set Database Timezone in Oracle RAC

In order to set the database timezone for Oracle 11g RAC, perform the following:
$ srvctl setenv database -d <dbname> -t TZ=<the TZ you want>
$ gsdctl stop
$ gsdctl start
For example:
To set the db timezone of RAC database called dbtst to Australia/Sydney, perform the following:
$ srvctl setenv database -d dbtst -t TZ=Australia/Sydney
$ gsdctl stop
$ gsdctl start
To check the timezone setting, issue the following:
$ srvctl getenv database -d dbtst

Buffer Busy Waits Tuning

Recently I was involved in resolving an ongoing degrading performance issue for an application. It was at a state whereby the users were just frustrated and the business was not happy at all.
When it ended up in my court, the first thing I did was to identify the window when the user experience was really bad. Hence, I had to liaise with app vendors to undertand the application a bit and what else was running on it.
It was finally understood that there is a small etl job that runs on the database during business hours regularly. It was one of the business requirments to fulfill a reporting need. Hence, stopping that etl process was out of the question. Also, the application performance slows down when the etl kicks off and the response time gets better when the etl is finished. Also, the etl used to run 4 times in the business hours.
After analysing the statspack report of the database since it was 9i, I found that the top wait event when in the slowness window was 'Buffer Busy Waits' apart from IO:

Top 5 Timed Events
~~~~~~~~~~~~~~~~~~                                                     % Total
Event                                               Waits    Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
db file sequential read                         2,198,322      13,940    40.34
buffer busy waits                               1,273,203       9,852    28.51
CPU time                                                        4,241    12.27
enqueue                                             1,170       3,414     9.88
latch free                                         97,147       1,542     4.46
Also the buffer busy waits are not in the top 5 wait events when performance issue is not there.
I tried to find whether there was a link when users were complaining about database performance and buffer busy waits and I did find a link. Here is an output of my query from statspack to check this trend:

SNAP_TIME         CLASS                WAIT_COUNT         TIME
----------------- -------------------- ---------- ------------
21-06-11 00:13:06 2nd level bmb                 1            3
21-06-11 00:13:06 data block               74,340      338,940
21-06-11 00:13:06 segment header               18           33
21-06-11 00:13:06 undo block                   42            1
21-06-11 00:43:05 data block               34,519      181,783
21-06-11 00:43:05 undo header                   2            5
21-06-11 01:13:09 data block                2,733        2,744
21-06-11 01:43:12 data block                  144          344
21-06-11 01:43:12 undo header                   2            0
21-06-11 02:13:16 data block                  187          208
21-06-11 02:43:20 data block                   70          155
21-06-11 03:13:25 data block              160,186      318,329
21-06-11 03:13:25 segment header                2            0
21-06-11 03:13:25 undo block                  170           25
21-06-11 03:43:28 data block               40,802       87,378
21-06-11 04:13:32 data block               21,974       19,604
21-06-11 04:13:32 segment header                1            2
21-06-11 04:43:36 data block                   17           47
21-06-11 05:13:40 data block                3,523        2,755
21-06-11 05:13:40 segment header                4            1
21-06-11 05:43:44 data block               15,238       11,911
21-06-11 05:43:44 segment header                6            5
21-06-11 05:43:44 undo header                   7            0
21-06-11 06:13:47 data block                  259          374
21-06-11 06:43:51 data block                  181          402
21-06-11 06:43:51 undo header                   6            2
21-06-11 07:13:56 data block               42,250       22,239
21-06-11 07:13:56 undo header                   3            0
21-06-11 07:43:59 data block              158,719       91,015
21-06-11 08:14:04 1st level bmb                 3            1
21-06-11 08:14:04 2nd level bmb                 1            3
21-06-11 08:14:04 data block              542,143      761,181
21-06-11 08:14:04 segment header                9           14
21-06-11 08:14:04 undo block                   16            2
21-06-11 08:14:04 undo header                  73            0
21-06-11 08:44:09 data block             ########    1,109,580
21-06-11 08:44:09 segment header                2            5
21-06-11 08:44:09 undo block                   65            8
21-06-11 08:44:09 undo header                   9           10
21-06-11 09:14:12 data block             ########      930,479
21-06-11 09:14:12 segment header                3            4
21-06-11 09:14:12 undo block                   20            1
21-06-11 09:14:12 undo header                   4            6
21-06-11 09:44:19 data block             ########      981,900
21-06-11 09:44:19 segment header                1            3
21-06-11 09:44:19 undo block                  115           53
21-06-11 09:44:19 undo header                  34            7
21-06-11 10:14:22 data block              982,942      701,688
21-06-11 10:14:22 segment header                6            7
21-06-11 10:14:22 undo block                   43           32
21-06-11 10:14:22 undo header                   3            5
21-06-11 10:44:26 data block              522,797      342,624
21-06-11 10:44:26 segment header                2            5
21-06-11 10:44:26 undo block                   14            0
21-06-11 10:44:26 undo header                   4            0
21-06-11 11:14:32 1st level bmb                 8            4
21-06-11 11:14:32 2nd level bmb                 6            3
21-06-11 11:14:32 data block              905,206    1,077,409
21-06-11 11:14:32 segment header               46           50
21-06-11 11:14:32 undo block                   62            9
21-06-11 11:14:32 undo header                   7            6
21-06-11 11:44:51 data block             ########    2,316,365
21-06-11 11:44:51 segment header                3           20
21-06-11 11:44:51 undo block                  186          293
21-06-11 11:44:51 undo header                   9           46
21-06-11 12:14:55 data block              153,829      130,860
21-06-11 12:14:55 undo block                   15           10
21-06-11 12:14:55 undo header                   2            1
21-06-11 12:44:59 data block              152,903       84,885
21-06-11 12:44:59 undo block                   13            0
21-06-11 12:44:59 undo header                   2            0
21-06-11 13:15:01 data block               37,928       37,543
21-06-11 13:15:01 segment header               10            4
21-06-11 13:15:01 undo header                   9            7
21-06-11 13:45:06 data block               28,441       12,470
21-06-11 13:45:06 undo header                   4            0
21-06-11 14:15:09 data block              595,511      339,329
21-06-11 14:15:09 undo block                   19            2
21-06-11 14:15:09 undo header                  27            7
21-06-11 14:45:16 data block               24,499       12,706
21-06-11 14:45:16 undo block                   83          130
21-06-11 14:45:16 undo header                  33           18
21-06-11 15:15:18 1st level bmb                 3            3
21-06-11 15:15:18 2nd level bmb                 2            3
21-06-11 15:15:18 data block             ########    2,149,819
With the above output, users were complaining about database performance when when the buffer busy waits was high. Also, the contention in on data blocks.
After running some monitoring script to find out which object was having this, it was identified that there was a massive table. Also, the database default block size was 32K meaning having alot of rows per block. To make it worse, the table was sitting on a local managed tablespace with manual segment space management. Another interesting finding during the analysis was that the reason code for buffer busy wait event was 300 (Block is being read by another session, and no other suitable block image was found, so we wait until the read is completed. This may also occur after a buffer cache assumed deadlock. The kernel can't get a buffer in a certain amount of time and assumes a deadlock. Therefore it will read the CR version of the block. ) It simply indicated we were having hot block contention.

Hence, my recommnedation to fix the issue was to create another 8K blocksize tablespace which had 'automatic segment space managment' and move the table with the issue to it. Once the migration was done, the buffer busy waits disappeared from the top 5 wait events and users were finding the application response time normal even when the etl was running. The business was happy as well.

The buffer busy wait events was also showing better results after the change:

SNAP_TIME         CLASS                WAIT_COUNT         TIME
----------------- -------------------- ---------- ------------
28-06-11 00:04:56 data block               40,959       90,487
28-06-11 00:04:56 undo block                    5            0
28-06-11 00:34:59 data block               59,194      179,264
28-06-11 00:34:59 undo block                   18            9
28-06-11 01:05:03 data block                1,459        2,088
28-06-11 01:35:06 data block                  145          451
28-06-11 02:05:11 data block                  235          503
28-06-11 02:35:14 data block                  144          516
28-06-11 03:05:19 data block               16,663       20,802
28-06-11 03:35:22 2nd level bmb                 1            4
28-06-11 03:35:22 data block              140,303      404,557
28-06-11 03:35:22 segment header                7           10
28-06-11 03:35:22 undo block                   39           58
28-06-11 03:35:22 undo header                   6           17
28-06-11 04:05:26 data block               31,410       24,654
28-06-11 04:35:29 data block                4,376        6,528
28-06-11 05:05:33 data block                9,393       14,982
28-06-11 05:05:33 segment header               12            8
28-06-11 05:05:33 undo block                    2            0
28-06-11 05:35:38 undo header                   2            0
28-06-11 06:05:41 data block                  266          138
28-06-11 06:35:45 data block              147,655      295,869
28-06-11 06:35:45 segment header                3            1
28-06-11 06:35:45 undo block                    8            0
28-06-11 07:05:49 data block                9,572       11,383
28-06-11 07:05:49 segment header                2            2
28-06-11 07:35:53 data block              141,267      280,172
28-06-11 07:35:53 undo block                    5            1
28-06-11 08:05:57 data block                1,239        1,913
28-06-11 08:05:57 segment header                2            2
28-06-11 08:05:57 undo header                   2            0
28-06-11 08:36:00 data block                8,436       15,255
28-06-11 08:36:00 segment header                3            2
28-06-11 08:36:00 undo header                  11            0
28-06-11 09:06:07 1st level bmb                 3            3
28-06-11 09:06:07 2nd level bmb                 2            0
28-06-11 09:06:07 data block              164,876      520,704
28-06-11 09:06:07 segment header               28           39
28-06-11 09:06:07 undo block                   18            3
28-06-11 09:06:07 undo header                   7            0
28-06-11 09:36:10 data block               15,731       45,259
28-06-11 09:36:10 undo block                    4            5
28-06-11 09:36:10 undo header                   1            2
28-06-11 10:06:19 data block               22,449       22,406
28-06-11 10:06:19 undo block                   11            6
28-06-11 10:06:19 undo header                  19           10
28-06-11 10:36:20 data block                9,059       14,184
28-06-11 10:36:20 segment header                1            4
28-06-11 10:36:20 undo block                    8            2
28-06-11 10:36:20 undo header                  28            1
28-06-11 11:06:24 data block               18,346       46,440
28-06-11 11:06:24 segment header                7            8
28-06-11 11:06:24 undo block                    6            0
28-06-11 11:06:24 undo header                   8            1
28-06-11 11:36:26 data block               14,429       15,896
28-06-11 11:36:26 undo block                    6            0
28-06-11 11:36:26 undo header                   9            3
28-06-11 12:06:32 data block                1,967        5,556
28-06-11 12:06:32 undo block                    3            0
28-06-11 12:06:32 undo header                   3            0
28-06-11 12:36:34 data block                1,992        3,037
28-06-11 12:36:34 undo block                    4            0
28-06-11 12:36:34 undo header                   1            5
28-06-11 13:06:40 1st level bmb                 8            4
28-06-11 13:06:40 2nd level bmb                 8            0
28-06-11 13:06:40 data block              183,334      517,622
28-06-11 13:06:40 segment header               34           43
28-06-11 13:06:40 undo block                    7            2
28-06-11 13:36:42 data block               26,026       28,859
28-06-11 13:36:42 undo header                   3            0
28-06-11 14:06:46 data block                2,078        3,912
28-06-11 14:06:46 segment header                1            3
28-06-11 14:06:46 undo block                    4            0
28-06-11 14:06:46 undo header                  15            2
28-06-11 14:36:52 data block                3,131        5,007
28-06-11 14:36:52 undo block                    3            0
28-06-11 14:36:52 undo header                  15            1
28-06-11 15:06:54 data block               20,727       92,361
28-06-11 15:06:54 segment header                8           24
28-06-11 15:06:54 undo block                   10            9
28-06-11 15:06:54 undo header                  23            7

Monday, 13 June 2011

Using Resource Manager to Manage Degree of Parallelism

I was asked to implement a resource manager to limit the degree of parallelism for all the users except a particular user who does more batch processing.
This is what I did:

BEGIN

DBMS_RESOURCE_MANAGER.CLEAR_PENDING_AREA();

DBMS_RESOURCE_MANAGER.CREATE_PENDING_AREA();

DBMS_RESOURCE_MANAGER.CREATE_PLAN(PLAN => 'DOP_PLAN',

COMMENT => 'Plan to limit degree of parallelism');

DBMS_RESOURCE_MANAGER.CREATE_CONSUMER_GROUP ( consumer_group => 'NOLIMIT_DOP' , comment =>'Batch users consumer group');

DBMS_RESOURCE_MANAGER.CREATE_PLAN_DIRECTIVE (plan => 'DOP_PLAN', group_or_subplan => 'NOLIMIT_DOP',

comment => 'NO limits to parallelism');

DBMS_RESOURCE_MANAGER.CREATE_PLAN_DIRECTIVE (plan => 'DOP_PLAN', group_or_subplan => 'OTHER_GROUPS',

comment => 'Limits use of parallelism',parallel_degree_limit_p1=> 4);

DBMS_RESOURCE_MANAGER.set_consumer_group_mapping(attribute => DBMS_RESOURCE_MANAGER.ORACLE_USER,value => 'BATCH_USER',consumer_group => 'NOLIMIT_DOP');

DBMS_RESOURCE_MANAGER.validate_pending_area();

DBMS_RESOURCE_MANAGER.submit_pending_area();

END;

exec DBMS_RESOURCE_MANAGER_PRIVS.grant_switch_consumer_group('BATCH_USER','NOLIMIT_DOP',false);

exec DBMS_RESOURCE_MANAGER.set_initial_consumer_group(user => 'BATCH_USER',consumer_group =>'NOLIMIT_DOP');

alter system set resource_manager_plan = 'DOP_PLAN' scope=both;

-----------------------------------------------------------------------------------------------------

Ok, this is what each line does:

The pending area is a temporary work area for Resource Manager configuration. The changes in the pending area are not visible until the pending area is submitted.

DBMS_RESOURCE_MANAGER.CLEAR_PENDING_AREA();
DBMS_RESOURCE_MANAGER.CREATE_PENDING_AREA();

Create a resource plan which called 'DOP_PLAN" which will contain the resource consumer group.

DBMS_RESOURCE_MANAGER.CREATE_PLAN(PLAN => 'DOP_PLAN',

COMMENT => 'Plan to limit degree of parallelism');

Create consumer group called 'NOLIMT_DOP'

DBMS_RESOURCE_MANAGER.CREATE_CONSUMER_GROUP ( consumer_group => 'NOLIMIT_DOP' , comment =>'Batch users consumer group');

Create plan directive to this group without any limits. This resource group will not have limit on degree of parallelism

DBMS_RESOURCE_MANAGER.CREATE_PLAN_DIRECTIVE (plan => 'DOP_PLAN', group_or_subplan => 'NOLIMIT_DOP',

comment => 'NO limits to parallelism');

All the other users who are part of OTHER_GROUPS will have a degree of parallelism limit to 4.

DBMS_RESOURCE_MANAGER.CREATE_PLAN_DIRECTIVE (plan => 'DOP_PLAN', group_or_subplan => 'OTHER_GROUPS',

comment => 'Limits use of parallelism',parallel_degree_limit_p1=> 4);

Add the user called BATCH_USER to the consumer group NOLIMIT_DOP.

DBMS_RESOURCE_MANAGER.set_consumer_group_mapping(attribute => DBMS_RESOURCE_MANAGER.ORACLE_USER,value => 'BATCH_USER',consumer_group => 'NOLIMIT_DOP');

Submit the changes.

DBMS_RESOURCE_MANAGER.validate_pending_area();

DBMS_RESOURCE_MANAGER.submit_pending_area();

Ensure the BATCH_USER user has permission to access NOLIMIT_DOP resource group and that is the initial group set for the user.

exec DBMS_RESOURCE_MANAGER_PRIVS.grant_switch_consumer_group('BATCH_USER','NOLIMIT_DOP',false);

exec DBMS_RESOURCE_MANAGER.set_initial_consumer_group(user => 'BATCH_USER',consumer_group =>'NOLIMIT_DOP');

Enable the resource plan

alter system set resource_manager_plan = 'DOP_PLAN' scope=both;

----------------------------------------------------------------------------------------

In order to see it really works, this is what I did:

The database has maximum parallel servers defined as

SQL> show parameter parallel_max

NAME TYPE VALUE
------------------------------------ ----------- ----------------------
parallel_max_servers integer 24

Connected as user BATCH_USER and issue the following command:

select /*+ parallel(dba_segments,10) */ * from dba_segments,dba_extents;

From another session I issue the following to see how many parallel processes did the user get and how much was requested:

SQL> select A.USERNAME,B.SID,b.qcsid,b.DEGREE,b.REQ_DEGREE from v$session a,v$
px_session b where a.sid=b.qcsid and a.serial#=b.qcserial# AND a.STATUS='ACTIVE'
order by 1;

USERNAME        SID      QCSID     DEGREE REQ_DEGREE
---------- ---------- ---------- ---------- ----------
BATCH_USER        112        116         10         10
BATCH_USER        103        116         10         10
BATCH_USER        111        116         10         10
BATCH_USER        124        116         10         10
BATCH_USER        114        116         10         10
BATCH_USER        105        116         10         10
BATCH_USER        106        116         10         10
BATCH_USER        113        116         10         10
BATCH_USER        119        116         10         10
BATCH_USER        144        116         10         10
BATCH_USER        118        116         10         10
BATCH_USER        110        116         10         10
BATCH_USER        115        116         10         10
BATCH_USER        107        116         10         10
BATCH_USER        130        116         10         10
BATCH_USER        109        116         10         10
BATCH_USER        108        116         10         10
BATCH_USER        104        116         10         10
BATCH_USER        142        116         10         10
BATCH_USER        149        116         10         10

From above you can see that the requested degree was 10 and the user got 10.

Now if I connect as another user other than BATCH_USER and request for 10 parallel processes, I should not get them all. Well lets see. I connected as perfstat user and issued the following:

select /*+ parallel(dba_segments,10) */ * from dba_segments,dba_extents;

Then connected to another session to check how many parallel processes were given:

SQL> select a.USERNAME,B.SID,b.qcsid,b.DEGREE,b.REQ_DEGREE from v$session a,v$px
_session b where a.sid=b.qcsid and a.serial#=b.qcserial# AND a.STATUS='ACTIVE' o
rder by 1;

USERNAME                              SID      QCSID     DEGREE REQ_DEGREE
------------------------------ ---------- ---------- ---------- ----------
SYSTEM                                116        121          4         10
SYSTEM                                115        121          4         10
SYSTEM                                108        121          4         10
SYSTEM                                142        121          4         10
SYSTEM                                104        121          4         10
SYSTEM                                118        121          4         10
SYSTEM                                109        121          4         10
SYSTEM                                130        121          4         10

So you can see that the user is only getting 4 parallel processes but it requested 10.

Hence, the resource plan is working the way I wanted it to.