Sunday, 29 May 2011

Oracle Dataguard Problem - Wierd but True

We had a Oracle DR setup for a client that has a physical standby and logical standby for same primary database. The physical standby has dataguard broker configured.

Last week we had an outage on the switch between the 2 servers and after the switch was restored together with the connection between the 2 servers, we started to have an issue whereby the primary database was not able to ship the archivelogs in a seamless fashion. We started to have log gaps between the 2 sites. Our physical standby and logical standby was able to process the logs but was not able to recover the missing logs from the primary site. However, all the missing logs was residing on the primary site.
When the physical standby was trying to fetch the logs from the primary site, we were getting the following error on the primary site' alert log file:


FAL[client]: Failed to request gap sequence
GAP - thread 1 sequence 138006-138007
DBID 371416489 branch 631758889
FAL[client]: All defined FAL servers have been attempted.
-------------------------------------------------------------
Check that the CONTROL_FILE_RECORD_KEEP_TIME initialization
parameter is defined to a value that is sufficiently large
enough to maintain adequate log switch information to resolve
archivelog gaps.
-------------------------------------------------------------


The primary site was not reporting any errors. We had to manually copy the missing archive logs on physical standby and logical standby and then it was able to process it. However, this log gaps was happening quite frequently. The issue was not getting resolved even after restarting the DataGuard Broker process on all the databases.

However, finally the issue got resolved by performing this on the primary database server:
1.       Identify the OS process numbers of the database archiver process.  For example “orcl” is the DB name below:
ps -ef | grep arc | grep orcl
2.       Kill the proceses using the unix “kill -9” command.
3.       Wait for the archiver processes to start automatically again.

Once the archiver processes were started, the DR site was able to automatically connect to the primary site via FAL processes and recover all the missing logs.

Should you encounter something like this in future, I suggest that you either request a clean restart of the primary database or just kill the archiver processes as it will restart automatically.

Cool but weird…

STATSPACK.SNAP is slow and got ORA-1652 ON 11.1

This is related to bug BUG:8323663/8226792 in 11.1 and Oracle support has confirmed that this is fixed in 11.2. They have not provided any fix in 11.1.

This is also related to statspack.snap taking longer time to run.

However, if you are stuck with 11.1 database, you can use the following solution as a fix to this issue:

1. Connect to the database as sys
2. Recreate the following view with the below command:
create or replace view STATS$V_$FILESTATXS as
select /*+RULE */ ts.name tsname
, df.name filename
, fs.phyrds
, fs.phywrts
, fs.readtim
, fs.writetim
, fs.singleblkrds
, fs.phyblkrd
, fs.phyblkwrt
, fs.singleblkrdtim
, fw.count wait_count
, fw.time time
, df.file#
from x$kcbfwait fw
, v$filestat fs
, v$tablespace ts
, v$datafile df
where ts.ts# = df.ts#
and fs.file# = df.file#
and fw.indx+1 = df.file#;

3. Run the statspack.snap procedure and it should run quicker with errors.

Enjoy...

How to generate the SQL execution plan for a current SQL run by a session

From 10g onwards, generating the SQL execution plan for a current session has become easire. Simply get the SQL_ID and SQL_CHILD_PLAN from v$session view and feed it in the below statement:

select * from table(dbms_xplan.display_cursor('&SQL_ID','&SQL_CHILD_NUMBER'));



And you will have you execution plan generated on your screen which is being used by the session running an SQL statment.

How to boost the speed of export/import process

I was given the task of migrating some databases for a client from Oracle 9i to Oracle 11g and also to move the database from Sun Solaris platform to IBM AIX platform.

The client wanted the migration to be done quickly as we were bound by space and the number of databases to be migrated were around 30. Normally, the approach in such migration is in the following phases:

- PHASE 1 - Move the database from one platform to another on the same version. But since the version we had was 9i, and the Solaris and AIX are not of same endian, we could not use transportable tablespace feature.

- PHASE 2 - Upgrade the database after the migration. In our situation, we will have to upgrade the database first and then migrate to different platform to use the transparent transportable tablespace feature. The risk here is that if we had some issues during the upgrade phase, rollowing back would take some more time as the old binaries and DB backup will need to be restored.



Taking all the constraints we had, we decided to use the old and beautiful exp/imp utility. However, we all know that if we don't know properly how to use this tool optimally, we could be in big trouble as it may take a very long time to run. So the plan the migration was as follows:

STEP 1 - export the database

SETP 2 - import data only

SETP 3 - import constraints only

STEP 4 - import the rest



Now to change the speed gear to the highest one, here are the tricks that I took to make it all happen in a much lesser time than usual. Basically, we tested it out with the default settings and found that it took approx. 10hrs to import 40GB of data. However, we managed to drop this time down to 4hrs by simply applying the following tricks during the whole process:

Assuming a full export, use direct=y if you’re not using any predicates in the export. Set your buffer to be big (10MB at least). Also try not to collect statistics during the export. This is something we can do after the import.

For example: exp username/password@db file=dump.dmp log=logfile.log full=y buffer=10485760 statistics=none

Alter all your constraints novalidate prior to export if you can bring the app down, and if you can take a consistent export. This helps in reinstating the constraints instantly without forcing Oracle to validate constraints on massive tables
I set workarea_size_policy=manual. This will ensure that manual PGA management is used. Since my import session will be the only session connected to the database, I think this is the simplest way.
Set sort_area_size=4GB (or more). Actually, the AIX server only allowed us to change this parameter to a max valus of 1GB but that helped us.

Set massive online redo logs at 2GB each, 2 members each, 6 groups.

To elobrate why we introduced step 4 in the process is that there seems to be a bug in Oracle whereby if we will import the constraints and indexes together, then the underlying indexes of a constraint is not dropped when the constraint is dropped. To overcome this problem, I always recommend to import the constraints before the you import the indexes.

After the import, I reset everything back down to “normal”. At the end of the day, the client was happy and so was I. It had a very happy ending...

ORA-27126: unable to lock shared memory segment in core

This articale covers the resolution for this error on AIX 5L platform.

This error happens when you set you LOCK_SGA parameter to TRUE for any Oracle database that is running on 10.2.0.4 patchset or higher release and you start the instance. You will not see this error as soon as the you set your LOCK_SGA to false and start the instance.

You will also find similar ORA-600 error in your alert log file when you face this issue:

ORA-00600: internal error code, arguments: [kmgsb_resize_sga_target_1]

This error basically means that the oracle user does not have the necessary privilege at the OS level to pin certain size of memory. To check whether oracle user has the required privilege, run the following:

lsuser -a capabilities oracle


The fix for this is as follows but you will need to have root privilege to do this on the server.

1. Add the CAP_BYPASS_RAC_VMM and CAP_PROPAGATE capabilities to "oracle" user
# chuser capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE oracle

2. Configure the AIX large page pool by calculating the number of large pages required for the SGA:
num_of_large_pages = INT((total_SGA_size-1)/16MB)+1

3. Configure the number and size of large pages:
# vmo -p -o lgpg_regions=num_of_large_pages -o lgpg_size=16777216


Please ensure that the value you set for lgpg_regions parameter should be sufficient enough for your SGA to pin the memory. Also ensure that your SGA is not bigger than your physical RAM.

You may refer to MOSS note 603489.1

"resmgr:cpu quantum" waits appearing and the performance is slow

The above symptom is noticed on 11.1.0.6 database. We came across this issue on one of our databases which was running slow when the backups were kicking in. Our database was not having any Resource Consumer Plans. However, from 11g onwards, every maintenance window is part of a DEFAULT resource plan by default. As soon as the maintenance windows open, the DEFAULT plan gets associated with it. The issue we had was that our backup window was falling in the maintenance window.

As a workaround to this issue, we disabled the resource plans from the maintenance windows and the wait disappeared.

However, I do recommend that the database running on 11.1.0.6 patchset should be patched to 11.1.0.7 to overcome this issue. And also, Oracle has associated the Maintenance Windows with the DEFAULT resource plans for a reason so we should stick with that which we can do on 11.1.0.7 patchset or higher.


As per MOSS note 6874858.8, the description and workaround is as follows:

When RMAN is running with Resource Manager enabled, the performance of
the system may degrade. Users may have problems logging in.
When this occurs large waits will be seen with the wait event
"resmgr:cpu quantum" with low CPU utilization.

Workaround
Disable Resource Manager.