Sunday 29 May 2011

Oracle Dataguard Problem - Wierd but True

We had a Oracle DR setup for a client that has a physical standby and logical standby for same primary database. The physical standby has dataguard broker configured.

Last week we had an outage on the switch between the 2 servers and after the switch was restored together with the connection between the 2 servers, we started to have an issue whereby the primary database was not able to ship the archivelogs in a seamless fashion. We started to have log gaps between the 2 sites. Our physical standby and logical standby was able to process the logs but was not able to recover the missing logs from the primary site. However, all the missing logs was residing on the primary site.
When the physical standby was trying to fetch the logs from the primary site, we were getting the following error on the primary site' alert log file:


FAL[client]: Failed to request gap sequence
GAP - thread 1 sequence 138006-138007
DBID 371416489 branch 631758889
FAL[client]: All defined FAL servers have been attempted.
-------------------------------------------------------------
Check that the CONTROL_FILE_RECORD_KEEP_TIME initialization
parameter is defined to a value that is sufficiently large
enough to maintain adequate log switch information to resolve
archivelog gaps.
-------------------------------------------------------------


The primary site was not reporting any errors. We had to manually copy the missing archive logs on physical standby and logical standby and then it was able to process it. However, this log gaps was happening quite frequently. The issue was not getting resolved even after restarting the DataGuard Broker process on all the databases.

However, finally the issue got resolved by performing this on the primary database server:
1.       Identify the OS process numbers of the database archiver process.  For example “orcl” is the DB name below:
ps -ef | grep arc | grep orcl
2.       Kill the proceses using the unix “kill -9” command.
3.       Wait for the archiver processes to start automatically again.

Once the archiver processes were started, the DR site was able to automatically connect to the primary site via FAL processes and recover all the missing logs.

Should you encounter something like this in future, I suggest that you either request a clean restart of the primary database or just kill the archiver processes as it will restart automatically.

Cool but weird…

No comments:

Post a Comment