while newer database iterations continuously emerge, countless enterprise architectures globally still depend heavily on the robust framework of oracle 12cr2. managing these legacy systems requires a deep understanding of core clusterware mechanics that remain relevant today. patching an oracle real application clusters (rac) environment is frequently cited as the most stressful operational task a database administrator faces. a minor oversight with directory permissions or a missing environment variable can rapidly escalate into a full cluster outage. this comprehensive guide, originally constructed around the rigorous demands of a patch set update (psu), serves as a universal blueprint. it is not merely a repetition of standard documentation, but rather a collection of battle-tested strategies, undocumented edge cases, and crucial troubleshooting maneuvers designed to protect your 12c infrastructure during any major update cycle. we will explore the zero-downtime rolling patch methodology, focusing on the hidden architectural nuances that determine success or failure.
- 1. the critical baseline: auditing environment variables
- 2. strategic staging logic and directory permissions
- 3. opatch engine upgrades and ownership synchronization
- 4. managing background daemons: mgmtdb and cha
- 5. the ultimate parachute: physical binary backups
- 6. execution mechanics: the rolling patch sequence
- 7. mitigating post-patch hangs: the glogin.sql bypass
1. the critical baseline: auditing environment variables
the foundation of a successful cluster intervention begins long before a single binary is downloaded. in a standard oracle rac architecture, the administrative workload is distributed across distinct operating system users, typically designated as ‘root’, ‘grid’, and ‘oracle’. each profile manages specific layers of the software stack. a catastrophic, yet remarkably common, failure occurs when the root user initiates clusterware scripts without the correct grid infrastructure pathways loaded into its environment profile.
when automated scripts like rootcrs.sh are executed, they implicitly rely on shell variables to locate critical perl interpreters and shared libraries. if these variables are empty, the script will often terminate silently or leave the clusterware suspended in an unrecoverable intermediate state. it is mandatory to manually echo these pathways across all user profiles on every node to guarantee operational symmetry.
-- execute and verify as root user
[root@ol7ora12r11]$ echo $db_home
/u01/app/oracle/product/12c/db_1
[root@ol7ora12r11]$ echo $grid_home
/u01/app/12c/grid
-- execute and verify as grid user
[+asm1:grid@ol7ora12r11]$ echo $grid_home
/u01/app/12c/grid
[+asm1:grid@ol7ora12r11]$ echo $oracle_home
/u01/app/12c/grid
-- execute and verify as oracle user
[ora12r11:oracle@ol7ora12r11]$ echo $oracle_home
/u01/app/oracle/product/12c/db_1
i recall a grueling troubleshooting session where the opatch utility failed with a vague “inventory not found” error during the application phase. after hours of tracing, the root cause was discovered: the root user possessed an entirely barren .bash_profile. the oracle automated routines had spawned a sub-process that inherited this void environment, leading to an immediate failure. explicitly confirming variable outputs before executing destructive commands is the primary discipline that differentiates veteran engineers from novices. if the terminal returns a blank line, you must forcefully export the missing variable before proceeding.
2. strategic staging logic and directory permissions
staging massive patch files within a clustered environment requires a different approach than single-instance databases. the instinct to use a shared network file system (nfs) mount to save space is often flawed. during the intense input/output operations of unpacking and applying thousands of library files, network latency or brief locking conflicts on shared storage can silently corrupt the installation process.
i strongly advocate for isolating the patch archives to a dedicated, local directory (for example, /tmp/patch) on each participating node. furthermore, we employ a strategy that often alarms security monitoring tools: implementing open permissions on this temporary staging zone. this is a tactical necessity because the patching workflow requires rapid context switching between the root, grid, and oracle user accounts.
-- perform on both node 1 and node 2
[root@ol7ora12r11]$ mkdir -pv /tmp/patch
-- upload patch archives via sftp/scp to the staging directory
[root@ol7ora12r11]$ ls -l /tmp/patch
-rw-r--r--. 1 root root 2393137641 sep 11 18:29 p33583921_122010_linux-x86-64.zip
-rw-r--r--. 1 root root 133535622 sep 11 18:21 p6880880_122010_linux-x86-64.zip
-- implement temporary open access for all administrative accounts
[root@ol7ora12r11]$ chmod -r 777 /tmp/patch
[root@ol7ora12r11]$ unzip /tmp/patch/p33583921_122010_linux-x86-64.zip -d /tmp/patch
[root@ol7ora12r11]$ chmod 777 -r /tmp/patch/33583921
i have frequently witnessed scenarios where files uploaded via scp by the root user retain strict ownership rules. when the grid user subsequently attempts to initiate the prerequisite conflict checks, the system immediately rejects the command due to an inability to read the source archive. applying chmod -r 777 to the localized staging path eliminates this entire class of trivial, yet blocking, permission conflicts. however, this operational shortcut requires strict discipline: the staging directory must be comprehensively deleted immediately upon the conclusion of the maintenance window to neutralize the temporary security vulnerability.
3. opatch engine upgrades and ownership synchronization
oracle’s internal patching utility, opatch, evolves at a rapid pace to accommodate shifting metadata structures and conflict resolution logic. attempting to interpret a complex release update using an obsolete opatch version will invariably lead to cryptic parsing failures. consequently, replacing the opatch directories within both the grid and database homes is an unavoidable prerequisite.
the most precarious aspect of this upgrade lies in user ownership. when a compressed opatch archive is extracted using root privileges, the resulting directory structure is assigned root ownership. if an administrator fails to meticulously revert this ownership back to the appropriate application users (grid:oinstall or oracle:oinstall), the opatch utility becomes entirely inaccessible to the database engine.
-- 1. upgrade grid infrastructure opatch
[root@ol7ora12r11]$ mv $grid_home/opatch $grid_home/opatch.bak.archive
[root@ol7ora12r11]$ unzip /tmp/patch/p6880880_122010_linux-x86-64.zip -d $grid_home
-- safety protocol: always change directory before recursive chown
[root@ol7ora12r11]$ cd $grid_home
[root@ol7ora12r11]$ chown grid:oinstall ./opatch
[root@ol7ora12r11]$ cd opatch
[root@ol7ora12r11]$ chown -r grid:oinstall *
-- 2. verify opatch engine integrity
[+asm1:grid@ol7ora12r11]$ opatch version
opatch version: 12.2.0.1.42
opatch succeeded.
the structure of the chown command is deliberately cautious. a catastrophic incident i observed involved an administrator attempting to run a recursive ownership change against the entire opatch path, but a typographical error caused the command to target the root of the grid home instead. this single mistake altered the core permissions of thousands of critical system binaries, preventing the cluster from starting and necessitating a full system restore. always navigate directly into the target directory before executing bulk permission changes to severely restrict the potential blast radius.
4. managing background daemons: mgmtdb and cha
the architecture of oracle 12cr2 clusterware includes autonomous monitoring components, specifically the management database (mgmtdb) and the cluster health advisor (cha). these sophisticated services continuously interrogate the system, maintaining open file handles on core grid infrastructure libraries.
if the cluster health advisor is active during the patching sequence, the operating system will forcefully prevent the opatch utility from overwriting these shared libraries, resulting in a terminal “text file busy” exception. furthermore, the overall health of the management database must be confirmed; if it is compromised prior to patching, the automated unlocking scripts may hang indefinitely while attempting a graceful shutdown.
-- assess management listener and database status
[+asm1:grid@ol7ora12r11]$ srvctl status mgmtlsnr
listener mgmtlsnr is enabled
listener mgmtlsnr is running on node(s): ol7ora12r11
[+asm1:grid@ol7ora12r11]$ crsctl status resource ora.mgmtdb -t
ora.mgmtdb1 online online ol7ora12r11 open,stable
-- explicitly terminate the cluster health advisor daemon
[+asm1:grid@ol7ora12r11]$ srvctl stop cha
[+asm1:grid@ol7ora12r11]$ crsctl status resource ora.chad -t
ora.chad
offline offline ol7ora12r11 stable
offline offline ol7ora12r12 stable
if the ora.mgmtdb resource displays an offline or unstable state, halt the patching operation immediately. applying new binaries will not magically resolve an underlying repository corruption. i have witnessed scenarios where patching scripts stalled for nearly an hour trying to interact with a compromised management database, transforming a routine update into a chaotic rollback scenario. proactively stopping the cha daemon is an equally critical defensive measure to ensure all library files are released and available for modification.
5. the ultimate parachute: physical binary backups
relying solely on software-based rollback mechanisms is a high-risk strategy in enterprise environments. if a patch installation fails catastrophically and corrupts the central oracle inventory, the native opatch rollback commands will fail alongside it.
the only guaranteed method of recovery is a pristine, physical archive of the entire oracle binary directory structure prior to any modifications. by utilizing the ‘tar’ utility, we create a compressed snapshot that allows us to bypass complex logical recoveries; we can simply eradicate the corrupted directory tree and extract the backup to instantly restore the previous operational state.
-- execute on all participating cluster nodes
-- create physical archive of grid_home
[root@ol7ora12r11]$ tar cvpzf /backup/grid_pre_patch_baseline.tar.gz $grid_home
-- create physical archive of oracle_home
[ora12r11:oracle@ol7ora12r11]$ tar cvpzf /backup/db_1_pre_patch_baseline.tar.gz $oracle_home
an absolutely vital rule: never store these backup archives within the very directory path you are about to patch. if you generate the archive inside $oracle_home, and later decide to execute an rm -rf command to clear a corrupted installation, you will obliterate your only backup simultaneously. always route these critical archives to a completely isolated mount point or external storage volume to guarantee their survival.
6. execution mechanics: the rolling patch sequence
the core advantage of rac architecture is the ability to execute a rolling patch, maintaining continuous service availability by updating one node at a time. this intricate dance requires a strict sequence of operations: conflict verification, clusterware shutdown (unlocking binaries), grid patching, database patching, and finally, clusterware initialization (relocking binaries).
we initiate the process using the rootcrs.sh script to suspend high availability services and grant the grid user write access to root-owned files. once the binaries are exposed, the opatch utility carefully weaves the new libraries into the existing infrastructure.
-- 1. execute conflict resolution checks (grid user)
$grid_home/opatch/opatch prereq checkconflictagainstohwithdetail -phbasedir /tmp/patch/33583921/33587128
-- 2. unlock grid infrastructure binaries (root user)
[root@ol7ora12r11]$ $grid_home/crs/install/rootcrs.sh -prepatch
-- 3. apply grid patches locally (grid user)
$grid_home/opatch/opatch apply -oh $grid_home -local /tmp/patch/33583921/33587128
-- 4. prepare and apply database patches (oracle user)
/tmp/patch/33583921/33587128/custom/scripts/prepatch.sh -dbhome $oracle_home
$oracle_home/opatch/opatch apply -oh $oracle_home -local /tmp/patch/33583921/33587128
during the application phase, it is common to encounter an error stating “chmod: changing permissions of …/extjobo: operation not permitted.” this is a deeply documented anomaly (reference doc id 2265726.1) resulting from immutable file attributes and can be safely ignored. more importantly, the explicit inclusion of the -local flag is non-negotiable. without this directive, the opatch engine may attempt unauthorized ssh propagation to the secondary node, which is actively processing production traffic, potentially inducing a massive system failure.
7. mitigating post-patch hangs: the glogin.sql bypass
the final step of the patching sequence involves executing the postpatch script as root to secure file permissions and reignite the cluster stack. this automated routine utilizes sql*plus in the background to verify database integration. however, this reliance introduces a subtle, highly destructive vulnerability related to customized sql environments.
many enterprise environments modify the global login script (glogin.sql) to inject security banners or custom prompts. the primitive regular expression parsers within the oracle automation scripts cannot interpret these visual enhancements. when the script encounters unexpected text, it assumes a fatal error has occurred or simply hangs infinitely, waiting for a standard prompt that will never arrive.
-- execute necessary binary relinking
[root@ol7ora12r11]$ chown root:root $db_home/bin/extjob
[root@ol7ora12r11]$ chmod 4750 $db_home/bin/extjob
[root@ol7ora12r11]$ $grid_home/rdbms/install/rootadd_rdbms.sh
-- the tactical bypass: temporarily neutralize glogin.sql
[+asm1:grid@ol7ora12r11]$ mv $grid_home/sqlplus/admin/glogin.sql $grid_home/sqlplus/admin/glogin.sql.disabled
-- execute the cluster restart protocol
[root@ol7ora12r11]$ $grid_home/crs/install/rootcrs.sh -postpatch
-- restore the environment profile
[+asm1:grid@ol7ora12r11]$ mv $grid_home/sqlplus/admin/glogin.sql.disabled $grid_home/sqlplus/admin/glogin.sql
this specific interaction is rarely mentioned in official patch notes, but it is a frequent source of stalled installations. by temporarily renaming the glogin.sql file, we force the sql*plus session to load a completely sterile, vanilla environment. this guarantees that the automation scripts receive the exact standard outputs they require to successfully complete the initialization phase, bringing the newly patched node safely back online.
