Oracle 19c RAC Permission Recovery Steps Explained

the nightmare scenario of catastrophic permission loss
it is the quintessential nightmare for any database administrator. you are meticulously performing server hardening or perhaps migrating storage mounts, and your intent is to merely modify the permissions of a specific diagnostic directory. you type a recursive permission change command—such as chown -r grid:oinstall—but due to a typographical error or an unexpanded shell variable, you execute the command directly against the parent grid_home or oracle_home directory. in that single instant, the entire highly available cluster architecture collapses. the network listeners die, the automatic storage management (asm) layer becomes unreachable, and the primary database instances crash violently.

this event transcends a simple downtime incident; it is a structural catastrophe. oracle grid infrastructure relies on an incredibly fragile, highly specific matrix of setuid bits and root-owned binaries that standard linux recursive commands indiscriminately obliterate. while this article uses oracle 19c as the primary reference environment, the recovery mechanism detailed herein—leveraging the rootcrs.sh -patch execution mode—is a universally applicable, powerful troubleshooting technique. this methodology can be deployed to resolve not only accidental permission modifications but also inexplicable node evictions or chronic high availability services (ohasd) startup failures across various modern oracle deployments, allowing you to bypass a grueling, multi-hour cluster reinstallation.

1. establishing the recovery environment architecture
2. anatomy of the catastrophe: analyzing the fatal command
3. diagnosing the collapse: interpreting the error cascade
4. step 1: emergency termination and unlocking the grid
5. step 2: surgical manual filesystem corrections
6. step 3: deploying the automated permission repair
7. step 4: resurrecting the database home binaries
8. step 5: relocking the cluster and final verification

1. establishing the recovery environment architecture

to accurately demonstrate this complex recovery procedure, i am utilizing a standardized two-node real application clusters (rac) architecture. this specific configuration closely mirrors the mission-critical banking and telecommunications systems i have managed throughout my career.

documenting the baseline environment
it is crucial to document the exact specifications of the environment under duress. this procedure was rigorously tested on oracle linux 7.9, hosting oracle database 19c. while the underlying recovery logic is robust, specific file pathways may differ slightly depending on your exact release update.


infrastructure specifications:
* operating system: oracle linux server 7.9
* database engine: oracle database 19c enterprise edition
* release level: release update 19.27.0.0.0
* cluster topology: 2-node configuration
  - node 1: ol7ora19r1 (192.168.0.21)
  - node 2: ol7ora19r2 (192.168.0.22)

field experience: the importance of exact versions
i deliberately highlight release update (ru) 19.27 because the internal directory structures and the behavior of the automated patch scripts can exhibit subtle variations between major releases like 11g, 12c, and 19c. while the fundamental logic of the rootcrs.sh utility remains remarkably consistent across versions, the precise locations of diagnostic logs or auxiliary tools (such as the trace file analyzer) frequently shift. you must always verify your specific opatch inventory configuration before blindly applying recovery commands.

↑ back to top

2. anatomy of the catastrophe: analyzing the fatal command

the following terminal output simulates the genesis of the administrative error. in a live production environment, this catastrophic failure is typically triggered when an automated shell script relies on an undefined variable (for example, executing chown -r user $log_dir/ when the $log_dir variable is mysteriously empty, causing the command to execute recursively from the root directory).

understanding the destruction of setuid
why is a simple recursive ownership change so uniformly fatal to an oracle cluster? unix permission models encompass more than just standard user ownership. critical oracle binaries—specifically ‘oracle’, ‘extjob’, and ‘jssu’—strictly require the implementation of the setuid (set user id) bit. this specialized permission mechanism allows the executable process to operate with the elevated privileges of the file’s owner (almost universally root) even when the process is initiated by a less privileged user like grid or oracle. a generic recursive chown command violently strips these setuid bits, instantly rendering the binaries impotent and incapable of accessing restricted network interfaces or spawning high-priority memory processes.


-- warning: destructive simulation - do not execute on production systems
-- simulating the administrative error across the cluster architecture

-- node 1: obliterating the grid_home and oracle_home permissions
[root@ol7ora19r1]$ echo $grid_home
/u01/app/19c/grid
[root@ol7ora19r1]$ chown grid:oinstall -r /u01/app/19c/grid

[root@ol7ora19r1]$ echo $oracle_home
/u01/app/oracle/product/19c/db_1
[root@ol7ora19r1]$ chown oracle:oinstall -r /u01/app/oracle/product/19c/db_1

-- node 2: replicating the error via an automated deployment script
[root@ol7ora19r2]$ chown grid:oinstall -r /u01/app/19c/grid
[root@ol7ora19r2]$ chown oracle:oinstall -r /u01/app/oracle/product/19c/db_1

↑ back to top

3. diagnosing the collapse: interpreting the error cascade

in the immediate aftermath of the destructive command, pre-existing database connections may momentarily hang in a zombie state, but any attempt to establish a new connection will be violently rejected. the system administrators will be confronted with a terrifying cascade of seemingly unrelated errors flooding the diagnostic alert logs and local terminal sessions.

deciphering the root cause
when i first encountered this exact scenario during a live production outage, my immediate instinct was to interrogate the network listener logs. interestingly, the listener process itself was often still operational, yet it was completely incapable of spawning dedicated server processes. the definitive clue resides within the operating system messages and the core binary permissions. if you observe a tns-12547: lost contact error operating in tandem with an ora-01012 exception, you are almost certainly diagnosing a catastrophic failure of binary executable capabilities, not a standard networking issue.


-- symptom 1: a massive flood of alert log connection rejections
fatal ni connect error 12547, connecting to:
(description=(address=(protocol=beq)(program=/u01/app/19c/grid/bin/oracle)...
tns-12547: tns:lost contact
tns-00517: lost contact

-- symptom 2: absolute failure of local sql*plus connectivity
[ora19r1:oracle@ol7ora19r1]$ sqlplus / as sysdba
connected.
error:
ora-01012: not logged on
process id: 0
session id: 0
serial number: 0

field experience: the misleading ora-01012
the ora-01012 error generated during this crisis is highly misleading. the text explicitly states “not logged on,” leading many administrators to frantically troubleshoot user passwords or authentication profiles. the actual, hidden root cause is that the core oracle binary process fundamentally failed to attach itself to the shared global area (sga) memory segment. it could not access the internal process registry because it was violently stripped of the root-level privileges required to perform those low-level memory operations.

↑ back to top

4. step 1: emergency termination and unlocking the grid

we cannot initiate any permission repair protocols while the underlying grid infrastructure is desperately attempting (and failing) to execute its automated startup routines. the core oracle high availability services daemon (ohasd) will be trapped in an infinite, destructive restart loop. we must aggressively intervene and forcefully unlock the installation.

bypassing the standard shutdown mechanisms
this unlocking procedure immediately terminates the active cluster stack and forcibly releases all persistent file locks held against the core binaries. utilizing the rootcrs.sh -unlock directive is exponentially safer than attempting a standard crsctl stop crs -f command in this highly degraded state. the unlock utility intentionally bypasses several higher-level validation checks that would otherwise hang indefinitely due to the shattered permission model, allowing it to interface directly with the lowest-level initialization scripts.


-- execute emergency unlock on node 1
[root@ol7ora19r1]$ cd $grid_home/crs/install
[root@ol7ora19r1]$ ./rootcrs.sh -unlock

-- expected successful output excerpt
using configuration parameter file: /u01/app/19c/grid/crs/install/crsconfig_params
...
2025/08/10 13:39:56 clsrsc-347: successfully unlock /u01/app/19c/grid

-- critical verification: ensure cluster processes are completely dead
[root@ol7ora19r1]$ crsctl stat res -t
crs-4535: cannot communicate with cluster ready services
crs-4000: command status failed, or completed with errors.

-- repeat the emergency unlock procedure on node 2
[root@ol7ora19r2]$ ./rootcrs.sh -unlock

↑ back to top

5. step 2: surgical manual filesystem corrections

before we can deploy the automated oracle repair utilities, we must manually intervene and correct the baseline ownership parameters of the primary parent directories and specific, highly sensitive socket pathways. automated recovery tools frequently operate under the dangerous assumption that the high-level parent directories (such as /u01/app/19c/grid) still maintain their correct foundational ownership.

cleaning the corrupted socket pathways
the official oracle support documentation frequently neglects to emphasize the absolute necessity of cleaning the crsdata directory structure. if you fail to aggressively recreate the crsdata/output directory, the subsequent cluster restart sequences may inexplicably fail. this occurs because the residual socket files—generated by the clusterware while operating in a severely broken, unprivileged state—retain corrupted ownership profiles that conflict with the newly restored binaries.


-- execute these surgical corrections on both cluster nodes
-- 1. critical: isolate bad sockets by renaming the crsdata output directory
[root@ol7ora19r1]$ mv /u01/app/oracle/crsdata/ol7ora19r1/output /u01/app/oracle/crsdata/ol7ora19r1/output.corrupt_bak
[root@ol7ora19r1]$ mkdir -p /u01/app/oracle/crsdata/ol7ora19r1/output
[root@ol7ora19r1]$ chown grid:oinstall /u01/app/oracle/crsdata/ol7ora19r1/output
[root@ol7ora19r1]$ chmod 755 /u01/app/oracle/crsdata/ol7ora19r1/output

-- 2. reset the absolute baseline grid_home ownership to root
[root@ol7ora19r1]$ chown root:oinstall -r /u01/app/19c/grid

-- 3. reset the primary oracle_home ownership to the database user
[root@ol7ora19r1]$ chown oracle:oinstall -r /u01/app/oracle/product/19c/db_1

-- 4. secure the overarching oracle_base directory
[root@ol7ora19r1]$ chown grid:oinstall -r /u01/app/oracle

-- 5. apply granular ownership to diagnostic and administrative pathways
[root@ol7ora19r1]$ chown oracle:oinstall -r /u01/app/oracle/product
[root@ol7ora19r1]$ chown oracle:oinstall -r /u01/app/oracle/checkpoints
[root@ol7ora19r1]$ chown oracle:oinstall -r /u01/app/oracle/diag/rdbms
[root@ol7ora19r1]$ chown oracle:oinstall -r /u01/app/oracle/admin
[root@ol7ora19r1]$ chown oracle:oinstall -r /u01/app/oracle/cfgtoollogs/dbca
[root@ol7ora19r1]$ chown oracle:oinstall -r /u01/app/oracle/cfgtoollogs/netca

↑ back to top

6. step 3: deploying the automated permission repair

we have arrived at the most technically sophisticated maneuver within this recovery guide. we will invoke the rootcrs.sh utility utilizing the specific -patch execution flag. this is the “magic bullet.”

understanding the patch mode behavior
why utilize the patch flag when we are not actively applying new software? when an administrator formally applies a release update to the grid infrastructure, the patching scripts must meticulously ensure that all permission models—including those elusive setuid bits—are perfectly aligned for the newly compiled binaries. by invoking this specific mode on a damaged system, we effectively trick the oracle automation into recursively traversing the entire grid_home. it reads the internal oracle inventory definitions and aggressively reapplies the correct, highly complex permission masks across thousands of files, completely overriding our previous destructive errors.


-- execute the automated repair on node 1, wait for absolute completion, then execute on node 2
[root@ol7ora19r1]$ cd /u01/app/19c/grid/crs/install
[root@ol7ora19r1]$ ./rootcrs.sh -patch

-- expected successful output stream
using configuration parameter file: /u01/app/19c/grid/crs/install/crsconfig_params
...
2025/08/10 13:58:34 clsrsc-4015: performing install or patch action for oracle trace file analyzer (tfa) collector.
2025/08/10 13:58:36 clsrsc-671: post-patch steps for patching gihome /u01/app/19c/grid succeeded.

analyzing the alternative plan
administrators frequently ask: “why not simply re-execute the standard root.sh script?” the reasoning is critical. the root.sh script is strictly designed for initial, pristine installations. attempting to re-execute it on an already configured cluster will violently overwrite foundational configuration files (such as the ocr.loc pointers or the grid plug and play profiles), or it will simply crash because it detects the cluster is already defined. conversely, the rootcrs.sh -patch command is highly idempotent regarding configurations but ruthlessly aggressive regarding permissions. it is the mathematically perfect tool for this exact disaster.

↑ back to top

7. step 4: resurrecting the database home binaries

at this stage, the grid infrastructure home has been successfully rehabilitated, but the relational database management system (rdbms) home remains severely compromised. the rootcrs.sh utility strictly confines its operations to the clusterware layer and entirely ignores the database engine paths. we must initiate a separate manual procedure to re-register the permissions for the core database executables.

executing the rdbms script sequence
we deploy a precise sequence of three independent root scripts. this carefully orchestrated progression guarantees that the overarching oracle inventory permissions are corrected via orainstroot.sh, and subsequently, the specific database engine executables have their critical root-level privileges (the setuid bits) formally restored via the database-specific root.sh script.


-- 1. prepare the environment via rootadd_rdbms.sh
[root@ol7ora19r1]$ cd /u01/app/oracle/product/19c/db_1/rdbms/install
[root@ol7ora19r1]$ ./rootadd_rdbms.sh

-- 2. repair the overarching oracle inventory permissions
[root@ol7ora19r1]$ cd /u01/app/orainventory
[root@ol7ora19r1]$ ./orainstroot.sh
changing permissions of /u01/app/orainventory.
adding read,write permissions for group.
removing read,write,execute permissions for world.

-- 3. execute the heavy lifting for the rdbms engine
[root@ol7ora19r1]$ cd /u01/app/oracle/product/19c/db_1
[root@ol7ora19r1]$ ./root.sh
performing root user operation.
the following environment variables are set as:
    oracle_owner= oracle
    oracle_home=  /u01/app/oracle/product/19c/db_1
...
finished running generic part of root script.

the ultimate validation check
before attempting to restart any services, you must demand visual confirmation that the permissions are physically repaired. the definitive indicator of success is the presence of the ‘s’ (setuid) bit within the file permission matrix for the core executables. if you execute an ls -al command against the ‘oracle’ and ‘extjob’ binaries and the ‘s’ is missing, the recovery has failed, and the database will not open.


-- verify the presence of the 'rws' (read, write, setuid) permission blocks
[root@ol7ora19r1]$ ls -al $grid_home/bin/extjob
-rwsr-x---. 1 root oinstall 3035304 aug 10 14:03 /u01/app/19c/grid/bin/extjob

[root@ol7ora19r1]$ ls -al $oracle_home/bin/oracle
-rwsr-s--x. 1 oracle oinstall 472659160 aug 10 14:06 .../bin/oracle

↑ back to top

8. step 5: relocking the cluster and final verification

the surgical procedures are complete. the final administrative phase demands that we re-lock the grid home—thereby re-enabling the strict security boundaries and immutable file protections—and formally command the cluster services to boot.

initiating the resurrection
if our manual interventions and automated patching routines were executed flawlessly, the oracle high availability services will successfully initialize, rapidly mount the underlying asm diskgroups, and automatically ignite the primary database instances across the cluster architecture.


-- 1. permanently lock the grid infrastructure home
[root@ol7ora19r1]$ cd /u01/app/19c/grid/crs/install
[root@ol7ora19r1]$ ./rootcrs.sh -lock
...
2025/08/10 14:10:04 clsrsc-329: replacing clusterware entries in file 'oracle-ohasd.service'

-- 2. command the clusterware to boot
[root@ol7ora19r1]$ crsctl start crs
crs-4123: starting oracle high availability services-managed resources

-- 3. verify absolute cluster health (allow 5 minutes for full stabilization)
[root@ol7ora19r1]$ crsctl stat res -t
...
ora.asm
      online  online       ol7ora19r1             started,stable
ora.ptdb.db
      online  online       ol7ora19r1             open,home=...

concluding thoughts on enterprise resilience
recovering an entire cluster architecture from a catastrophic recursive permission deletion is an incredibly stressful endeavor, but it imparts a profound understanding of the deep interdependencies between the linux operating system’s privilege models and oracle’s core binaries. utilizing this precise sequence—emergency unlock, targeted manual intervention, and automated repair via the patch directive—effectively saved this environment from an agonizing 8-hour total re-imaging process. the ultimate administrative takeaway: always ruthlessly verify your variable expansions in automated shell scripts, and whenever architecturally possible, restrict direct root logins in favor of targeted sudo access to aggressively minimize the potential blast radius of human error.

↑ back to top

Disaster Recovery: How I Saved a 19c RAC Cluster After Accidental chown -R (Without Reinstalling)

1. establishing the recovery environment architecture

2. anatomy of the catastrophe: analyzing the fatal command

3. diagnosing the collapse: interpreting the error cascade

4. step 1: emergency termination and unlocking the grid

5. step 2: surgical manual filesystem corrections

6. step 3: deploying the automated permission repair

7. step 4: resurrecting the database home binaries

8. step 5: relocking the cluster and final verification

Leave a ReplyCancel reply

1. establishing the recovery environment architecture

2. anatomy of the catastrophe: analyzing the fatal command

3. diagnosing the collapse: interpreting the error cascade

4. step 1: emergency termination and unlocking the grid

5. step 2: surgical manual filesystem corrections

6. step 3: deploying the automated permission repair

7. step 4: resurrecting the database home binaries

8. step 5: relocking the cluster and final verification

Related Posts

Leave a ReplyCancel reply

Discover more from OraPert For Oracle