Just making it up as I go along...: Physical Standby Database with Oracle Standard Edition - an alternative approach

Introduction

Oracle Data Guard is a great product with features which, when configured correctly, allow full replication of Oracle databases, with zero data loss (via 'Real Time Apply' of redo), and almost zero loss of availability (through Fast Start Failover and Client Fast Connection Failover). Combined with Oracle Real Application Clusters (RAC) and Oracle Flashback Technologies, it is one of the major components in building a Maximum Availability Architecture (MAA). Additional features such as Active Data Guard give further return on investment from Business Continuity infrastructure.

However, all that comes at a price. Data Guard and Flashback are only available with Oracle Enterprise Edition (EE). Active Data Guard is an additional license, as is RAC for Enterprise Edition. Some businesses are happy to pay the premium, and will be able to justify the additional cost. Not everyone needs, or can afford, an Enterprise Edition license, and make do with Oracle Standard Edition (SE).

So if you are using Oracle SE, and still want to maintain a physical standby database as part of your Business Continuity solution, what do you do? There are commercial solutions such as Dbvisit Standby, that offer added ease-of-use features, but once again they come at a cost, although not as much as an Oracle EE license. I would also suggest every Oracle DBA who has encountered this scenario has implemented a home-grown solution, either built from scratch, or modified from various solutions published on the internet over the years. It is typically a mixture of scripts - shell, perl, SQL, etc - whatever the particular DBA was proficient in, run via cron at various intervals. They are generally site specific, not very robust and have a lot of hard-coding of path names, passwords and the like.

Approach

In approaching a problem like this, and in designing and implementing Oracle infrastructure in general, I like to fall back to a philosophy I have adapted from that developed for creating 'low maintenance' gardens by horticulturist Don Burke. Don's approach was to use a smart selection of plants and landscaping that would look after itself, with minimal maintenance, leaving the gardener to simply enjoy the result. He called himself 'The Lazy Gardener'. Likewise, my approach is 'The Lazy DBA'. By that I mean the DBA should use simple and robust tools and processes, as well as taking advantage of all the technology available, to build infrastructure that can (more-or-less) run itself, requiring minimal input on the part of the DBA, leaving him or her to more interesting tasks, such as planning for the next big project.

So my criteria in building a solution to this particular issue included:

event-driven rather than continuous polling
use existing infrastructure or Oracle technology
work with RAC and single instance
any code developed should run with zero (or minimal) modification for local environment
new technology should be lightweight and minimally intrusive on existing infrastructure
should make use of the existing environment for all information
no passwords stored anywhere
minimal development effort

I wanted to have a cross-platform solution, but the more I looked into it, the more unlikely that seemed without having to build something from scratch, something I wanted to avoid. Given most of my experience is in the Linux/Unix arena I decided to stick with Linux as a first implementation, and look at cross-platform later.

Lets breakdown what needs to happen to update the standby database. The basic process is:

A log switch occurs in the primary database
The previous log is archived
The system detects the existence of the new archived log
The archived log is transferred to the standby system
The standby system detects the existence of the new archived log
The new archived log is registered with the standby database
The standby database performs recovery with the new archived log

The process with gap resolution is:

The standby database is queried for archive gaps

The primary database is notified of the missing archived logs

the primary database places the missing archived logs into the stream for transfer and application against the standby database

Detecting archive logs

Detecting archived logs written to a file system on Linux is not that big a deal. The 'inotify' infrastructure built into the Linux kernel allows monitoring for changes to files and directories. There are a number of utilities and language extensions that hook into this infrastructure.

If you are using ASM - something I think modern Oracle implementations should be doing whether RAC or single instance - it is slightly more problematic, since you cannot use inotify directly with ASM, and AFAICT there is no way to trigger existance of a new archived log from within the database. The workaround is to simply have a second archived log destination pointing to a local file system. Just as an aside, if you are using RAC with Oracle SE, the license conditions state you must use a recovery area located on ASM as your primary log archive destination. You can also have a log archive destination to a local file system, but you cannot use any other shared (i.e. NFS) or clustered file system (i.e. OCFS2) other than ASM.

Log shipping

The options here are the usual suspects - rsync, scp or sftp. All of these have the advantage of being able to work without requiring passwords once ssh user equivalence between the primary and standby nodes have been configured.

Log application

Once the archived log has been copied to the standby system, 'inotify' takes care of detecting its arrival. It is just then a matter of registering and applying it to the standby database. Standard Oracle database utilities like RMAN and SQLPlus can handle this. Once again, no passwords will be required because, as long as we are the 'oracle' user, we can 'connect / as sysdba'.

Other requirements

Other requirements like making use of the existing infrastructure and getting configuration information from the existing environment are admirable solved by Oracle Grid Infrastructure. If you are using RAC, you will already have GI, and even if you aren't using RAC, I would recommend implementing Oracle Restart, which is just Oracle GI on a stand-alone system. You get ASM, and HA components like auto-start on reboot and auto-restart of failed instances. Oracle GI can also manage third party applications (like ours). All the configuration information we need is already available from the database or from the Oracle Cluster Registry (OCR) so there is no need to create separate configuration files for what we need. Actually, that is not quite true, but the configuration file we need can be built automatically from all the other information we have.

Putting it together

So, what we end up with is an environment built on Oracle Grid Infrastructure and ASM, for both single-instance and RAC databases, some application that will get notifications of new files, rsync to transfer the files, and another app that will act once the new files arrive. The apps will be registered with Oracle GI as resources to be managed.

Initially, I was looking at writing something in shell or python to do all the missing logic at both ends, but that was starting to look like too much hard work. It was then that I found lsyncd.

Lsyncd

This is how lsyncd is described on the project page

"Lsyncd watches a local directory trees event monitor interface (inotify or fsevents). It aggregates and combines events for a few seconds and then spawns one (or more) process(es) to synchronize the changes. By default this is rsync. Lsyncd is thus a light-weight live mirror solution that is comparatively easy to install not requiring new filesystems or blockdevices and does not hamper local filesystem performance.
...

Fine-grained customizaton can be achieved through the config file. Custom action configs can even be written from scratch in cascading layers ranging from shell scripts to code written in the Lua language. This way simplicity can be balanced with powerfulness."

What that means for us is the one application, using a very basic configuration file, will monitor, and act upon, file system events, and transfer files via rsync. Not only that, but it has the ability to perform arbitrary actions on file system events, just by including what needs to happen in the configuration file. So instead of having to write a specific 'detect and act upon file system event' application, we have a generic 'detect and act upon file system event' utility, with the rsync stuff built in, already available. Awesome.

All that is left to do is to work out what needs to go into the configuration file. Then write a script that can generate a specific configuration file based on the specific environment. Which is what I have done.

In Part 2, I describe the implementation of the solution.

Just making it up as I go along...

Saturday, 19 July 2014

Physical Standby Database with Oracle Standard Edition - an alternative approach - Part 1