Wednesday, October 19, 2011

High Availability Clustering with PostgreSQL


Firstly, I should thank my company for giving me an opportunity to work mostly with PostgreSQL HA stuff. I have worked with very good clients who has implemented Clustering with PostgreSQL. So, my article here is to give little idea on how HA clustering will work with PostgreSQL.

PostgreSQL has built-in functionality for High Availability like Warm Standby,Hot Standby and Streaming Replication. But, missing few features like Switchover/Switchback, failover automation, minimal downtime etc., which are mostly demanded by the companies. Postgres community member's are working on the demands aggressively and hope we see very new PostgreSQL soon with all features bundled. For now, let see Clustering with PostgreSQL.


There are many clustering architecture diagram's in brief which I have shared links below, but what I made here is just an overview of it. 

What is High Availablity clustering ? 

High availability clustering (HAC)is a feature which provides redundancy and fault tolerance. Its a number of connected devices processing and providing a service. HAC, involves employing both hardware and software technologies, like Server redundancy(including application failover and server clustering), Storage redundancy (including RAID and I/O multipathing), Network redundancy and Power system redundancy.

It's goal is to ensure this service is always available even in the event of a failure. If one server fail's the other servers will continue processing and take on the processing load of the failed server. HA cluster implementation attempt to use redundancy of cluster components to eliminate single points of failure.

Currently Available HA Products

There are many competitive high-availability software products in the market today; deciding which one to purchase can be tough. Following are the list of features you need to look in any HAC product.
  • Clustering capability ( How many servers can be clustered together?)
  • Load-balancing capability 
  • Intelligent monitoring 
  • Centralized management capability
  • Application monitoring
  • Cost (Most importantly though :) )
  • Customer support (Most of the products do this)
I have seen two of them, one is RedHat Cluster Suite (which is commonly used HA package for Linux operating system) and another is Steeleye-LifeKeeper.


Who needs ?

High Availability Clusters are often used by websites serving 24x7x365 not affording any downtime Eg: Amazon.com,Music websites, Customer Service sites etc., or Companies with  Critical Databases.

How it works ?

You need minimum two nodes to start with HA. HA clusters usually use a heartbeat private network connection which is used to monitor the health and status of each node in the cluster. In any serious condition, any of the cluster goes down then other node attempts to start services and provides the same service. 

Types of HAC :
  • Active/Passive: In this mode, one node is active (i.e., Primary) and processing service, while other node will be in passive mode meaning its a standby and will only become active if the primary node fails.
  • Active/Active : In this mode, both nodes are active and traffic is load balanced between both nodes and processing service. If one node fails, the other node will take the full processing load, until the failed node becomes active again.
Note: Active/Active mode is not supported with PostgreSQL.


Heartbeat:

A heartbeat is a sensing mechanism which sends a signal across to the primary node, and if the primary node stops responding to the heartbeat for a predefined amount of time, then a failover occurs automatically.

Failover Automation:

Automatic failover is the process of moving active services from the primary node to the standby node when the primary node fails. Usually the standby node continues its services until the primary node has come back up and running. When a device fails another device takes over this process which is referred to as a failover. 

Failover automation is usually implemented on hardware firewalls over networks. You need to configure firewalls on Primary to take over Standby node in case of primary firewall fails. 

HAC support with PostgreSQL

Currently, RHCS or LifeKeeper supports Active/Passive clustering with PostgreSQL. There is no Active/Active support for PostgreSQL yet. As I said, PostgreSQL has no built-in functionality of Failover Automation including third party replication tools like Slony-I, Londiste, etc.. To achieve this you may need to trick with OS level Scripting or take the help of Clustering. 

Below link will help you to understand more about PostgreSQL Clustering with RHCS by Devrim Gunduz(Postgres Community Member).


Setup service from EnterpriseDB on RHCS:


Do post your comments..

--Raghav




Replication in PostgreSQL 9.0


Word "Replication" means a process of sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility.

Replication is very interesting subject in any databases. In database competition world, PostgreSQL has its own uniqueness in RDBMS Open source for High availability. Latest PostgreSQL 9.1 has in-built support of Synchronous and Asynchronous replication.  In-built Asynchronous replications are Warm Standby, Hot Standby and Streaming Replication and with third party tools Slony,Londiste,Mammoth etc. Below chart will help you to understand about available Synchronous and Asynchronouse replication.


WAL Shipping (Hot Standby and Warm Standby): 

PostgreSQL has the ability to ship WAL's to another Server i.e, Standby. The Standby server will be running in recovery mode with the pg_standby utility applying the WAL's. Primary Server generates archives (a copy of WAL, usually 16 MB file) and sends them to multiple slaves, later it will be applied by pg_standby utility. 
  • Warm Standby: Primary generates archives and feed them to Slave. Its a WAL Shipping to slave. Slave will be in continous recovery and not accessible for reads.
  • Hot Standby: Hot Standby is the name for the capability to run queries on a database that is currently performing archive recovery. In Hot Standby slaves can be used for read-only access.
Disadvantages:
  • Slave applie's WAL's periodically not continously, means only completed XLOG's will be available to slave as WAL archives and those will be applied. So, lag will be the unfilled or uncompleted WAL which has not generated archive. Data loss will be minimum of one WAL(16 MB).

Trigger Based Replication :

In trigger based replication, tools like Slony,Londiste, Mammoth uses ON INSERT, ON UPDATE, ON DELETE triggers on tables to maintain replication between Master and Slave. Slave will hold consistent Snapshots. 

Streaming Replication :

Its also called as Binary replication. PostgreSQL, XLOG's records generated at primary will be shipped to Standby via network.  Lag in streaming replication is very minimum like single transaction depending on Network Speed and Hot Standby Settings. Multiple Slave can be configured. Streaming replication comes with additional process 'WAL SENDER' at Primary and 'WAL RECEIVER' at Standby.

Advantages:
  • On Primary Crash, standby can be recovered in very less time.
  • Standby can be opened and it will be in READ ONLY mode.
  • It can be used for Reporting Server.
  • Load balancing can be configured using pgpool-II between Primary and Standby.
Disadvantages:
  • Standby Server should hold same amount of Memory/Disk/CPU etc., because, in case of Primary crashes the Slave acts as Primary.
  • Minimal Lag i.e. (one transaction behind Primary)
Slony Replication:

Slony is a asynchronous trigger-based replication. Its a single master to multiple slave replication system for PostgreSQL. Every table or sequence on Master will be replicated via remote triggers to Slave. Updates are committed to one database and are applied to Slave later as EVENTs. Using Slony Switchover and switchback is possible.

Limitations of Slony-I
  • Tables must have a primary key or a unique.
  • Only Tables and sequeces are allowed for replication.
  • Slave databases cannot be modified.
Advantages:
  • Slony-I supports switchback.
  • Using Slony-I, we can upgrade PG from one version to another version without any downtime.
Disadvantages:
  • Slony cannot detect the network failuer, hence causing all the EVENT's created at primary will be queued and are released once Network catch ups.
  • NO DDL changes allowed on the replication Tables while Slony Daemons running. 

Do post your comments, they will be highly appreciated. 

--Raghav