backup and high availability guide - alfresco jira · pdf filealfresco 2.1 backup and high...

Alfresco 2.1

Backup and High Availability Guide

Alfresco 2.1 Backup and High Availability Guide

Backup and High Availability Guide - i

Copyright (c) 2007 by Alfresco and others.

Information in this document is subject to change without notice. No part of this document may be reproduced or transmitted in any form or by any means, electronic or mechanical, for any purpose, without the express written permission of Alfresco. The trademarks, service marks, logos or other intellectual property rights of Alfresco and others used in this documentation ("Trademarks") are the property of Alfresco and their respective owners. The furnishing of this document does not give you license to these patents, trademarks, copyrights or other intellectual property except as expressly provided in any written agreement from Alfresco.

The United States export control laws and regulations, including the Export Administration

Regulations of the U.S. Department of Commerce, and other applicable laws and regulations apply to this documentation which prohibit the export or re-export of content, products, services, and technology to certain countries and persons. You agree to comply with all export laws, regulations and restrictions of the United States and any foreign agency or authority and assume sole responsibility for any such unauthorized exportation.

If you need technical support for this product, contact Customer Support by email at [email protected]. If you have comments or suggestions about this documentation, contact us at [email protected].

This edition applies to version 2.1 of the licensed program.


Backup and High Availability Guide - ii

Contents

INTRODUCTION .........................................................................................................................................1

DATA REQUIRING BACKUP...........................................................................................................................1 Static Data...............................................................................................................................................1 Dynamic Data..........................................................................................................................................1

DOCUMENT LIFE CYCLE EXPLAINED................................................................................................2

DOCUMENT CREATION .................................................................................................................................2

BACKUP AND RECOVERY WINDOW....................................................................................................3

BACKUP OPTIONS .....................................................................................................................................4

COLD BACKUP ..............................................................................................................................................4 HOT BACKUP ................................................................................................................................................4 WARM STANDBY ..........................................................................................................................................7

Warm standby set up procedure using Oracle replication/clusters.........................................................8 Transaction walkthrough.......................................................................................................................11 Warm standby setup procedure using archive logs ...............................................................................11

HOT STANDBY ............................................................................................................................................14 Transaction walkthrough.......................................................................................................................15

HIGH AVAILABILITY ..................................................................................................................................16 Transaction walkthrough.......................................................................................................................17

SOFT DELETE ...........................................................................................................................................18

RESTORING INDIVIDUAL CONTENT FILES.....................................................................................19

SOLUTION COMPARISON .....................................................................................................................20


Backup and High Availability Guide - 3



Introduction

This document describes the options for configuring backup and High Availability solutions with your Alfresco server. The list of configurations is not exhaustive and there may be more than one way of achieving any given server configuration, particularly when using 3rd party solutions.

The document assumes the reader is familiar with Alfresco extension configuration framework and Oracles Replication and Clustering solutions if they are employed.

Data Requiring Backup The first consideration is what data needs to be backed up. Any Alfresco repository can be considered as having two types of data, static and dynamic. Static data includes software components that do not change through usage of the Alfresco repository, such as uploading and creating content. Conversely, dynamic data changes as a direct result of using Alfresco.

Static Data

• General Operating System

• Application Server Install (Tomcat, Jboss...)

• Database Install

• Alfresco Extensions

• Supporting Applications (OpenOffice, ImageMagick)

Static data can be rebuilt using the original distributions for the software components. However, it may be desirable to backup static data to save time if the system needs to be rebuilt. Installation specific configuration files, such as any Alfresco extensions, should always be backed up whenever they are updated.

Dynamic Data

Database

• RDBMS data files

• Tablespaces/Archive Logs/Control Files

Alfresco Content Stores

Alfresco Indexes (Note: These can be rebuilt from the RDBMS data and content stores although this may be time consuming for large repositories)

A backup strategy therefore requires backup of both the content store(s) (file systems) and RDMS data.



Document life cycle explained It is important to understand Alfresco document life cycle in order to be able to manage disk space and backups. Alfresco store his content on file system. The content of the file uploaded in Alfresco is kept intact but is renamed by giving a unique name followed by “.bin” extension. A reference to the content and the meta data property values are stored in the database. A content (most of the time equal to file) is or is not referenced from the database. A non referenced file is called “orphan”.

Document creation When a document is created in Alfresco the document is renamed and stored under the location designated by the property “dir.contentstore”. A reference to the location is kept in the DB and meta data are also stored in the DB.

Document put in the bin

When a user transfers a document to the bin, the content is still referenced by the database and the operation is reversible. Internally the deleted nodes are transferred to a store called “archive://SpaceStore”. At that point, the document content on the file system is still referenced by the database and still occupying space on the file system.

Bin emptied

If the bin is emptied, then the content become orphan (not referenced by the database). The content is kept in place for a retention period

Orphan files deletion

After the retention period expiration, the default behavior of Alfresco is to push the content to a content store located under a location on the file system “dir.contentstore.deleted”. Under that location, content can be cleaned up at will in order to recuperate the disk space. The files are pushed there only after a retention priod of 7 days (default). The retention period is a property (ProtectDays) of the bean called “ContentStoreCleaner”.



Backup and Recovery Window If practical, by far the simplest backup approach is to shut down the Alfresco Server and RDBMS and then backup the relevant dynamic data files. This ensures that no files can change during the backup procedure. However, if no backup window is available then alternative approaches will obviously be required.

Another consideration that is critical to the configuration of a backup strategy is the acceptable system downtime (if any) whilst the system is being recovered. A recovery window of several hours (or even days if over a weekend for example) as opposed to minutes requires a very different approach to restore.

The following provides descriptions of several possible approaches and the pros and cons of each.



Backup Options

Cold backup This is the simplest from of backup as we do not need to deal with updates while the backup is taking place.

To perform a cold backup:

1. Shut down Tomcat and Oracle.

2. Back up the file systems containing the dynamic data described previously using operating system utilities or a third party application.

To recover:

• Restore the file systems from the backup data set(s)

The main disadvantage with this approach is that users cannot use the application whist backup and recovery procedures are taking place.

Hot backup Hot backup requires that the system be available whilst the backups are taking place.

As described previously, documents are stored in the file system and metadata is stored in database, we must therefore ensure the database and file system backups are synchronized otherwise the repository will be corrupt, i.e.:

Scenario 1

1. Object Data exists in RDBMS with reference to file

2. Referenced file on file system does not exist

Outcome: Repository is corrupt.

Scenario 2

1. File exists on the file system

2. No reference to file from object defined in the database

Outcome: Repository is not corrupt.

Therefore, our backup strategy must ensure that the RDBMS backup is never ahead of the file system backup.

To run a hot backup

RDBMS must run in a mode that allows hot backup, such as Archivelog mode for Oracle.

The database backup can use RDBMS vendor tools (or 3rd party if preferred) to perform the hot backup of the database.

1. Back up RDBMS

2. Back up file system after RDBMS backup is complete

Note Backup is valid from the point at which the RDBMS backup was completed.



To recover

The restore procedure depends on what data has been lost.

To recover from a RDBMS crash:

1. Recover the RDBMS from previous backup

2. Apply archive & redo logs

Data Loss: None (except last transaction)

Note The file system may include content that does not exist in the RDBMS, however, this is not an issue as it is the integrity of the RDBMS that dictates the repositories integrity.

To recover from a file system failure

1. Recover the file system from previous backups

2. Roll back RDBMS to point of the last files system backup

Data Loss: Everything since last file system backup

The disadvantage of this approach is there is potential for data loss according to the time difference between file system backups. To avoid this it is recommended that disk mirroring or RAID is used for file storage. This ensures that no data is lost if a file system disk fails.

In addition, Alfresco organises its file system storage directories based on the date and time the content is created or modified in the repository. Every content file also has a unique name.

The following illustration is an example storage file structure.



It very easy to identify when content was created or updated and the file system backup strategy can take advantage of this fact. For example, the file system backup procedure can be run on a daily basis. The procedure backups up all content created that day based on the predictable file structure. If required, once the backup is complete and verified, the content could be removed from the mirror to save storage. For example, a backup and restore procedure using this approach would be:

Back up using Mirror

1. Back up RDBMS.

2. Back up the mirror (Only new/updated content is mirrored to secondary disk).

3. Verify the backup.

4. Remove Mirrored content (optional).

Restore from file system failure

1. Recover the file system from previous backups

2. Copy and content created since last file system backup from the mirror to the master disk

3. Rollback RDBMS to point of the last files system backup (only required when mirroring is not being used)

Data Loss: None (Assuming transactional file system mirroring based on Alfresco Replicating Content Store or 3rd party solution)

Note If the backup is performed from a mirror, user performance should not be affected during the backup read operations.



Warm standby Warm Standby makes use of a standby mirror platform, which in addition to acting as the platform against which backups are performed, can also be used a standby server should the production environment fail. It therefore requires a similar but not (necessarily) identical platform as the main server. For example, the standby could have much lower cost disk storage compared to the main production server. The following illustration shows an example production-standby configuration.

Backups can be performed against the standby server using the backup techniques described previously without impacting system performance for the users.

On failure, users are redirected to the standby server. Once the production server issues have been addressed it can be brought online in standby mode, where it will automatically catch up with the standby server and can ultimately be returned to production status.

A warm standby server is ready to run after a few minor configuration changes.

Depending on the details of the configuration, the downtime for this configuration is typically less than 5 minutes.

This configuration uses Alfresco Replicating Content Stores to unsure that the standby server has its own copy of the content files and either Oracle Clustering or Replication is used to maintain an up to date copy of the database.



Replicating Contents Stores provide replication of content (both inbound and outbound) and simultaneous access to multiple content stores. In this case we are using a replicating store to automatically create a copy of the content files on the standby servers content store as part of the update transactions. This is achieved by configuring the content replication to be ‘outbound’ which means the content will be ‘pushed’ to the remote store and synchronous which means it will part of the transaction. See Figure 4 – Configuring Replicating Content Stores for an example of how this is defined in the Alfresco configuration file.

Warm standby set up procedure using Oracle replication/clusters The following describes the general set up procedure to configure a warm standby server using Oracle Replication or Clusters. It assumes the production server has been installed and configured. This is provided for illustration as there are many ways to set this up and the specifics will depend on the details of your implementation such as the operating system, database version, disk subsystems, preferred database backup tool etc.

To configure the base standby environment

1. Set up the same directory paths for the Alfresco and database installations

2. Copy the Alfresco and Oracle installations from your production server to the standby server in the identical locations

Note Alfresco and Oracle could be installed from scratch but care should be taken to ensure the installations are identical.

3. Update any host specific configuration files.

Note For Oracle these are: tnsnames.ora and listener.ora For Alfresco there are Oracle connection details in custom-repository.properties and host specific entries in file_servers.xml

4. Configure the index recovery component to ensure that the Lucene indexes are up to date on the standby server. The Lucene indexes are updated from the L2 cache by the index recovery component. This is scheduled through the Alfresco Quartz scheduler, and is turned off by default. Start with the <extConfigRoot>/alfresco/extension/index-tracking-context.xml.sample and modify it to run the indexRecoveryComponent every 10 seconds as shown in following code sample.

<bean id="indexTrackerTrigger" class="org.alfresco.util.CronTriggerBean">

<property name="jobDetail">

<bean class="org.springframework.scheduling.quartz.JobDetailBean">

<property name="jobClass">

<value>org.alfresco.repo.node.index.IndexRecoveryJob</value>

</property>

<property name="jobDataAsMap">

<map>

<entry key="indexRecoveryComponent">

<ref bean="indexTrackerComponent" />

</entry>

</map>



</property>

</bean>

</property>

<property name="scheduler">

<ref bean="schedulerFactory" />

</property>

<property name="cronExpression">

<value>0,10,20,30,40,50 * * * * ?</value>

</property>

</bean>

<bean

id="indexTrackerComponent"

class="org.alfresco.repo.node.index.IndexRemoteTransactionTracker"

parent="indexRecoveryComponentBase">

<property name="remoteOnly">

<value>true</value>

</property>

</bean>

To configure database updates

• Configure Oracle’s Basic Replication between the production and standby database instances. This can be done via the Oracle Enterprise Manager or the Replication Management API. Alternatively, an Oracle cluster can be used.

Note Using a cluster allows the production database to automatically re-synchronise with the standby server when it is brought back on line after a failover has occurred.



To configure content replication

• On the production server, configure a synchronous replicating content store between the standby server and the production server. This is configured using a config file that you create in the extensions folder that ends *-context.xml.

• This configuration override must be applied to all servers. See the following code sample where both servers store their content locally in /var/alfresco/content-store. The Shared Backup Store is visible to all servers as /share/alfresco/content-store.

• The write to both stores must be successful for the transaction to commit.

<bean id="localDriveContentStore" class="org.alfresco.repo.content.filestore.FileContentStore">

<constructor-arg>

<value>/var/alfresco/content-store</value>

</constructor-arg>

</bean>

<bean id="networkContentStore" class="org.alfresco.repo.content.filestore.FileContentStore">

<constructor-arg>

<value>/share/alfresco/content-store</value>

</constructor-arg>

</bean>

<bean id="fileContentStore" class="org.alfresco.repo.content.replication.ReplicatingContentStore" >

<property name="primaryStore">

<ref bean="localDriveContentStore" />

</property>

<property name="secondaryStores">

<list>

<ref bean="networkContentStore" />

</list>

</property>

<property name="inbound">

<value>true</value> <<<<----------------Pull content from the secondary storePull content from the secondary storePull content from the secondary storePull content from the secondary store

</property>

<property name="outbound">

<value>true</value> <<<<----------------Push content to the secondary storePush content to the secondary storePush content to the secondary storePush content to the secondary store

</property>

<property name="retryingTransactionHelper">

<ref bean="retryingTransactionHelper"/>

</property>

</be



Startup procedure on failover

1. Restart Standby Database in ‘Normal’ mode.

2. Start Alfresco Tomcat server.

3. Using DNS, redirect the failed servers name to the standby servers IP.

Transaction walkthrough The following describes the events that take place at each point during a document upload.

1. Transaction started.

2. Object Created via Hibernate - Hibernate persists object in database.

3. File streamed (stored) in primary content store.

4. File streamed (stored) in standby content store.

5. Local Cache (EHCache) updated. Oracle replicates update to remote clustered database as part of the transactions commit process (If using Oracle Cluster)

6. Transaction committed. Oracle queues transaction to replica database (If using Oracle basic replication)

Warm standby setup procedure using archive logs The following describes the general setup procedure to configure a warm standby server using Oracle Archive Logs. Scripts will be run according to a schedule (e.g. 5 minutes) that will apply to the archive logs to the standby database. Note that this method has a potential data loss (5 minutes in this example) as the backup is run according to a schedule rather than continuously.

It assumes the production server has been installed and configured. As before, this is provided for illustration as there are many ways to set this up and the specifics will depend on the details of your implementation such as the operating system, database version, disk subsystems, preferred database backup tool etc.

To configure the base standby environment

1. Set up the same directory paths for the Alfresco and database installations

2. Copy the Alfresco and Oracle installations from your production server to the standby server in the identical locations

Note Alfresco and Oracle could be installed from scratch but care should be taken to ensure the installations are identical.

3. Update any host specific configuration files.

Note For Oracle these are: tnsnames.ora and listener.ora For Alfresco there are Oracle connection details in custom-db-connection.properties and host specific entries in file_servers.xml

4. Configure the index recovery component using index-recovery-context.xml. This will ensure the Lucene indexes on the standby server are up to date.



To configure database updates

1. Run Oracle in Archive Log Mode

2. Set archive directory to a directory shared between both servers

3. Configure standby database to run in ‘continuous recovery’ mode

Note On each iteration, the standby database will apply the archive log

To configure content replication

• On the production server, configure a synchronous replicating content store between the standby server and the production server. This is configured using the replicating-content-services-context.xml config file. See the following code sample as an example.

• Configuring the replication to be synchronous means that replication will be part of the transaction. The write to both stores must be successful for the transaction to commit.

<bean id="localDriveContentStore" class="org.alfresco.repo.content.filestore.FileContentStore">

<constructor-arg>

<value>/var/alfresco/content-store</value>

</constructor-arg>

</bean>

<bean id="networkContentStore" class="org.alfresco.repo.content.filestore.FileContentStore">

<constructor-arg>

<value>/share/alfresco/content-store</value>

</constructor-arg>

</bean>

<bean id="fileContentStore" class="org.alfresco.repo.content.replication.ReplicatingContentStore" >

<property name="primaryStore">

<ref bean="localDriveContentStore" />

</property>

<property name="secondaryStores">

<list>

<ref bean="networkContentStore" />

</list>

</property>

<property name="inbound">

<value>true</value> <<<<----------------Pull content from the secondary storePull content from the secondary storePull content from the secondary storePull content from the secondary store

</property>

<property name="outbound">

<value>true</value> <<<<----------------Push content to the secondary storePush content to the secondary storePush content to the secondary storePush content to the secondary store

</property>

<property name="retryingTransactionHelper">

<ref bean="retryingTransactionHelper"/>

</property>



</bean>

To configure the backup schedule

Once the basic mechanism has been successfully tested:

1. Create db scripts to run the backup process.

2. Schedule scripts using CRON jobs. Running scripts every 5 minutes will result in maximum data loss of 5 minutes.

Startup procedure on failover

1. Restart Standby Database in ‘Normal’ mode.

2. Start Alfresco Tomcat server.

3. Using DNS, redirect the failed servers name to the standby servers IP.



Hot standby Hot and warm standby are very similar. The primarily difference is that the standby server is always ready to run. Minimal reconfiguration of the server is required to bring it online. This also means that rather than the standby application server instance being offline to users, hot standby allows read-only or read/write access to the clustered environment. In addition, either server can be brought on or offline for maintenance as required and will automatically catch up with their clustered partner.

Hot standby makes use of EHCache which is configured to be clusterable. The cache operates across transactions and caches entities that have been persisted to the database.

Hot standby setup procedure

The setup procedure for hot standby is the same as for warm standby, except that a clustered cache is required between the Tomcat instances and database to ensure transaction integrity. Caching is configured using xml configuration.





2. Object Created via Hibernate - Hibernate persists object in database.



5. Local and Remote Cache (EHCache) updated. Oracle replicates update to remote clustered database as part of the transactions commit process (If using Oracle Cluster)

6. Transaction committed. Oracle queues transaction to replica database (If using Oracle basic replication)



High Availability The high availability architecture builds on that used to implement a Hot Standby environment described above, to implement a fully clustered environment. The key difference is the use of an active database cluster that replicates updates transactionally between the database instances.

High Availability setup procedure

The setup procedure for High Availability is the same as for hot standby, except that a clustered Oracle is used to ensure database to transaction integrity. This ensures that the databases on the 2 servers will always be in sync. Setup the environment as per the warm standby instructions, configure the 2 databases as a master to master Oracle Cluster. See Oracle's documentation for details.





2. Object created via Hibernate - Hibernate persists object in database.



5. Local and Remove Cache (EHCache) updated.

6. Oracle replicates update to remote clustered database as part of the transactions commit process.

7. Transaction committed.



Soft delete

When content or spaces are deleted, they are actually moved to a deleted items area (similar to the Windows recycle bin capability). Users can recover items they have deleted from this area or perform a remove on the deleted items to delete them fully. Standard users cannot view or recover objects that have been deleted by other users. Administrators can view, restore and remove objects that have been deleted by any users.

To access this option, go to User Options -> Manage Deleted Items.

The ability for users to remove objects from the deleted items area can be easily disabled via simple customisation. This would only allow administrators to perform a full delete.

Note When items are fully deleted, the content files remain on the fileystem. The items are only removed from the file system when a job called fileContentStoreCleanerJob is executed. This can be configured to run according to a schedule, for example, every 7 days. Rather than delete, the cleaner job can also be configured to move the content to a different store such as a ‘deleted items’ store. In the default installation, the cleaner job runs at 4am each day and moves the files to a ‘contentstore.deleted’ store.



Restoring individual content files

Should the situation arise where an object exists in the database but there is no content file on the file system, it is possible to restore the content file from a backup or content store mirror. Administrators can use the Alfresco node browser to view the internal properties for an object including its expected path in the content store. The file, including its path can then be restored to the content store from the backup media or mirrored content store.

Refer to the following example of using the Node Brower to view the path to a content file.



Solution comparison

Cold Backup

Hot Backup

Warm Standby

Hot Standby

High Availability

Data Loss 24hrs (assuming nightly backup)

Variable – Possibly 1hr

None None None

Time to Recover* Long Long Short Very Short None

Cost Low Low Medium Medium High

Configuration & Maintenance Effort

Low Low Medium Medium High

Complexity Low Low Medium Medium High

System Availability

Poor Poor Good Very Good Excellent

Flexibility Poor Poor Good Good Poor

*Actual time will depend on several factors including amount of data, device speed, available network bandwidth etc. In this case, long typically means hours, short - several minutes, very short - a few minutes.

High availability is clearly the best choice when system availability is critical. However, it is also be most expensive approach. Hot standby provides a good compromise, providing good availability and quick recovery times at lower cost than high availability.

backup and high availability guide - alfresco jira · pdf filealfresco 2.1 backup and high...

Documents