robust qad infrastructure on vmware and netapp qad infrastructure leveraging vmware and netapp...

Robust QAD Infrastructure

Leveraging VMware and Netapp

Sekhar Athmakuri

IT Manager

• Tower International Overview

• QAD deployments at Tower

• QAD infrastructure upgrade and virtualization

• DR Architecture

• Additional application and infrastructure upgrades

2

Agenda

9/27/2013

• Tower International is a leading integrated global manufacturer of engineered structural

metal components and assemblies primarily serving automotive original equipment

manufacturers

3

Tower International at a glance

9/27/2013

• Revenue: 2.1 Billion

• Employees: 9000

• Corp. Headquarters: Livonia, Michigan, USA

• Locations: Our products are manufactured at 29 production facilities strategically

located near our customers in North America, South America, Europe and Asia. We

support our manufacturing operations through eight engineering and sales locations

throughout the world.

• QAD deployed globally at all Tower locations

• QAD infrastructure centralized at our USA based Global Data

Center

• QAD versions deployed at Tower include:

– MfgPro 9.0 / Progress 9.1E

– EB2 (various SP levels) / Progress 9.1E

– QAD 2008 SE / Progress 10.1.C

– QAD 2010 SE with .NET / Progress 10.2.B

• Total QAD users around 1600

4

QAD at Tower

9/27/2013

• MfgPro 9.0 and EB2 run centrally from USA based Data Center

• Infrastructure:

– HP PARISC servers running HPUX 11.x

• Obsolete hardware. Slow performance.

• No hardware fault tolerance. Risk of single server failure bringing down

QAD for an entire region.

– EMC SAN storage

• Costly and complex outsourced storage solution

– DR (Disaster Recovery) solution with 2 day recovery time from tape

backups

• Recovery time too long

• Solution does not scale well

• Costly outsourced solution

5

Pre virtualization QAD at Tower

9/27/2013

• Project started in 2010 with the following objectives:

1. Upgrade QAD server infrastructure (replace obsolete hardware)

2. Improve QAD performance (5X)

3. Replace outsourced SAN storage solution (reduce cost and

complexity)

4. Implement scalable (across multiple applications including QAD)

DR solution with a recovery time objective of 4 hours or less

6

Infrastructure upgrade project

9/27/2013

• Key infrastructure changes

– Migrate from HPUX to Redhat Linux

• Linux made hardware upgrade possible without requiring a QAD application

upgrade

• Utilize latest Intel X86 CPUs for improved performance

– Virtualize servers on VMware

• Provide hardware fault tolerance without the complexity

• Simplify future QAD upgrades. New servers to test upgrades can be setup easily

without having to purchase new hardware.

• Server portability for future Data Center move

– Migrate from EMC SAN storage to Netapp NAS storage

• Reduce cost and complexity of storage solution

• Already proven technology at Tower

• New DR solution based on VMware virtualization and Netapp storage

replication technologies

– QAD servers and databases to be replicated to DR site

7

New Solution

9/27/2013

8

New VMware/Netapp Infrastructure

9/27/2013

• Key Success Factors

– Redhat Enterprise Linux 4 with increased CPU and Memory capacity

– Increased DB cache for improved performance

– Single tier QAD/Progress architecture. Client and DB on a single

server. This later allowed for self client mode to boost performance.

– QAD databases, binaries, users on NFS version 3 TCP mounts.

Simplified storage management.

– Leveraged Bravepoint’s Pro dump/load utility to minimize outage

windows during migrations

9

Migration to Linux

9/27/2013

• Key Benefits

– LDAP (AD) authentication for users

– Improved performance (5X)

• Several large operations including MRP runs saw over 5X speed

improvement. (MRP runs went from around 50min to around 10min)

• Larger DB Cache, Large read cache on Netapp and faster CPUs

• Lessons Learned

– Code compilation issues. Source code for some custom programs could not

be properly identified. Better source code management was needed.

– Slow telnet performance on Linux especially for regions outside USA. Nagel

algorithm had to be disabled (set NODELAY flag in /etc/xinetd.d/telnet) to fix

telnet performance.

– Direct printing to Korea printers did not work in Linux. NetTerm (Terminal

emulation software) print to local functionality was used as a workaround.

10

Migration to Linux Contd….

9/27/2013

• Key success factors:

– VMware Vsphere 4 cluster with Intel Nehalem CPUs optimized for

virtualization

– No over subscription of cpu/memory resources to ensure consistent

performance

– Simplified VMware configuration with NFS volumes for data stores

– Separate gigabit network dedicated for storage access

– VMware best practices including Jumbo frames (MTU 9000)

• Key Benefits:

– VMware snapshots: Facilitates simple back out from system changes and

recovery from OS corruption.

– Automatic load balancing of virtual servers across hosts

– Ease of virtual server resource (cpu/mem) changes and new server

deployments

– Cloning: Test servers could be easily cloned from production ensuring same

OS configuration and patches on test.

11

Virtualization of QAD / Progress

9/27/2013

• Lessons learned:

– Vmotion of systems with large amounts of memory was disruptive to

systems being vmotioned.

• Setup rules to pin systems with large memory to specific hosts in order

to prevent automatic relocation. Did not impact hardware fault

tolerance.

– IP hash based load balancing across 4 gigabit ports on VMhost

storage network was not optimal. 95% of the traffic stayed on a

single gigabit port.

– Storage NICs on the VMhosts were shared for both storage traffic as

well as vmotion traffic. This caused contention at times.

12

Virtualization of QAD / Progress contd….

9/27/2013

• Key success factors

– Dual FAS 3160 Storage controllers for high availability

– Netapp sizing based on peak IOPS usage on EMC storage

– 250GB of flash based read cache per controller

– High performance 15K RPM fiber channel disks with large stripe size

– Ether channeled storage network interface with 8 gigabit ports

– Netapp best practices including jumbo frames (9000 MTU) on

storage interfaces

13

Netapp Storage

9/27/2013

• Key benefits

– Eliminated frequent dump and loads for QAD databases

• Old infrastructure required frequent (multiple times a year) dump/loads

due to poor performance

• Performance has been good and consistent since the upgrade

minimizing the need for dump/loads

– Snapshots: easily create logical snapshots of volumes for backup

and recovery

– Dedupe: Eliminates duplicate data blocks on volumes. Saw a 35%

space savings on QAD DB volumes

– NFS and CIFS access to same volume. Eliminated the need for

users to FTP to QAD servers to access print files.

• Lessons learned

– IO inefficiencies due to misaligned Windows 2003 and Redhat Linux

4/5 virtual servers.

• Administrative correction needed to fix the issue

14

Netapp Storage contd…

9/27/2013

15

DR Architecture

9/27/2013

• Key network design considerations:

– Different system IP addresses in DR

• Employ DNS to manage IP change

– Fenced DR environment to protect production environment during

DR testing

• Access to DR environment controlled via access lists on Cisco layer 3

switch

– Optimize replication traffic between Data Center and DR site

• Over 80% optimization achieved with Riverbed WAN accelerators

– Adequate network bandwidth to replicate daily changes between

Data Center and DR site

• 45Mbps line used at DR site

16

QAD Disaster Recovery

9/27/2013

• Replication Process:

– Daily replication of all QAD servers and ancillary systems to DR site

using Netapp SMVI utility

• Swap partitions are on separate volumes and are not replicated on a

daily basis

– Daily QAD DB Replication using Netapp SnapMirror technology:

• Proquiet QAD DBs and take a snapshot (takes a few seconds)

• Release QAD DB to normal operation

• Replicate snapshot copy of DB to DR site

– Only data blocks that changed since previous day’s snapshot are

replicated

17

QAD Server/Data Replication Process

9/27/2013

• DR Startup process:

– VMware SRM (Site Recovery Manager) used to automate the

following:

• Change IP addresses of all servers

• Change DNS entries to reflect new IPs

• Change DB mounts on all QAD servers to reflect DR Netapp

• Create a logical copy (flex clone) of DR volumes

• Start all systems (Entire process takes less than 4 hours)

• DR Testing:

– All QAD environments are validated annually

– Logical copies of volumes are discarded after testing is completed.

– No impact to production systems or replicated data in DR

18

QAD DR Startup and Testing

9/27/2013

• 5X QAD performance improvement

• Reduced storage cost and complexity with Netapp storage

– Less than 2 year return on investment based on outsourced storage

expense savings

• Eliminated frequent database dump/loads – saved 0.5 FTE (Full-

time Equivalent)

• Automatic hardware failure protection with VMware

• Standby scalable (across multiple apps) DR solution with a less

than 4 hour system recovery time.

19

Key accomplishments of this project

9/27/2013

• Server environment: Redhat 5 64bit, 8 vcpus and 64GB of RAM.

• 10 NA plants were migrated from a MfgPro 9.0 environment to a QAD

2010SE multi-domain environment

• System performance was good until a large plant came on board

• Linux system load frequently exceeded 80 resulting in severe

performance degradation

• High cpu utilization

• Resolution steps:

– Implemented a larger server with 24 vcpus and 128GB of RAM

– System load issues were resolved after enabling “-q” parameter (only reads

code once) in client connections

– Performance was significantly improved after implementing self client mode

– Corrected several custom programs that were consuming large amounts of

cpu due to indexing issues

20

QAD 2010SE multi-domain virtualization

9/27/2013

21

MRP (23.2 Full Regen) Run time improvements

9/27/2013

00:00

07:12

14:24

21:36

28:48

36:00

4/9/2012 4/10/2012 4/11/2012 4/12/2012 4/13/2012 4/14/2012 4/15/2012 4/16/2012 4/17/2012 4/18/2012 4/19/2012 4/20/2012

auburn

bardstown

chicago

clinton

elkton

madison

meridian

ohio

plymouth

smyrna

Min

ute

s

Post Upgrade

• 10Gbit storage network

– Simplified storage network due to reduced number of ports

– Much faster Vmotion speeds. No impact on systems with large

amounts of memory during Vmotion.

• VMware upgrade to Vsphere 5 along with distributed virtual

switch implementation

– Eliminated VMhost load balancing inefficiencies across multiple

NICs

• Upgraded Netapp to FAS 3250 cluster

– SSDs used for read/write caching

– Aggregate / Volume level caching control with SSD cache

22

Netapp and VMware upgrades

9/27/2013

?

23

Questions

9/27/2013

robust qad infrastructure on vmware and netapp qad infrastructure leveraging vmware and netapp...

Documents