Cloud computing business continuity pdf


















Contact book. For more information visit www. Network Functions Virtualization NFV is a concept, which promises to grant network operators the required flexibility to quickly develop and provision new network functions and services, which can be hosted in the cloud.

Availability refers to the cloud uptime and the cloud capability to operate continuously. Providing highly available services in cloud computing is essential for maintaining customer confidence and satisfaction and preventing revenue losses. This chapter covers cloud computing as business continuity solution and cloud service availability. This chapter also covers the causes of service unavailability and the impact due to service unavailability.

Further, this chapter covers various ways to achieve the required cloud service availability. Introduction Cloud computing service outage can seriously affect workloads of enterprise systems and consumer data and applications [1, 2]. Several e. It can lead to significant financial losses or even endanger human lives [3]. Amazon cloud services unavailability resulted in data loss of many high-profile sites and serious business issues for hundreds of IT managers.

In addi- tion, credibility of cloud providers took a hit because of these service failures [4]. The business continuity BC is important for service providers to deliver services to consumers in accordance with the SLAs. In a cloud environment, it is important that BC processes should support all the layers; physical, virtual, control, orchestration, and service to provide uninterrupted services to the consumers.

The BC processes are automated through orchestration to reduce the manual intervention, for example if a service requires VM backup for every 6 hours, then backing up VM is scheduled automatically every 6 hours [5]. Disaster Recovery DR is the coordinated process of restoring IT infrastructure, including data that is required to support ongoing cloud services, after a natural or human-induced disaster occurs. The basic underlying concept of DR is to have a secondary data center or site DR site and at a pre-planned level of operational readiness when an outage happens at the primary data center.

Expensive service disruption can result from disasters, both manmade and natural. To prevent failure in a Cloud Service Provider CSPs system, 2 different disaster recovery models DR have been proposed: the first being the Traditional model, and 2nd being cloud-based, where the first is usable with both the dedicated and shared approach. The relevant model is chosen by customers using cost and speed as the determining factors.

On the other hand, in the dedicated approach a customer is assigned an infrastructure leading to a higher speed and therefore cost.

At the other end of the spectrum the shared model, often referred to as the Distributed Approach, multiple users are assigned a given infrastructure, which results in a cheaper outlay but leads to a lower recovery speed. Background 2. Access to data is protected by firewalls. Cloud computing provides the context of offering virtualized computing resources and services in a shared and scalable environment through the network.

A big percentage of global IT firms and governmental entities have incorporated cloud services for a multitude of purposes such as those related to mission-oriented applications and thus sensitive data. In order to provide full-support for these applications and their sensitive data, it is vital to include ample provision of envi- ronments that incorporate dependable cloud computing.

Self-service means that the consumers themselves carry out all the activities required to provision the cloud resource. To enable on-demand self-service, a cloud provider maintains a self-service portal, which allows consumers to view and order cloud services.

The cloud provider pub- lishes a service catalog on the self-service portal. The service catalog lists items, such as service offerings, service prices, service functions, request processes, and so on. Usually, end-users have no knowledge about the exact location of the resources they may want to access, but they may be able to specify location at a higher level of abstraction e.

Examples of such resources include storage, processing, memory, and network bandwidth. The characteristic of rapid elasticity gives consumers the impression that unlimited IT resources can be provisioned at any given time. It enables consumers in few minutes to adapt to the variations in workloads by quickly and dynamically expanding scaling outward or reducing scaling inward IT resources, and to proportionately maintain the required performance level.

The metering system continuously monitors resource usage per consumer and provides reports on resource utilization. For example, the metering system monitors utilization of processor time, network bandwidth, and storage capacity. The different deployment models present a number of tradeoffs in terms of control, scale, cost, and availability of resources. In the public cloud model, there may be multiple tenants consumers who share common cloud resources.

A provider typically has default service levels for all consumers of the public cloud. Some providers may optionally provide features that enable a con- sumer to configure their account with specific location restrictions. Public cloud services may be free, subscription-based or provided on a pay-per-use model. Figure 1 below illustrates a generic public cloud that is available to enterprises and to individuals.

Departments and business units within an organization rely on network services implemented on a private cloud for dedicated to consumers. Generally, organizations are likely to avoid the adoption of public clouds as they are used by the public to access their facilities over the Internet. A private cloud offers organizations a greater degree of privacy, and control over the cloud infrastructure, applications, and data. There are two variants of a private cloud: Figure 1.

A generic public cloud available to enterprises and individuals. The on- premise private cloud model enables an organization to have complete control over the infrastructure and data.

In some cases, a private cloud may also span across multiple sites of an organization, with the sites interconnected via a secure network connection. Figure 2 illustrates a private cloud of an enterprise that is available to itself. The provider manages the cloud infrastructure and facilitates an exclusive private cloud environment for the organization.

Figure 2. An enterprise private cloud. Figure 3. An externally hosted private cloud. Figure 4 illustrates a hybrid cloud that is composed of an on-premise private cloud deployed by an enterprise and a public cloud serving enterprise and individual consumers. The end-user can in particular run a multitude of software packages, which encompass a variety of applications as well as operating systems. The cloud service provider deploys and manages the underlying cloud infrastructure.

While software, such as operating system OS , database, and applications on the cloud resources, can be deployed and configured by Consumers.

In some organizations IaaS users are typically IT system administrators. In such cases, internal implementation of IaaS can even be carried out by the organization, with support given by the IaaS to its IT to manage the resources and services. In such examples the 2 options of Subscription-based or Resource-based according to resource usage can be implemented for IaaS pricing. The IaaS provider pools the underlying IT resources and they are shared by multiple consumers through a multi-tenant model.

The consumer does not manage or control the underlying cloud infrastructure, including network, servers, operat- ing systems, or storage, but has control over the deployed applications and possibly the configuration settings for the application-hosting environment.

Figure 4. A hybrid cloud composed of an on-premise private cloud and a public cloud. The applications are accessible from various client devices through either a thin client interface, such as a web browser for example, web- based email , or a program interface. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings.

In the SaaS model, a provider hosts an application centrally in the cloud and offers it to multiple consumers for use as a service. The consumers do not own or manage any aspect of the cloud infrastructure. In a SaaS context, a given version of an application, with a specific configuration hardware and software typically provides services to multiple consumers by partitioning their individual sessions and data.

SaaS applications execute in the cloud and usually do not need installation on end-point devices. This enables a consumer to access the application on demand from any location and use it through a web browser on a variety of end-point devices. Some SaaS applications may require a client interface to be locally installed on an endpoint device. Figure 5 illustrates the three cloud service models.

Services include user management, push notifications, integration with social networking services [6] and more. This is a relatively recent model in cloud computing [7] with most BaaS startups dating from or later [8] but trends indicate that these services are gaining significant mainstream traction with enterprise consumers. Figure 5. Cloud service models. Despite the name, it does not actually involve running code without servers [9].

Serverless computing is so named because the business or person that owns the system does not have to purchase, rent or provision servers or virtual machines for the back-end code to run on. FaaS is included under the broader term serverless computing, but the terms may also be used interchangeably [10].

These five participating actors are the cloud pro- vider, the cloud consumer, the cloud broker, the cloud carrier and the cloud auditor. Table 1 presents the definitions of these actors [4]. These service providers aimed to simplify and speed up the process of adding new network functions or applications. Cloud Role Definition see also ref. Provider Any individual entity or organization which is responsible for making services available and providing computing resources to cloud consumers Broker An IT entity that provides an entry for managing performance and QoS of cloud computing services.

In addition, it helps cloud providers and consumers with management of service negotiations. Auditor A party that can provide an independent evaluation of cloud services provided by cloud providers in terms of performance, security and privacy impact, information system operations and etc. Carrier An intermediary party that provides access and connectivity to consumers through any access devices such as networks. Cloud carrier transports services from a cloud provider to cloud consumers.

Table 1. The five actors in cloud computing environment. Multiple VNFs can be added to a standard server and can then be monitored and controlled by a hypervisor. If a VNF running on a virtual machine requires more bandwidth, for example, the decision to scale or move a VNF is to be taken by NFV management and orchestration functions that can move the virtual machine to another physical server or provision another virtual machine on the original server to handle part of the load. Having this flexibility allows an IT department to respond in a more agile manner to changing business goals and network service demands.

If a customer requests a new function, for example, NFV enables the service provider to add the said function by configuring it in a virtual machine without upgrading or buying new hardware. Figure 6. Components of NFV architecture. NFV can then help reduce both operational and capital expenditures. To provide enterprises and individuals with a real time, on demand, all online experience requires an end-to-end E2E coordinated architecture featuring agile, automatic, and intelligent operation during each phase.

The comprehensive cloud adaptation of networks, operation systems, and services is a prerequisite for this much- anticipated digital transformation. In existing networks, operators have gradually used SDN and NFV to implement ICT network hardware virtualization, but retain a conventional operational model and software architecture.

Physical networks are constructed based on DCs to pool hardware resources including part of RAN and core network devices , which maximizes resource utilization. Operators transform networks using a network architecture based on data center DC in which all functions and service applications are running on the cloud DC, referred to as a Cloud-Native architecture.

In the 5G era, a single network infrastructure can meet diversified service requirements. Cloud computing for business continuity Business continuity is a set of processes that includes all activities that a business must perform to mitigate the impact of service outage. BC entails preparing for, responding to, and recovering from a system outage that adversely affects business operations. It describes the processes and procedures a service provider establishes to ensure that essential functions can continue during and after a disaster.

BC involves proactive measures, such as business impact analysis, risk assess- ment, building resilient IT infrastructure, deploying data protection solutions backup and replication.

It also involves reactive countermeasures, such as disaster recovery, to be invoked in the event of a service failure.

Disaster recovery DR is the coordinated process of restoring IT infrastructure, including data that is required to support ongoing cloud services, after a natural or human-induced disaster occurs. Cloud service availability refers to the ability of a cloud service to perform its agreed function according to business requirements and customer expectations during its operation. Cloud service providers need to design and build their infra- structure to maximize the availability of the service, while minimizing the impact of an outage on consumers.

Cloud service availability depends primarily on the reli- ability of the cloud infrastructure compute, storage, and network components, business applications that are used to create cloud services, and the availability of data. The time between two outages, whether scheduled or unscheduled, is com- monly referred as uptime, because the service is available during this time.

Con- versely, the time elapsed during an outage from the moment a service becomes unavailable to the moment it is restored is referred to as downtime.

A simple mathematical expression of service availability is based on the agreed service time and the downtime. In a cloud environment, a service provider publishes the availability of a service in the SLA. For example, if the service is agreed to be available for Therefore, it is important for the service provider to identify the causes of service failure, and analyze its impact to the business. Data center failure is not the only cause of service failure. Poor application design or resource configuration errors can also lead to service outage.

For example, if the web portal is down for some reason, then the services are inaccessible to the consumers, which leads to service unavailability. Even unavailability of data due to several factors data corruption and human error also leads to service unavailability.

A cloud service might also cease to function due to an outage of the dependent services. Perhaps even more impactful on the availability are the outages that are required as a part of the normal course of doing business. The IT department is routinely required to take on activities such as refreshing the data center infrastructure, migration, running routine maintenance or even relocating to a new data center.

Any of these activities can have its own significant and negative impact on service availability. Planned outages may include installation and maintenance of new hard- ware, software upgrades or patches, performing application and data restores, facility operations renovation and construction , and migration.

Unplanned out- ages include failure caused by human errors, database corruption, failure of physi- cal and virtual components, and natural or human-made disasters. Loss of revenue includes direct loss, compensatory payments, future revenue loss, billing loss, and investment loss.

Damages to reputations may result in a loss of confidence or credibility with customers, suppliers, financial markets, banks, and business part- ners. Other possible consequences of service outage include the cost of additional rented equipment, overtime, and extra shipping.

This process typically involves both oper- ational personnel and automated procedure in order to reactivate the service application at a functioning data center. This requires the transfer of application users, data, and services to the new data center. This involves the use of redundant infrastructure across different geographic locations, live migration, backup, and replication solutions. Building fault tolerance cloud infrastructure This section covers the key fault tolerance mechanisms at the cloud infrastruc- ture component level and covers the concept of service availability zones.

Single Point of Failure. The general method to avoid single points of failure is to provide redundant components for each necessary resource, so that a service can continue with the available resource even if a component fails. Service provider may also create multiple service availability zones to avoid single points of failure at data center level.

Usually, each zone is isolated from others, so that the failure of one zone would not impact the other zones. It is also important to have high availability mechanisms that enable automated service failover within and across the zones in the event of component failure, data loss, or disaster.

A set of N components has at least one standby component. The standby component is active only if any one of the active components fails. In such cases, the standby component remains active in the service operation even if all other components are fully functional.

The load for this cloud service is balanced between the sites. If one of the site is down, the available site would manage the service operations and manage the workload. It is important to have high availability mechanisms that enable automated service failover.

The example shown in Figure 8 represents an infrastructure designed to mitigate the single points of failure at component level. Single points of failure at the compute level can be avoided by implementing redundant compute systems in a clustered configuration.

Single points of failure at the network level can be avoided via path and node redundancy and various fault tolerance protocols. Multiple independent paths can be configured between nodes so that if a compo- nent along the main path fails, traffic is rerouted along another. The key techniques for protecting storage from single points of failure are RAID, erasure coding techniques, dynamic disk sparing, and configuring redundant stor- age system components.

Many storage systems also support redundant array inde- pendent nodes RAIN architecture to improve the fault tolerance. The following slides will discuss the various fault tolerance mechanisms as listed on the slide to avoid single points of failure at the component level.

A service availability zone is a location with its own set of resources and isolated from other zones to avoid that a failure in one zone will not impact other zones. A zone can be a part of a data center or may even be comprised of the whole data center. This provides redundant cloud computing facilities on which applications or services can be deployed.

Service providers typ- ically deploy multiple zones within a data center to run multiple instances of a service , so that if one of the zone incurs outage due to some reasons, then the service can be failed over to the other zone.

They also deploy multiple zones across geographically dispersed data centers to run multiple instances of a service , so that the service can survive even if the failure is at the data center level. It is also important that there should be a mechanism that allows seamless automated failover of services running in one zone to another. Figure 8. Implementing Redundancy at Component Level.

This is because manual steps are often error prone and may take considerable time to implement. Automated failover also provides a reduced RTO when compared to the manual process. A failover process also depends upon other capabilities, including replication and live migration capabilities, and reliable network infrastructure between the zones. In this sce- nario, all the traffic goes to the active zone primary zone only and the storage is replicated from the primary zone to the secondary zone.

When a disaster occurs, the service is failed over to the secondary zone. The only requirement is to start the application instances in the secondary zone and the traffic is rerouted to this location. If the primary zone goes down, the service is failed over to the secondary zone and all the requests are rerouted.

This implementation provides faster restore of a service very low RTO. In this case, both the zones are active, running simultaneously, han- dling consumers requests and the storage is replicated between the zones.

There should be a mechanism in place to synchronize the data between the two zones. If one of the zone fails, the service is failed over to the other active zone. The key point to be noted here is until the primary zone is restored, the secondary zone may have a sudden increase in workload. So, it is important to initiate additional instances to handle the workload at secondary zone. Figure 9. The figure details the underlying techniques such as live migration of VMs using stretched cluster, which enables continues availability of service in the event of compute, storage, and zone site failure.

Data protection solution: backup This section covers an introduction to backup and recovery as well as a review of the backup requirements in a cloud environment. This lesson also covers guest-level and image-level backup methods. Further, this section covers backup as a service, backup service deployment options, and deduplication for backup environment.

Typically organizations implement data protection solution in order to protect the data from accidentally deleting files, application crashes, data corrup- tion, and disaster. Data should be protected at local location and as well as to a remote location to ensure the availability of service. For example, when a service is failed over to other zone data center , the data should be available at the destina- tion in order to successfully failover the service to minimize the impact to the service.

Individual applications or services and associated data sets have different business values, require different data protection strategies. As a result, a well-executed data protection infrastructure should be implemented by a service provider to offer a choice of cost effective options to meet the various tiers of protection needed.

In a tiered approach, data and applications services are allocated to categories tiers depending on their importance. Using tiers, resources and data protection techniques can be applied more cost effectively to meet the more stringent requirements of critical services while less expensive approaches are used for the other tiers. The two key data protection solutions widely implemented are backup and replication.

With the growing business and the regulatory demands for data storage, retention, and availability, cloud service providers face the task of backing up an ever-increasing amount of data. This task becomes more challenging with the growth of data, reduced IT budgets, and less time available for taking backups. Moreover, service providers need fast backup and recovery of data to meet their service level agreements. The amount of data loss and downtime that a business can endure in terms of RPO and RTO are the primary considerations in selecting and implementing a specific backup strategy.

RPO specifies the time interval between two backups. For example, if a service requires an RPO of 24 hours, the data need to be backed up every 24 hours. RTO relates to the time taken by the recovery process.

To meet the defined RTO, the service provider should choose the appropriate backup media or backup target to minimize the recovery time. For example, a restore from tapes takes longer to complete than a restore from disks. Service providers need to evaluate the various backup methods along with their recovery considerations and retention require- ments to implement a successful backup and recovery solution in a cloud environment.

Multiple VMs are hosted on single or clustered physical compute systems. The virtualized compute system environment is typically managed from a management server, which pro- vides a centralized management console for managing the environment. The inte- gration of backup application with the management server of virtualized environment is required. Advanced backup methods require the backup application to obtain a view of the virtualized environment and send configuration commands related to backup to the management server.

The backup may be performed either file-by-file or as an image. Cloud services have different avail- ability requirement and that would affect the backup strategy. For example, if the consumer chose a higher backup service level e. Typi- cally, cloud environment has large volume of redundant data. Backing up of redun- dant data would significantly affect the backup window and increase the operating expenditure. Service provider needs to consider deduplication techniques to over- come these challenges.

It is also important to ensure that most of the backup and recovery operations need to be automated. The complexity of the data environment, exemplified by the prolifera- tion and dynamism of virtual machines, constantly outpaces existing data backup plans. Deployment of a new backup solution takes weeks of planning, justification, procurement, and setup.

Some organizations find the traditional on-premise backup approach inadequate to the challenge. Service providers offer backup as a service that enables an organization to reduce its backup management overhead.

It also enables the individual consumer to perform backup and recovery anytime, from anywhere, using a network connec- tion. Backup as a service enables enterprises to procure backup services on-demand. Consumers do not need to invest in capital equipment in order to implement and manage their backup infrastructure. Mobile workers represent a particular risk because of the increased possibility of lost or stolen machines.

Backing up to cloud ensures regular and automated backups for these sites and workers who lack local IT staff, or who lack the time to perform and maintain regular backups. For such consumers, a cloud service provider offers replicated backup service that replicates backup data to a remote disaster recovery site. Instead, their data is transferred over a network to a backup infrastructure managed by the cloud service provider.

Data protection solution: replication This section covers the replication and its types. This section also covers local replication methods such as snapshot and mirroring. This section further covers remote replication methods such as synchronous and asynchronous remote replica- tions along with continuous data protection CDP. If a local outage or disaster occurs, faster data and VM restore, and restart is essential to ensure business continuity.

One of the ways to ensure BC is replication, which is the process of creating an exact copy replica of the data. These replica copies are used for restore and restart services if data loss occurs. Service provider should provide the option to consumers for choosing the location to which the data is to be replicated in order to comply with regulatory requirements. Repli- cation can be classified into two major categories: local replication and remote replication.

Local replication refers to replicating data within the same location. Local replicas help to restore the data in the event of data loss or enables to restart the application immediately to ensure BC. Snapshot and mirroring are the widely deployed local replication techniques.

Remote replication refers to replicating data across multiple locations locations can be geographically dispersed. Remote repli- cation helps organizations to mitigate the risks associated with regional outages resulting from natural or human-made disasters. During disasters, the services can be moved to a remote location to ensure continuous business operation. In a remote replication, data can be synchronously or asynchronously replicated.

Replicas are immediately accessible by the application, but a backup copy must be restored by backup software to make it accessible to applications. Backup is always a point-in-time copy, but a replica can be a point-in-time copy or continu- ous.

Backup is typically used for operational or disaster recovery but replicas can be used for recovery and restart.

Replicas typically provide faster RTO compared to recovery from backup. A snapshot can be created by using compute operating environment hypervisor , or storage system operating environment. Typically the storage sys- tem operating environment takes snapshot at volume level, that may contain mul- tiple VMs data and configuration files. This option does not provide an option to restore a VM in the volume.

The most common snapshot technique implemented in a cloud environment is virtual machine snapshot. A virtual machine snapshot pre- serves the state and data of a virtual machine at a specific point-in-time.

This VM snapshot is useful for quick restore of a VM. For example, a cloud administrator can snapshot a VM, then make changes such as applying patches, and software upgrades.

If anything goes wrong, administrator can simply restore the VM to its previous state using the previously created VM snapshot. The hypervisor provides an option to create and manage multiple snapshots. When a VM snapshot is created, a child virtual disk delta disk file is created from the base image or parent virtual disk. The snapshot mechanism prevents the guest operating system from writing to the base image or parent virtual disk and instead directs all writes to the delta disk file.

Successive snapshots generate a new child virtual disk from the previous child virtual disk in the chain. Snapshots hold only changed blocks. This VM snapshot can be used for creating image-based backup discussed earlier to offload the backup load from a hypervisor. The example shown on the slide illustrates mirroring between volumes within a storage system.

The replica is attached to the source and established as a mirror of the source. The data on the source is copied to the replica. New updates to the source are also updated on the replica. Local Replication: Mirroring. While the replica is attached to the source, it remains unavailable to any other compute system. However, the compute system continues to access the source. After the synchronization is complete, the replica can be detached from the source and made available for other business operations such as backup and testing.

If the source volume is not available due to some reason, the replica enables to restart the service instance on it or restores the data to the source volume to make it available for operations. Remote Replication: Synchronous. Additional writes on the source cannot occur until each preceding write has been completed and acknowledged. This ensures that data is identical on the source and the replica at all times.

Further, writes are transmitted to the remote site exactly in the order in which they are received at the source. Therefore, write ordering is maintained. The figure on the slide illustrates an example of synchronous remote replication. Data can be replicated synchronously across multiple sites. If the pri- mary zone is unavailable due to disaster, then the service can be restarted immedi- ately in other zone to meet the required SLA. The degree of impact on response time depends primarily on the distance and the network band- width between sites.

If the bandwidth provided for synchronous remote replication is less than the maximum write workload, there will be times during the day when the response time might be excessively elongated, causing applications to time out. Typically syn- chronous remote replication is deployed for distances less than KM miles between the two sites. If the data is replicated syn- chronously between zones and the disaster strikes, then there would be a chance that both the zones may be impacted.

Through a business continuity planning, risk management procedures and processes are established which are aimed to prevent mission-critical services interruptions and to reestablish immediate full functions of the organization. The Importance of Disaster Continuity and Disaster Recovery to the Function of the Business The main goal of disaster recovery in the cloud is to provide the organization with a way of recovering data and implementing failover in an event of a natural catastrophe or man-made disaster.

Both businesses continuing planning and Disaster recovery are essential parts of overall risk management for the organization. Some risks that exist in the organization cannot be eliminated and therefore the implementation of business continuity plans and cloud disaster recovery prepares for potential disruptive events that might occur within the organization.

In addition, disaster continuity and cloud disaster recovery are important because they provide the organization with a detailed strategy of how the business will survive and continue immediately after disasters and severe interruptions.

The risks that can be transferred to the cloud provider include reduced risks of losing data, access of data by unauthorized personnel, risks of data deletions, visibility and control among the consumers, data leakages and incomplete date deletion as a result of transferring data over different devices.

The reduced visibility and control occur during the process of transacting assets or operation to the cloud as organization losses control over the services. Data stored in the cloud can be stolen.



0コメント

  • 1000 / 1000