Joel Alcântara (advised by Alysson Bessani)

Master’s thesis, Mestrado em Engenharia Informática, Departamento de Informática, Faculdade de Ciências da Universidade de Lisboa, Sept. 2016

Abstract: Disaster recovery is a crucial feature to ensure high availability and data protection in modern information systems. The most common approach today consists of replicating the services that make up the system in a set of virtual machines located in a geographically distant public cloud infrastructure. These computational instances are kept executing in passive mode, receiving updates from the primary infrastructure, in order to remain up to date and ready to perform failover if a disaster occurs at the primary infrastructure. This approach leads to expressive monetary and management costs for keeping virtual machines executing in the cloud. In this work, we present GINJA – a disaster recovery system for transactional database management systems that relies exclusively on public cloud storage services (e.g., Amazon S3, Azure Blob Storage) to backup its data. By eliminating the need to keep servers running on a secondary site, GINJA reduces substantially the monetary and management costs of the disaster recovery. Furthermore, our solution also includes a configuration model that allows users to have a precise control about the cost, durability and performance trade-offs, and introduces a minimum overhead to the performance of the database management system. Additionally, GINJA is implemented as a specialized file system in user space, which brings major benefits in terms of portability, and allows it to be easily extended to support other database management systems. Lastly, we have performed an extensive evaluation of our system, that covers aspects such as performance, resource usage and monetary costs. The results show that GINJA is capable of performing disaster recovery with small monetary costs (less than 5 dollars for certain practical configurations), while introducing a minimum overhead to the database management system (12% overhead for the TPC-C workloads with at most 20 seconds of data loss in case of disasters).

Project(s): Project:SUPERCLOUD

Research line(s): Fault and Intrusion Tolerance in Open Distributed Systems (FIT)

