MainMemory Recovery

Embed Size (px)

Citation preview

  • 8/8/2019 MainMemory Recovery

    1/7

  • 8/8/2019 MainMemory Recovery

    2/7

    main memory failure. In actual use the archive data-base may be updated directly from data in mainmemory or from data in the log.

    Fig. 1. - MMDB Model

    This view of a MMDB may seemto that of a conventional database.exist several major differences:almost identicalHowever, there

    1. Main memory is assumed to be large enoughto hold all databases currently being accessed.2. Any recovery schemes must deal with restor-ing the MMDB not data on secondary storage.

    3. No database access can be performed againstthe archive database. Its use is strictly as abackup to the main memory database.There are several advantages to the use ofMMDBs. Obviously, processing time and throughputrates should improve hue to the elimination of I/Ooverhead. It has been suggested that the improved per-formance can eliminate the need for concurrency controlbq allowing the serial execution o f MMDB transactions. While we dont agree with this observation, it is cer-

    tainly conceivable t,hat concurrency control mechanismsspecifically designed for the MMDB environment canreduce the overhead and complexity usually associatedwith concurrency control techniques 3.Some problems not existing in conventional DBMSsystems are introduced in the MMDB environment.The major problem deals with the volatility of mainmemory. To reduce the impact of this problem, it hasbeen proposed that a small stable main memory be usedto support recovery processing 3*6. If a stable memoryis assumed, its size must be large enough to contain allupdates of active transactions. Another major problemis the excessive overhead needed to initially load a data-base into memory for processing. This procedurerequires merging data on the archive database and thelog to obtain data needed to recover to the most recentconsistent state. To reduce the time, archives should becreated very frequently and could be distributed acrossseveral secondary storage devices. Additional concernscenter around the increase in the number of mainmemory components and the resulting reliability problems and increase in access time. A unique architectureis currently under investigation at Princeton to addressthese issues . We are currently only addressing whatappears to be the major problem: MMDB recovery.

    covery DefinedThe issues concerning traditional database recoveryare well known and understood 2~g,11~*8. he purposeof this section is to examine aspects concerning data-base recovery under the MMDB assumption. We inves-tigate the various types of failures effecting database

    processing, describe recovery operations necessary forMMDBs, and conclude by listing the desired features aMMDB recovery technique should possess.When discussing MMDB recovery, it is importantto realize that the objective is that of recovering data inmain memory. With conventional databases, the currentdatabase state exists partly in main memory and partlyin the secondary storage. With MMDBs, the currentstate completely exists in main memory. Secondarystorage is used solely for recovery purposes. SinceMMDB processing incurs no I/O, any I/O required toensure recoverability can have a significant impact onsystem performance and become the major bottleneckduring processing.As with traditional database processing, a trans-

    action is assumed to be the unit of recovery and con-sistency, and the failures which must be anticipated aretransaction, system, and media failures. The differencesbetween conventional DBMS recovery and MMDBrecovery are introduced when the specific operationsrequired to accomplish recovery are examined. TABLE1 shows the operations needed to recover from the threefailure types in both the tradtional and MMDB environ-ments.

    TABLE 1DATABASE RECOVERY OPERATIONS

    I Failure Type 1 RecoveryOperationsRequired ITraditional DBMS MMDBTransactionFailure TransactionUNDO TraasactionUNDO

    System Failure Global UNDO Global REDOPartial REDOI Media Failure I Global REDO I Global REDOPartial REDO ITransaction failure occurs when a transaction doesnot successfully commit. This type of failure occursmore often than the other two, and thus efficient

    recovery from it is essential. A rule of thumb is thatrecovery should occur in a similar time frame to that ofsuccessful completion of the transaction tl. The nor-mal procedure for recovery a fter a transaction failure isa Transaction UNDO. This implies that all effects theaborted transaction has had on the primary databasecopy must be removed. The major concern existingwith transaction UNDO in a MMDB environment isthat it be done with as little I/O as possible. If the ruleof thumb is to be achieved, no I/O should be required

    1221

  • 8/8/2019 MainMemory Recovery

    3/7

    to accomplish transaction UNDO. Additional processingduring transaction UNDO involves the removal of anydirty data from the log or archive database. Thesewould also like to be performed with no I/O.Recovery from a system failure is quite differentwith MMDBs than with traditional DBMSs. Tradition-ally, the effects of any interrupt,ed t,ransactions must beundone, Global UNDO, and any completed transactionswhich have not had updates reflected in the databaseneed to be redone, Partial REDO. When a systemfailure occurs, the entire MMDB contents are lost.MMDB recovery must therefore perform a GlobalREDO by completely reloading all databases in mainmemory. The archive database is used to reload thedatabases to some prior state and then any committedtransactions reflected in the log are redone. The GlobalREDO operation is required in conventional systemsonly after a media failure causes the loss of the primarydatabase copy on secondary storage.System failures occur less often than transactionfailures and more often media failures ll. A goal forsystem failure recovery is that it be accomplished in atime comparable to that required for successful comple-

    tion of all active transactions. With MMDBs, a GlobalREDO requires I/O from the archive database, and thusit seems impossib le to achieve this goal. One way toreload the MMDBs as quickly as possible is to ensure asmuch of recent database updates as possible arereflected in the archive database. This implies that logdata must be frequently flushed to the archive da tabase.In traditional database systems a checkpoint isoften used to reduce the work needed to recover fromsystem failures 2~g~11~18.We view a MMDB Checkpointas recording all data concerning a prior database stateinto the archive database and writing a correspondingcheckpoint record on the log. These checkpoints shouldbe done with as little impact on transaction processingas possible. Frequent flushing to the archive database

    thus requires frequent checkpointing.Global REDO loads MMDBs into main memory.However, not all transactions require all data, thereforetransactions can begin processing as soon as some of thedata they need is available. This prob lem is similar tothe fetching strategies associated with virtual memorymanagement. At least four possibilities exist whenloading databases into main memory:1. Database Prefetching - Loading an entiredatabase into main memory prior to schedul-ing any transactions accessing it.2. Page Prefetching - Prefetch some subset ofdatabase pages and allow transactions to begin

    access of them as the remainder of the data-base is loaded.3. Demand Loading - Only load a database whensome transaction first accesses t.4. Demand Paging - Load database pages afterfirst access to them.

    More research is needed to determine which of theseprovides the best performance for Global REDO. The

    method to be used depends on such factors as databasestorage structure, access methods used, storage locationon disk, and whether any transactions have immediateneed of the data.Although media failure may only occur once ortwice a year, the impact on recovery of traditional data-bases can be severe I1 A memory failure with MMDBscan be treated as a system failure and a Global REDOperformed. However, if the specific location of thefailure can be identified, a Partial REDO of only theeffected area would be warranted. This indicates thatthe archive database should be physically structured tocorrespond with memory addresses. Perhaps partition-ing of memory and archive databases and the ability torecover by these partitions is needed. Future researchwill examine this idea of partitioning for partial redos.Media failures can effect the archive database or log.Restoring these files creates similar problems for con-ventional and MMDB systems. Differences do exist inthat the archive database may be needed more fre-quently than in conventional DBMSs and thus its quickrecovery is more important with MMDBs. With theexistence of prior archive databases and correspondinglog data, recreation of archives simply requires redoingcheckpoint processing.Authors have ignored the problems associated withfailure of stable memories. The use of stable memoriesdoes imply that overhead for recovery from systemfailures is greatly reduced, however, there is really nosuch thing as a forever stable memory. Thus anyreliable recovery technique must prepare for the failureof stable memory. This implies that any MMDBrecovery technique needs to provide the facilities forGlobal and/or Partial REDO of all of memory - stableand non-stable.Another issue concerning MMDB recovery is whenlog I/O operations occur. It is important that any I/Oneeded be performed asynchronously to normal data-

    base processing. This implies that log I/O not occuronly at commit time, but that it be performedthroughout transaction processing. Transaction pro-cessing should not be dependent on or held up by I/Oto the log.We close this section by summarizing the majorrequirements for MMDB recovery in the following wishlist:1. No I/O required to accomplish transactionUNDO.2. Frequent checkpoints performed withminimum impact on transaction processing.3. Asynchronous processing of log I/O and trans-

    action processing.Certainly an additional requirement be that a minimumamount of redundant data be used. For example, theuse of before images on the log should be avoided ifpossible.

    1228

  • 8/8/2019 MainMemory Recovery

    4/7

    Recoverv TecuPrior to introducing the new MMDB recoverymethod, we examine previously proposed techniques andcompare their processing to the items on the wish list.As stated earlier, IBM has as implemented MMDBsin the IMS/VS Fast Path Feature 2r13. At initializa-tion of IMS, the MMDBs are loaded into main memory.Updates are performed in special database buffers andMMDB pages are not modified until commit time.Commit processing ensures that all after images arewritten to the log prior to updating the MMDB. S stemwide transaction-cons istent checkpoints (TCC) II areaccomplished by an asynchronous IMS task running inparallel with MMDB processing. The major disadvan-tages of this scheme are that log I/O is only performedat commit time and entire MMDBs must be read to per-form checkpoints.The Massive Memory Machine (MMM) project atPrinceton University has described an architecturespecifically des\gffe to support massive amounts of pri-mary storage 1 1 . Associated with this projec t is the

    design of a MMDB recovery scheme based upon ahardware logging device, HALO 6. HALO interceptsall MMDB operations and creates BFIM and AFIM logdata initially written to a nonvolatile main memory,and as time perm its written out to t,he log on disk. Theuse of the stable main memory implies that commitprocessing need not wait until the log buffers have beenBushed. The BFA4 log data is needed to accomplishtransaction UNDO. Continuous action-consistentcheckpoints (ACC) I1 of log data to the archive data-base is made. The asynchronous updating of the log andparallel, continuous updating of the archive databaseare certainly advantages of this scheme. However therequirements for specialized hardware and stable mainmemory, the interception of all database calls, and therequirement for BFIM log data to accomplish trans-action UNDOs are disadvantages.

    Researchers at the University of California atBerkeley, have investigated ;ome of the implementationconcerns for a MMDB . Their recovery schemeassumes the use of a log with BFIM and AFIM plus fre-quent checkpointing. The notion of a pre-committedtransaction is used to achieve asynchronous logging anddatabase processing. When a transaction commits, thecommit record is placed in the log buffer and otherconflicting transactions are allowed to progress eventhough the log buffers have not been flushed to disk.The transaction completes commit processing only whenthis is done. ACC checkpointing occurs continouslyand in parallel with transaction processing by readingthe entire MMDB and identifying modified pages.Although not specifically discussed, it appears that anentire database must be loaded into main memory priorto access.

    The concept of a Database Cache has been pro-posed for fatabase systems with large amounts of mainmemory . It is assumed that there is sufficientmemory space to store all dirty pages plus some otherpages which have been fixed for reading. The da tabase

    cache and disk database together represent the currentdatabase just as with traditional DBMS systems. Eventhough this approach is not. strictly a MMDB, anunusual recovery scheme appropriate to MMDBs is pro-posed. This approach recognizes that %hadow mainmemory pages a can be used to eIiminate the need fortransaction UNDO. To avoid the overhead of loadingentire databases, a demand paging technique is used tobring pages into main memory. No log is used, rather asafe localed in nonvalatile memory containing dataneeded to reconstruct part of the cache after failure ismaintained. As a minimum, the safe contains all pagesnot currently residing on the disk database. Whenmemory pages are to be modified, a main memory sha-dow page is used for updating if the disk database doesnot contain a copy of the page to be modified. In theevent of transaction failure, these shadow pages aresimply deleted in the cache. Subsequent transactionswill either access the other page in the cache or incur apage fault to bring in a new copy of the page. To com-mit a transaction all modified pages are written to thesafe. Novel procedures are used to limit the number ofrecords on the safe and to determine the exact state ofeach page in main memory. Only when a page is tar-geted for replacement is it written back to the diskdatabase.

    Design for a MMDB including data structurerepresentation and recovery technique has been propsedat IBM . It is assumed that a MMDB relation isloaded into main memory at the first reference and, ifmodified, written to disk at commit time. All recoveryoverhead is restricted to the comm it operation achievinga type of transaction-oriented checkpoint (TOC) . Noseparate log is suggested, rather the use of shadowpages on the archive database. At commit time, allmodified relations are written to shadow areas on thearchive database. Once this has been accomplished thenew directory structure is updated and old databaseareas released.Design for a MMDB system, MM-DBMS, iscurrently under way at the University of Wiscons in-Madison 14115 . This study includes the design of anarchitecture, query processing, data structures, andrecovery technique for a MMDB. Recovery processinguses a stable log buffer as well as a special log processorto perform checkpointing. Further details concerningthe recovery strategy used are not yet available.

    TABLE 2 summarizes the different MMDBrecovery techniques described. Although some uniquerecovery ideas are introduced, none of the techniquessatisfies all items on our wish list. The Fast PathFeature, DB Cache, and Ammann techniques dontmeet the asynchronous log I/O and transaction process-ing requirement because all log I/O overhead occurs attransaction commit. The MMM and Berkeley methodsrequire BFIM and AFIM log data and must have logI/O operations to accomplish a transaction UNDO.The information shown in the last row of TABLE 2describes the type of checkpointing performed by thevarious techniques. Part of this data indicates whethercheckpointing is accomplished by reading the MMDB orlog data. Checkpoints for the Fast Path and Berkeley

    1229

  • 8/8/2019 MainMemory Recovery

    5/7

    TABLE 2COMPARISON OF PREVIOUS MMDB RECOVERY TECHNIQUES

    techniques require examining all MMDB pages andtherefore must impact transaction processing whencheckpointing occurs.A Better Technique

    In this section we describe preliminary results con-cerning a new MMDB recovery technique which bettersatisfies the requirements identified in section 3 thanpreviously proposed methods. The highlights of thisnew method are:1. Main memory shadow pages2. Pre-committed transactions3. Automatic checkpointing4. Recovery processor

    Each of these is discussed in the following paragraphs.A followup paper will define this technique in moredetail as well as provide simulation results examining itsperformance. We assume that no stable main memoryexists, but note any changes which would be needed inthe event nonvolatile main memory were available. Theimpact of concurrency control is not included in thisdiscussion. It may be assumed that transactions areeither run serially or that twophase locking at the pagelevel is used.

    Main memory shadow pages (similar to that pro-posed for the DB Cache 5, are used to achieve the goalof no I/O for transaction UNDO. Duplicate copies aremade of any pages updated by a transaction and allmodifications occur on these pages. As pages aremodified, AFIM records are written into the log bufferfor output to disk. At com mit time, a commit record isalso written in the buffer for output. Subsequentconflicting transactions can begin processing as soon asthis occurs. They will use the data in the dirty pagesand, if needed, create new copies for their updating. Ifa transaction commits (commit record written to disk),the previous clean pages are released and the dirtypages become the new clean ones. When a transactionabnormally terminates, the dirty pages are released.

    To accomplish asynchronous log I/O and trans-action processing, the precommitment technique isused. As explained above, transactions need not waituntil a conflicting transaction has completely commitedprior to beginning execution. A lso, log I/O is performedthroughout a transaction execution rather than just atcommit time. Indeed, a transaction can not commituntil all log buffers are written to disk, but without astable main memory for the buffer this can not beavoided. If stable main memory were available, thepre-commitment technique would not be necessary andcompletely independent log I/O and transaction pro-cessing would be possible.

    The log contains Begin-Transaction (BT),Commit Transaction (CT), Abort-Transaction (AT),Checkp&t, and AFIM records. All log records exceptthe checkpoint contain the ID of the correspondingtransaction. The BT record contains a flag that indi-cates the state of the corresponding transaction. It isinitially written with an indication that the transactionis active, when commi.ted or aborted it is appropriatelymodified. This random access to the log implies that arandom access device must be used. As explainedbelow, this flag is used during checkpointing to avoidflushing dirty data to the archive database. It alsoeliminates the need for removing this dirty data duringtransaction UNDO.To accomplish automatic checkpointing, the loggerkeeps track of the state of the log. Assuming an initialstate of 0, state transitions occur when BT, CT, or ATrecords are written to the log. The BT record incre-ments the state value by 1, while the CT and AT decre-ment it by 1. A state value of 0 indicates that a TCCstate exists on the disk. When the logger detects astate value of 0, a checkpoint record is written out to

    the disk before any other records already in the buffer.To find the most recent checkpoint record on the log,the logger also notes the address of this checkpointrecord and records it along with its unique checkpointID in a predefined fixed disk address. Performing thisautomatic checkpointing only requires two additionalI/O operations and is performed independently of trans-action processing. To ensure that TCC checkpointsoccur, the logger may need to periodically force check-

    1230

  • 8/8/2019 MainMemory Recovery

    6/7

    points even though the checkpoint state haa notoccurred. To accomplish this, the logger must write alllog records for actively executing transactions to bewritten to the log, before allowing any new transactionsto begin writing to the log. Unlike the techniques usedin current database systems where processing isquiesced to reach a TCC, this technique has no impacton executing transactions.Actual checkpointing to the archive database isaccomplished by a separate Recovery Processor (RP).Checkpointing is a two step process: creating a newcopy of the archive database and then applying logAFIM records for committed transactions to bring thenew archive database state up to that indicated by thelatest checkpoint on the log. When an archive databaseis first created , the checkpoint ID and associatedaddress of the latest log checkpoint record are writteninto a predefined location on the archive database. AllAFIM records for successfully committed transactions(as identified by the ST log record) between the check-point address on the previous archive database and theaddress of the latest checkpoint are applied to thearchive. If a checkpoint is attempted, but the check-point ID on log is the same as that on the prior archive,then the RP must wait until a new checkpoint is taken.To avoid the overhead of copying the entire archive,many checkpoints could be applied to the same copyand at periodic time intervals request the creation of anew archive database.Figure 2 shows the model of the MMDB recoverysystem proposed. The use of main memory shadowpages ensures that only AFIMs are needed on the logand that no I/O is required for transaction UNDO,Checkpoint records are automatically written to the log,and continous checkpointing to the archive database isperformed by the RP. Both of these operations are per-formed in parallel with and asynchronously to MMDBtransaction processing. W ithout stable main memory,pre-commitment of transactions and continous writingto the log buffer provide asynchronous log I/O andtransaction processing. With the use of stable memory,a nonvolatile log buffer removes the need for pre-commitment but achieves a true asynchronous opera-tion of log I/O and transaction execution.

    Fig. 2. - Model of Proposed MMDB Recovery System

    Summary and Future ResearchAfter defining MMDB recovery, identifying itsrequirements, and surveying the literature, we have pro-posed a new. MMDB recovery technique which bettermeets these requirements than previous methods. Ourtechnique requires no transaction UNDO processingafter a transaction failure, uses asynchronous I/O andtransaction processing to reduce recovery overhead dur-ing transaction processing, and provides continuoussystem checkpoints in parallel with yet no impact ontransaction processing. Recent h4MDB recovery perfor-mance studies have shown that the major fac tor con-cerning efficient recovery is the use of stable memory4~~. Next to stable memory, the use of additional Iog-ging and checkpointing processors can aIso impact per-formance 4. This is the only known technique propos-ing the use of a special checkpoint processor.Many areaa for future research remain. A majorarea of study will address the issue of efficient loadingfor MMDBs including the idea of partitioning. This willexamine methods for distributing log and archive data-

    base information across multiple secondary storage dev-ices. Along this line we also intend to evaluate variousstorage techniques to be used for the archive databases.Currently, a simulation study is being performed tomore accurately compare the proposed technique to previous ones. A future paper will more precisely defineour proposed technique and present the results of thesimulation experiments.

    111

    I21PI

    [41

    PI

    Arthur C. Ammann, Maria Butrico Hanrahan, andRavi Krishnamurthy, Design of a MemoryResident DBMS, Proceedings 01 the IEEE SpringComputer Conference, 1985, pp. 5457.C. J. Date, An Introduction to Database SystemsVolume Ii, Addison-Wesley Publiihing Company,July 1984, pp. l-33.David J. Dewitt, Randy H. Katz, Frank Olken,Leonard D. Shapiro, Michael R. Stonebraker, andDavid Wood, Implementation Techniques forMain Memory Database Systems, Proceedings ofthe ACM SIGMOD International Conference onManagement o/Data, 1984, pp. 1-8.Margaret H. Eich, A Classification and Com-parison of Main Memory Database Recovery Tech-niques, Southern Methodist University Depart-ment of Computer Science Technical Report 86CSE15, June 1986.Klaus Elhardt and Rudolf Ba.yer, A DatabaseCache for High Performance and Fast Restart inDatabase Systems, ACM Transactions on Data-base Systems, Vol. 9, No . 4, December 1984, pp.503-525.

    1231

  • 8/8/2019 MainMemory Recovery

    7/7

    [S] Hector Garcia-Moiina, Richard J. Lipton, and PeterHoneyman, A Massive Memory Database Sys-tem, Princeton University Department of Electri-cal Engineering and Computer Science TechnicalReport, September 1983.[7] Hector Garcia-Molina, Richard J . Lipton, andJacob0 Valdes, A Massive Memory Machine,IEEE Transactions on Computers, Vol. C-33, No.

    5, May 1984, pp. 391-399.[8] Hector Garcia-Molina, Richard Cullingford, PeterHoneyman, and Richard Lipton, The Case forMassive Memory, Princeton University Depart-ment of Electrical Engineering and Computer Sci-ence Technical Report 326, May 1984.[Q] J. N. Gray, Notes on Data Base Operating Sys-tems, Lecture Noted in Computer Science No. 60,Springer-Verlag, 1978, pp. 394-481.(lo] Jim Gray, Paul McJones, Mike Blasgen, BruceLindsay, Raymond Lorie, Tom Price, Franc0 Put-zolu, and Irving Traiger, The Recovery Managerof the System R Database Manager, ComputingSurveys, Vol. 13, No. 2, June 1981, pp. 223-242.[II] Theo Haerder and Andreas Reuter, Principles ofTransaction-Oriented Database Recovery, Com-puting Surveys, Vol. 15, No. 4, December 1983, pp.287-317.

    (12) IBM, IMS/VS Version 1 Fast Path Feature Gen-eral Information A4anua1, GH2Q-9069-2, April 1978.[13] IBM World Trade Systems Centers, IMS Version 1Release 1.5 Fast Path Feature Description andDesign Guide, GX!O-5775, 1979.[14] Tobin J. Lehman and Michael J. Carey, A Studyof Index Struct,ures for Main Memory DatabaseManagement Syst.ems, University of Wisconsin-

    Madison Computer Sciences Department TechnicalReport #SOS, July 1985.[15] Tobin J. Lehman and Michael J. Carey, QueryProcessing in Main Memory Dat:ibase ManagementSystems, Proceedings of the ACM-SIGMOD Inter-national Conlerence on Management of Data, May1986.[16] Raymond A. Lorie, Physical Integrity in a LargeSegmented Database, ACM Transactions on Data-base Systems, Vol. 2, No. 1, March 1977, pp. 91-

    104.[17] Kenneth Salem and Hector Garcia-Molina, CrashRecovery Mechanisms for Mail1 Storage DatabaseSystems, Princeton University Department of

    Computer Science Technical Report CS-TR-034086, April 1986.[18] Joost S. M. Verhofstad, Recovery Techniques ForDatabase Systems, Computing Surveys, Vol. 10,No. 2, June 1978, pp. f6&195.

    1232