Application API

From The Open Source Backup Wiki (Amanda, MySQL Backup, BackupPC)
Revision as of 15:06, 14 November 2008 by Ted (talk | contribs) (→‎restore command)
Jump to navigationJump to search

Introduction

There are two compelling reasons to introduce the Application API:

  • To allow recovery of a single file without transmitting the entire backup archive to the client.
  • To make it easier to support new client backup mechanisms, both at the filesystem and application level.

The Application API addresses these needs by changing backup, restore, selfcheck, and other Amanda client commands.

Historically, Amanda has focused on managing large chunks of data generated by one of only a few hard-coded applications (generally either GNU tar, some version of dump, or smbclient). The Application API addresses both limitations in the following manner:

  • It provides modular support for adding client backup tools, both for filesystems and applications such as databases, mail servers, etc.
  • It extends Amanda to allow more granular backup and restore options.

Application API vs Dumper API

This Application API replaces and supercedes the previous Dumper API proposal.

Why make this change? The Dumper API has a number of limitations that the Application API avoids:

  • The Dumper API had no restore support at all (only backup)
  • The Dumper API included a lot of the details of dumping applications in the API itself.
  • The Dumper API requires transmitting the entire archive to a client to extract even a single file.
  • No Dumper API implementation work exists, though the proposal is over 4 years old.

Backward Compatability

The Application API maintains backward compatibility by extending existing behavior rather than replacing it:

  • Legacy clients can be dumped as before. The server writes data to tape in the legacy format.
  • Legacy tapes can be read as before.
    • When restoring to a new client (one using the Application API), the server provides the legacy dump as one large collection.
    • When restoring to a legacy client, the restore works as before.
  • Legacy clients cannot restore data backed up by Application API clients; legacy clients can be restored only only from legacy dumps.

Nomenclature

This nomenclature is derived from the SCSI command-set standard INCITS T10/1731-D.

A User Object is the basic unit of backup and restore, from the user perspective. Currently, a user object is a file or directory. In the future other types of data may be supported. Each user object has a hierarchical identifier and a set of associated attributes. Also, each user object is entirely contained within some set of collections, but a single collection may contain data from multiple user objects.

A Collection is the basic unit of backup and restore as it resides on the backup media. A collection is the smallest unit that can be stored or retrieved from media.

Each collection and user object may originate from only a single backup job, collection merge, or collection copy/migration.

Application API Operations

Implementing the Application API requires changes to the backup server, but most of the code that constitutes the API itself resides on the client. The operations listed below are from the perspective of the backup clients.

Backup

Input: Specifies what is to be backed up: A filesystem, device, particular set of files, database table, etc.

Action: Reads the specified object.

Output: A set of collections (containing the backup data), and information on a set of user objects (identifier, attributes, associated collections)

Restore

Input: List of user objects to be restored, relevant collections, and target locations for the restore.

Action: Reads the collections and writes the relevant user objects in their original form to the specified location.

Output: None (other than administrative messages)

Reindex

Input: Octet stream of all the collections from a single job.

Output: Byte offsets for each collection in the stream, and information on the set of user objects in the stream.

Estimate

Input: Information on what is to be backed up: A filesystem, device, particular set of files, database table, etc.

Output: An estimate of how much space this data set will consume.

selfcheck

Input: Information on what is to be backed up: A filesystem, device, particular set of files, database table, etc.

Action: Determines if there are any configuration problems.

Output: Success or failure.

Capabilities

Input/Action: none.

Output: Capabilities of this application driver. For example, the application may not support exclusion. This command can also tell if this driver can read a dump from some other version of the same driver.

Print-Command

Input: Information on what is to be backed up: A filesystem, device, particular set of files, database table, etc.

Output: Prints a one-line command, if one exists, to restore this data from tape. This can be used for non-Amanda bare-metal disaster recovery.

Examples

Here are some examples of how the generic nomenclature might be applied in a particular application driver.

Dump

User object => Filesystem object (file, directory, socket, pipe, etc.)
Collection => Entire filesystem

GNU tar

User object => Archive object (file, directory, etc.)
Collection => one 512-byte tar block.

Note that having such collections can be problematic; see below.

SQL database

User Object => Database table
Collection => Entire database

Alternative SQL database

User Object => Table row
Collection => Entire database

This conception is only useful if you have very large table rows; otherwise, the indices will be as big as the original database!

Media Formats

At present, there are two tape formats: Traditional and Spanned. In the future, more formats might be added. The Application API will not change this. Indeed, the on-tape format will not change at all; you could still restore a gnutar dump under the Application API using some earlier version of Amanda. The only thing that changes is the terminology:

Traditional

When dumping, we continue to write a 32k Amanda header followed by the complete set of collections provided by the client. As before, these are written as 32K tape blocks. On restore, we can use the BSR command to seek the tape drive to the appropriate tape block, and then read the desired collections. Thus the index need only store the byte offset of each collection.

To recover without Amanda, you can still use dd to read the tape contents into the tool directly.

Spanned

When using the Spanned tape format with the Application API, we again take the complete set of collections provided by the client, treating it as a single BLOB, and dividing it into chunks. We continue to write a 32k Amanda header at the beginning of each chunk, followed by a number of 32k tape blocks. On restore, we again can use the BSR command to find the appropriate tape block and read the desired collections. Under this scenario, we must again note where each collection is stored. This could be done by storing the byte-offset of each collection, along with chunk information for the job as a whole, or else by storing the chunk and byte offset for each piece of each collection.

Future

There are several limitations in the existing tape formats that may be addressed in the future. A new tape format might also take better advantage of Application API features. But such a change is not directly connected to or required by this API.

Clarifying Common misconceptions

Because the Application API represents a major departure from historical Amanda thinking, misconceptions are common. This section attempts to address some of the most common.

The exact location of user objects is known.

Amanda can restore a user object only by retrieving the associated collections. Aside from tracking the collection that contains it, Amanda doesn't store the exact location of any user object. Amanda still has enough information to efficiently restore a user object without reading the whole dump -- assuming that collections are smaller than a dump.

To put it another way, the object may not be found at any particular byte offset in the backup. Even if it could, Amanda wouldn't know that offset. But nonetheless Amanda has sufficient information perform restores efficiently.

Collections must not be very small (or very big)

Although Amanda will not enforce any particular size restriction on a collection, the optimal size for roughly corresponds to the size of a user object. In general, there is not much advantage to having collections smaller than about 64k. Very small collections will bloat the index; very large collections may cause slower restores, especially partial restores of small objects from the collection.

The server can understand a collection

As today, the server doesn't know anything about the collections on media -- it can only store and retrieve them. An entire collection (not an entire job) must be sent to an Amanda client running the same Application API for interpretation.

Note, however, that this Amanda client may be on the same physical machine as the Amanda server.

Inputs Outputs above are associated with particular sockets

As there is as yet no line protocol associated with this API, it would be premature to talk about particular sockets. But it is very possible that all output data (octet stream, collection byte offsets, and user object information) will be multiplexed in a single network socket.

On restore, The client may seek to a particular place in the backup data

Although the client could do this, the server doesn't know anything about it. Rather, the server provides the set of collections that includes all the user objects of interest. Then (and only then) the client goes about restoring user objects from this set of collections.

The data stream sent to the server is opaque to the server

Although the collection data itself is opaque, the other data (collection sizes, user object identifiers and attributes) is very much interpreted by the server. There should, for example, be a standard way of representing file permissions and timestamps as user object attributes.

Implementation phase

The implementation will be done in many phases

Phase 1: Application API that is unaware of collections

  a- tape header change
  b- Small change to the protocol to allow the server to send Application API information to the server.
  c- modify selfcheck/sendsize/sendbackup/amrecover to use the Application API
  d- amrecover/amidxtaped
  support
    --message=line
    --index=line

Phase 2: Tool definition in amanda.conf and amanda-client.conf

  a- conffile
  b- protocol change to send tool property to plugins, <XML> for REQ packet

Phase 3: new format for message in selfcheck/sendsize/sendbackup (XML)

  a- Change in selfcheck/sendsize/sendbackup
  b- Change in amcheck/planner/dumper
  support
    --message=xml

Phase 4: Object user index content attribute

  a- new XML index format
  b- amindexd - sort and parse new index format
  c- protocol between amindexd and amrecover to send attribute
  d- amrecover - can display attribute.
  support
    --index=xml

Phase 5: Allow collection in data stream

  a- the index will contains information about collection
  data stream should not be filtered (compression/encryption), otherwise, we can recover a single collection.

Phase 6: amrestore/amfetchdump can restore only one collection instead of a complete data stream

Phase 7: amrecover understand collection

  a- amindexd send collection information to amrecover
  b- amrecover send collection information to amidxtaped.
  c- amidxtaped send only the needed collection

Phase 8: filter each collection separately

  a- a new filter must be executed for each collection

Application calling convention

This is subject to change

support command

support [--config config] [--host host] [--disk disk] [--device device] [--PROPERTY_NAME PROPERTY_VALUE]*

0utput on fd1
   CONFIG YES|NO
   HOST YES|NO
   DISK YES|NO
   MAX-LEVEL level
   INDEX-LINE YES|NO
   INDEX-XML YES|NO
   MESSAGE-LINE YES|NO
   MESSAGE-XML YES|NO
   RECORD YES|NO
   INCLUDE YES|NO
   INCLUDE-LIST YES|NO
   INCLUDE-OPTIONAL YES|NO
   EXCLUDE YES|NO
   EXCLUDE-LIST YES|NO
   EXCLUDE-OPTIONAL YES|NO
   COLLECTION YES|NO
   CALCSIZE YES|NO
   MULTI-ESTIMATE YES|NO

selfcheck command

selfcheck [--message (line|xml)] [--config config] [--host host] [--disk disk] --device device --level level [--record] [--PROPERTY_NAME PROPERTY_VALUE]*

Output on fd1
(if no --message or --message line) (Could be many lines)
   OK [message]
   ERROR [message]
0utput on fd1
(if --message xml)
   format not yet defined

estimate command

estimate [--message [line|xml]] [--config config] [--host host] [--disk disk] --device amdevice --level level [--PROPERTY_NAME PROPERTY_VALUE]*

 output on fd1: (if no --message or --message line)
   error message that should be logged.
   SIZE value'suffix where suffix could be K, M, G
 output on fd1: (if --message xml)
   format not yet defined

backup command

backup [--message (line|xml)] [--index (line|xml)] [--config config] [--host host] [--disk disk] --device amdevice --level level [--record] [--PROPERTY_NAME PROPERTY_VALUE]*

output on fd1
   data stream
output on fd3
(if no --message or --message line)
   error message
   HEADER variable=value, information that should go in the amanda header.
   SIZE value suffix where suffix could be K, (kilobytes) M, (megabytes) or G (gigabytes)
output on fd3
(if --message xml)
   format not yet defined
output on fd4
(if --index line)
   index stream (One filename by line)
output on fd4
(if --index xml)
   xml index stream (format not yet defined)
Error messages should begin with a | for normal ouput
                                    ? for strange or error output
                                    & for unknown output

restore command

restore [--message (line|xml)] [--index (line|xml)] [--config config] [--host host] [--disk disk] --device amdevice --level level [--PROPERTY_NAME PROPERTY_VALUE]*

Input on fd0
   what to extract.
Input fd1
   data stream
Output on fd2
(if no --message or --message line)
   error message
0utput on fd2
(if --message xml)
   format not yet defined

index command

index [--message (line|xml)] [--index (line|xml)] [--config config] [--host host] [--disk disk] --device amdevice --level level [--PROPERTY_NAME PROPERTY_VALUE]*

 input fd1:
   data stream
 output on fd3: (if no --message or --message line)
   error message
 output on fd3: (if --message xml)
   format not yet defined
 output on fd4: (if --index line)
   index stream (One filename by line)
 output on fd4: (if --index xml)
   xml index stream

tool property format

Each property is passed as command line option, if a property has many values, then it must have an option for each value.

How to use

The application must be defined in amanda.conf

Define the "my_application" application using the "myapplication" binary.

 define application-tool my_application {
    comment "a comment"
    "my_app"                          # inherit config of the my_app application
    plugin  "myapplication"           # name of the application, it must be installed in dumper dir
    property "mailto" "amandabackup"  # can set property
 }


The dumptype must specify the application

Define the "my_dumptype" dumptype using the "my_application" application

 define dumptype my_dumptype {
    program "APPLICATION"
    application "my_application"
 }

Define the "my_dumptype_2" using a modified "my_application" application

 define dumptype my_dumptype_2 {
    program "APPLICATION"
    application {                # define a custom application
       "my_application"          # inherit setting from another application
       property "mailto" "root"  # override property
    }
 }

Dle using my_dumptype or my_dumptype_2 will use the myapplication application.

Available application

amgtar

 define application-tool app_amgtar {
     comment "amgtar"
     plugin  "amgtar"
     #property "GNUTAR-PATH" "/path/to/gtar"
     #property "GNUTAR-LISTDIR" "/path/to/gnutar_list_dir"
                   #default from gnutar_list_dir setting in amanda-client.conf
     #property "ONE-FILE-SYSTEM" "yes"  #use '--one-file-system' option
     #property "SPARSE" "yes"           #use '--sparse' option
     #property "ATIME-PRESERVE" "yes"   #use '--atime-preserve=system' option
     #property "CHECK-DEVICE" "yes"     #use '--no-check-device' if set to "no"
 }
 define dumptype dt_amgtar {
     program "APPLICATION"
     application "app_amgtar"
 }

Your DLE must inherit from the dt_amgtar dumptype.

amstar

 define application-tool app_amstar {
     comment "amstar"
     plugin  "amstar"
     #property "STAR-PATH" "/path/to/star"
     #property "STAR-TARDUMP" "/path/to/tardumps"  # default /etc/tardumps
     #property "STAR-DLE-TARDUMP" "no"
         # if 'yes' then create a different tardump file for each DLE,
         # it is required if you do many dump in parallel (maxdump>1)
     #property "ONE-FILE-SYSTEM" "yes"  #use '-xdev' option
     #property "SPARSE" "yes"           #use '-sparse' option
 }
 define dumptype dt_amstar {
     program "APPLICATION"
     application "app_amstar"
 }

Your DLE must inherit from the dt_amstar dumptype. amstar can only be used to backup full disk, i.e. the mount point.