Application API: Difference between revisions

From wiki.zmanda.com
Jump to navigation Jump to search
No edit summary
m (fix broken links in the Application API page)
 
(85 intermediate revisions by 6 users not shown)
Line 1: Line 1:
==Introduction==
<div style="float:right">__TOC__</div>
There are two compelling reasons to introduce the Application API:
This page documents the Application API from a developer's perspective -- in particular, someone interested in modifying an existing application or creating a new one.  For the basics of ''using'' the Application API in an Amanda configuration, see [[How To:Use Amanda Applications on a Client]]. Note that the implementation of the Application API is [[Application API/Implementation|still in progress]].
* To allow recovery of a single file without transmitting the entire dump to the client.
* To make it easier to add new dumper applications.
The Application API addresses these needs by changing the backup, restore, selfcheck, and other amanda client operations.


Historically, Amanda has focused on managing large indivisible chunks of data generated by one of only a few hard-coded applications (generally either GNU tar, some kind of dump, or smbclient). The Application API aims to address both limitations: That is, the Application API will modularize the existing tool support, so that it is easy (even trivial) to add a new tool. And the Application API will extend Amanda beyond simple large chunks of data, allowing backup and restore of only part of a backup job.
Most of the useful content is held in subpages
* [[Application API/Terminology | Terminology]] describes some of the terms used around the API
* [[Application API/Operations | Operations]] describes the API operations in detail
* [[Application API/DAR | DAR]] describes DAR (Direct access recovery)
* [[Application API/Implementation | Implementation]] gives the roadmap for the API's implementation in Amanda
* [[Application API/Misconceptions | Misconceptions]] will set your thinking straight about how the API works


This proposal is ambitious, in that it will require major changes to fundamental Amanda infrastructure. Even more, this will require a change in the way Amanda developers and users think about their backups. The hurdles are not small, but they are surmountable, and the benefits are great.
==Background==
 
There are two compelling reasons to introduce the Application API:
With the Application API in place, we can support new kinds of dump formats with ease, but entirely new non-filesystem applications as well: Look for support of dumping PostgreSQL and Oracle databases, Subversion repositories, even Microsoft Exchange.
* To allow recovery of a single file without transmitting the entire backup archive to the client.
* To make it easier to support new client backup mechanisms, both at the filesystem and application level.
The Application API addresses these needs by changing the way Amanda client operations work.


=== Application API vs Dumper API ===
Historically, Amanda has focused on managing large chunks of data generated by one of only a few hard-coded applications (generally either GNU '''tar''', some version of '''dump''', or '''smbclient'''). The Application API addresses both limitations in the following manner:
This Application API replaces and supercedes the previous [[Dumper API]] proposal.


Why make this change? The Dumper API has a number of limitations that the Application API avoids:
*It provides modular support for adding client backup tools, both for filesystems and applications such as databases, mail servers, etc.  
* The Dumper API had no restore support at all (only backup)
*It extends Amanda to allow more granular backup and restore options.
* The Dumper API included a lot of the details of dumping applications in the API itself.
* The Dumper API requires transmitting the entire archive to a client to extract even a single file.
* No Dumper API implementation work exists, though the proposal is over 4 years old.


=== Backward Compatability ===
=== Backward Compatability ===
Although the application API will make big changes to Amanda's core, it is important to maintain backward compatability. Since these concepts are an extension of existing behavior, we can easily support old clients and tapes as follows:
The Application API maintains backward compatibility by extending existing behavior rather than replacing it. Essentially, it adds "APPLICATION" as an alternative program to "GNUTAR" and "DUMP". The latter two options remain unchanged.
* Old clients can be dumped as before. The server will write data to tape in the old way.
* Old tapes can be read as before.
** When restoring to a new client (one with the Application API), the server will provide the old dump as one large collection.
** When restoring to an old client, the restore works as before.
* Old clients cannot restore data dumped by new clients.
* Old clients can be restored only
 
This level of barkward compatability will not come easily, but should be doable.
 
== Nomenclature ==
 
This nomenclature is derived from the SCSI command-set standard INCITS T10/1731-D.
 
A '''User Object''' is the basic unit of backup and restore, as far as the user is concerned. Right now a user object is a file or directory, but in the future other kinds of data may be supported. Each user object has a (hierarchical) identifier and a set of associated attributes. Also, each user object is entirely contained within some set of collections, but a single collection may contain data from multiple user objects.
 
A '''Collection''' is the basic unit of backup and restore, as far as the media back-end is concerned. A collection is the smallest thing that can be stored or retrieved from media.
 
Each collection and user object may originate from only a single backup job, collection merge, or collection copy/migration.
 
== Application API Operations ==
 
Implementing the Application API will require changes to the backup server, but most of the code that constitutes the API itself resides on the client. The operations listed below are from the perspective of the backup clients.
 
=== Backup ===
Input: Information on what is to be backed up: A filesystem, device, particular set of files, database table, etc.
 
Action: Reads relevant information
 
Outputs: A set of collections (contains backup data), and information on a set of user objects (identifier, attributes, associated collections)
 
=== Restore ===
Input: List of user objects to be restored, relevant collections, and location to restore to.
 
Action: Reads the collections and writes the relevant user objects in their original form to some location.
 
Output: None (other than administrative messages)
 
=== Reindex ===
Input: Octet stream with all the collections from a single job.
 
Output: Byte offsets for each collection in the stream, and information on the set of user objects in the stream.
 
=== Estimate ===
Input: Information on what is to be backed up: A filesystem, device, particular set of files, database table, etc.
 
Output: An estimate of how much space this data set will consume.
 
=== selfcheck ===
Input: Information on what is to be backed up: A filesystem, device, particular set of files, database table, etc.
 
Action: Determines if there are any configuration problems.
 
Output: Success or failure.
 
=== Capabilities ===
Input/Action: none.
 
Output: Capabilities of this application driver. For example, the application may not support exclusion. This command can also tell if this driver can read a dump from some other version of the same driver.
 
=== Print-Command ===
Input: Information on what is to be backed up: A filesystem, device, particular set of files, database table, etc.
 
Output: Prints a one-line command, if one exists, to restore this data from tape. This can be used for non-Amanda bare-metal disaster recovery.
 
== Examples ==
Here are some examples of how the generic nomenclature might be applied in a particular application driver.
 
=== Dump ===
User object => Filesystem object (file, directory, socket, pipe, etc.)<BR>
Collection => Entire filesystem
 
=== GNU tar ===
User object => Archive object (file, directory, etc.)<BR>
Collection => one 512-byte tar block.
 
Note that having such collections can be problematic; see below.
 
=== SQL database ===
User Object => Database table<BR>
Collection => Entire database
 
==== Alternative SQL database ====
User Object => Table row<BR>
Collection => Entire database
 
This conception is only useful if you have very large table rows; otherwise, the indices will be as big as the original database!
 
=== Media Formats ===
At present, there are two tape formats: Traditional and Spanned. In the future, more formats might be added, but for now, these are the only two. The Application API will not change this. Indeed, the on-tape format will not change at all; you could still restore a gnutar dump under the Application API using some earlier version of Amanda. The only thing that changes is the terminology:
 
==== Traditional ====
On dump, we continue to write a 32k amanda header followed by the complete set of collections as provided by the client. As before, these are written as 32K tape blocks. On restore, we can use the BSR command to seek the tape drive to the appropriate tape block, and then read the interesting collections. All this requires us to maintain in the index is the byte-offset of each collection.
 
If you want to recover without Amanda, you can still use <tt>dd</tt> to read the tape contents into the tool directly.
 
==== Spanned ====
When using the Spanned tape format with the Application API, we again take the complete set of collections as provided with the client, treating it as a single BLOB, and dividing it into chunks. We continue to write a 32k amanda header at the beginning of each chunk, followed by a number of 32k tape blocks. On restore, we again can use the BSR command to seek the tape drive to the appropriate tape block, and read the interesting collections. Under this scenario, we must again note where each collection is stored. This could be done by storing the byte-offset of each collection, along with chunk information for the job as a whole, or else by storing the chunk and byte offset for each piece of each collection.
 
== Clarifying Common misconceptions ==
Because the Application API represents a major departure from historical Amanda thinking, misconceptions are common. This section attempts to address some of the most common.
 
=== The exact location of user objects is known. ===
Amanda can restore a user object only by retrieving the associated collections. So, Amanda doesn't know the exact location of any user object (besides which collections it's in), but it does have enough information to efficiently restore a user object without reading the whole dump -- assuming that collections are smaller than a dump.
 
To put it another way, the object may not be found at any particular byte offset in the backup. Even if it could, Amanda wouldn't know that offset. But nonetheless Amanda has sufficient information to do an efficient restore.
 
=== Collections must be very small (or very big) ===
Although Amanda will not enforce any particular size restriction on a collection, the optimal size for a collection is on the order of the size of a user object. In general, there is not much advantage to having collections smaller than about 64k. Very small collections will make for a larger index, and very large collections may make for slower restore, if the user is only interested in a particular (small) user object from that collection.
 
=== The server can understand a collection ===
As today, the server doesn't know anything about the collections on media -- it can only store and retrieve them. An entire collection (not an entire job) must be sent to an Amanda client running the same Application API for interpretation.
 
Note, however, that this Amanda client may be on the same physical machine as the Amanda server.
 
=== Inputs Outputs above are associated with particular sockets ===
As there is as yet no line protocol associated with this API, it would be premature to talk about particular sockets. But it is very possible that all output data (octet stream, collection byte offsets, and user object information) will be multiplexed in a single network socket.
 
=== On restore, The client may seek to a particular place in the backup data ===
Although the client could do this, the server doesn't know anything about it. Rather, the server provides the set of collections that includes all the user objects of interest. Then (and only then) the client goes about restoring user objects from this set of collections.


=== The data stream sent to the server is opaque to the server ===
* Legacy clients can be dumped as before. The server writes data to tape in the legacy format.
Although the collection data itself is opaque, the other data (collection sizes, user object identifiers and attributes) is very much interpreted by the server. There should, for example, be a standard way of representing file permissions and timestamps as user object attributes.
* Legacy tapes can be read as before.
** When restoring to a new client (one using the Application API), the server provides the legacy dump as one large collection.
** When restoring to a legacy client, the restore works as before.
* Legacy clients cannot restore data backed up by Application API clients; legacy clients can  be restored only only from legacy dumps.

Latest revision as of 09:09, 17 December 2023

This page documents the Application API from a developer's perspective -- in particular, someone interested in modifying an existing application or creating a new one. For the basics of using the Application API in an Amanda configuration, see How To:Use Amanda Applications on a Client. Note that the implementation of the Application API is still in progress.

Most of the useful content is held in subpages

  • Terminology describes some of the terms used around the API
  • Operations describes the API operations in detail
  • DAR describes DAR (Direct access recovery)
  • Implementation gives the roadmap for the API's implementation in Amanda
  • Misconceptions will set your thinking straight about how the API works

Background

There are two compelling reasons to introduce the Application API:

  • To allow recovery of a single file without transmitting the entire backup archive to the client.
  • To make it easier to support new client backup mechanisms, both at the filesystem and application level.

The Application API addresses these needs by changing the way Amanda client operations work.

Historically, Amanda has focused on managing large chunks of data generated by one of only a few hard-coded applications (generally either GNU tar, some version of dump, or smbclient). The Application API addresses both limitations in the following manner:

  • It provides modular support for adding client backup tools, both for filesystems and applications such as databases, mail servers, etc.
  • It extends Amanda to allow more granular backup and restore options.

Backward Compatability

The Application API maintains backward compatibility by extending existing behavior rather than replacing it. Essentially, it adds "APPLICATION" as an alternative program to "GNUTAR" and "DUMP". The latter two options remain unchanged.

  • Legacy clients can be dumped as before. The server writes data to tape in the legacy format.
  • Legacy tapes can be read as before.
    • When restoring to a new client (one using the Application API), the server provides the legacy dump as one large collection.
    • When restoring to a legacy client, the restore works as before.
  • Legacy clients cannot restore data backed up by Application API clients; legacy clients can be restored only only from legacy dumps.