Application API: Difference between revisions

From wiki.zmanda.com
Jump to navigation Jump to search
No edit summary
m (fix broken links in the Application API page)
 
(96 intermediate revisions by 7 users not shown)
Line 1: Line 1:
==Introduction==
<div style="float:right">__TOC__</div>
Historically, Amanda has focused on managing large indivisible chunks of data generated by one of only a few hard-coded applications (generally either GNU tar, some kind of dump, or smbclient). The Application API aims to address both limitations: That is, the Application API will modularize the existing tool support, so that it is easy (even trivial) to add a new tool. And the Application API will extend Amanda beyond simple large chunks of data, allowing backup and restore of only part of a backup job.
This page documents the Application API from a developer's perspective -- in particular, someone interested in modifying an existing application or creating a new one. For the basics of ''using'' the Application API in an Amanda configuration, see [[How To:Use Amanda Applications on a Client]].  Note that the implementation of the Application API is [[Application API/Implementation|still in progress]].


This proposal is ambitious, in that it will require major changes to fundamental Amanda infrastructure. Even more, this will require a change in the way Amanda developers and users think about their backups. The hurdles are not small, but they are surmountable, and the benefits are great.
Most of the useful content is held in subpages
* [[Application API/Terminology | Terminology]] describes some of the terms used around the API
* [[Application API/Operations | Operations]] describes the API operations in detail
* [[Application API/DAR | DAR]] describes DAR (Direct access recovery)
* [[Application API/Implementation | Implementation]] gives the roadmap for the API's implementation in Amanda
* [[Application API/Misconceptions | Misconceptions]] will set your thinking straight about how the API works


With the Application API in place, we can support new kinds of dump formats with ease, but entirely new non-filesystem applications as well: Look for support of dumping PostgreSQL and Oracle databases, Subversion repositories, even Microsoft Exchange.
==Background==
There are two compelling reasons to introduce the Application API:
* To allow recovery of a single file without transmitting the entire backup archive to the client.
* To make it easier to support new client backup mechanisms, both at the filesystem and application level.
The Application API addresses these needs by changing the way Amanda client operations work.


== Nomenclature ==
Historically, Amanda has focused on managing large chunks of data generated by one of only a few hard-coded applications (generally either GNU '''tar''', some version of '''dump''', or '''smbclient'''). The Application API addresses both limitations in the following manner:


A '''User Object''' is the basic unit of backup and restore, as far as the user is concerned. Right now a user object is a file or directory, but in the future other kinds of data may be supported. Each user object has a (hierarchical) identifier and a set of associated attributes. Also, each user object is entirely contained within some set of collections, but a single collection may contain data from multiple user objects.
*It provides modular support for adding client backup tools, both for filesystems and applications such as databases, mail servers, etc.  
*It extends Amanda to allow more granular backup and restore options.


A '''Collection''' is the basic unit of backup and restore, as far as the media back-end is concerned. A collection is the smallest thing that can be stored or retrieved from media.
=== Backward Compatability ===
The Application API maintains backward compatibility by extending existing behavior rather than replacing it.  Essentially, it adds "APPLICATION" as an alternative program to "GNUTAR" and "DUMP". The latter two options remain unchanged.


Each collection and user object may originate from only a single backup job.
* Legacy clients can be dumped as before. The server writes data to tape in the legacy format.
 
* Legacy tapes can be read as before.
== Application API Operations ==
** When restoring to a new client (one using the Application API), the server provides the legacy dump as one large collection.
 
** When restoring to a legacy client, the restore works as before.
Implementing the Application API will require changes to the backup server, but most of the code that constitutes the API itself resides on the client. The operations listed below are from the perspective of the backup clients.
* Legacy clients cannot restore data backed up by Application API clients; legacy clients can  be restored only only from legacy dumps.
 
=== Backup ===
Input: Information on what is to be backed up: A filesystem, device, particular set of files, database table, etc.
 
Action: Reads relevant information
 
Outputs: A set of collections (contains backup data), and information on a set of user objects (identifier, attributes, associated collections)
 
=== Restore ===
Input: List of user objects to be restored, relevant collections, and location to restore to.
 
Action: Reads the collections and writes the relevant user objects in their original form to some location.
 
Output: None (other than administrative messages)
 
=== Reindex ===
Input: Octet stream with all the collections from a single job.
 
Output: Byte offsets for each collection in the stream, and information on the set of user objects in the stream.
 
=== Estimate ===
Input: Information on what is to be backed up: A filesystem, device, particular set of files, database table, etc.
 
Output: An estimate of how much space this data set will consume.
 
=== selfcheck ===
Input: Information on what is to be backed up: A filesystem, device, particular set of files, database table, etc.
 
Action: Determines if there are any configuration problems.
 
Output: Success or failure.
 
=== Capabilities ===
Input/Action: none.
 
Output: Capabilities of this application driver. For example, the application may not support exclusion.
 
=== Print-Command ===
Input: Information on what is to be backed up: A filesystem, device, particular set of files, database table, etc.
 
Output: Prints a one-line command, if one exists, to restore this data from tape. This can be used for non-Amanda bare-metal disaster recovery.
 
== Examples ==
Here are some examples of how the generic nomenclature might be applied in a particular application driver.
 
=== Dump ===
User object => Filesystem object (file, directory, socket, pipe, etc.)<BR>
Collection => Entire filesystem
 
=== GNU tar ===
User object => Archive object (file, directory, etc.)<BR>
Collection => one 512-byte tar block.
 
Note that having such collections can be problematic; see below.
 
=== SQL database ===
User Object => Database table<BR>
Collection => Entire database
 
==== Alternative SQL database ====
User Object => Table row<BR>
Collection => Entire database
 
This conception is only useful if you have very large table rows; otherwise, the indices will be as big as the original database!
 
== Clarifying Common misconceptions ==
Because the Application API represents a major departure from historical Amanda thinking, misconceptions are common. This section attempts to address some of the most common.
 
=== The exact location of user objects is unknown. ===
Although Amanda knows which collections are required to restore a single user object, the exact location of that object is unknown. To put it another way, the object may not be found at any particular byte offset in the backup. Even if it could, Amanda wouldn't know that offset.
 
=== Collections must be very small (or very big) ===
Although Amanda will not enforce any particular size restriction on a collection, the optimal size for a collection is on the order of the size of a user object. In general, there is not much advantage to having collections smaller than about 64k. Very small collections will make for a larger index, and very large collections may make for slower restore, if the user is only interested in a particular (small) user object from that collection.
 
=== The server can understand a collection ===
As today, the server doesn't know anything about the collections on media -- it can only store and retrieve them. An entire collection (not an entire job) must be sent to an Amanda client running the same Application API for interpretation.
 
Note, however, that this Amanda client may be on the same physical machine as the Amanda server.
 
=== Inputs Outputs above are associated with particular sockets ===
As there is as yet no line protocol associated with this API, it would be premature to talk about particular sockets. But it is very possible that all output data (octet stream, collection byte offsets, and user object information) will be multiplexed in a single network socket.
 
=== On restore, The client may seek to a particular place in the backup data ===
Although the client could do this, the server doesn't know anything about it. Rather, the server provides the set of collections that includes all the user objects of interest. Then (and only then) the client goes about restoring user objects from this set of collections.
 
=== The data stream sent to the server is opaque to the server ===
Although the collection data itself is opaque, the other data (collection sizes, user object identifiers and attributes) is very much interpreted by the server. There should, for example, be a standard way of representing file permissions and timestamps as user object attributes.

Latest revision as of 09:09, 17 December 2023

This page documents the Application API from a developer's perspective -- in particular, someone interested in modifying an existing application or creating a new one. For the basics of using the Application API in an Amanda configuration, see How To:Use Amanda Applications on a Client. Note that the implementation of the Application API is still in progress.

Most of the useful content is held in subpages

  • Terminology describes some of the terms used around the API
  • Operations describes the API operations in detail
  • DAR describes DAR (Direct access recovery)
  • Implementation gives the roadmap for the API's implementation in Amanda
  • Misconceptions will set your thinking straight about how the API works

Background

There are two compelling reasons to introduce the Application API:

  • To allow recovery of a single file without transmitting the entire backup archive to the client.
  • To make it easier to support new client backup mechanisms, both at the filesystem and application level.

The Application API addresses these needs by changing the way Amanda client operations work.

Historically, Amanda has focused on managing large chunks of data generated by one of only a few hard-coded applications (generally either GNU tar, some version of dump, or smbclient). The Application API addresses both limitations in the following manner:

  • It provides modular support for adding client backup tools, both for filesystems and applications such as databases, mail servers, etc.
  • It extends Amanda to allow more granular backup and restore options.

Backward Compatability

The Application API maintains backward compatibility by extending existing behavior rather than replacing it. Essentially, it adds "APPLICATION" as an alternative program to "GNUTAR" and "DUMP". The latter two options remain unchanged.

  • Legacy clients can be dumped as before. The server writes data to tape in the legacy format.
  • Legacy tapes can be read as before.
    • When restoring to a new client (one using the Application API), the server provides the legacy dump as one large collection.
    • When restoring to a legacy client, the restore works as before.
  • Legacy clients cannot restore data backed up by Application API clients; legacy clients can be restored only only from legacy dumps.