Application API

From wiki.zmanda.com
Revision as of 22:00, 15 December 2005 by Ian Turner (talk | contribs) (Document the Application API)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Introduction

Historically, Amanda has focused on managing large indivisible chunks of data generated by one of only a few hard-coded applications (generally either GNU tar, some kind of dump, or smbclient). The Application API aims to address both limitations: That is, the Application API will modularize the existing tool support, so that it is easy (even trivial) to add a new tool. And the Application API will extend Amanda beyond simple large chunks of data, allowing backup and restore of only part of a backup job.

This proposal is ambitious, in that it will require major changes to fundamental Amanda infrastructure. Even more, this will require a change in the way Amanda developers and users think about their backups. The hurdles are not small, but they are surmountable, and the benefits are great.

With the Application API in place, we can support new kinds of dump formats with ease, but entirely new non-filesystem applications as well: Look for support of dumping PostgreSQL and Oracle databases, Subversion repositories, even Microsoft Exchange.

Nomenclature

A **User Object** is the basic unit of backup and restore, as far as the user is concerned. Right now a user object is a file or directory, but in the future other kinds of data may be supported. Each user object has a (hierarchical) identifier and a set of associated attributes. Also, each user object is entirely contained within some set of collections, but a single collection may contain data from multiple user objects.

A **Collection** is the basic unit of backup and restore, as far as the media back-end is concerned. A collection is the smallest thing that can be stored or retrieved from media.

Each collection and user object may originate from only a single backup job.

Application API Operations

Implementing the Application API will require changes to the backup server, but most of the code that constitutes the API itself resides on the client. The operations listed below are from the perspective of the backup clients.

Backup

Input: Information on what is to be backed up: A filesystem, device, particular set of files, database table, etc.

Action: Reads relevant information

Outputs: A set of collections (contains backup data), and information on a set of user objects (identifier, attributes, associated collections)

Restore

Input: List of user objects to be restored, relevant collections, and location to restore to.

Action: Reads the collections and writes the relevant user objects in their original form to some location.

Output: None (other than administrative messages)

Reindex

Input: Octet stream with all the collections from a single job.

Output: Byte offsets for each collection in the stream, and information on the set of user objects in the stream.

Estimate

Input: Information on what is to be backed up: A filesystem, device, particular set of files, database table, etc.

Output: An estimate of how much space this data set will consume.

selfcheck

Input: Information on what is to be backed up: A filesystem, device, particular set of files, database table, etc.

Action: Determines if there are any configuration problems.

Output: Success or failure.

Capabilities

Input/Action: none.

Output: Capabilities of this application driver. For example, the application may not support exclusion.

Print-Command

Input: Information on what is to be backed up: A filesystem, device, particular set of files, database table, etc.

Output: Prints a one-line command, if one exists, to restore this data from tape. This can be used for non-Amanda bare-metal disaster recovery.

Examples

Here are some examples of how the generic nomenclature might be applied in a particular application driver.

Dump

User object => Filesystem object (file, directory, socket, pipe, etc.)
Collection => Entire filesystem

GNU tar

> User object => Archive object (file, directory, etc.) > Collection => one 512-byte tar block.

Note that having such collections can be problematic; see below.

SQL database

> User Object => Database table > Collection => Entire database

Alternative SQL database

> User Object => Table row > Collection => Entire database

This conception is only useful if you have very large table rows; otherwise, the indices will be as big as the original database!

Clarifying Common misconceptions

Because the Application API represents a major departure from historical Amanda thinking, misconceptions are common. This section attempts to address some of the most common.

The exact location of user objects is unknown.

Although Amanda knows which collections are required to restore a single user object, the exact location of that object is unknown. To put it another way, the object may not be found at any particular byte offset in the backup. Even if it could, Amanda wouldn't know that offset.

Collections must be very small (or very big)

Although Amanda will not enforce any particular size restriction on a collection, the optimal size for a collection is on the order of the size of a user object. In general, there is not much advantage to having collections smaller than about 64k. Very small collections will make for a larger index, and very large collections may make for slower restore, if the user is only interested in a particular (small) user object from that collection.

The server can understand a collection

As today, the server doesn't know anything about the collections on media -- it can only store and retrieve them. An entire collection (not an entire job) must be sent to an Amanda client running the same Application API for interpretation.

Note, however, that this Amanda client may be on the same physical machine as the Amanda server.

Inputs Outputs above are associated with particular sockets

As there is as yet no line protocol associated with this API, it would be premature to talk about particular sockets. But it is very possible that all output data (octet stream, collection byte offsets, and user object information) will be multiplexed in a single network socket.

On restore, The client may seek to a particular place in the backup data

Although the client could do this, the server doesn't know anything about it. Rather, the server provides the set of collections that includes all the user objects of interest. Then (and only then) the client goes about restoring user objects from this set of collections.

The data stream sent to the server is opaque to the server

Although the collection data itself is opaque, the other data (collection sizes, user object identifiers and attributes) is very much interpreted by the server. There should, for example, be a standard way of representing file permissions and timestamps as user object attributes.