Tape Driver Semantics
From The Open Source Backup Wiki (Amanda, MySQL Backup, BackupPC)
These two articles were originally written by John R Jackson, and unearthed by Jon LaBadie. They are an excellent summary of the behavior of tapes on UNIX systems. It has not been updated at all (it's that good!), so take the use of the term "new" in combination with 8mm tapes in its proper historical context. --Dustin 23:00, 16 January 2010 (UTC)
What is a Tape?
This is a quick summary of what a tape is. Tapes are sequential access removable media devices. That means the transport (drive) and the media (tape) that actually stores the bits are two separate items. At any one time only one piece of media (often called a volume) may be loaded in a transport, but any of several volumes may be put in a transport at different times, subject to physical compatibility.
Note that just being able to physically put a tape into a drive does not guarantee it can be processed. If you take a tape written on a new 8mm drive that supports high densities, it will not be able to be read in an older drive that does not have those same features. Some drives also require a certain "quality" of tape to be able to write at higher densities. For instance, DLT drives can write more data on a type IV tape than they can on a type III (DAT media has a similar characteristic).
Two types of records are stored on tape media, data and end of file (EOF) marks. EOF markers are also known as tape marks (or file marks). The data recorded between beginning of tape (BOT) or a tape mark and the next tape mark is often called a file. Note that there may be an arbitrary number of files on a tape.
The space between records (data or EOF) is known as an inter-record gap (IRG).
Data records are made up of some number of bytes. Some tape devices require the number of bytes to be constant (fixed record). The majority allow the number of bytes to vary, although there may be an upper (or lower) bound, and they may also have a fixed length mode (often used to emulate disks for booting a machine).
Tape marks are recorded using some special recording method to distinguish them from normal data. The exact details are technology specific. On some devices it might be done with a special signal burst that the hardware knows is not valid data.
Other devices have a special servo track with timing marks to know exactly how far down the tape the transport is. Tape marks may get a special value written into the servo track, or the servo track value may be recorded someplace special on the tape or in the cartridge (DLT media has a catalogue that is read whenever a tape is loaded, for instance).
And there are other variations.
Due to the different recording methods, tape marks may or may not take up actual space on the media, and the amount of space may vary wildly depending on the type of device. For instance, tape marks on the original 8mm drives consumed the equivalent of several MBytes of data space. On the other hand, they appear to take no space at all on some newer technologies.
Physical end of tape (EOT) is, literally, that point at the end of the reel where you run out of tape. However, all devices also have some means of reporting a logical end of medium (EOM), i.e. the point beyond which there is no more valid data.
For nine-track tapes, two tape marks in a row is meant to be interpreted by the software as end of medium. Most other devices record another special type of mark after the last write so they can report end of medium when it is seen.
Once EOM is recorded on a tape, all data previously recorded beyond that point is no longer available (without extraordinary methods).
The multitude of recording methods provided by the various tape technologies is way beyond the scope of this document. However there are two basic techniques worthy of mention, linear and serpentine.
Linear recording (nine track, 4mm, 8mm, etc) uses the entire width of the tape for each record and makes one pass from BOT to EOT. There may be one or more tracks involved. For instance, nine track tape has nine tracks running the length of the media. Eight are used for the eight bits of each byte and the ninth is for parity.
Serpentine recording always involves multiple tracks. When writing from BOT, one (or more) tracks are written until EOT is hit. The tape then reverses direction and continues writing to a different track(s). This is repeated as many times as there are tracks.
For instance, DLT7000 tape, when recorded in 35 GByte mode, has 208 tracks that are recorded in 52 groups of four. So each pass from one end to the other writes four tracks, and it takes 52 passes to transfer the entire volume.
One common tape mark characteristic is that they may be detected when the transport is in high speed search mode. This mode typically moves the media several times faster than when doing normal I/O and is specifically meant for repositioning the media to a specific point (beginning of tape, end of medium (EOM) or a tape mark).
This type of motion may be "forward" (toward EOM) or "backward" (toward BOT). The functions are typically called "forward space file" (fsf) and "backward space file" (bsf) and always allow a count of the number of files to skip (space) over. Note that in the case of serpentine recording, "forward" and "backward" may not relate to the actual direction the media is moving.
Doing a forward space operation leaves the tape positioned in the gap between the tape mark and the next record.
Doing a backward space operation leaves it in the gap just before the tape mark. Doing a read at this point will return EOF because it will see the tape mark.
It is common to know which tape file stores a particular data object. Asking the drive to skip from the current file position to the desired one is almost always going to be faster (sometimes by orders of magnitude) than reading through the tape one file at a time.
Unloading a tape from a transport always involves rewinding it to BOT then ejecting it.
In theory, if the tape cartridge has two reels (supply and takeup), unloading could just kick the tape out of the transport without rewinding it. However, in practice, you almost always need to know where you are when you load a tape and this would make that difficult. So even two reel technologies are rewound before unload.
A read operation on a tape always moves the media forward one entire physical record. If the buffer you supply the read(2) function is not large enough, the kernel driver may fill in whatever you do provide with as much as it can, or it may return an error. But in any case, the media will be positioned such that the next read will start with the first byte of the next record.
If a read operation detects a tape mark, the kernel driver translates this situation into Unix EOF, i.e. the read(2) returns zero for the number of bytes transferred. As with reading a data record, the media will be positioned after the tape mark (the record just read) and be ready to read the next record. However, what the kernel driver does at this point depends on how it is configured; see below.
Two successive reads that return EOF (possibly with an FSF in between for SysV semantics) indicate end of media (EOM).
What happens when you try to read past EOM varies by kernel driver. Some will continue to return EOF. Others will return an error (typically EIO).
A write(2) operation to a tape device always writes one physical record. If the device only supports a fixed record size (or is configured to behave that way), the record must be that size or a multiple of the size. If the device supports (or is configure for) variable record sizes, you can write a record of any length, although the transport or kernel driver may impose upper or lower bounds.
Writing a tape mark is done with a special ioctl(2) call, the specifics of which vary from OS to OS. But the end result is the same. First, the transport flushes any data it has in its internal buffers (if any), then one (or more) tape marks are written to the media.
In theory, you should be able to start writing at any point. For instance, you could rewind to BOT, write 100 records, rewind to BOT, read 50 records and then start writing again in that first file (the last 50 records from first write pass would be "lost"/overwritten).
However, many devices only allow writing when positioned at BOT or just after (or before) a tape mark. So the first 100 records would be OK above because they were written at BOT. But the second write after reading 50 records would fail. It would be valid to read all 100 records, and the tape mark, then start writing a second file. It would also be valid to read all 100 records, the tape mark, then backspace over the tape mark and start writing, which would append the new records to the first file.
The kernel driver also does some writing for you. If the last operation you do is a write(2) and you close(2) the device, or use ioctl(2) to do any type of backward motion (e.g. rewind), the driver will write a tape mark. This ensure that in normal circumstances all files end with a tape mark.
One exception case can cause a tape file to not be properly "terminated" with a tape mark. If the device resets itself (machine reboot, SCSI timeout due to some other error, etc), the default action is to rewind the tape even if the last operation was a write. It is unlikely the device will terminate the file and during reading an error is the most likely result.
Appending new files to a tape should be trivial. Skip to end of media and start writing. How hard could that be?
Well, it depends on how much you trust tapes and transports. I've been working with them for over 25 years. Believe me. They are not to be trusted. At all. Ever.
The first problem is trusting the forward space operation to end up where you told it to. If the device moves the media at high speed and depends on the ability to detect tape marks as they go flying by the heads, it's certainly possible to miss one (or more) and skip too far. It's also possible to get false counts from data that, at high speed, looks like a tape mark, resulting in stopping too soon.
Some kernel drivers support a "skip to EOM" operation on devices that support EOM in hardware, but in the code I've seen this is just emulated via "fsf 999999" until it gets an error.
However, this is actually probably reasonable safe. Certainly much more so than on nine track tapes where, if you missed a tape mark, you could end up pulling the tape completely off the supply reel.
It's also possibly safer on devices that don't actually record tape marks on the media, but save their tachometer (servo) value.
Even so, the safest way to append is to forward space to the file before you want to append, read that file and verify it is the one you expect it to be, keep reading until the tape mark, then start writing.
The second problem is related to the physics of pushing and pulling a band of plastic to and fro across a set of heads. If all the motion is in one direction (from either end, or from a known good position), everything is reasonably OK. But, in general, it's a bad idea to change directions and then start writing.
For instance, in the above example of appending, if you wanted to actually add more data records to the file instead of starting a new file, it would be a bad idea to read the trailing tape mark, skip backward over it, then start writing. Logically, it looks fine. Physically, there are a lot of things that can go wrong.
Picture yourself holding a piece of string in one hand and applying light tension to it laying on a table with the other. As you pull with one hand, the string slides nicely along. If you try to push it back a bit, it's obviously all bunched up.
Things are better (and more like actual tape hardware) if instead of pushing you reach to the other side of the hand holding the string down and pull from that side a bit. But, although it's hard to detect, you've actually stretch the string a bit with that change of direction. That, in turn, can lead to errors.
So the suggested solution to the problem of appending records is to know how many records are in the file, read just that many and then start writing. Note that some transports may not allow this as they would not realize you were at a tape mark even though you had not actually read it.
Note that if the drive has internal data buffers, write(2) calls are probably not actually getting your data to the media. The data is transferred to the buffer and your call returns successfully. The transport doesn't actually move the data to the media until some time later.
This means the only time you know all your data has successfully made it to tape is after you issue a write tape mark command and it returns without error. Note also that if it does return an error, you have no way of knowing how much data actually made it to the media. It might have been all of it and the error was for the tape mark itself. Or the internal buffers may have held all your data since the previous tape mark was written and the very first physical write of that failed so none of it made it to the media.
Some devices support so called "zero tape mark" calls to force flushing of the buffers. A normal tape mark write request is made but with a count of zero. The buffers are flushed but no tape marks are written. However, not all devices support this feature, and use can cause significant performance degradation.
Some drives try to deal with this issue at EOM (the usual place to get a write error) by knowing internally how much media is left. This is the so called logical end of tape (LEOT), as compared to physical end of tape (PEOT). If you try to issue a write past LEOT the drive will return an error. A write tape mark operation will go ahead and flush what it had in the buffers up to that point in the hopes there is enough room. LEOT can be anywhere from KBytes to MBytes ahead of PEOT.
Note that even with this feature, however, you cannot be certain the internal buffers were flushed properly.
Some of the above statements about buffer flushing are not strictly true. If the transport has a SCSI interface, and if it has device specific commands for getting detailed internal information, and if your kernel has support for issuing raw SCSI commands to the transport, it might be possible to tell how much data made it to tape at any point in time. However, the effort involved, especially to be portable across hardware and OS's, is a major undertaking. And in many cases it is just not possible.
All kernel drivers provide two ways to access a drive, the so-called "rewind" and "no-rewind" devices. The difference is in what happens when a program closes its access to the drive.
First, if the last thing the program did was a write(2), the kernel will write a tape mark to close out the current file.
If you were using a "rewind" device name, the tape will be rewound to BOT.
If you were using a "no-rewind" device name, and you were writing to the device, the tape will be left positioned after the tape mark.
If you were using a "no-rewind" device name, and you were reading the device, and you are using BSD semantics, the tape will be left positioned exactly where it is, which could possibly be in the middle of a file.
If you were using a "no-rewind" device name, and you were reading the device, and you are using SysV semantics, the kernel will do a forward space file operation to skip past the next tape mark and leave the tape positioned there.
Fixed Record Size
Some devices only support a fixed record size, typically 1 KByte. Records you write must be some multiple of that size.
Many devices that support variable record lengths also support a fixed record length option. Problems can happen when a tape written one way (fixed or variable) needs to be read on a drive configured the other way.
In general, variable is more flexible. A drive configured to read variable length records can (obviously) read variable length records and it can also read fixed length records. There is nothing magic about fixed length records -- it's purely a restriction imposed by the drive when writing.
However, a drive configure to read fixed length records can only read tapes written with that same fixed length size (or an even multiple of it).
Assuming a transport that uses a 1 KByte record size (which is typical), writing 1 KByte fixed length records can be read as is with the drive reconfigured for variable record size because the physical and logical record size are the same.
If the records written are 2 KBytes, the drive will output them as two physical records. When read back with the drive configured for variable record size you will see two 1 KByte records instead of a single 2 KByte record. The data will all be there, it just needs to be recombined.
One other oddity about fixed record sizes. Some devices (e.g. 8mm) support a special 512 byte record size. This is typically used to create tapes that can be used to boot a machine (boot sequences tend to like to read 512 byte records, which is what disks typically present). However, what's really happening is that the physical records are still 1 KByte, but only the first 512 bytes are used. This has to be taken into account if you try to read such a tape with a different record size (including variable). It also means the drive throughput will be cut in half since every 512 bytes of data is actually using up 1024 bytes of tape space.
SYSV vs. BSD Tapes
Back in the dark ages (the 70's), two flavors of Unix were being actively developed, BSD (from the University of California, Berkeley) and System V (from AT&T). Among other things destined to annoy system administrators down through the ages, they managed to come up with different ways to treat tapes (which, if you know the pathetic history of tape support in Unix, is really humorous).
A read operation on a tape always moves the media forward one entire physical record, whether it is a data record or a tape mark.
If you read a tape mark, you'll get an EOF indication (zero bytes transferred). However, what happens on the next read depends on BSD vs SysV semantics:
- the first record from the next file will be returned.
- an error will be reported until you do an explicit forward space file (FSF) operation (ioctl). Note that this does not actually move forward another file. The driver knows it is sitting just past a tape mark and only uses this operation to clear that flag.
This difference makes coding applications difficult if the programmer does not know which type of semantics are being used. If SysV is assumed, and the program does an fsf operation after getting EOF, it will be skipping over every other file if the driver is really BSD. If BSD is assumed, doing another read after EOF will return an error if the driver is really SysV.
One way around this is to close and re-open the drive whenever EOF is detected. On BSD this will have no effect. On SysV it will also have no effect because the driver will see the tape is positioned after a tape mark and allow reading to continue on the next file.
However, this opens a timing window where some other process could, in theory, open the tape drive. Since tape drives do not allow multiple processes to open them at the same time, the original program would fail.
There is no difference between BSD and SysV semantics for writing.
If you were reading and close a device being processed with BSD semantics, the tape position is left alone. You could be left in the middle of a file so that the next open and read operation picks up where you left off.
If you were reading and close a device being processed with SysV semantics, the kernel will do a forward space file (FSF) operation to position the tape just past the next tape mark, unless you just did an fsf operation yourself, in which case the tape is left where it is (at the beginning of a new file or at EOM). A subsequent open and read will get data from the next file regardless of where you stopped reading in the previous file.
Note that the "mt" command opens and closes the tape device to do its work. So it is subject to the same semantic issues.
In particular, if you do an fsf operation you will end up in the same place regardless of semantics. This is because before the close the tape will be positioned after the tape mark. BSD will leave the position alone. SysV will see that it is immediately after a tape mark and also leave the position alone.
However, a bsf (backward space file) operation acts differently. Before the close, the tape will be positioned just before the tape mark. BSD will leave that alone and the next read will get back an EOF. But SysV will advance past the tape mark on the "mt" close and leave the tape ready to read the first record of the next file.
Which is Which
How to use BSD vs. SysV tape semantics depends on the kernel tape driver you're using, which, in turn, depends on your OS.
On Solaris, for instance, if the device name has a 'b' in it, you're using BSD semantics. So /dev/rmt/0bn is BSD while /dev/rmt/0n is the same drive but accessed with SysV semantics.
If you cannot find out from the system documentation which way your driver works, here's a little test case.
Warning: make sure you use a scratch tape!!!. The following will clobber whatever data is on the tape.
If you use sh/ksh/bash:
t=/dev/whatever # use this for sh/ksh/bash f1=/etc/hosts # need a file >= 1 KByte f2=/etc/services # need a file >= 1 KByte
If you use csh/tcsh:
set t=/dev/whatever # use this for csh/tcsh set f1=/etc/hosts # need a file >= 1 KByte set f2=/etc/services # need a file >= 1 KByte
The remainder of the tests are shell independent.
First, make sure you are using a no-rewind device:
mt -f $t rewind ( dd if=$f1 bs=1k count=1 ; dd if=$f1 bs=1k count=1 ) > $t ( dd if=$f2 bs=1k count=1 ; dd if=$f2 bs=1k count=1 ) > $t mt -f $t rewind dd if=$t bs=1k count=1 | head
If the "head" command shows output from $f2 instead of $f1, you are using a rewinding device name. This is because the system rewound the tape between the two "dd" statements, so the second one overwrote what the first one did.
Find a no-rewind device name and reset the 't' shell variable before continuing.
Now, see which semantics you are using:
mt -f $t rewind ( dd if=$f1 bs=1k count=1 ; dd if=$f1 bs=1k count=1 ) > $t ( dd if=$f2 bs=1k count=1 ; dd if=$f2 bs=1k count=1 ) > $t
mt -f $t rewind dd if=$t bs=1k count=1 | head dd if=$t bs=1k count=1 | head
If the output from the first and second "head" calls is the same, you are using a device configured with BSD semantics. This is because the first dd leaves the tape after the first record of the first file so the second dd reads the second record of the first file, which should be the same as the first record.
If the output from the two "head" calls is different, you are using SysV semantics. This is because the kernel driver advances to the second file after the first dd completed, so the second dd reads the first record of the second file.