From: Sergey Poznyakoff Date: Fri, 9 Jun 2006 13:46:34 +0000 (+0000) Subject: New file X-Git-Url: https://git.dogcows.com/gitweb?p=chaz%2Ftar;a=commitdiff_plain;h=97bfe578edf7a92392896ddb4d401a11ad86808e New file --- diff --git a/doc/intern.texi b/doc/intern.texi new file mode 100644 index 0000000..e6f9706 --- /dev/null +++ b/doc/intern.texi @@ -0,0 +1,329 @@ +@c This is part of the paxutils manual. +@c Copyright (C) 2006 Free Software Foundation, Inc. +@c This file is distributed under GFDL 1.1 or any later version +@c published by the Free Software Foundation. + +@menu +* Standard:: Basic Tar Format +* Extensions:: @acronym{GNU} Extensions to the Archive Format +* Snapshot Files:: +* Dumpdir:: +@end menu + +@node Standard +@unnumberedsec Basic Tar Format +@UNREVISED + +While an archive may contain many files, the archive itself is a +single ordinary file. Like any other file, an archive file can be +written to a storage device such as a tape or disk, sent through a +pipe or over a network, saved on the active file system, or even +stored in another archive. An archive file is not easy to read or +manipulate without using the @command{tar} utility or Tar mode in +@acronym{GNU} Emacs. + +Physically, an archive consists of a series of file entries terminated +by an end-of-archive entry, which consists of two 512 blocks of zero +bytes. A file +entry usually describes one of the files in the archive (an +@dfn{archive member}), and consists of a file header and the contents +of the file. File headers contain file names and statistics, checksum +information which @command{tar} uses to detect file corruption, and +information about file types. + +Archives are permitted to have more than one member with the same +member name. One way this situation can occur is if more than one +version of a file has been stored in the archive. For information +about adding new versions of a file to an archive, see @ref{update}. +@FIXME-xref{To learn more about having more than one archive member with the +same name, see -backup node, when it's written.} + +In addition to entries describing archive members, an archive may +contain entries which @command{tar} itself uses to store information. +@xref{label}, for an example of such an archive entry. + +A @command{tar} archive file contains a series of blocks. Each block +contains @code{BLOCKSIZE} bytes. Although this format may be thought +of as being on magnetic tape, other media are often used. + +Each file archived is represented by a header block which describes +the file, followed by zero or more blocks which give the contents +of the file. At the end of the archive file there are two 512-byte blocks +filled with binary zeros as an end-of-file marker. A reasonable system +should write such end-of-file marker at the end of an archive, but +must not assume that such a block exists when reading an archive. In +particular @GNUTAR{} always issues a warning if it does not encounter it. + +The blocks may be @dfn{blocked} for physical I/O operations. +Each record of @var{n} blocks (where @var{n} is set by the +@option{--blocking-factor=@var{512-size}} (@option{-b @var{512-size}}) option to @command{tar}) is written with a single +@w{@samp{write ()}} operation. On magnetic tapes, the result of +such a write is a single record. When writing an archive, +the last record of blocks should be written at the full size, with +blocks after the zero block containing all zeros. When reading +an archive, a reasonable system should properly handle an archive +whose last record is shorter than the rest, or which contains garbage +records after a zero block. + +The header block is defined in C as follows. In the @GNUTAR{} +distribution, this is part of file @file{src/tar.h}: + +@smallexample +@include header.texi +@end smallexample + +All characters in header blocks are represented by using 8-bit +characters in the local variant of ASCII. Each field within the +structure is contiguous; that is, there is no padding used within +the structure. Each character on the archive medium is stored +contiguously. + +Bytes representing the contents of files (after the header block +of each file) are not translated in any way and are not constrained +to represent characters in any character set. The @command{tar} format +does not distinguish text files from binary files, and no translation +of file contents is performed. + +The @code{name}, @code{linkname}, @code{magic}, @code{uname}, and +@code{gname} are null-terminated character strings. All other fields +are zero-filled octal numbers in ASCII. Each numeric field of width +@var{w} contains @var{w} minus 1 digits, and a null. + +The @code{name} field is the file name of the file, with directory names +(if any) preceding the file name, separated by slashes. + +@FIXME{how big a name before field overflows?} + +The @code{mode} field provides nine bits specifying file permissions +and three bits to specify the Set UID, Set GID, and Save Text +(@dfn{sticky}) modes. Values for these bits are defined above. +When special permissions are required to create a file with a given +mode, and the user restoring files from the archive does not hold such +permissions, the mode bit(s) specifying those special permissions +are ignored. Modes which are not supported by the operating system +restoring files from the archive will be ignored. Unsupported modes +should be faked up when creating or updating an archive; e.g., the +group permission could be copied from the @emph{other} permission. + +The @code{uid} and @code{gid} fields are the numeric user and group +ID of the file owners, respectively. If the operating system does +not support numeric user or group IDs, these fields should be ignored. + +The @code{size} field is the size of the file in bytes; linked files +are archived with this field specified as zero. @FIXME-xref{Modifiers, in +particular the @option{--incremental} (@option{-G}) option.} + +The @code{mtime} field is the data modification time of the file at +the time it was archived. It is the ASCII representation of the octal +value of the last time the file's contents were modified, represented +as an integer number of +seconds since January 1, 1970, 00:00 Coordinated Universal Time. + +The @code{chksum} field is the ASCII representation of the octal value +of the simple sum of all bytes in the header block. Each 8-bit +byte in the header is added to an unsigned integer, initialized to +zero, the precision of which shall be no less than seventeen bits. +When calculating the checksum, the @code{chksum} field is treated as +if it were all blanks. + +The @code{typeflag} field specifies the type of file archived. If a +particular implementation does not recognize or permit the specified +type, the file will be extracted as if it were a regular file. As this +action occurs, @command{tar} issues a warning to the standard error. + +The @code{atime} and @code{ctime} fields are used in making incremental +backups; they store, respectively, the particular file's access and +status change times. + +The @code{offset} is used by the @option{--multi-volume} (@option{-M}) option, when +making a multi-volume archive. The offset is number of bytes into +the file that we need to restart at to continue the file on the next +tape, i.e., where we store the location that a continued file is +continued at. + +The following fields were added to deal with sparse files. A file +is @dfn{sparse} if it takes in unallocated blocks which end up being +represented as zeros, i.e., no useful data. A test to see if a file +is sparse is to look at the number blocks allocated for it versus the +number of characters in the file; if there are fewer blocks allocated +for the file than would normally be allocated for a file of that +size, then the file is sparse. This is the method @command{tar} uses to +detect a sparse file, and once such a file is detected, it is treated +differently from non-sparse files. + +Sparse files are often @code{dbm} files, or other database-type files +which have data at some points and emptiness in the greater part of +the file. Such files can appear to be very large when an @samp{ls +-l} is done on them, when in truth, there may be a very small amount +of important data contained in the file. It is thus undesirable +to have @command{tar} think that it must back up this entire file, as +great quantities of room are wasted on empty blocks, which can lead +to running out of room on a tape far earlier than is necessary. +Thus, sparse files are dealt with so that these empty blocks are +not written to the tape. Instead, what is written to the tape is a +description, of sorts, of the sparse file: where the holes are, how +big the holes are, and how much data is found at the end of the hole. +This way, the file takes up potentially far less room on the tape, +and when the file is extracted later on, it will look exactly the way +it looked beforehand. The following is a description of the fields +used to handle a sparse file: + +The @code{sp} is an array of @code{struct sparse}. Each @code{struct +sparse} contains two 12-character strings which represent an offset +into the file and a number of bytes to be written at that offset. +The offset is absolute, and not relative to the offset in preceding +array element. + +The header can hold four of these @code{struct sparse} at the moment; +if more are needed, they are not stored in the header. + +The @code{isextended} flag is set when an @code{extended_header} +is needed to deal with a file. Note that this means that this flag +can only be set when dealing with a sparse file, and it is only set +in the event that the description of the file will not fit in the +allotted room for sparse structures in the header. In other words, +an extended_header is needed. + +The @code{extended_header} structure is used for sparse files which +need more sparse structures than can fit in the header. The header can +fit 4 such structures; if more are needed, the flag @code{isextended} +gets set and the next block is an @code{extended_header}. + +Each @code{extended_header} structure contains an array of 21 +sparse structures, along with a similar @code{isextended} flag +that the header had. There can be an indeterminate number of such +@code{extended_header}s to describe a sparse file. + +@table @asis + +@item @code{REGTYPE} +@itemx @code{AREGTYPE} +These flags represent a regular file. In order to be compatible +with older versions of @command{tar}, a @code{typeflag} value of +@code{AREGTYPE} should be silently recognized as a regular file. +New archives should be created using @code{REGTYPE}. Also, for +backward compatibility, @command{tar} treats a regular file whose name +ends with a slash as a directory. + +@item @code{LNKTYPE} +This flag represents a file linked to another file, of any type, +previously archived. Such files are identified in Unix by each +file having the same device and inode number. The linked-to name is +specified in the @code{linkname} field with a trailing null. + +@item @code{SYMTYPE} +This represents a symbolic link to another file. The linked-to name +is specified in the @code{linkname} field with a trailing null. + +@item @code{CHRTYPE} +@itemx @code{BLKTYPE} +These represent character special files and block special files +respectively. In this case the @code{devmajor} and @code{devminor} +fields will contain the major and minor device numbers respectively. +Operating systems may map the device specifications to their own +local specification, or may ignore the entry. + +@item @code{DIRTYPE} +This flag specifies a directory or sub-directory. The directory +name in the @code{name} field should end with a slash. On systems where +disk allocation is performed on a directory basis, the @code{size} field +will contain the maximum number of bytes (which may be rounded to +the nearest disk block allocation unit) which the directory may +hold. A @code{size} field of zero indicates no such limiting. Systems +which do not support limiting in this manner should ignore the +@code{size} field. + +@item @code{FIFOTYPE} +This specifies a FIFO special file. Note that the archiving of a +FIFO file archives the existence of this file and not its contents. + +@item @code{CONTTYPE} +This specifies a contiguous file, which is the same as a normal +file except that, in operating systems which support it, all its +space is allocated contiguously on the disk. Operating systems +which do not allow contiguous allocation should silently treat this +type as a normal file. + +@item @code{A} @dots{} @code{Z} +These are reserved for custom implementations. Some of these are +used in the @acronym{GNU} modified format, as described below. + +@end table + +Other values are reserved for specification in future revisions of +the P1003 standard, and should not be used by any @command{tar} program. + +The @code{magic} field indicates that this archive was output in +the P1003 archive format. If this field contains @code{TMAGIC}, +the @code{uname} and @code{gname} fields will contain the ASCII +representation of the owner and group of the file respectively. +If found, the user and group IDs are used rather than the values in +the @code{uid} and @code{gid} fields. + +For references, see ISO/IEC 9945-1:1990 or IEEE Std 1003.1-1990, pages +169-173 (section 10.1) for @cite{Archive/Interchange File Format}; and +IEEE Std 1003.2-1992, pages 380-388 (section 4.48) and pages 936-940 +(section E.4.48) for @cite{pax - Portable archive interchange}. + +@node Extensions +@unnumberedsec @acronym{GNU} Extensions to the Archive Format +@UNREVISED + +The @acronym{GNU} format uses additional file types to describe new types of +files in an archive. These are listed below. + +@table @code +@item GNUTYPE_DUMPDIR +@itemx 'D' +This represents a directory and a list of files created by the +@option{--incremental} (@option{-G}) option. The @code{size} field gives the total +size of the associated list of files. Each file name is preceded by +either a @samp{Y} (the file should be in this archive) or an @samp{N}. +(The file is a directory, or is not stored in the archive.) Each file +name is terminated by a null. There is an additional null after the +last file name. + +@item GNUTYPE_MULTIVOL +@itemx 'M' +This represents a file continued from another volume of a multi-volume +archive created with the @option{--multi-volume} (@option{-M}) option. The original +type of the file is not given here. The @code{size} field gives the +maximum size of this piece of the file (assuming the volume does +not end before the file is written out). The @code{offset} field +gives the offset from the beginning of the file where this part of +the file begins. Thus @code{size} plus @code{offset} should equal +the original size of the file. + +@item GNUTYPE_SPARSE +@itemx 'S' +This flag indicates that we are dealing with a sparse file. Note +that archiving a sparse file requires special operations to find +holes in the file, which mark the positions of these holes, along +with the number of bytes of data to be found after the hole. + +@item GNUTYPE_VOLHDR +@itemx 'V' +This file type is used to mark the volume header that was given with +the @option{--label=@var{archive-label}} (@option{-V @var{archive-label}}) option when the archive was created. The @code{name} +field contains the @code{name} given after the @option{--label=@var{archive-label}} (@option{-V @var{archive-label}}) option. +The @code{size} field is zero. Only the first file in each volume +of an archive should have this type. + +@end table + +You may have trouble reading a @acronym{GNU} format archive on a +non-@acronym{GNU} system if the options @option{--incremental} (@option{-G}), +@option{--multi-volume} (@option{-M}), @option{--sparse} (@option{-S}), or @option{--label=@var{archive-label}} (@option{-V @var{archive-label}}) were +used when writing the archive. In general, if @command{tar} does not +use the @acronym{GNU}-added fields of the header, other versions of +@command{tar} should be able to read the archive. Otherwise, the +@command{tar} program will give an error, the most likely one being a +checksum error. + +@node Snapshot Files +@unnumberedsec Format of the Incremental Snapshot Files +@include snapshot.texi + +@node Dumpdir +@unnumberedsec Dumpdir +@include dumpdir.texi