ALIs

kommt noch

Handling backups and archiving on LRZ HPC systems

On all LRZ HPC systems, mechanisms are provided which allow the user to restore accidentally deleted or overwritten files, write files to tape, and retrieve them. This document describes usage and recommended practices for these facilities.


Table of contents


Snapshots, Backup and Restoring

Snapshots

For all files in $HOME backup copies are kept and made available in the special subdirectory

$HOME/.snapshot/
.

Several snapshots are available:
 
File system Time of snapshot Number of snapshots retained How to access
/home/hlrb2 daily at 3:15, 9:15, 15:15, 21:15 4 $HOME/.snapshot/6hrs.<date>_<time>/
daily at 0:10 10 $HOME/.snapshot/daily.<date>_0010/
/home/cluster daily at 3:00, 9:00, 15:00, 21:00 4 $HOME/.snapshot/hourly.[0-3]/
daily at 0:00 10 $HOME/.snapshot/nightly.[0-9]/

A file can be restored by simply copying the file from the appropriate snapshot directory to its original location. Please note:

  • The directory $HOME/.snapshot/ is not listed by ls or even ls -a and you cannot create it either. It is however possible to do a cd $HOME/.snapshot/ and then see all entries using the ls command.
  • When copying the snapshot file to its original location, some versions of cp might refuse to overwrite the original location (since it uses the same inode). In that case, copy the snapshot file to an alternative location and then move it to the original location.
  • There are no snapshots for the $OPT_TMP and $PROJECT file systems. You have to archive such data yourself if necessary.
  • Deleted files in your ordinary $HOME directory are still contained in the snapshot directories and they are accounted for the volume quota. Because of the way snapshots work there is reserved space for old file versions which is 300% bigger than your project quota. That means, if your quota is e.g., 25 GB, that there are 75 GB of "snapshot reserve" for changes. If you change or delete more than 75 GBs of data during a 10 day interval it might happen that your project space is full and even deleting files does not recover any storage. Please contact the LRZ HPC support group if you run into problems with this mechanism; LRZ sysadmins can manually remove superfluous snapshots.

Tape backups

Additionally to the snapshots described above, LRZ also maintains TSM tape backups. TSM tape backups are made less often but live longer:
 
File system Time of tape backup Number of file versions retained Life time of unchanged files Life time of backup for files removed from disk storage
/home/hlrb2/ Saturday 2:00 3 duration of the project 1 year 
/home/cluster/ Saturday 22:50 3 duration of the project 1 year 

If you cannot find the version of the data you need in the snapshot directories please contact the LRZ HPC support group such that we can restore the data from the TSM tape backup for you. The TSM tape backup of $HOME will not be accessible by users directly.

Please note, that there are no tape backups for the $OPT_TMP and $PROJECT file systems.
You need to archive data residing on one of these file systems at your own discretion.

Archiving and Retrieving

In order to archive and retrieve data at the HLRB-II the Tivoli Storage Management Infrastructure (TSM) is provided. A system wide TSM client configuration is available, such that you do not need to perform the installation or configuration of a TSM client yourself.

Archiving data with TSM

Let's assume you have a file myFile stored on the temporary filesystem $OPT_TMP. Since myFile may be automatically removed from $OPT_TMP by high-watermark deletion after some days, you might want to have an archive copy at hand. So here's how to create one. Go to $OPT_TMP and invoke
 dsmc archive myFile
In case the file name contains spaces you have to enclose it in double quotes, e.g. like "this is my file". If you want to archive several files myFile1, myFile2, ... you can use wildcards:
 dsmc archive myFile*
You can also archive complete directory trees. This can be achieved using an additional command-line option:
 dsmc archive MyDirectory/ -subdir=yes
Please note the trailing slash in the directory name. This slash is important since it ensures that dsmc interpretes MyDirectory/ as a directory.

Please also consult the section on Optimal usage of TSM.

Retrieving data with TSM

You can search for archived files in $OPT_TMP by:
 dsmc query archive $OPT_TMP/ -subdir=yes
Again the slash after $OPT_TMP is important to remind dsmc that $OPT_TMP is a directory. A file can be retrieved with the command:
 dsmc retrieve $OPT_TMP/myFile $OPT_TMP/myOldFile
If you omit the second file argument, the file will be restored under its original name. Of course, you can also retrieve complete directory trees:
dsmc retrieve MyDirectory/ RetrievedDirectory/ -subdir=yes
This will restore the data in RetrievedDirectory/MyDirectory/. Again, directory or file names containing spaces have to be enclosed in double quotes.

Retrieving files on the Cluster Systems: Special cases

This subsection is of interest only to users of the Linux Cluster systems at LRZ.

Due to changes in the system configuration, it is necessary to use one of the following commands to retrieve files archived before April 14, 2009:

  • dsmc retrieve {/home}/cluster/$(id -g -n)/$USER/MyDirectory \
            $HOME/MyDirectory -subdir=yes  -se=HPCArchive
    
    for files archived after the restructuring of the HOME path names; please take note of the curly brackets around the /home path component.
  • dsmc retrieve /home/cluster/$USER/MyDirectory \
            $HOME/MyDirectory -subdir=yes  -se=HPCArchive
    
    for files archived before the restructuring of the HOME path names.
  • dsmc retrieve /lustre[_projects]/.../MyDirectory \
            /lustre[_projects]/.../MyDirectory -subdir=yes  -se=HPCArchive
    
    for files archived from the /lustre or /lustre_projects storage areas.

Retrieving files belonging to users

To retrieve files which were archived by other users (even for other members of your group), you need to perform following steps:

a) the user who archived the data (e.g. h1100xx)  must execute the command

 h1100xx@a01:~>dsmc set access archive "*" <TSM_NODE> h0000yy

this will grant access to all files ("*") archived on tsm node <TSM_NODE> (the tsm node the user is bound to) for user h0000yy. The value which needs to be entered for <TSM_NODE> is contained in the "servername" entry of the file $DSM_CONFIG, it will typically be of the form HLRBArchive_<number> (on HLRB-II) or LXCL_ARCHIVE_<number> (on the Cluster systems).

b) the other user must execute the commands

h0000yya@01:~>source set_dsm_config.sh h1100xx # for C shell: source set_dsm_config.csh h1100xx
h0000yya@01:~>dsmc retrieve -fromowner=h1100xx "<archived file>" "<local file>"
The sourced shell script sets the TSM configuration to point at the TSM archive valid for the other user. You can reset this to the original user account by running the script without arguments:
h0000yya@01:~>source set_dsm_config.sh
Otherwise, attempts to perform "normal" archiving or restoring (under the own account) may fail.

Deletion of TSM archives

To prevent users from shooting themselves in the foot, deletion of archives has been disabled. Multiple archiving of the same path name is always tagged with at minimum the archiving date on the TSM server, and the last archived version is retrieved unless you specify the -pick subargument at retrieval, in which case you are offered a choice of archived versions.

Optimal usage of TSM

In order to achieve a better performance with TSM archive or retrieve jobs you should consider the following guidelines.

Archiving/Retrieving large files

Use this procedure only if your archive files are larger than 1 GByte each.
Otherwhise, first accumulate your files in tar archives, which are larger than 1 GByte, or follow the procedure explained in the next subsection.

If you have multiple files to archive/retrieve you should archive/retrieve more than one file per dsmc call. For large files the optimum throughput performance is achieved with 4 files per dsmc call. If you have 6 files to archive for example call:

 dsmc ar file1 file2 file3 file4
 dsmc ar file5 file6
Archiving/retrieving more than 4 files with one dsmc call does not increase TSM performance anymore. Instead it may even lead to a slightly lower overall performance.

Using file lists for many small files

If you want to archive/retrieve more than 100 files which are smaller than 1 GByte each, you should archive/retrieve them via a file list. To do so, create a file fileList.txt, which contains the full qualified path names of the files to archive, one per line. You must not use any wildcard character, and if a file name contains spaces you must enclose the name with quotes. After that call:
 dsmc ar -filelist=fileList.txt
It is worth to mention that accumulating many small files into few large files, by using system tools like tar, is beneficial in terms of TSM archive/retrieve performance. So, if possible create few large files and archive them by using the procedure described in the previous subsection.

Reserving CPUs for archiving

TSM archive/retrieve performance can be further improved by dedicating CPUs to the TSM client. You can choose from two options:
  1. Using a batch job for archiving

    Of course it is also possible to write a small batch script, which contains the appropriate dsmc calls for archiving/retrieving. Using this method you should be able to archive/retrieve files with a size of up to at least 16 TB. If you intend to archive files larger than 16 TB please consult the LRZ HPC support group.

  2. Interactive batch job for archiving/retrieving on HLRB-II

    Login to hlrb2 and initiate an interactive batch job by invoking

     qsub -I -l select=<number of cpus>
    
    Here <number of cpus> should be between 1 and 4 depending on the number of files you want to archive/retrieve in parallel. Using more than 4 CPUs does not increase TSM performance since the TSM server performs best with 4 parallel archive/retrieve jobs. Please keep in mind that interactive batch sessions will be terminated after 4 hours. If you plan to archive/retrieve files greater than approximately 1 TB you should consider to use a normal script-driven batch job.

TSM GUI

You can also use a graphical front end, dsmj, to perform archiving and retrieving of files. Note that it may be necessary to load a suitable java environment module before using this facility; LRZ does not support this tool on its HPC platforms, so you use this at your own risk.