Quick start guide¶
This quick start guide shows how the dtool
command line tool can be used to
accomplish some common data management tasks.
Organising files into a dataset on local disk¶
In this scenario one simply wants to organise one or more files into a dataset in the file system on the local computer.
When working on local disk a dataset is simply a standardised directory layout combined with some hidden files used to annotate the dataset and its items.
The first step is to create a “proto” dataset. The command below creates a
dataset named fishers-iris-data
in the current working directory.
$ dtool create fishers-iris-data
One can now add files to the dataset by moving/copying them to the
fisher-iris-data/data
directory, or by using the built in dtool add
item
command. In the example below the file iris.csv
is added to the
proto dataset.
$ touch iris.csv
$ dtool add item iris.csv fishers-iris-data
Metadata describing the data is as important as the data itself. Metadata
describing the dataset is stored in the file fisers-iris-data/README.yml
.
An easy way to add content to this file is to use the dtool readme
interactive
, which will prompt for input regarding the dataset.
$ dtool readme interactive fishers-iris-data
description [Dataset description]: Fisher's classic iris data, but with an empty file :(
project [Project name]: dtool demo
confidential [False]:
personally_identifiable_information [False]:
name [Your Name]: Tjelvar Olsson
email [olssont@nbi.ac.uk]:
username [olssont]:
creation_date [2017-10-06]:
Updated readme
To edit the readme using your default editor:
dtool readme edit fiser-iris-data
Finally, to convert the proto dataset into a dataset one uses the dtool
freeze
command.
$ dtool freeze fishers-iris-data
Generating manifest [####################################] 100% iris.csv
Dataset frozen fiser-iris-data
Copying data from an external hard drive to remote storage as a dataset¶
Genome sequencing generates large volumes of data, which are often sent from the sequencing company to the user by posting an external hard drive. When backing up such data on a remote storage system one does not want to have to reorganise the data before copying it to the remote storage system.
In this case one can create a “symlink” dataset and copy that to the remote storage. A symlink dataset is a dataset where the data directory is a symlink to another location, for example the data directory on the external hard drive.
$ dtool create bgi-sequencing-12345 --symlink-path /mnt/external-hard-drive
Again, adding metadata to the dataset is vital.
$ dtool readme interactive bgi-sequencing-12345
One can then convert the proto dataset into a dataset by “freezing” it.
$ dtool freeze bgi-sequencing-12345
It is now time to copy the dataset to the remote storage. The command below
assumes that one has credentials setup to write to the Amazon S3 bucket
dtool-demo
. The command copies the local dataset to the S3 dtool-demo
bucket.
$ dtool cp bgi-sequencing-12345 s3://dtool-demo/
The command above returns feedback on the URI used to identify the dataset in
the remote storage. In this case
s3://dtool-demo/1e47c076-2eb0-43b2-b219-fc7d419f1f16
.
The URI used to identify the dataset uses the UUID of the dataset rather than the dataset’s name. This is to avoid name clashes in the object storage.
Finally, one may want to confirm that the data transfer was successful. This
can be achieved using the dtool diff
command, which should show no
differences if the transfer was successful.
$ dtool diff bgi-sequencing-12345 s3://dtool-demo/1e47c076-2eb0-43b2-b219-fc7d419f1f16
By default only identifiers and file sizes are compared. To check file hashes
make use of the --full
option.
Warning
When comparing datasets identifiers, sizes and hashes are compared. When checking that the hashes are identical the hashes for the first dataset are recalculated using the hashing algorithm of the reference dataset (the second). If the dataset in S3 had been specified as the first argument then all the files would have had to have been downloaded to the local disk before calculating their hashes, which would have made the command slower.
Copying a dataset from remote storage to local disk¶
After having copied a dataset to a remote storage system one may have deleted the copy on the local disk. In this case one may want to be able to get the dataset back onto the local disk.
This can be achieved using the dtool cp
command. The command below copies
the dataset in iRODS to the current working directory.
$ dtool cp s3://dtool-demo/1e47c076-2eb0-43b2-b219-fc7d419f1f16 ./
Note that on the local disk the dataset will use the name of the dataset rather
than the UUID, in this example bgi-sequencing-12345
.
Again one can verify the data transfer using the dtool diff
command.
$ dtool diff bgi-sequencing-12345 s3://dtool-demo/1e47c076-2eb0-43b2-b219-fc7d419f1f16