diff --git a/docs/data/big-data-tips-and-tricks.md b/docs/data/big-data-tips-and-tricks.md new file mode 100644 index 00000000..be702366 --- /dev/null +++ b/docs/data/big-data-tips-and-tricks.md @@ -0,0 +1,58 @@ +# Big Data Tips and Tricks + +For **Big Data** sets, we recommend using the `boto3` library or `s5cmd` tool. + +## Fewer big files are better than many small files + +When transferring **Big Data**, it is better to use fewer big files instead of many small files. + +Be aware of that when you are transferring a lot of small files, the overhead of the transfer process can be significant. +You can save time and resources by packing the small files into a single big file and transferring it as one object. + +## Chunk size matters + +When transferring big files, the upload (or download) process is divided into chunks - so called `multipart uploads` (or downloads). +The size of these chunks can have a significant impact on the transfer speed. + +The optimal chunk size depends on the size of the files you are transferring and the network conditions. + +There is no one-size-fits-all solution, so you should experiment with different chunk sizes to find the optimal one for your use case. +We recommend starting with a chunk size of `file_size / 1000` (where `file_size` is the size of the file you are transferring). +You can then adjust the chunk size based on the results of your experiments. + +## Cluster choice matters + +Some cluster offer better `network interface` than others. + +When transferring big files, it is important to choose a cluster with a good network interface. +One such cluster is the `halmir` machines, which offer a `10 Gbps` network interface. + +You can check the possible clusters and their network interfaces on the [official website](https://metavo.metacentrum.cz/pbsmon2/nodes/physical) of the MetaCentrum. + +## Hard disk speed does not matter + +Our research has shown that the speed of the hard disk does not have a significant impact on the transfer speed. + +When transferring big files, the network interface is the bottleneck, not the hard disk speed. + +Therefore, you do not need to worry about the usage of `tmpfs` or `ramdisk` when transferring big files. + +## Utilize compression + +When transferring big files, it is a good idea to utilize compression. + +You can compress the files before transferring them, effectively reducing the time and resources needed for the transfer. + +Choice of the compression algorithm depends on the type of the files you are transferring, there is no one-size-fits-all solution. +We recommend using the `zstandard` algorithm, as it offers a good balance between compression ratio and decompression speed. +Depending on the type of your files, you can also consider using the `gzip`, `bzip2`, or `xz` algorithms. + +For more information about the compression algorithms, please check this [comparison](https://quixdb.github.io/squash-benchmark/). + +## Use the right tool for the job + +When transferring big files, it is important to use the right tool for the job. + +If you are unsure which tool to use, we recommend checking the [Storage Department](storage-department.md) page with a table of S3 service clients. + +In short, we recommend using the `boto3` library or `s5cmd` tool for **Big Data** transfers. diff --git a/docs/data/storage-department.md b/docs/data/storage-department.md index d054a7c4..24162038 100644 --- a/docs/data/storage-department.md +++ b/docs/data/storage-department.md @@ -1,26 +1,57 @@ # Storage Department services -The CESNET Storage Department provides various types of data services. It is available to all users with **MetaCentrum login and password**. +The CESNET Storage Department provides various types of data services. +It is available to all users with **MetaCentrum login and password**. -Storage Department data policies will be described to a certain level at this page. For more detailed information, users should however navigate the [Storage Department documentation pages](https://docs.du.cesnet.cz). +Storage Department data policies will be described to a certain level at this page. +For more detailed information, users should however navigate the [Storage Department documentation pages](https://docs.du.cesnet.cz). !!! warning "Data storage technology in the Data Storage Department has changed by May 2024" For a long time the data were stored on hierarchical storage machines ("HSM" for short) with a directory structure accessible from `/storage/du-cesnet`.
Due to technological innovation of operated system were the HSM storages disconnected and decommissioned. User data have been transferred to [machines with Object storage technology](https://docs.du.cesnet.cz/en/object-storage-s3/s3-service).
Object storage is successor of HSM with slightly different set of commands, i.e. it **does not** work in the same way. ## Object storage -S3 storage is available for all Metacentrum users. You can generate your credetials via [Gatekeeper service](https://access.du.cesnet.cz/#/). Where you will select your Metacentrum account and you should obtain your `access_key` and `secret_key`. +S3 storage is available for all Metacentrum users. +You can generate your credetials via [Gatekeeper service](https://access.du.cesnet.cz/#/). +Where you will select your Metacentrum account and you should obtain your `access_key` and `secret_key`. ### Simple storage - use when you need commonly store your data -You can use the S3 storage as simple storage to store your data. You can use your credentials to configure some of the supported S3 clients like s3cmd, s5cmd (large datasets) and rclone. The detailed tutorial for S3 client configuration can be found in the [official Data Storage Department tutorials](https://docs.du.cesnet.cz/en/object-storage-s3/s3-clients) +You can use the S3 storage as simple storage to store your data. +You can use your credentials to configure some of the supported S3 clients like s3cmd, s5cmd (large datasets) and rclone. +The detailed tutorial for S3 client configuration can be found in the [official Data Storage Department tutorials](https://docs.du.cesnet.cz/en/object-storage-s3/s3-clients). ### Direct usage in the job file -You can add s5cmd and rclone commands directly into your job file. +You can add `s5cmd` and `rclone` commands directly into your job file. + !!! warning "Bucket creation" Do not forget that the bucket being used for staging MUST exist on the remote S3 data storage. If you plan to stage-out your data into a non-existing bucket the job will fail. You need to prepare the bucket for stage-out in advance. You can use the [official Data Storage Department tutorials](https://docs.du.cesnet.cz/en/object-storage-s3/s3-clients) for particular S3 client. +### Big Data transfers + +For **Big Data** sets, we recommend using the `boto3` library or `s5cmd` tool. + +For general tips and tricks regarding **Big Data** and **CESNET S3 storage**, please visit the [Big Data Tips and Tricks](big-data-tips-and-tricks.md) page. +### S3 service clients + +| Binary | Source code language | Library | Console usage | Python usage | Fit for Big Data transfers | +|-----------------|----------------------|-----------------|---------------|--------------|----------------------------| +| aws cli | Python | aws cli | Yes | Yes | No | +| s3cmd | Python | s3cmd | Yes | Yes | No | +| s4cmd | Python | [boto3](#boto3) | No | Yes | Yes | +| [s5cmd](#s5cmd) | Go | --- ? --- | Yes | No | Yes | + +For further details and more information about all the possible S3 clients, please refer to the [official Data Storage Department tutorials](https://docs.du.cesnet.cz/en/docs/object-storage-s3/s3-service). + +#### boto3 + +`boto3` is a **Python** library that allows you to interact with the S3 storage. +You have to use it from your **Python** scripts - it is not a standalone tool like `s3cmd` or `s5cmd`. + +For more details and information about `boto3`, please check the [Data Storage guide](https://docs.du.cesnet.cz/en/docs/object-storage-s3/boto3). + #### s5cmd -To use s5cmd tool (preferred) you need to create a credentials file (copy the content below) in your home dir, e.g. `/storage/brno2/home//.aws/credentials`. + +To use `s5cmd` tool you need to create a credentials file (copy the content below) in your home dir, e.g. `/storage/brno2/home//.aws/credentials`. ``` [profile-name] @@ -32,7 +63,8 @@ multipart_threshold = 128MB multipart_chunksize = 32MB ``` -Then you can continue to use `s5cmd` via commands described in [Data Storage guide](https://docs.du.cesnet.cz/en/object-storage-s3/s5cmd). Alternatively, you can directly add the following lines into your job file. +Then you can continue to use `s5cmd` via commands described in [Data Storage guide](https://docs.du.cesnet.cz/en/object-storage-s3/s5cmd). +Alternatively, you can directly add the following lines into your job file. ``` #define CREDDIR, where you stored your S3 credentials for, default is your home directory @@ -47,7 +79,9 @@ s5cmd --credentials-file "${S3CRED}" --profile profile-name --endpoint-url=https #### rclone -Alternatively, you can use rclone tool, which is less handy for large data sets. In case of large data sets (tens of terabytes) please use `s5cmd` above. For rclone you need to create a credentials file (copy the content below) in your home dir, e.g. `/storage/brno2/home//.config/rclone/rclone.conf`. +Alternatively, you can use `rclone` tool, which is less handy for large data sets. +In case of large data sets (tens of terabytes) please use `s5cmd` or `boto3`, mentioned above. +For rclone you need to create a credentials file (copy the content below) in your home dir, e.g. `/storage/brno2/home//.config/rclone/rclone.conf`. ``` [profile-name] @@ -59,7 +93,8 @@ endpoint = s3.cl4.du.cesnet.cz acl = private ``` -Then you can continue to use `rclone` via commands described in [Data Storage guide](https://docs.du.cesnet.cz/en/object-storage-s3/rclone). Or you can directly add following lines into your job file. +Then you can continue to use `rclone` via commands described in [Data Storage guide](https://docs.du.cesnet.cz/en/object-storage-s3/rclone). +Or you can directly add following lines into your job file. ``` #define CREDDIR, where you stored your S3 credentials for, default is your home directory @@ -71,4 +106,3 @@ rclone sync --progress --fast-list --config ${S3CRED} profile-name:my-bucket/h2o #stage out command for rclone rclone sync --progress --fast-list --config ${S3CRED} ${DATADIR}/h2o.out profile-name:my-bucket/ ``` -