diff --git a/hadoop-tools/hadoop-azure/src/site/markdown/abfs.md b/hadoop-tools/hadoop-azure/src/site/markdown/abfs.md index fdf366f95d34b..037212e79a094 100644 --- a/hadoop-tools/hadoop-azure/src/site/markdown/abfs.md +++ b/hadoop-tools/hadoop-azure/src/site/markdown/abfs.md @@ -69,7 +69,7 @@ with Hierarchical Namespaces. ## Hierarchical Namespaces (and WASB Compatibility) A key aspect of ADLS Gen 2 is its support for -[hierachical namespaces](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace) +[hierarchical namespaces](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace) These are effectively directories and offer high performance rename and delete operations —something which makes a significant improvement in performance in query engines writing data to, including MapReduce, Spark, Hive, as well as DistCp. @@ -297,7 +297,7 @@ This is shown in the Authentication section. ## Authentication -Authentication for ABFS is ultimately granted by [Azure Active Directory](https://docs.microsoft.com/en-us/azure/active-directory/develop/authentication-scenarios). +Authentication for ABFS is ultimately granted by [Azure Active Directory](https://docs.microsoft.com/en-us/azure/active-directory/develop/authentication-scenarios) (now Microsoft Entra ID). The concepts covered there are beyond the scope of this document to cover; developers are expected to have read and understood the concepts therein @@ -332,7 +332,7 @@ possible ### AAD Token fetch retries -The exponential retry policy used for the AAD token fetch retries can be tuned +The exponential retry policy used for the AAD (now Entra ID) token fetch retries can be tuned with the following configurations. * `fs.azure.oauth.token.fetch.retry.max.retries`: Sets the maximum number of retries. Default value is 5. @@ -652,8 +652,7 @@ CustomDelegationTokenManager interface. {fully-qualified-class-name-for-implementation-of-CustomDelegationTokenManager-interface} ``` -In case delegation token is enabled, and the config `fs.azure.delegation.token -.provider.type` is not provided then an IlleagalArgumentException is thrown. +In case delegation token is enabled, and the config `fs.azure.delegation.token.provider.type` is not provided then an IllegalArgumentException is thrown. ### Shared Access Signature (SAS) Token Provider @@ -663,7 +662,7 @@ To know more about how SAS Authentication works refer to [Grant limited access to Azure Storage resources using shared access signatures (SAS)](https://learn.microsoft.com/en-us/azure/storage/common/storage-sas-overview) There are three types of SAS supported by Azure Storage: -- [User Delegation SAS](https://learn.microsoft.com/en-us/rest/api/storageservices/create-user-delegation-sas): Recommended for use with ABFS Driver with HNS Enabled ADLS Gen2 accounts. It is Identity based SAS that works at blob/directory level) +- [User Delegation SAS](https://learn.microsoft.com/en-us/rest/api/storageservices/create-user-delegation-sas): Recommended for use with ABFS Driver with HNS Enabled ADLS Gen2 accounts. It is an identity-based SAS that works at blob/directory level) - [Service SAS](https://learn.microsoft.com/en-us/rest/api/storageservices/create-service-sas): Global and works at container level. - [Account SAS](https://learn.microsoft.com/en-us/rest/api/storageservices/create-account-sas): Global and works at account level. @@ -754,16 +753,16 @@ requests. User can specify them as fixed SAS Token to be used across all the req ``` - 1. Fixed SAS Token: - ```xml - - fs.azure.sas.fixed.token - FIXED_SAS_TOKEN - - ``` + 2. Account SAS (Fixed SAS Token at Account Level): + ```xml + + fs.azure.sas.fixed.token + FIXED_SAS_TOKEN + + ``` - Replace `FIXED_SAS_TOKEN` with fixed Account/Service SAS. You can also -generate SAS from Azure portal. Account -> Security + Networking -> Shared Access Signature + - Replace `FIXED_SAS_TOKEN` with fixed Account/Service SAS. You can also + generate SAS from Azure portal. Account -> Security + Networking -> Shared Access Signature - **Security**: Account/Service SAS requires account keys to be used which makes them less secure. There is no scope of having delegated access to different users. @@ -864,16 +863,16 @@ Azure OAuth tokens. Consult the source in `org.apache.hadoop.fs.azurebfs.extensions` and all associated tests to see how to make use of these extension points. -_Warning_ These extension points are unstable. +_Warning_ : These extension points are unstable. ### Networking Layer: ABFS Driver can use the following networking libraries: - ApacheHttpClient: - Library Documentation. - - Default networking library. - JDK networking library: - Library documentation. + - Default networking library. The networking library can be configured using the configuration `fs.azure.networking.library` while initializing the filesystem. @@ -1007,13 +1006,13 @@ greater than or equal to 0. retries of IO operations. Currently this is used only for the server call retry logic. Used within `AbfsClient` class as part of the ExponentialRetryPolicy. This value indicates the smallest interval (in milliseconds) to wait before retrying -an IO operation. The default value is 3000 (3 seconds). +an IO operation. The default value is 500 milliseconds. `fs.azure.io.retry.max.backoff.interval`: Sets the maximum backoff interval for retries of IO operations. Currently this is used only for the server call retry logic. Used within `AbfsClient` class as part of the ExponentialRetryPolicy. This value indicates the largest interval (in milliseconds) to wait before retrying -an IO operation. The default value is 30000 (30 seconds). +an IO operation. The default value is 25000 (25 seconds). `fs.azure.io.retry.backoff.interval`: Sets the default backoff interval for retries of IO operations. Currently this is used only for the server call retry @@ -1023,7 +1022,7 @@ value. This random delta is then multiplied by an exponent of the current IO retry number (i.e., the default is multiplied by `2^(retryNum - 1)`) and then contstrained within the range of [`fs.azure.io.retry.min.backoff.interval`, `fs.azure.io.retry.max.backoff.interval`] to determine the amount of time to -wait before the next IO retry attempt. The default value is 3000 (3 seconds). +wait before the next IO retry attempt. The default value is 500 milliseconds. `fs.azure.write.request.size`: To set the write buffer size. Specify the value in bytes. The value should be between 16384 to 104857600 both inclusive (16 KB @@ -1361,9 +1360,9 @@ Operation failed: "Server failed to authenticate the request. Causes include: * Your credentials are incorrect. -* Your shared secret has expired. in Azure, this happens automatically +* Your shared secret has expired. In Azure, this happens automatically. * Your shared secret has been revoked. -* host/VM clock drift means that your client's clock is out of sync with the +* Host/VM clock drift means that your client's clock is out of sync with the Azure servers —the call is being rejected as it is either out of date (considered a replay) or from the future. Fix: Check your clocks, etc. diff --git a/hadoop-tools/hadoop-azure/src/site/markdown/blobEndpoint.md b/hadoop-tools/hadoop-azure/src/site/markdown/blobEndpoint.md index 07c499cea5db8..5116ded54f09d 100644 --- a/hadoop-tools/hadoop-azure/src/site/markdown/blobEndpoint.md +++ b/hadoop-tools/hadoop-azure/src/site/markdown/blobEndpoint.md @@ -15,7 +15,7 @@ # Azure Blob Storage REST API (Blob Endpoint) ## Introduction -The REST API for Blob Storage defines HTTP operations against the storage account, containers(filesystems), and blobs.(files) +The REST API for Blob Storage defines HTTP operations against the storage account, containers(filesystems), and blobs(files). The API includes the operations listed in the following table. | Operation | Resource Type | Description | @@ -27,8 +27,8 @@ The API includes the operations listed in the following table. | [List Blobs](#list-blobs) | Filesystem | Lists the paths under the specified directory inside container acting as hadoop filesystem. | | [Put Blob](#put-blob) | Path | Creates a new path or updates an existing path under the specified filesystem (container). | | [Lease Blob](#lease-blob) | Path | Establishes and manages a lease on the specified path. | -| [Put Block](#put-block) | Path | Appends Data to an already created blob at specified path. | -| [Put Block List](#put-block-list) | Path | Flushes The Appended Data to the blob at specified path. | +| [Put Block](#put-block) | Path | Appends data to an already created blob at specified path. | +| [Put Block List](#put-block-list) | Path | Flushes the appended data to the blob at specified path. | | [Set Blob Metadata](#set-blob-metadata) | Path | Sets the user-defined attributes of the blob at specified path. | | [Get Blob Properties](#get-blob-properties) | Path | Gets the user-defined attributes of the blob at specified path. | | [Get Blob](#get-blob) | Path | Reads data from the blob at specified path. | @@ -43,7 +43,7 @@ already exists, the operation fails. Rest API Documentation: [Create Container](https://docs.microsoft.com/en-us/rest/api/storageservices/create-container) ## Delete Container -The Delete Container operation marks the specified container for deletion. The container and any blobs contained within it. +The Delete Container operation marks the specified container and any blobs contained within it for deletion. Rest API Documentation: [Delete Container](https://docs.microsoft.com/en-us/rest/api/storageservices/delete-container) ## Set Container Metadata @@ -67,7 +67,7 @@ Partial updates are not supported with Put Blob Rest API Documentation: [Put Blob](https://docs.microsoft.com/en-us/rest/api/storageservices/put-blob) ## Lease Blob -The Lease Blob operation creates and manages a lock on a blob for write and delete operations. The lock duration can be 15 to 60 seconds, or can be infinite. +The Lease Blob operation creates and manages a lock on a blob for creating file, opening file for write and rename operations. The lock duration can be 15 to 60 seconds, or can be infinite. Rest API Documentation: [Lease Blob](https://docs.microsoft.com/en-us/rest/api/storageservices/lease-blob) ## Put Block @@ -104,4 +104,4 @@ Rest API Documentation: [Copy Blob](https://docs.microsoft.com/en-us/rest/api/st ## Append Block The Append Block operation commits a new block of data to the end of an existing append blob. -Rest API Documentaion: [Append Block](https://learn.microsoft.com/en-us/rest/api/storageservices/append-block) \ No newline at end of file +Rest API Documentation: [Append Block](https://learn.microsoft.com/en-us/rest/api/storageservices/append-block) \ No newline at end of file diff --git a/hadoop-tools/hadoop-azure/src/site/markdown/fns_blob.md b/hadoop-tools/hadoop-azure/src/site/markdown/fns_blob.md index 27934b2e25aa5..f104dfd463e59 100644 --- a/hadoop-tools/hadoop-azure/src/site/markdown/fns_blob.md +++ b/hadoop-tools/hadoop-azure/src/site/markdown/fns_blob.md @@ -18,10 +18,10 @@ The ABFS driver is recommended to be used only with HNS Enabled ADLS Gen-2 accounts for big data analytics because of being more performant and scalable. -However, to enable users of legacy WASB Driver to migrate to ABFS driver without -needing them to upgrade their general purpose V2 accounts (HNS-Disabled), Support +However, to allow users of legacy WASB Driver to migrate to ABFS driver without +requiring them to upgrade their general purpose V2 accounts (HNS-Disabled), support for FNS accounts is being added to ABFS driver. -Refer to [WASB Deprication](./wasb.html) for more details. +Refer to [WASB Deprecation](./wasb.html) documentation for more details. ## Azure Service Endpoints Used by ABFS Driver Azure Services offers two set of endpoints for interacting with storage accounts: @@ -38,7 +38,7 @@ HNS Enabled accounts will still use DFS Endpoint which continues to be the recommended stack based on performance and feature capabilities. ## Configuring ABFS Driver for FNS Accounts -Following configurations will be introduced to configure ABFS Driver for FNS Accounts: +Following configurations have been introduced to configure ABFS Driver for FNS Accounts: 1. Account Type: Must be set to `false` to indicate FNS Account ```xml @@ -47,19 +47,19 @@ Following configurations will be introduced to configure ABFS Driver for FNS Acc ``` -2. Account Url: It is the URL used to initialize the file system. It is either passed -directly to file system or configured as default uri using "fs.DefaultFS" configuration. -In both the cases the URL used must be the blob endpoint url of the account. +2. Account Url: It is the URL used to initialize the file system. It is either be passed + directly to the file system or configured as the default URI using "fs.DefaultFS" configuration. + In both cases the URL used must be the blob endpoint url of the account. ```xml fs.defaultFS abfss://CONTAINER_NAME@ACCOUNT_NAME.blob.core.windows.net ``` -3. Service Type for FNS Accounts: This will allow an override to choose service -type specially in cases where any local DNS resolution is set for the account and driver is -unable to detect the intended endpoint from above configured URL. If this is set -to blob for HNS Enabled Accounts, FS init will fail with InvalidConfiguration error. +3. Service Type for FNS Accounts: This allows an override to choose the service + type especially in cases where local DNS resolution is set for the account and the driver is + unable to detect the intended endpoint from above configured URL. If this is set + to blob for HNS-enabled accounts, FS initialization will fail with InvalidConfiguration error. ```xml fs.azure.fns.account.service.type @@ -67,11 +67,11 @@ to blob for HNS Enabled Accounts, FS init will fail with InvalidConfiguration er ``` -4. Service Type for Ingress Operations: This will allow an override to choose service -type only for Ingress Related Operations like [Create](./blobEndpoint.html#put-blob), -[Append](./blobEndpoint.html#put-block), -and [Flush](./blobEndpoint.html#put-block-list). All other operations will still use the -configured service type. +4. Service Type for Ingress Operations: This allows an override to choose service + type only for Ingress related operations like [Create](./blobEndpoint.html#put-blob), + [Append](./blobEndpoint.html#put-block), + and [Flush](./blobEndpoint.html#put-block-list). All other operations will still use the + configured service type. ```xml fs.azure.ingress.service.type @@ -106,40 +106,33 @@ The following configs are related to rename and delete operations. - `fs.azure.blob.copy.max.wait.millis`: Maximum time to wait for a blob copy operation to complete. The default value is 5 minutes. -- `fs.azure.blob.atomic.rename.lease.refresh.duration`: Blob rename lease - refresh +- `fs.azure.blob.atomic.rename.lease.refresh.duration`: Blob rename lease refresh duration in milliseconds. This setting ensures that the lease on the blob is - periodically refreshed during a rename operation to prevent other operations + periodically refreshed during a rename operation preventing other operations from interfering. The default value is 60 seconds. -- `fs.azure.blob.dir.list.producer.queue.max.size`: Maximum number of blob - entries +- `fs.azure.blob.dir.list.producer.queue.max.size`: Maximum number of blob entries enqueued in memory for rename or delete orchestration. The default value is 2 times the default value of list max results, which is 5000, making the current value 10000. - `fs.azure.blob.dir.list.consumer.max.lag`: It sets a limit on how much blob information can be waiting to be processed (consumer lag) during a blob - listing - operation. If the amount of unprocessed blob information exceeds this limit, - the - producer will pause until the consumer catches up and the lag becomes + listing operation. If the amount of unprocessed blob information exceeds this limit, + the producer will pause until the consumer catches up and the lag becomes manageable. The default value is equal to the value of default value of list - max - results which is 5000 currently. + max results which is 5000 currently. - `fs.azure.blob.dir.rename.max.thread`: Maximum number of threads per blob - rename - orchestration. The default value is 5. + rename orchestration. The default value is 5. -- `fs.azure.blob.dir.delete.max.thread`: Maximum number of thread per - blob-delete - orchestration. The default value currently is 5. +- `fs.azure.blob.dir.delete.max.thread`: Maximum number of thread per blob + delete orchestration. The default value currently is 5. ## Features currently not supported -1. **User Delegation SAS** feature is currently not supported but we +1. **User Delegation SAS** feature is currently not supported, but we plan to bring support for it in the future. Jira to track this workitem : https://issues.apache.org/jira/browse/HADOOP-19406. diff --git a/hadoop-tools/hadoop-azure/src/site/markdown/wasb.md b/hadoop-tools/hadoop-azure/src/site/markdown/wasb.md index 270fd14da4c44..e176a5d890139 100644 --- a/hadoop-tools/hadoop-azure/src/site/markdown/wasb.md +++ b/hadoop-tools/hadoop-azure/src/site/markdown/wasb.md @@ -16,7 +16,7 @@ ## Introduction WASB Driver is a legacy Hadoop File System driver that was developed to support -[FNS(FlatNameSpace) Azure Storage accounts](https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction) +[FNS (FlatNameSpace) Azure Storage accounts](https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction) that do not honor File-Folder syntax. HDFS Folder operations hence are mimicked at client side by WASB driver and certain folder operations like Rename and Delete can lead to a lot of IOPs with @@ -93,5 +93,4 @@ Refer to [ABFS Authentication](abfs.html/authentication) for more details. ### ABFS Features Not Available for migrating Users Certain features of ABFS Driver will be available only to users using HNS accounts with ABFS driver. -1. ABFS Driver's SAS Token Provider plugin for UserDelegation SAS and Fixed SAS. -2. Client Provided Encryption Key (CPK) support for Data ingress and egress. +1. Client Provided Encryption Key (CPK) support for Data ingress and egress.