4 Copy from AWS S3
Ze Qian Zhang редактировал(а) эту страницу 2020-06-21 17:21:40 -07:00
Этот файл содержит неоднозначные символы Юникода!

Этот файл содержит неоднозначные символы Юникода, которые могут быть перепутаны с другими в текущей локали. Если это намеренно, можете спокойно проигнорировать это предупреждение. Используйте кнопку Экранировать, чтобы подсветить эти символы.

Overview

AzCopy v10 (starting in 10.0.9 release) supports copying data between two Blob storage services (two Azure Storage accounts) as well as from Amazon Web Services (AWS) S3 to Azure Blob storage. To do this, AzCopy uses Put from URL API from the Azure Blob storage REST APIs which directly copies a given chunk of publicly accessible data from a given URL to an Azure Blob storage account. For copying data from AWS S3, AzCopy enumerates all objects in a given bucket, creates pre-signed URLs and then issues Put from URL APIs on each object to copy the data to Azure. Note that the copy operation does not use the machines bandwidth where AzCopy is run, hence making it efficient and performant to copy data.

Authentication

AzCopy uses an access key, and secret to authenticate with AWS S3. For the destination Blob storage account you can use any of the available authentication options (SAS token, or Azure Active Directory authentication).

Examples

Run ./AzCopy copy –help to see example commands.

1. Copy single object from AWS:

  • Set environment variable AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY for the S3 source.
  • Issue azcopy cp "https://s3.amazonaws.com/[bucket]/[object]" https://[destaccount].blob.core.windows.net/[container]/[path/to/blob]?[SAS]

2. Copy directory from AWS:

  • Set environment variable AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY for the AWS S3 source.
  • azcopy cp "https://s3.amazonaws.com/[bucket]/[folder]" "https://[destaccount].blob.core.windows.net/[container]/[path/to/directory]?[SAS]" --recursive=true
  • Please refer to https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html for what “folder” means for S3.

3.Copy a bucket to a Blob container:

  • Set environment variable AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY for the AWS S3 source.
  • azcopy cp "https://s3.amazonaws.com/[bucket]" "https://[destaccount].blob.core.windows.net/[container]?[SAS]" --recursive=true

4. Copy all buckets in AWS S3 to an Azure Storage account:

  • Set environment variable AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY for S3 source.
  • azcopy cp "https://s3.amazonaws.com/" "https://[destaccount].blob.core.windows.net?[SAS]" --recursive=true

5. Copy a bucket to an Azure Storage account:

  • Set environment variable AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY for S3 source.
  • azcopy cp "https://s3.amazonaws.com/[bucket]/" "https://[destaccount].blob.core.windows.net/?<SAS>" --recursive
  • When source is bucket and destination is an Azure blob service endpoint, container with name [bucket] will be created in destination storage account to store data.

Remarks

a. Objects in AWS S3 with suffix “/” are considered as folder

Similar to the behavior of AWS S3 management console, objects with suffix “/” will be considered as folder in AzCopy, and will not be copied as an object to Azure Blob storage.

b. URL styles supported

AzCopy supports virtual-hosted-style and path-style URL:

  1. virtual-hosted-style

The bucket name is part of the domain name in the URL, examples: a. http://bucket.s3.amazonaws.com b. http://bucket.s3-aws-region.amazonaws.com

  1. path-style URL

Examples: a. http://s3.amazonaws.com/bucket (US East (N. Virginia) Region endpoint) b. http://s3-aws-region.amazonaws.com/bucket (Region-specific endpoint)

c. Bucket name resolving

AWS S3 has different set of naming conventions for bucket names compared to Azure Blob container/ADLS Gen2 filesystem/File share.

For Azure, container/filesystem/share's naming follows:

  1. Lower case letters, numbers and hyphen.
  2. 3-63 length.
  3. Name should not contain two consecutive hyphens.
  4. Name should not start or end with hyphen.

For AWS S3, bucket's naming follows:

  1. The bucket name can be between 3 and 63 characters long, and can contain only lower-case characters, numbers, periods, and dashes.
  2. Each label in the bucket name must start with a lowercase letter or number.
  3. The bucket name cannot contain underscores, end with a dash or period, have consecutive periods, or use dashes adjacent to periods.
  4. The bucket name cannot be formatted as an IP address (198.51.100.24).

AzCopy will auto-resolve following naming issues:

  1. bucket name with period. In this case, AzCopy try to replace period with hyphen. e.g. bucket.with.period -> bucket-with-period
  2. bucket name with consecutive hyphens. In this case, AzCopy try to replace consecutive hyphen, with -[numberOfHyphens]-. e.g. bucket----hyphens -> bucket-4-hyphens

The resolver in AzCopy checks if there are naming collision with other existing bucket names, and try to add suffix when there is any collision. e.g. There is buckets with name: bucket-name, bucket.name. AzCopy will resolve bucket.name -> bucket-name -> bucket-name-2 All the resolved names will be logged as WARNING.

Example:

INFO: Scanning...
INFO: trying to copy the source as bucket/folder/list of files
INFO: source is bucket and destination is an Azure service endpoint, bucket with name "test-7-special-name" will be created in destination to store data

Job 052cfd44-5677-bd43-7921-5e1ca876d7d3 has started
Log file is located at: C:\Users\jiacfan/.AzCopy/052cfd44-5677-bd43-7921-5e1ca876d7d3.log

0 Done, 0 Failed, 3 Pending, 0 Skipped, 3 Total,


Job 052cfd44-5677-bd43-7921-5e1ca876d7d3 summary
Elapsed Time (Minutes): 0.0667
Total Number Of Transfers: 3
Number of Transfers Completed: 3
Number of Transfers Failed: 0
Number of Transfers Skipped: 0
TotalBytesTransferred: 3477801
Final Job Status: Completed

d.Object name handling

Please note that Azure Storage does not permit object name (or any segment in the virtual directory path) to end with trailing dots (ex: dir1/dir2.../file or dir1/dir2/file...). The Storage service will trim away the trailing dots when the copy operation is performed.

e.Metadata handling

AWS S3 allows a different charset for object metadata keys compared to Azure Blob storage. AzCopy provides a flag with 3 options to handle invalid metadata key when transferring objects to Azure: s2s-invalid-metadata-handle=ExcludeIfInvalid, FailIfInvalid, RenameIfInvalid.

  1. ExcludeIfInvalid

This is the default option, metadata with invalid metadata key will be excluded from the transfer while the object itself will be copied to Azure. Use this option when you do not care that you lose the metadata in source, AWS S3. When the metadata is excluded, following event will be logged as WARNING and the file transfer will succeed:

2019/03/13 09:05:21 WARN: [P#0-T#0] METADATAWARNING: For source "https://s3.amazonaws.com/s2scopybucket/test_copy_single_file_from_s3_to_blob_excludeifinvalid", invalid metadata with keys '$%^' '1abc' are excluded

  1. FailIfInvalid

If you set FailIfInvalid for s2s-invalid-metadata-handle flag, then the objects with invalid metadata keys will fail to transfer to Azure Blob storage. Failure will be logged and included in the failed count in the transfer summary. Use this option if you would like to fix the objects with invalid data in the AWS S3 source. Once you have done that, you can restart the AzCopy job using azcopy resume` command to retry the failed objects.

2019/03/13 09:22:38 ERR: [P#0-T#0] COPYFAILED: <sourceURL> : metadata with keys '$%^' '1abc' in source is invalid

  1. RenameIfInvalid

If you set RenameIfInvalid value for s2s-invalid-metadata-handle flag, AzCopy will automatically resolve the invalid metadata key and copy the object to Azure using the resolved metadata key value pair. Use this option if you rather fix the invalid metadata in Azure after moving all data to Azure Storage. This allows you to dispose the contents of your AWS S3 bucket since all of the information is stored on the Azure Blob storage side.

The rename logic is as following:

  1. replace all invalid char (i.e. ASCII chars expect [0-9A-Za-z_]) with '_'
  2. add 'rename_' as prefix for the new valid key, this key will be used to save original metadata's value.
  3. add 'rename_key_' as prefix for the new valid key, this key will be used to save original metadata's invalid key. Example, given invalid metadata of '123-invalid':'content', it will be resolved as two new key value pairs: 'rename_123_invalid':'content' 'rename_key_123_invalid':'123-invalid'

User can then try to recover the metadata in Azure side since metadata key is preserved as a value on the Blob storage service. Transfer of the object will be failed if rename operation fails.

Performance

AzCopy V10 uses Put from URL APIs to coordinate transfer of objects between Azure Storage accounts, and between AWS S3 to Azure Blob storage. This means the data is copied directly by the Azure Storage service from its source. By default, AzCopy schedules transfer of chunks (8MB each) simultaneously in multiple connections of up to 8 x (number of cores of your VM).

In our perfect world benchmark setups, we were able to transfer large amounts of data from AWS S3 to Azure Blob storage at the speed of 50Gbps (default Azure Blob storage ingress limit). Note that if you have a higher ingress limit requirement, you can create a support ticket to get the limit raised.

To optimize the performance, please consider to:

a. Minimize copy requests latency by placing VM that runs AzCopy to as close as possible to the destination Storage account. E.g. if the Storage account is in east us 2, use a VM in east us 2.

b. Try setting a higher number of connections using the environment variable “AZCOPY_CONCURRENCY_VALUE” according to your PCs configuration and workload.

Limitations

a. For AWS S3 to Azure Blob copy, S3 Versioning is not currently supported, i.e. only the latest version will be copied – previous versions will be skipped.

b. Storage class of your S3 objects will not be copied/mapped to Blob tier. Your data will be copied with your Storage accounts default settings.

c. AWS Account Access Keys. i.e. Access key ID and Secret access key is the only supported authentication option for AWS S3

d. Please be aware of the naming incompatibilities explained in the Remarks section.