diff --git a/DESCRIPTION b/DESCRIPTION index 4b4498b..f482d3e 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,6 +1,6 @@ Package: AzureStor Title: Storage Management in 'Azure' -Version: 2.1.1.9000 +Version: 3.0.0 Authors@R: c( person("Hong", "Ooi", , "hongooi@microsoft.com", role = c("aut", "cre")), person("Microsoft", role="cph") @@ -19,12 +19,10 @@ Imports: mime, openssl, xml2, - AzureRMR (>= 2.2.1) + AzureRMR (>= 2.3.0) Suggests: knitr, jsonlite, testthat Roxygen: list(markdown=TRUE) RoxygenNote: 6.1.1 -Remotes: - Azure/AzureRMR diff --git a/NEWS.md b/NEWS.md index f1ce06d..7255445 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,4 +1,4 @@ -# AzureStor 2.1.1.9000 +# AzureStor 3.0.0 ## Significant user-visible changes @@ -10,7 +10,7 @@ - Significant changes to file storage methods for greater consistency with the other storage types: - The default directory for `list_azure_files` is now the root, mirroring the behaviour for blobs and ADLSgen2. - The output of `list_azure_files` now includes the full path as part of the file/directory name. - - Add `recursive` argument to file storage methods for recursing through subdirectories. Like above, for file storage this can be slow, so try to use a non-recursive solution where possible. + - Add `recursive` argument to `list_azure_files`, `create_azure_dir` and `delete_azure_dir` for recursing through subdirectories. Like with file transfers, for Azure file storage this can be slow, so try to use a non-recursive solution where possible. - Make output format for `list_adls_files`, `list_blobs` and `list_azure_files` more consistent. The first 2 columns for a data frame output are now always `name` and `size`; the size of a directory is NA. The 3rd column for non-blobs is `isdir` which is TRUE/FALSE depending on whether the object is a directory or file. Any additional columns remain storage type-specific. - New `get_storage_metadata` and `set_storage_metadata` methods for managing user-specified properties (metadata) for objects. - Revamped methods for getting standard properties, which are now all methods for `get_storage_properties` rather than having specific functions for blobs, files and directories. diff --git a/R/blob_copyurl.R b/R/blob_copyurl.R index 958cbba..c56c2ca 100644 --- a/R/blob_copyurl.R +++ b/R/blob_copyurl.R @@ -32,7 +32,7 @@ multicopy_url_to_storage.blob_container <- function(container, src, dest, ...) #' @param async For `copy_url_to_blob` and `multicopy_url_to_blob`, whether the copy operation should be asynchronous (proceed in the background). #' @details -#' `copy_url_to_blob` transfers the contents of the file at the specified HTTP\[S\] URL directly to blob storage, without requiring a temporary local copy to be made. `multicopy_url_to_blob1 does the same, for multiple URLs at once. These functions have a current file size limit of 256MB. +#' `copy_url_to_blob` transfers the contents of the file at the specified HTTP\[S\] URL directly to blob storage, without requiring a temporary local copy to be made. `multicopy_url_to_blob` does the same, for multiple URLs at once. These functions have a current file size limit of 256MB. #' @rdname blob #' @export copy_url_to_blob <- function(container, src, dest, lease=NULL, async=FALSE) @@ -69,12 +69,12 @@ multicopy_url_to_blob <- function(container, src, dest, lease=NULL, async=FALSE, stop("'dest' must contain one name per file in 'src'", call.=FALSE) if(n_src == 1) - return(copy_url_to_blob(container, src, dest, ...)) + return(copy_url_to_blob(container, src, dest, lease=lease, async=async)) init_pool(max_concurrent_transfers) pool_export("container", envir=environment()) - pool_map(function(s, d, ...) AzureStor::copy_url_to_blob(container, s, d, ...), + pool_map(function(s, d, lease, async) AzureStor::copy_url_to_blob(container, s, d, lease=lease, async=async), src, dest, MoreArgs=list(lease=lease, async=async)) invisible(NULL) } diff --git a/README.md b/README.md index 9d1b7b7..750dfc9 100644 --- a/README.md +++ b/README.md @@ -59,14 +59,15 @@ These functions for working with objects within a storage container: - `delete_storage_file`: delete a file or blob - `storage_upload`/`storage_download`: transfer a file to or from a storage container - `storage_multiupload`/`storage_multidownload`: transfer multiple files in parallel to or from a storage container - +- `get_storage_properties`: Get properties for a storage object +- `get_storage_metadata`/`set_storage_metadata`: Get and set user-defined metadata for a storage object ```r # example of working with files and directories (ADLSgen2) cont <- storage_container(ad_end_tok, "myfilesystem") list_storage_files(cont) create_storage_dir(cont, "newdir") -storage_download(cont, "/readme.txt", "~/readme.txt") +storage_download(cont, "/readme.txt") storage_multiupload(cont, "N:/data/*.*", "newdir") # uploading everything in a directory ``` @@ -76,7 +77,7 @@ AzureStor includes a number of extra features to make transferring files efficie ### Parallel connections -As noted above, you can transfer multiple files in parallel using the `multiupload_*`/`multidownload_*` functions. These functions utilise a background process pool supplied by AzureRMR to do the transfers in parallel, which usually results in major speedups when transferring multiple small files. The pool is created the first time a parallel file transfer is performed, and persists for the duration of the R session; this means you don't have to wait for the pool to be (re-)created each time. +As noted above, you can transfer multiple files in parallel using the `storage_multiupload/download` functions. These functions utilise a background process pool supplied by AzureRMR to do the transfers in parallel, which usually results in major speedups when transferring multiple small files. The pool is created the first time a parallel file transfer is performed, and persists for the duration of the R session; this means you don't have to wait for the pool to be (re-)created each time. ```r # uploading/downloading multiple files at once: use a wildcard to specify files to transfer @@ -86,22 +87,7 @@ storage_multidownload(cont, src="/monthly/jan*.*", dest="~/data/january") # or supply a vector of file specs as the source and destination src <- c("file1.csv", "file2.csv", "file3.csv") dest <- file.path("data/", src) -storage_multiupload(cont, src, dest) -``` - -You can also use the process pool to parallelise tasks for which there is no built-in function. For example, the following code will delete multiple files in parallel: - -```r -files_to_delete <- list_storage_files(cont, "datadir", info="name") - -# initialise the background pool with 10 nodes -AzureRMR::init_pool(10) - -# export the container object to the nodes -AzureRMR::pool_export("cont") - -# delete the files -AzureRMR::pool_sapply(files_to_delete, function(f) AzureStor::delete_storage_file(cont, f)) +storage_multiupload(cont, src=src, dest=dest) ``` ### Transfer to and from connections @@ -120,7 +106,7 @@ storage_upload(cont, src=con, dest="iris.rds") # downloading files into memory: as a raw vector with dest=NULL, and via a connection rawvec <- storage_download(cont, src="iris.json", dest=NULL) -rawToChar(rawvec) +rawToChar(rawConnectionValue(rawvec)) con <- rawConnection(raw(0), "r+") storage_download(cont, src="iris.rds", dest=con) diff --git a/man/blob.Rd b/man/blob.Rd index 8324f37..e30b1b5 100644 --- a/man/blob.Rd +++ b/man/blob.Rd @@ -83,7 +83,7 @@ Upload, download, or delete a blob; list blobs in a container. \code{upload_blob} and \code{download_blob} can display a progress bar to track the file transfer. You can control whether to display this with \code{options(azure_storage_progress_bar=TRUE|FALSE)}; the default is TRUE. -\code{copy_url_to_blob} transfers the contents of the file at the specified HTTP[S] URL directly to blob storage, without requiring a temporary local copy to be made. `multicopy_url_to_blob1 does the same, for multiple URLs at once. These functions have a current file size limit of 256MB. +\code{copy_url_to_blob} transfers the contents of the file at the specified HTTP[S] URL directly to blob storage, without requiring a temporary local copy to be made. \code{multicopy_url_to_blob} does the same, for multiple URLs at once. These functions have a current file size limit of 256MB. } \examples{ \dontrun{ diff --git a/tests/testthat/test02a_blobext.R b/tests/testthat/test02a_blobext.R index d659b51..4c470b8 100644 --- a/tests/testthat/test02a_blobext.R +++ b/tests/testthat/test02a_blobext.R @@ -144,7 +144,7 @@ test_that("Blob multicopy from URL works", contname <- paste0(sample(letters, 10, TRUE), collapse="") cont <- create_blob_container(bl, contname) - fnames <- c("DESCRIPTION", "LICENSE", "NAMESPACE") + fnames <- c("LICENSE", "LICENSE.md", "CONTRIBUTING.md") src_urls <- paste0("https://raw.githubusercontent.com/Azure/AzureStor/master/", fnames) origs <- paste0("../../", fnames) dests <- c(tempfile(), tempfile(), tempfile()) diff --git a/tests/testthat/test05_generics.R b/tests/testthat/test05_generics.R index 21ddb8f..128cb6b 100644 --- a/tests/testthat/test05_generics.R +++ b/tests/testthat/test05_generics.R @@ -129,7 +129,7 @@ test_that("Blob copy from URL works", # use readLines to workaround GH auto-translating CRLF -> LF expect_identical(readLines(orig_file), readLines(new_file)) - fnames <- c("DESCRIPTION", "LICENSE", "NAMESPACE") + fnames <- c("LICENSE", "LICENSE.md", "CONTRIBUTING.md") src_urls <- paste0("https://raw.githubusercontent.com/Azure/AzureStor/master/", fnames) origs <- paste0("../../", fnames) dests <- c(tempfile(), tempfile(), tempfile()) diff --git a/vignettes/intro.rmd b/vignettes/intro.rmd index d2a2594..67c251f 100644 --- a/vignettes/intro.rmd +++ b/vignettes/intro.rmd @@ -62,13 +62,12 @@ These functions for working with objects within a storage container: - `storage_upload`/`storage_download`: transfer a file to or from a storage container - `storage_multiupload`/`storage_multidownload`: transfer multiple files in parallel to or from a storage container - ```r # example of working with files and directories (ADLSgen2) cont <- storage_container(ad_end_tok, "myfilesystem") list_storage_files(cont) create_storage_dir(cont, "newdir") -storage_download(cont, "/readme.txt", "~/readme.txt") +storage_download(cont, "/readme.txt") storage_multiupload(cont, "N:/data/*.*", "newdir") # uploading everything in a directory ``` @@ -78,7 +77,7 @@ AzureStor includes a number of extra features to make transferring files efficie ### Parallel connections -As noted above, you can transfer multiple files in parallel using the `multiupload_*`/`multidownload_*` functions. These functions utilise a background process pool supplied by AzureRMR to do the transfers in parallel, which usually results in major speedups when transferring multiple small files. The pool is created the first time a parallel file transfer is performed, and persists for the duration of the R session; this means you don't have to wait for the pool to be (re-)created each time. +The `storage_multiupload/download` functions transfer multiple files in parallel, which usually results in major speedups when transferring multiple small files. The pool is created the first time a parallel file transfer is performed, and persists for the duration of the R session; this means you don't have to wait for the pool to be (re-)created each time. ```r # uploading/downloading multiple files at once: use a wildcard to specify files to transfer @@ -91,21 +90,6 @@ dest <- file.path("data/", src) storage_multiupload(cont, src, dest) ``` -You can also use the process pool to parallelise tasks for which there is no built-in function. For example, the following code will delete multiple files in parallel: - -```r -files_to_delete <- list_storage_files(cont, "datadir", info="name") - -# initialise the background pool with 10 nodes -AzureRMR::init_pool(10) - -# export the container object to the nodes -AzureRMR::pool_export("cont") - -# delete the files -AzureRMR::pool_sapply(files_to_delete, function(f) AzureStor::delete_storage_file(cont, f)) -``` - ### Transfer to and from connections You can upload a (single) in-memory R object via a _connection_, and similarly, you can download a file to a connection, or return it as a raw vector. This lets you transfer an object without having to create a temporary file as an intermediate step. @@ -122,7 +106,7 @@ storage_upload(cont, src=con, dest="iris.rds") # downloading files into memory: as a raw vector with dest=NULL, and via a connection rawvec <- storage_download(cont, src="iris.json", dest=NULL) -rawToChar(rawvec) +rawToChar(rawConnectionValue(rawvec)) con <- rawConnection(raw(0), "r+") storage_download(cont, src="iris.rds", dest=con) @@ -165,6 +149,52 @@ For more information, see the [AzCopy repo on GitHub](https://github.com/Azure/a **Note that AzureStor uses AzCopy version 10. It is incompatible with versions 8.1 and earlier.** +## Other features + +### Parallel connections + +The `storage_multiupload/download` functions mentioned above use a background process pool supplied by AzureRMR. You can also use this pool to parallelise tasks for which there is no built-in function. For example, the following code will delete multiple files in parallel: + +```r +files_to_delete <- list_storage_files(container, "datadir", info="name") + +# initialise the background pool with 10 nodes +AzureRMR::init_pool(10) + +# export the container object to the nodes +AzureRMR::pool_export("cont") + +# delete the files +AzureRMR::pool_sapply(files_to_delete, function(f) AzureStor::delete_storage_file(cont, f)) +``` + +### Metadata + +To get and set user-defined properties (metadata) for storage objects, use the `get_storage_metadata` and `set_storage_metadata` functions. + +```r +fs <- storage_container("https://mystorage.dfs.core.windows.net/myshare", key="access_key") +storage_upload(share, "iris.csv", "newdir/iris.csv") + +set_storage_metadata(fs, "newdir/iris.csv", name1="value1") +# will be list(name1="value1") +get_storage_metadata(fs, "newdir/iris.csv") + +set_storage_metadata(fs, "newdir/iris.csv", name2="value2") +# will be list(name1="value1", name2="value2") +get_storage_metadata(fs, "newdir/iris.csv") + +set_storage_metadata(fs, "newdir/iris.csv", name3="value3", keep_existing=FALSE) +# will be list(name3="value3") +get_storage_metadata(fs, "newdir/iris.csv") + +# deleting all metadata +set_storage_metadata(fs, "newdir/iris.csv", keep_existing=FALSE) + +# if no filename supplied, get/set metadata for the container +get_storage_metadata(fs) +``` + ## Admin interface Finally, AzureStor's admin-side interface allows you to easily create and delete resource accounts, as well as obtain access keys and generate a SAS. Here is a sample workflow: