This commit is contained in:
Hong Ooi 2019-11-02 03:46:22 +11:00
Родитель cda0178881
Коммит f42675c450
8 изменённых файлов: 65 добавлений и 51 удалений

Просмотреть файл

@ -1,6 +1,6 @@
Package: AzureStor
Title: Storage Management in 'Azure'
Version: 2.1.1.9000
Version: 3.0.0
Authors@R: c(
person("Hong", "Ooi", , "hongooi@microsoft.com", role = c("aut", "cre")),
person("Microsoft", role="cph")
@ -19,12 +19,10 @@ Imports:
mime,
openssl,
xml2,
AzureRMR (>= 2.2.1)
AzureRMR (>= 2.3.0)
Suggests:
knitr,
jsonlite,
testthat
Roxygen: list(markdown=TRUE)
RoxygenNote: 6.1.1
Remotes:
Azure/AzureRMR

Просмотреть файл

@ -1,4 +1,4 @@
# AzureStor 2.1.1.9000
# AzureStor 3.0.0
## Significant user-visible changes
@ -10,7 +10,7 @@
- Significant changes to file storage methods for greater consistency with the other storage types:
- The default directory for `list_azure_files` is now the root, mirroring the behaviour for blobs and ADLSgen2.
- The output of `list_azure_files` now includes the full path as part of the file/directory name.
- Add `recursive` argument to file storage methods for recursing through subdirectories. Like above, for file storage this can be slow, so try to use a non-recursive solution where possible.
- Add `recursive` argument to `list_azure_files`, `create_azure_dir` and `delete_azure_dir` for recursing through subdirectories. Like with file transfers, for Azure file storage this can be slow, so try to use a non-recursive solution where possible.
- Make output format for `list_adls_files`, `list_blobs` and `list_azure_files` more consistent. The first 2 columns for a data frame output are now always `name` and `size`; the size of a directory is NA. The 3rd column for non-blobs is `isdir` which is TRUE/FALSE depending on whether the object is a directory or file. Any additional columns remain storage type-specific.
- New `get_storage_metadata` and `set_storage_metadata` methods for managing user-specified properties (metadata) for objects.
- Revamped methods for getting standard properties, which are now all methods for `get_storage_properties` rather than having specific functions for blobs, files and directories.

Просмотреть файл

@ -32,7 +32,7 @@ multicopy_url_to_storage.blob_container <- function(container, src, dest, ...)
#' @param async For `copy_url_to_blob` and `multicopy_url_to_blob`, whether the copy operation should be asynchronous (proceed in the background).
#' @details
#' `copy_url_to_blob` transfers the contents of the file at the specified HTTP\[S\] URL directly to blob storage, without requiring a temporary local copy to be made. `multicopy_url_to_blob1 does the same, for multiple URLs at once. These functions have a current file size limit of 256MB.
#' `copy_url_to_blob` transfers the contents of the file at the specified HTTP\[S\] URL directly to blob storage, without requiring a temporary local copy to be made. `multicopy_url_to_blob` does the same, for multiple URLs at once. These functions have a current file size limit of 256MB.
#' @rdname blob
#' @export
copy_url_to_blob <- function(container, src, dest, lease=NULL, async=FALSE)
@ -69,12 +69,12 @@ multicopy_url_to_blob <- function(container, src, dest, lease=NULL, async=FALSE,
stop("'dest' must contain one name per file in 'src'", call.=FALSE)
if(n_src == 1)
return(copy_url_to_blob(container, src, dest, ...))
return(copy_url_to_blob(container, src, dest, lease=lease, async=async))
init_pool(max_concurrent_transfers)
pool_export("container", envir=environment())
pool_map(function(s, d, ...) AzureStor::copy_url_to_blob(container, s, d, ...),
pool_map(function(s, d, lease, async) AzureStor::copy_url_to_blob(container, s, d, lease=lease, async=async),
src, dest, MoreArgs=list(lease=lease, async=async))
invisible(NULL)
}

Просмотреть файл

@ -59,14 +59,15 @@ These functions for working with objects within a storage container:
- `delete_storage_file`: delete a file or blob
- `storage_upload`/`storage_download`: transfer a file to or from a storage container
- `storage_multiupload`/`storage_multidownload`: transfer multiple files in parallel to or from a storage container
- `get_storage_properties`: Get properties for a storage object
- `get_storage_metadata`/`set_storage_metadata`: Get and set user-defined metadata for a storage object
```r
# example of working with files and directories (ADLSgen2)
cont <- storage_container(ad_end_tok, "myfilesystem")
list_storage_files(cont)
create_storage_dir(cont, "newdir")
storage_download(cont, "/readme.txt", "~/readme.txt")
storage_download(cont, "/readme.txt")
storage_multiupload(cont, "N:/data/*.*", "newdir") # uploading everything in a directory
```
@ -76,7 +77,7 @@ AzureStor includes a number of extra features to make transferring files efficie
### Parallel connections
As noted above, you can transfer multiple files in parallel using the `multiupload_*`/`multidownload_*` functions. These functions utilise a background process pool supplied by AzureRMR to do the transfers in parallel, which usually results in major speedups when transferring multiple small files. The pool is created the first time a parallel file transfer is performed, and persists for the duration of the R session; this means you don't have to wait for the pool to be (re-)created each time.
As noted above, you can transfer multiple files in parallel using the `storage_multiupload/download` functions. These functions utilise a background process pool supplied by AzureRMR to do the transfers in parallel, which usually results in major speedups when transferring multiple small files. The pool is created the first time a parallel file transfer is performed, and persists for the duration of the R session; this means you don't have to wait for the pool to be (re-)created each time.
```r
# uploading/downloading multiple files at once: use a wildcard to specify files to transfer
@ -86,22 +87,7 @@ storage_multidownload(cont, src="/monthly/jan*.*", dest="~/data/january")
# or supply a vector of file specs as the source and destination
src <- c("file1.csv", "file2.csv", "file3.csv")
dest <- file.path("data/", src)
storage_multiupload(cont, src, dest)
```
You can also use the process pool to parallelise tasks for which there is no built-in function. For example, the following code will delete multiple files in parallel:
```r
files_to_delete <- list_storage_files(cont, "datadir", info="name")
# initialise the background pool with 10 nodes
AzureRMR::init_pool(10)
# export the container object to the nodes
AzureRMR::pool_export("cont")
# delete the files
AzureRMR::pool_sapply(files_to_delete, function(f) AzureStor::delete_storage_file(cont, f))
storage_multiupload(cont, src=src, dest=dest)
```
### Transfer to and from connections
@ -120,7 +106,7 @@ storage_upload(cont, src=con, dest="iris.rds")
# downloading files into memory: as a raw vector with dest=NULL, and via a connection
rawvec <- storage_download(cont, src="iris.json", dest=NULL)
rawToChar(rawvec)
rawToChar(rawConnectionValue(rawvec))
con <- rawConnection(raw(0), "r+")
storage_download(cont, src="iris.rds", dest=con)

Просмотреть файл

@ -83,7 +83,7 @@ Upload, download, or delete a blob; list blobs in a container.
\code{upload_blob} and \code{download_blob} can display a progress bar to track the file transfer. You can control whether to display this with \code{options(azure_storage_progress_bar=TRUE|FALSE)}; the default is TRUE.
\code{copy_url_to_blob} transfers the contents of the file at the specified HTTP[S] URL directly to blob storage, without requiring a temporary local copy to be made. `multicopy_url_to_blob1 does the same, for multiple URLs at once. These functions have a current file size limit of 256MB.
\code{copy_url_to_blob} transfers the contents of the file at the specified HTTP[S] URL directly to blob storage, without requiring a temporary local copy to be made. \code{multicopy_url_to_blob} does the same, for multiple URLs at once. These functions have a current file size limit of 256MB.
}
\examples{
\dontrun{

Просмотреть файл

@ -144,7 +144,7 @@ test_that("Blob multicopy from URL works",
contname <- paste0(sample(letters, 10, TRUE), collapse="")
cont <- create_blob_container(bl, contname)
fnames <- c("DESCRIPTION", "LICENSE", "NAMESPACE")
fnames <- c("LICENSE", "LICENSE.md", "CONTRIBUTING.md")
src_urls <- paste0("https://raw.githubusercontent.com/Azure/AzureStor/master/", fnames)
origs <- paste0("../../", fnames)
dests <- c(tempfile(), tempfile(), tempfile())

Просмотреть файл

@ -129,7 +129,7 @@ test_that("Blob copy from URL works",
# use readLines to workaround GH auto-translating CRLF -> LF
expect_identical(readLines(orig_file), readLines(new_file))
fnames <- c("DESCRIPTION", "LICENSE", "NAMESPACE")
fnames <- c("LICENSE", "LICENSE.md", "CONTRIBUTING.md")
src_urls <- paste0("https://raw.githubusercontent.com/Azure/AzureStor/master/", fnames)
origs <- paste0("../../", fnames)
dests <- c(tempfile(), tempfile(), tempfile())

Просмотреть файл

@ -62,13 +62,12 @@ These functions for working with objects within a storage container:
- `storage_upload`/`storage_download`: transfer a file to or from a storage container
- `storage_multiupload`/`storage_multidownload`: transfer multiple files in parallel to or from a storage container
```r
# example of working with files and directories (ADLSgen2)
cont <- storage_container(ad_end_tok, "myfilesystem")
list_storage_files(cont)
create_storage_dir(cont, "newdir")
storage_download(cont, "/readme.txt", "~/readme.txt")
storage_download(cont, "/readme.txt")
storage_multiupload(cont, "N:/data/*.*", "newdir") # uploading everything in a directory
```
@ -78,7 +77,7 @@ AzureStor includes a number of extra features to make transferring files efficie
### Parallel connections
As noted above, you can transfer multiple files in parallel using the `multiupload_*`/`multidownload_*` functions. These functions utilise a background process pool supplied by AzureRMR to do the transfers in parallel, which usually results in major speedups when transferring multiple small files. The pool is created the first time a parallel file transfer is performed, and persists for the duration of the R session; this means you don't have to wait for the pool to be (re-)created each time.
The `storage_multiupload/download` functions transfer multiple files in parallel, which usually results in major speedups when transferring multiple small files. The pool is created the first time a parallel file transfer is performed, and persists for the duration of the R session; this means you don't have to wait for the pool to be (re-)created each time.
```r
# uploading/downloading multiple files at once: use a wildcard to specify files to transfer
@ -91,21 +90,6 @@ dest <- file.path("data/", src)
storage_multiupload(cont, src, dest)
```
You can also use the process pool to parallelise tasks for which there is no built-in function. For example, the following code will delete multiple files in parallel:
```r
files_to_delete <- list_storage_files(cont, "datadir", info="name")
# initialise the background pool with 10 nodes
AzureRMR::init_pool(10)
# export the container object to the nodes
AzureRMR::pool_export("cont")
# delete the files
AzureRMR::pool_sapply(files_to_delete, function(f) AzureStor::delete_storage_file(cont, f))
```
### Transfer to and from connections
You can upload a (single) in-memory R object via a _connection_, and similarly, you can download a file to a connection, or return it as a raw vector. This lets you transfer an object without having to create a temporary file as an intermediate step.
@ -122,7 +106,7 @@ storage_upload(cont, src=con, dest="iris.rds")
# downloading files into memory: as a raw vector with dest=NULL, and via a connection
rawvec <- storage_download(cont, src="iris.json", dest=NULL)
rawToChar(rawvec)
rawToChar(rawConnectionValue(rawvec))
con <- rawConnection(raw(0), "r+")
storage_download(cont, src="iris.rds", dest=con)
@ -165,6 +149,52 @@ For more information, see the [AzCopy repo on GitHub](https://github.com/Azure/a
**Note that AzureStor uses AzCopy version 10. It is incompatible with versions 8.1 and earlier.**
## Other features
### Parallel connections
The `storage_multiupload/download` functions mentioned above use a background process pool supplied by AzureRMR. You can also use this pool to parallelise tasks for which there is no built-in function. For example, the following code will delete multiple files in parallel:
```r
files_to_delete <- list_storage_files(container, "datadir", info="name")
# initialise the background pool with 10 nodes
AzureRMR::init_pool(10)
# export the container object to the nodes
AzureRMR::pool_export("cont")
# delete the files
AzureRMR::pool_sapply(files_to_delete, function(f) AzureStor::delete_storage_file(cont, f))
```
### Metadata
To get and set user-defined properties (metadata) for storage objects, use the `get_storage_metadata` and `set_storage_metadata` functions.
```r
fs <- storage_container("https://mystorage.dfs.core.windows.net/myshare", key="access_key")
storage_upload(share, "iris.csv", "newdir/iris.csv")
set_storage_metadata(fs, "newdir/iris.csv", name1="value1")
# will be list(name1="value1")
get_storage_metadata(fs, "newdir/iris.csv")
set_storage_metadata(fs, "newdir/iris.csv", name2="value2")
# will be list(name1="value1", name2="value2")
get_storage_metadata(fs, "newdir/iris.csv")
set_storage_metadata(fs, "newdir/iris.csv", name3="value3", keep_existing=FALSE)
# will be list(name3="value3")
get_storage_metadata(fs, "newdir/iris.csv")
# deleting all metadata
set_storage_metadata(fs, "newdir/iris.csv", keep_existing=FALSE)
# if no filename supplied, get/set metadata for the container
get_storage_metadata(fs)
```
## Admin interface
Finally, AzureStor's admin-side interface allows you to easily create and delete resource accounts, as well as obtain access keys and generate a SAS. Here is a sample workflow: