Fixes issues with layer remounting (e.g. a running container which then
has `docker cp` used to copy files in or out) by applying the same
refcounting implementation that exists in other graphdrivers like
overlay and aufs.
Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
Save was failing file integrity checksums due to bugs in both
Windows and Docker. This commit includes fixes to file time handling
in tarexport and system.chtimes that are necessary along with
the Windows platform fixes to correctly support save. With this
change, sysfile_backups for windowsfilter driver are no longer
needed, so that code is removed.
Signed-off-by: Stefan J. Wernli <swernli@microsoft.com>
Fix root directory of the mountpoint being owned by real root. This is
unique to ZFS because of the way file mountpoints are created using the
ZFS tooling, and the remapping that happens at layer unpack doesn't
impact this root (already created) holding directory for the layer.
Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
For btrfs driver, in d.Create(), Get() of parentDir is called but not followed
by Put().
If we apply SElinux mount label, we need to mount btrfs subvolumes in d.Get(),
without a Put() would end up with a later Remove() failure on
"Device resourse is busy".
This calls the subvolume helper function directly in d.Create().
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Most storage drivers call graphdriver.GetFSMagic(home),
it is more clean to easy to maintain. So btrfs need to
adopt such change.
Signed-off-by: Kai Qiang Wu(Kennan) <wkqwu@cn.ibm.com>
Right now if somebody has enabled deferred device deletion, then
deleteTransaction() returns success even if device could not be deleted. It
has been marked for deferred deletion. Right now we will mark device ID free
and potentially use it again when somebody tries to create new container. And
that's wrong. Device ID is not free yet. It will become free once devices
has actually been deleted by the goroutine later.
So move the location of call to markDeviceIDFree() to a place where we know
device actually got deleted and was not marked for deferred deletion.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Make sure btrfs mounted subvolumes are owned properly when a remapped
root exists (user namespaces are enabled, for example)
Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
Instead of creating a "0.0" subdirectory and migrating graphroot
metadata into it when user namespaces are available in the daemon
(currently only in experimental), change the graphroot dir permissions
to only include the execute bit for "other" users.
This allows easy migration to and from user namespaces and will allow
easier integration of user namespace support into the master build.
Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
Support restoreCustomImage for windows with a new interface to extract
the graph driver from the LayerStore.
Signed-off-by: Daniel Nephin <dnephin@docker.com>
The loopback logic is not technically exclusive to the devicemapper
driver. This reorganizes the code such that the loopback code is usable
outside of the devicemapper package and driver.
Signed-off-by: Vincent Batts <vbatts@redhat.com>
Really fixing 2 things:
1. Panic when any error is detected while walking the btrfs graph dir on
removal due to no error check.
2. Nested subvolumes weren't actually being removed due to passing in
the wrong path
On point 2, for a path detected as a nested subvolume, we were calling
`subvolDelete("/path/to/subvol", "subvol")`, where the last part of the
path was duplicated due to a logic error, and as such actually causing
point #1 since `subvolDelete` joins the two arguemtns, and
`/path/to/subvol/subvol` (the joined version) doesn't exist.
Also adds a test for nested subvol delete.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
After the very first init of the graph `docker info` correctly shows the
base fs type under `Backing Filesystem`. This information isn't stored
anywhere. After a restart (w/o erasing `/var/lib/docker`) `docker info`
shows an empty string under `Backing Filesystem`.
This patch records the base fs type after the first run in the metadata
or, to fix old devices that don't have this info in the metadata, just
probe the fs type of the base device at graph startup.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
All underlay dirs need proper remapped ownership. This bug was masked by the
fact that the setupInitLayer code was chown'ing the dirs at startup
time. Since that bug is now fixed, it revealed this permissions issue.
Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
ext4 filesystem creation can take a long time on 100G thin device and
systemd might time out and kill docker service. Often user is left thinking
why docker is taking so long and logs don't give any hint. Log an info
message in journal for start and end of filesystem creation. That way
a user can look at logs and figure out that filesystem creation is
taking long time.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Add distribution package for managing pulls and pushes. This is based on
the old code in the graph package, with major changes to work with the
new image/layer model.
Add v1 migration code.
Update registry, api/*, and daemon packages to use the reference
package's types where applicable.
Update daemon package to use image/layer/tag stores instead of the graph
package
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
This change will allow us to run SELinux in a container with
BTRFS back end. We continue to work on fixing the kernel/BTRFS
but this change will allow SELinux Security separation on BTRFS.
It basically relabels the content on container creation.
Just relabling -init directory in BTRFS use case. Everything looks like it
works. I don't believe tar/achive stores the SELinux labels, so we are good
as far as docker commit.
Tested Speed on startup with BTRFS on top of loopback directory. BTRFS
not on loopback should get even better perfomance on startup time. The
more inodes inside of the container image will increase the relabel time.
This patch will give people who care more about security the option of
runnin BTRFS with SELinux. Those who don't want to take the slow down
can disable SELinux either in individual containers or for all containers
by continuing to disable SELinux in the daemon.
Without relabel:
> time docker run --security-opt label:disable fedora echo test
test
real 0m0.918s
user 0m0.009s
sys 0m0.026s
With Relabel
test
real 0m1.942s
user 0m0.007s
sys 0m0.030s
Signed-off-by: Dan Walsh <dwalsh@redhat.com>
Signed-off-by: Dan Walsh <dwalsh@redhat.com>
If platform supports xfs filesystem then use xfs as default filesystem
for container rootfs instead of ext4. Reason being that ext4 is pre-allcating
lot of metadata (around 1.8GB on 100G thin volume) and that can take long
enough on AWS storage that systemd times out and docker fails to start.
If one disables pre-allocation of ext4 metadata, then it will be allocated
when containers are mounted and we will have multiple copies of metadata
per container. For a 100G thin device, it was around 1.5GB of metadata
per container.
ext4 has an optimization to skip zeroing if discards are issued and
underlying device guarantees that zero will be returned when discarded
blocks are read back. devicemapper thin devices don't offer that guarantee
so ext4 optimization does not kick in. In fact given discards are optional
and can be dropped on the floor if need be, it looks like it might not be
possible to guarantee that all the blocks got discarded and if read back
zero will be returned.
Signed-off-by: Anusha Ragunathan <anusha@docker.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
If user wants to use a filesystem it can be specified using dm.fs=<filesystem>
option. It is possible that docker already had base image and a filesystem
on that. Later if user wants to change file system using dm.fs= option
and restarts docker, that's not possible. Warn user about it.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Right now if blkid fails we are just logging a debug message and don;t return
the actual error to caller. Caller gets the error message that thin pool
base device UUID verification failed and it might give impression that thin
pool changed. But that's not the case. Thin pool is in such a state that we
could not even query the thin device UUID. Retrun error message appropriately
to make situation more clear.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
The LXC driver was deprecated in Docker 1.8.
Following the deprecation rules, we can remove a deprecated feature
after two major releases. LXC won't be supported anymore starting on Docker 1.10.
Signed-off-by: David Calavera <david.calavera@gmail.com>
This reverts commit d5cd032a86.
Commit caused issues on systems with case-insensitive filesystems.
Revert for now
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
- Move autogen/dockerversion to version
- Update autogen and "builds" to use this package and a build flag
Signed-off-by: Vincent Demeester <vincent@sbr.pm>
cleanupDeleted() takes devices.Lock() but does not drop it if there are
no deleted devices. Hence docker deadlocks if one is using deferred
device deletion feature. (--storage-opt dm.use_deferred_deletion=true).
Fix it. Drop the lock before returning.
Also added a unit test case to make sure in future this can be easily
detected if somebody changes the function.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Generate a hash chain involving the image configuration, layer digests,
and parent image hashes. Use the digests to compute IDs for each image
in a manifest, instead of using the remotely specified IDs.
To avoid breaking users' caches, check for images already in the graph
under old IDs, and avoid repulling an image if the version on disk under
the legacy ID ends up with the same digest that was computed from the
manifest for that image.
When a calculated ID already exists in the graph but can't be verified,
continue trying SHA256(digest) until a suitable ID is found.
"save" and "load" are not changed to use a similar scheme. "load" will
preserve the IDs present in the tar file.
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Adds support for the daemon to handle user namespace maps as a
per-daemon setting.
Support for handling uid/gid mapping is added to the builder,
archive/unarchive packages and functions, all graphdrivers (except
Windows), and the test suite is updated to handle user namespace daemon
rootgraph changes.
Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
When `-s` is not specified, there is no need to ask if there is a plugin
with the specified name.
This speeds up unit tests dramatically since they don't need to wait the
timeout period for each call to `graphdriver.New`.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
There is no need to call `os.Stat` on the driver filesystem path of a
container as `os.RemoveAll` already handles (properly) the case where
the path no longer exists.
Given the results of the stat() were not even being used, there is no
value in erroring out because of the stat call failure, and worse, it
prevents daemon cleanup of containers in "Dead" state unless you re-create
directories that were already removed via a manual cleanup after a
failure. This brings removal in overlay in line with aufs/devicemapper
drivers which don't error out if the filesystem path no longer exists.
Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
Right now we check for the existence of device but don't make sure it is
a thin pool device. We assume it is a thin pool device and call poolStatus()
on the device which returns an error EOF. And that error does not tell
anything.
So before we reach the stage of calling poolStatus() make sure we are working
with a thin pool device otherwise error out.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Start a goroutine which runs every 30 seconds and if there are deferred
deleted devices, it tries to clean those up.
Also it moves the call to cleanupDeletedDevices() into goroutine and
moves the locking completely inside the function. Now function does not
assume that device lock is held at the time of entry.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Finally here is the patch to implement deferred deletion functionality.
Deferred deleted devices are marked as "Deleted" in device meta file.
First we try to delete the device and only if deletion fails and user has
enabled deferred deletion, device is marked for deferred deletion.
When docker starts up again, we go through list of deleted devices and
try to delete these again.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Provide a command line option dm.use_deferred_deletion to enable deferred
device deletion feature. By default feature will be turned off.
Not sure if there is much value in deferred deletion being turned on
without deferred removal being turned on. So for now, this feature can
be enabled only if deferred removal is on.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Currently during startup we walk through all the device files and read
their device ID and mark in a bitmap that device id is used.
We are anyway going through all device files. So we can as well load all
that data into device hash map. This will save us little time when
container is actually launched later.
Also this will help with later patches where cleanup deferred device
wants to go through all the devices and see which have been marked for
deletion and delete these.
So re-organize the code a bit.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Simplify setupBaseImage() even further. Move some more code in a separate
function. Pure code reorganization. No functionality change.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Move thin pool related checks in a separate function. Pure code reorganization.
Makes reading code easier.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
This moves base device creation function in a separate function. Pure
code reorganization. Makes reading code little easier.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
This patch does three things. Following are the descriptions.
===
Create a separate function for delete transactions so that parent function
is little smaller.
Also close transaction if an error happens.
===
When docker is being shutdown, save deviceset metadata first before
trying to remove the devices. Generally caller gives only 10 seconds
for shutdown to complete and then kills it after that. So if some device
is busy, we will wait 20 seconds for it removal and never be able to save
metadata. So first save metadata and then deal with device removal.
===
Move issue discard operation in a separate function. This makes reading code
little easier.
Also don't issue discards if device is still open. That means devices is
still probably being used and issuing discards is not a good idea.
This is especially true in case of deferred deletion. We want to issue
discards when device is not open. At that time device can be deleted too.
Otherwise we will issue discards and deletion will actually fail. Later
we will try deletion again and issue discards again and deletion will
fail again as device is open and busy.
So this will ensure that discards are issued once when device is not open
and it can actually be deleted.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Right now we seem to have 3 locks.
- devinfo.lock
This is a per device lock
- metaData.devicesLock
This is supposedely protecting map of devices.
- Global DeviceSet lock
This is protecting map of devices as well as serializing calls to libdevmapper.
Semantics of per devices lock and global deviceset lock seem to be very clear.
Even ordering between these two locks has been defined properly.
What is not clear is the need and ordering of metaData.devicesLock. Looks like
this lock is not necessary and global DeviceSet lock should be used to
protect map of devices as it is part of DeviceSet.
This patchset gets rid of metaData.devicesLock and instead uses DeviceSet
lock to protect map of devices.
Also at couple of places during initialization takes devices.Lock(). That
is not strictly necessary as there is supposed to be one thread of execution
during initializaiton. Still it makes the code clearer.
I think this makes code more clear and easier to understand and easier to
make further changes.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
maxDeviceID is upper limit on device Id thin pool can support. Right now
we have this check only during startup. It is a good idea to move this
check in loadMetadata so that any time a device file is loaded and if it
is corrupted and device Id is more than maxDevieceID, it will be detected
right then and there.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Use deactivateDevice() instead of removeDevice() directly. This will make
sure for device deletion, deferred removal is used if user has configured
it in. Also this makes reading code litle easier as there is single function
to remove a device and that is deactivateDevice().
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
If a device is still mounted at the time of DeleteDevice(), that means
higher layers have not called Put() properly on the device and are trying
to delete it. This is a bug in the code where Get() and Put() have not been
properly paired up. Fail device deletion if it is still mounted.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Exists() and HasDevice() just check if device file exists or not. It does
not say anything about if device is mounted or not. Fix comments.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
device has map (device.Devices), contains valid devices and we skip all
the files which are not device files. transaction metadata file is not
device file. Skip this file when devices files are being read and loaded
into map.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
This fixes the case where directory is removed in
aufs and then the same layer is imported to a
different graphdriver.
Currently when you do `rm -rf /foo && mkdir /foo`
in a layer in aufs the files under `foo` would
only be be hidden on aufs.
The problems with this fix:
1) When a new diff is recreated from non-aufs driver
the `opq` files would not be there. This should not
mean layer differences for the user but still
different content in the tar (one would have one
`opq` file, the others would have `.wh.*` for every
file inside that folder). This difference also only
happens if the tar-split file isn’t stored for the
layer.
2) New files that have the filenames before `.wh..wh..opq`
when they are sorted do not get picked up by non-aufs
graphdrivers. Fixing this would require a bigger
refactoring that is planned in the future.
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
The mount syscall does not handle string flags like "noatime",
we must use bitmasks like MS_NOATIME instead.
pkg/mount.Mount already handles this work.
Signed-off-by: Carl Henrik Lunde <chlunde@ping.uio.no>