Граф коммитов

5058 Коммитов

Автор SHA1 Сообщение Дата
Zhao Lei af90204750 btrfs: Move btrfs_raid_array to public
This array is used to record attributes of each raid type,
make it public, and many functions will benifit with this array.

For example, num_tolerated_disk_barrier_failures(), we can
avoid complex conditions in this function, and get raid attribute
simply by accessing above array.

It can also make code logic simple, reduce duplication code, and
increase maintainability.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Alexandru Moise e9cf439f0d btrfs: use a single if() statement for one outcome in get_block_rsv()
Rather than have three separate if() statements for the same outcome
we should just OR them together in the same if() statement.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Alexandru Moise a099d0fdb3 btrfs: memset cur_trans->delayed_refs to zero
Use memset() to null out the btrfs_delayed_ref_root of
btrfs_transaction instead of setting all the members to 0 by hand.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Byongho Lee 568b1c9cca btrfs: remove unnecessary list_del
We can safely iterate whole list items, without using list_del macro.
So remove the list_del call.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Byongho Lee d7641a49a5 btrfs: replace unnecessary list_for_each_entry_safe to list_for_each_entry
There is no removing list element while iterating over list.
So, replace list_for_each_entry_safe to list_for_each_entry.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Alexandru Moise f2f767e734 btrfs: trimming some start_transaction() code away
Just call kmem_cache_zalloc() instead of calling kmem_cache_alloc().
We're just initializing most fields to 0, false and NULL later on
_anyway_, so to make the code mode readable and potentially gain
a bit of performance (completely untested claim), we should fill our
btrfs_trans_handle with zeros on allocation then just initialize
those five remaining fields (not counting the list_heads) as normal.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Alexandru Moise 0412e58c6d btrfs: Fixed declaration of old_len
old_len is used to store the return value of btrfs_item_size_nr().
The return value of btrfs_item_size_nr() is of type u32.
To improve code correctness and avoid mixing signed and unsigned
integers I've changed old_len to be of type u32 as well.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Alexandru Moise ce0eac2a1d btrfs: Fixed dsize and last_off declarations
The return values of btrfs_item_offset_nr and btrfs_item_size_nr are of
type u32. To avoid mixing signed and unsigned integers we should also
declare dsize and last_off to be of type u32.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Chandan Rajendra 0d51e28a11 Btrfs: btrfs_submit_bio_hook: Use btrfs_wq_endio_type values instead of integer constants
btrfs_submit_bio_hook() uses integer constants instead of values from "enum
btrfs_wq_endio_type". Fix this.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:47 +02:00
Qu Wenruo 0f6925fa29 btrfs: Avoid truncate tailing page if fallocate range doesn't exceed inode size
Current code will always truncate tailing page if its alloc_start is
smaller than inode size.

For example, the file extent layout is like:
0	4K	8K	16K	32K
|<-----Extent A---------------->|
|<--Inode size: 18K---------->|

But if calling fallocate even for range [0,4K), it will cause btrfs to
re-truncate the range [16,32K), causing COW and a new extent.

0	4K	8K	16K	32K
|///////|	<- Fallocate call range
|<-----Extent A-------->|<--B-->|

The cause is quite easy, just a careless btrfs_truncate_inode() in a
else branch without extra judgment.
Fix it by add judgment on whether the fallocate range is beyond isize.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-20 19:07:29 -07:00
Filipe Manana 0305cd5f7f Btrfs: fix truncation of compressed and inlined extents
When truncating a file to a smaller size which consists of an inline
extent that is compressed, we did not discard (or made unusable) the
data between the new file size and the old file size, wasting metadata
space and allowing for the truncated data to be leaked and the data
corruption/loss mentioned below.
We were also not correctly decrementing the number of bytes used by the
inode, we were setting it to zero, giving a wrong report for callers of
the stat(2) syscall. The fsck tool also reported an error about a mismatch
between the nbytes of the file versus the real space used by the file.

Now because we weren't discarding the truncated region of the file, it
was possible for a caller of the clone ioctl to actually read the data
that was truncated, allowing for a security breach without requiring root
access to the system, using only standard filesystem operations. The
scenario is the following:

   1) User A creates a file which consists of an inline and compressed
      extent with a size of 2000 bytes - the file is not accessible to
      any other users (no read, write or execution permission for anyone
      else);

   2) The user truncates the file to a size of 1000 bytes;

   3) User A makes the file world readable;

   4) User B creates a file consisting of an inline extent of 2000 bytes;

   5) User B issues a clone operation from user A's file into its own
      file (using a length argument of 0, clone the whole range);

   6) User B now gets to see the 1000 bytes that user A truncated from
      its file before it made its file world readbale. User B also lost
      the bytes in the range [1000, 2000[ bytes from its own file, but
      that might be ok if his/her intention was reading stale data from
      user A that was never supposed to be public.

Note that this contrasts with the case where we truncate a file from 2000
bytes to 1000 bytes and then truncate it back from 1000 to 2000 bytes. In
this case reading any byte from the range [1000, 2000[ will return a value
of 0x00, instead of the original data.

This problem exists since the clone ioctl was added and happens both with
and without my recent data loss and file corruption fixes for the clone
ioctl (patch "Btrfs: fix file corruption and data loss after cloning
inline extents").

So fix this by truncating the compressed inline extents as we do for the
non-compressed case, which involves decompressing, if the data isn't already
in the page cache, compressing the truncated version of the extent, writing
the compressed content into the inline extent and then truncate it.

The following test case for fstests reproduces the problem. In order for
the test to pass both this fix and my previous fix for the clone ioctl
that forbids cloning a smaller inline extent into a larger one,
which is titled "Btrfs: fix file corruption and data loss after cloning
inline extents", are needed. Without that other fix the test fails in a
different way that does not leak the truncated data, instead part of
destination file gets replaced with zeroes (because the destination file
has a larger inline extent than the source).

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_cloner

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount "-o compress"

  # Create our test files. File foo is going to be the source of a clone operation
  # and consists of a single inline extent with an uncompressed size of 512 bytes,
  # while file bar consists of a single inline extent with an uncompressed size of
  # 256 bytes. For our test's purpose, it's important that file bar has an inline
  # extent with a size smaller than foo's inline extent.
  $XFS_IO_PROG -f -c "pwrite -S 0xa1 0 128"   \
          -c "pwrite -S 0x2a 128 384" \
          $SCRATCH_MNT/foo | _filter_xfs_io
  $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 256" $SCRATCH_MNT/bar | _filter_xfs_io

  # Now durably persist all metadata and data. We do this to make sure that we get
  # on disk an inline extent with a size of 512 bytes for file foo.
  sync

  # Now truncate our file foo to a smaller size. Because it consists of a
  # compressed and inline extent, btrfs did not shrink the inline extent to the
  # new size (if the extent was not compressed, btrfs would shrink it to 128
  # bytes), it only updates the inode's i_size to 128 bytes.
  $XFS_IO_PROG -c "truncate 128" $SCRATCH_MNT/foo

  # Now clone foo's inline extent into bar.
  # This clone operation should fail with errno EOPNOTSUPP because the source
  # file consists only of an inline extent and the file's size is smaller than
  # the inline extent of the destination (128 bytes < 256 bytes). However the
  # clone ioctl was not prepared to deal with a file that has a size smaller
  # than the size of its inline extent (something that happens only for compressed
  # inline extents), resulting in copying the full inline extent from the source
  # file into the destination file.
  #
  # Note that btrfs' clone operation for inline extents consists of removing the
  # inline extent from the destination inode and copy the inline extent from the
  # source inode into the destination inode, meaning that if the destination
  # inode's inline extent is larger (N bytes) than the source inode's inline
  # extent (M bytes), some bytes (N - M bytes) will be lost from the destination
  # file. Btrfs could copy the source inline extent's data into the destination's
  # inline extent so that we would not lose any data, but that's currently not
  # done due to the complexity that would be needed to deal with such cases
  # (specially when one or both extents are compressed), returning EOPNOTSUPP, as
  # it's normally not a very common case to clone very small files (only case
  # where we get inline extents) and copying inline extents does not save any
  # space (unlike for normal, non-inlined extents).
  $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar

  # Now because the above clone operation used to succeed, and due to foo's inline
  # extent not being shinked by the truncate operation, our file bar got the whole
  # inline extent copied from foo, making us lose the last 128 bytes from bar
  # which got replaced by the bytes in range [128, 256[ from foo before foo was
  # truncated - in other words, data loss from bar and being able to read old and
  # stale data from foo that should not be possible to read anymore through normal
  # filesystem operations. Contrast with the case where we truncate a file from a
  # size N to a smaller size M, truncate it back to size N and then read the range
  # [M, N[, we should always get the value 0x00 for all the bytes in that range.

  # We expected the clone operation to fail with errno EOPNOTSUPP and therefore
  # not modify our file's bar data/metadata. So its content should be 256 bytes
  # long with all bytes having the value 0xbb.
  #
  # Without the btrfs bug fix, the clone operation succeeded and resulted in
  # leaking truncated data from foo, the bytes that belonged to its range
  # [128, 256[, and losing data from bar in that same range. So reading the
  # file gave us the following content:
  #
  # 0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
  # *
  # 0000200 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a
  # *
  # 0000400
  echo "File bar's content after the clone operation:"
  od -t x1 $SCRATCH_MNT/bar

  # Also because the foo's inline extent was not shrunk by the truncate
  # operation, btrfs' fsck, which is run by the fstests framework everytime a
  # test completes, failed reporting the following error:
  #
  #  root 5 inode 257 errors 400, nbytes wrong

  status=0
  exit

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-16 21:02:53 +01:00
Linus Torvalds 6aa8ca4df0 Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "I have two more bug fixes for btrfs.

  My commit fixes a bug we hit last week at FB, a combination of lots of
  hard links and an admin command to resolve inode numbers.

  Dave is adding checks to make sure balance on current kernels ignores
  filters it doesn't understand.  The penalty for being wrong is just
  doing more work (not crashing etc), but it's a good fix"

* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: fix use after free iterating extrefs
  btrfs: check unsupported filters in balance arguments
2015-10-16 12:55:34 -07:00
Filipe Manana 5e6ecb362b Btrfs: fix double range unlock of hole region when reading page
If when reading a page we find a hole and our caller had already locked
the range (bio flags has the bit EXTENT_BIO_PARENT_LOCKED set), we end
up unlocking the hole's range and then later our caller unlocks it
again, which might have already been locked by some other task once
the first unlock happened.

Currently this can only happen during a call to the extent_same ioctl,
as it's the only caller of __do_readpage() that sets the bit
EXTENT_BIO_PARENT_LOCKED for bio flags.

Fix this by leaving the unlock exclusively to the caller.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-14 04:37:00 +01:00
Filipe Manana 8039d87d9e Btrfs: fix file corruption and data loss after cloning inline extents
Currently the clone ioctl allows to clone an inline extent from one file
to another that already has other (non-inlined) extents. This is a problem
because btrfs is not designed to deal with files having inline and regular
extents, if a file has an inline extent then it must be the only extent
in the file and must start at file offset 0. Having a file with an inline
extent followed by regular extents results in EIO errors when doing reads
or writes against the first 4K of the file.

Also, the clone ioctl allows one to lose data if the source file consists
of a single inline extent, with a size of N bytes, and the destination
file consists of a single inline extent with a size of M bytes, where we
have M > N. In this case the clone operation removes the inline extent
from the destination file and then copies the inline extent from the
source file into the destination file - we lose the M - N bytes from the
destination file, a read operation will get the value 0x00 for any bytes
in the the range [N, M] (the destination inode's i_size remained as M,
that's why we can read past N bytes).

So fix this by not allowing such destructive operations to happen and
return errno EOPNOTSUPP to user space.

Currently the fstest btrfs/035 tests the data loss case but it totally
ignores this - i.e. expects the operation to succeed and does not check
the we got data loss.

The following test case for fstests exercises all these cases that result
in file corruption and data loss:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_cloner
  _require_btrfs_fs_feature "no_holes"
  _require_btrfs_mkfs_feature "no-holes"

  rm -f $seqres.full

  test_cloning_inline_extents()
  {
      local mkfs_opts=$1
      local mount_opts=$2

      _scratch_mkfs $mkfs_opts >>$seqres.full 2>&1
      _scratch_mount $mount_opts

      # File bar, the source for all the following clone operations, consists
      # of a single inline extent (50 bytes).
      $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 50" $SCRATCH_MNT/bar \
          | _filter_xfs_io

      # Test cloning into a file with an extent (non-inlined) where the
      # destination offset overlaps that extent. It should not be possible to
      # clone the inline extent from file bar into this file.
      $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 16K" $SCRATCH_MNT/foo \
          | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo

      # Doing IO against any range in the first 4K of the file should work.
      # Due to a past clone ioctl bug which allowed cloning the inline extent,
      # these operations resulted in EIO errors.
      echo "File foo data after clone operation:"
      # All bytes should have the value 0xaa (clone operation failed and did
      # not modify our file).
      od -t x1 $SCRATCH_MNT/foo
      $XFS_IO_PROG -c "pwrite -S 0xcc 0 100" $SCRATCH_MNT/foo | _filter_xfs_io

      # Test cloning the inline extent against a file which has a hole in its
      # first 4K followed by a non-inlined extent. It should not be possible
      # as well to clone the inline extent from file bar into this file.
      $XFS_IO_PROG -f -c "pwrite -S 0xdd 4K 12K" $SCRATCH_MNT/foo2 \
          | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo2

      # Doing IO against any range in the first 4K of the file should work.
      # Due to a past clone ioctl bug which allowed cloning the inline extent,
      # these operations resulted in EIO errors.
      echo "File foo2 data after clone operation:"
      # All bytes should have the value 0x00 (clone operation failed and did
      # not modify our file).
      od -t x1 $SCRATCH_MNT/foo2
      $XFS_IO_PROG -c "pwrite -S 0xee 0 90" $SCRATCH_MNT/foo2 | _filter_xfs_io

      # Test cloning the inline extent against a file which has a size of zero
      # but has a prealloc extent. It should not be possible as well to clone
      # the inline extent from file bar into this file.
      $XFS_IO_PROG -f -c "falloc -k 0 1M" $SCRATCH_MNT/foo3 | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo3

      # Doing IO against any range in the first 4K of the file should work.
      # Due to a past clone ioctl bug which allowed cloning the inline extent,
      # these operations resulted in EIO errors.
      echo "First 50 bytes of foo3 after clone operation:"
      # Should not be able to read any bytes, file has 0 bytes i_size (the
      # clone operation failed and did not modify our file).
      od -t x1 $SCRATCH_MNT/foo3
      $XFS_IO_PROG -c "pwrite -S 0xff 0 90" $SCRATCH_MNT/foo3 | _filter_xfs_io

      # Test cloning the inline extent against a file which consists of a
      # single inline extent that has a size not greater than the size of
      # bar's inline extent (40 < 50).
      # It should be possible to do the extent cloning from bar to this file.
      $XFS_IO_PROG -f -c "pwrite -S 0x01 0 40" $SCRATCH_MNT/foo4 \
          | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo4

      # Doing IO against any range in the first 4K of the file should work.
      echo "File foo4 data after clone operation:"
      # Must match file bar's content.
      od -t x1 $SCRATCH_MNT/foo4
      $XFS_IO_PROG -c "pwrite -S 0x02 0 90" $SCRATCH_MNT/foo4 | _filter_xfs_io

      # Test cloning the inline extent against a file which consists of a
      # single inline extent that has a size greater than the size of bar's
      # inline extent (60 > 50).
      # It should not be possible to clone the inline extent from file bar
      # into this file.
      $XFS_IO_PROG -f -c "pwrite -S 0x03 0 60" $SCRATCH_MNT/foo5 \
          | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo5

      # Reading the file should not fail.
      echo "File foo5 data after clone operation:"
      # Must have a size of 60 bytes, with all bytes having a value of 0x03
      # (the clone operation failed and did not modify our file).
      od -t x1 $SCRATCH_MNT/foo5

      # Test cloning the inline extent against a file which has no extents but
      # has a size greater than bar's inline extent (16K > 50).
      # It should not be possible to clone the inline extent from file bar
      # into this file.
      $XFS_IO_PROG -f -c "truncate 16K" $SCRATCH_MNT/foo6 | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo6

      # Reading the file should not fail.
      echo "File foo6 data after clone operation:"
      # Must have a size of 16K, with all bytes having a value of 0x00 (the
      # clone operation failed and did not modify our file).
      od -t x1 $SCRATCH_MNT/foo6

      # Test cloning the inline extent against a file which has no extents but
      # has a size not greater than bar's inline extent (30 < 50).
      # It should be possible to clone the inline extent from file bar into
      # this file.
      $XFS_IO_PROG -f -c "truncate 30" $SCRATCH_MNT/foo7 | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo7

      # Reading the file should not fail.
      echo "File foo7 data after clone operation:"
      # Must have a size of 50 bytes, with all bytes having a value of 0xbb.
      od -t x1 $SCRATCH_MNT/foo7

      # Test cloning the inline extent against a file which has a size not
      # greater than the size of bar's inline extent (20 < 50) but has
      # a prealloc extent that goes beyond the file's size. It should not be
      # possible to clone the inline extent from bar into this file.
      $XFS_IO_PROG -f -c "falloc -k 0 1M" \
                      -c "pwrite -S 0x88 0 20" \
                      $SCRATCH_MNT/foo8 | _filter_xfs_io
      $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo8

      echo "File foo8 data after clone operation:"
      # Must have a size of 20 bytes, with all bytes having a value of 0x88
      # (the clone operation did not modify our file).
      od -t x1 $SCRATCH_MNT/foo8

      _scratch_unmount
  }

  echo -e "\nTesting without compression and without the no-holes feature...\n"
  test_cloning_inline_extents

  echo -e "\nTesting with compression and without the no-holes feature...\n"
  test_cloning_inline_extents "" "-o compress"

  echo -e "\nTesting without compression and with the no-holes feature...\n"
  test_cloning_inline_extents "-O no-holes" ""

  echo -e "\nTesting with compression and with the no-holes feature...\n"
  test_cloning_inline_extents "-O no-holes" "-o compress"

  status=0
  exit

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-14 04:36:43 +01:00
Chris Mason dc6c5fb3b5 btrfs: fix use after free iterating extrefs
The code for btrfs inode-resolve has never worked properly for
files with enough hard links to trigger extrefs.  It was trying to
get the leaf out of a path after freeing the path:

	btrfs_release_path(path);
	leaf = path->nodes[0];
	item_size = btrfs_item_size_nr(leaf, slot);

The fix here is to use the extent buffer we cloned just a little higher
up to avoid deadlocks caused by using the leaf in the path.

Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org # v3.7+
cc: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
2015-10-13 18:54:44 -07:00
David Sterba 8eb934591f btrfs: check unsupported filters in balance arguments
We don't verify that all the balance filter arguments supplemented by
the flags are actually known to the kernel. Thus we let it silently pass
and do nothing.

At the moment this means only the 'limit' filter, but we're going to add
a few more soon so it's better to have that fixed. Also in older stable
kernels so that it works with newer userspace tools.

Cc: stable@vger.kernel.org # 3.16+
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-13 18:53:03 -07:00
Robin Ruede b96b1db039 btrfs: fix resending received snapshot with parent
This fixes a regression introduced by 37b8d27d between v4.1 and v4.2.

When a snapshot is received, its received_uuid is set to the original
uuid of the subvolume. When that snapshot is then resent to a third
filesystem, it's received_uuid is set to the second uuid
instead of the original one. The same was true for the parent_uuid.
This behaviour was partially changed in 37b8d27d, but in that patch
only the parent_uuid was taken from the real original,
not the uuid itself, causing the search for the parent to fail in
the case below.

This happens for example when trying to send a series of linked
snapshots (e.g. created by snapper) from the backup file system back
to the original one.

The following commands reproduce the issue in v4.2.1
(no error in 4.1.6)

    # setup three test file systems
    for i in 1 2 3; do
	    truncate -s 50M fs$i
	    mkfs.btrfs fs$i
	    mkdir $i
	    mount fs$i $i
    done
    echo "content" > 1/testfile
    btrfs su snapshot -r 1/ 1/snap1
    echo "changed content" > 1/testfile
    btrfs su snapshot -r 1/ 1/snap2

    # works fine:
    btrfs send 1/snap1 | btrfs receive 2/
    btrfs send -p 1/snap1 1/snap2 | btrfs receive 2/

    # ERROR: could not find parent subvolume
    btrfs send 2/snap1 | btrfs receive 3/
    btrfs send -p 2/snap1 2/snap2 | btrfs receive 3/

Signed-off-by: Robin Ruede <rruede+git@gmail.com>
Fixes: 37b8d27de5 ("Btrfs: use received_uuid of parent during send")
Cc: stable@vger.kernel.org # v4.2+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Ed Tomlinson <edt@aei.ca>
2015-10-13 20:04:10 +01:00
Filipe Manana d906d49fc5 Btrfs: send, fix file corruption due to incorrect cloning operations
If we have a file that shares an extent with other files, when processing
the extent item relative to a shared extent, we blindly issue a clone
operation that will target a length matching the length in the extent item
and uses as a source some other file the receiver already has and points
to the same extent. However that range in the other file might not
exclusively point only to the shared extent, and so using that length
will result in the receiver getting a file with different data from the
one in the send snapshot. This issue happened both for incremental and
full send operations.

So fix this by issuing clone operations with lengths that don't cover
regions of the source file that point to different extents (or have holes).

The following test case for fstests reproduces the problem.

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -fr $send_files_dir
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _need_to_be_root
  _require_cp_reflink
  _require_xfs_io_command "fpunch"

  send_files_dir=$TEST_DIR/btrfs-test-$seq

  rm -f $seqres.full
  rm -fr $send_files_dir
  mkdir $send_files_dir

  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  # Create our test file with a single 100K extent.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
     $SCRATCH_MNT/foo | _filter_xfs_io

  # Clone our file into a new file named bar.
  cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar

  # Now overwrite parts of our foo file.
  $XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \
     -c "pwrite -S 0xcc 90K 10K" \
     -c "fpunch 70K 10k" \
     $SCRATCH_MNT/foo | _filter_xfs_io

  _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
     $SCRATCH_MNT/snap

  echo "File digests in the original filesystem:"
  md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
  md5sum $SCRATCH_MNT/snap/bar | _filter_scratch

  _run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap

  # Now recreate the filesystem by receiving the send stream and verify
  # we get the same file contents that the original filesystem had.
  _scratch_unmount
  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap

  # We expect the destination filesystem to have exactly the same file
  # data as the original filesystem.
  # The btrfs send implementation had a bug where it sent a clone
  # operation from file foo into file bar covering the whole [0, 100K[
  # range after creating and writing the file foo. This was incorrect
  # because the file bar now included the updates done to file foo after
  # we cloned foo to bar, breaking the COW nature of reflink copies
  # (cloned extents).
  echo "File digests in the new filesystem:"
  md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
  md5sum $SCRATCH_MNT/snap/bar | _filter_scratch

  status=0
  exit

Another test case that reproduces the problem when we have compressed
extents:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -fr $send_files_dir
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _need_to_be_root
  _require_cp_reflink

  send_files_dir=$TEST_DIR/btrfs-test-$seq

  rm -f $seqres.full
  rm -fr $send_files_dir
  mkdir $send_files_dir

  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount "-o compress"

  # Create our file with an extent of 100K starting at file offset 0K.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K"       \
                  -c "fsync"                        \
                  $SCRATCH_MNT/foo | _filter_xfs_io

  # Rewrite part of the previous extent (its first 40K) and write a new
  # 100K extent starting at file offset 100K.
  $XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K"    \
          -c "pwrite -S 0xcc 100K 100K"      \
          $SCRATCH_MNT/foo | _filter_xfs_io

  # Our file foo now has 3 file extent items in its metadata:
  #
  # 1) One covering the file range 0 to 40K;
  # 2) One covering the file range 40K to 100K, which points to the first
  #    extent we wrote to the file and has a data offset field with value
  #    40K (our file no longer uses the first 40K of data from that
  #    extent);
  # 3) One covering the file range 100K to 200K.

  # Now clone our file foo into file bar.
  cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar

  # Create our snapshot for the send operation.
  _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
          $SCRATCH_MNT/snap

  echo "File digests in the original filesystem:"
  md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
  md5sum $SCRATCH_MNT/snap/bar | _filter_scratch

  _run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap

  # Now recreate the filesystem by receiving the send stream and verify we
  # get the same file contents that the original filesystem had.
  # Btrfs send used to issue a clone operation from foo's range
  # [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed
  # to by foo's second file extent item, this was incorrect because of bad
  # accounting of the file extent item's data offset field. The correct
  # range to clone from should have been [40K, 100K[.
  _scratch_unmount
  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount "-o compress"

  _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap

  echo "File digests in the new filesystem:"
  # Must match the digests we got in the original filesystem.
  md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
  md5sum $SCRATCH_MNT/snap/bar | _filter_scratch

  status=0
  exit

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-13 01:05:27 +01:00
Chris Mason 6db4a7335d Merge branch 'fix/waitqueue-barriers' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4 2015-10-12 16:24:40 -07:00
Chris Mason 62fb50ab7c Merge branch 'anand/sysfs-updates-v4.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-12 16:24:15 -07:00
Chris Mason 640926ffdd Merge branch 'cleanup/messages' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4 2015-10-12 16:22:26 -07:00
David Sterba ee86395458 btrfs: comment the rest of implicit barriers before waitqueue_active
There are atomic operations that imply the barrier for waitqueue_active
mixed in an if-condition.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-10 18:42:00 +02:00
David Sterba 779adf0f64 btrfs: remove extra barrier before waitqueue_active
Removing barriers is scary, but a call to atomic_dec_and_test implies
a barrier, so we don't need to issue another one.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-10 18:40:33 +02:00
David Sterba a83342aa0c btrfs: add comments to barriers before waitqueue_active
Reduce number of undocumented barriers out there.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-10 18:40:04 +02:00
David Sterba 33a9eca7e4 btrfs: comment waitqueue_active implied by locks
Suggested-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-10 18:35:10 +02:00
David Sterba b666a9cd99 btrfs: add barrier for waitqueue_active in clear_btree_io_tree
waitqueue_active should be preceded by a barrier, in this function we
don't need to call it all the time.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-10 18:24:48 +02:00
David Sterba 730d9ec36b btrfs: remove waitqueue_active check from btrfs_rm_dev_replace_unblocked
Normally the waitqueue_active would need a barrier, but this is not
necessary here because it's not a performance sensitive context and we
can call wake_up directly.

Suggested-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-10 18:16:38 +02:00
Linus Torvalds 175d58cfed Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "These are small and assorted.  Neil's is the oldest, I dropped the
  ball thinking he was going to send it in"

* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: support NFSv2 export
  Btrfs: open_ctree: Fix possible memory leak
  Btrfs: fix deadlock when finalizing block group creation
  Btrfs: update fix for read corruption of compressed and shared extents
  Btrfs: send, fix corner case for reference overwrite detection
2015-10-09 16:39:35 -07:00
David Sterba f14d104dbd btrfs: switch more printks to our helpers
Convert the simple cases, not all functions provide a way to reach the
fs_info. Also skipped debugging messages (print-tree, integrity
checker and pr_debug) and messages that are printed from possibly
unfinished mount.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08 13:08:03 +02:00
David Sterba 9464732266 btrfs: switch message printers to ratelimited variants
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08 13:04:06 +02:00
David Sterba 1dd6d7ca9d btrfs: introduce ratelimited variants of message printing functions
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08 11:07:56 +02:00
David Sterba b14af3b46f btrfs: switch message printers to ratelimited _in_rcu variants
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08 11:07:55 +02:00
David Sterba 24aa6b41d4 btrfs: introduce ratelimited _in_rcu variants of message printing functions
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08 11:07:55 +02:00
David Sterba ecaeb14b91 btrfs: switch message printers to _in_rcu variants
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08 11:07:55 +02:00
David Sterba 08a84e25a8 btrfs: introduce _in_rcu variants of message printing functions
Due to the missing variants there are messages that lack the information
printed by btrfs_info etc helpers.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08 11:07:55 +02:00
NeilBrown 7d35199e15 BTRFS: support NFSv2 export
The "fh_len" passed to ->fh_to_* is not guaranteed to be that same as
that returned by encode_fh - it may be larger.

With NFSv2, the filehandle is fixed length, so it may appear longer
than expected and be zero-padded.

So we must test that fh_len is at least some value, not exactly equal
to it.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: David Sterba <dsterba@suse.cz>
2015-10-06 06:55:23 -07:00
chandan e5fffbac4a Btrfs: open_ctree: Fix possible memory leak
After reading one of chunk or tree root tree's root node from disk, if the
root node does not have EXTENT_BUFFER_UPTODATE flag set, we fail to release
the memory used by the root node. Fix this.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
2015-10-06 06:55:22 -07:00
Filipe Manana d9a0540a79 Btrfs: fix deadlock when finalizing block group creation
Josef ran into a deadlock while a transaction handle was finalizing the
creation of its block groups, which produced the following trace:

  [260445.593112] fio             D ffff88022a9df468     0  8924   4518 0x00000084
  [260445.593119]  ffff88022a9df468 ffffffff81c134c0 ffff880429693c00 ffff88022a9df488
  [260445.593126]  ffff88022a9e0000 ffff8803490d7b00 ffff8803490d7b18 ffff88022a9df4b0
  [260445.593132]  ffff8803490d7af8 ffff88022a9df488 ffffffff8175a437 ffff8803490d7b00
  [260445.593137] Call Trace:
  [260445.593145]  [<ffffffff8175a437>] schedule+0x37/0x80
  [260445.593189]  [<ffffffffa0850f37>] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
  [260445.593197]  [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0
  [260445.593225]  [<ffffffffa07eac44>] btrfs_lock_root_node+0x34/0x50 [btrfs]
  [260445.593253]  [<ffffffffa07eff6b>] btrfs_search_slot+0x88b/0xa00 [btrfs]
  [260445.593295]  [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs]
  [260445.593324]  [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
  [260445.593351]  [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
  [260445.593394]  [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
  [260445.593427]  [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
  [260445.593459]  [<ffffffffa0800964>] do_chunk_alloc+0x2a4/0x2e0 [btrfs]
  [260445.593491]  [<ffffffffa0803815>] find_free_extent+0xa55/0xd90 [btrfs]
  [260445.593524]  [<ffffffffa0803c22>] btrfs_reserve_extent+0xd2/0x220 [btrfs]
  [260445.593532]  [<ffffffff8119fe5d>] ? account_page_dirtied+0xdd/0x170
  [260445.593564]  [<ffffffffa0803e78>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
  [260445.593597]  [<ffffffffa080c9de>] ? btree_set_page_dirty+0xe/0x10 [btrfs]
  [260445.593626]  [<ffffffffa07eb5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
  [260445.593654]  [<ffffffffa07ebbff>] btrfs_cow_block+0x11f/0x1c0 [btrfs]
  [260445.593682]  [<ffffffffa07ef8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs]
  [260445.593724]  [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs]
  [260445.593752]  [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
  [260445.593830]  [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
  [260445.593905]  [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
  [260445.593946]  [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
  [260445.593990]  [<ffffffffa0815798>] btrfs_commit_transaction+0xa8/0xb40 [btrfs]
  [260445.594042]  [<ffffffffa085abcd>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
  [260445.594089]  [<ffffffffa082bc84>] btrfs_sync_file+0x294/0x350 [btrfs]
  [260445.594115]  [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0
  [260445.594133]  [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180
  [260445.594149]  [<ffffffff8123e35d>] do_fsync+0x3d/0x70
  [260445.594169]  [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110
  [260445.594187]  [<ffffffff8123e600>] SyS_fsync+0x10/0x20
  [260445.594204]  [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71

This happened because the same transaction handle created a large number
of block groups and while finalizing their creation (inserting new items
and updating existing items in the chunk and device trees) a new metadata
extent had to be allocated and no free space was found in the current
metadata block groups, which made find_free_extent() attempt to allocate
a new block group via do_chunk_alloc(). However at do_chunk_alloc() we
ended up allocating a new system chunk too and exceeded the threshold
of 2Mb of reserved chunk bytes, which makes do_chunk_alloc() enter the
final part of block group creation again (at
btrfs_create_pending_block_groups()) and attempt to lock again the root
of the chunk tree when it's already write locked by the same task.

Similarly we can deadlock on extent tree nodes/leafs if while we are
running delayed references we end up creating a new metadata block group
in order to allocate a new node/leaf for the extent tree (as part of
a CoW operation or growing the tree), as btrfs_create_pending_block_groups
inserts items into the extent tree as well. In this case we get the
following trace:

  [14242.773581] fio             D ffff880428ca3418     0  3615   3100 0x00000084
  [14242.773588]  ffff880428ca3418 ffff88042d66b000 ffff88042a03c800 ffff880428ca3438
  [14242.773594]  ffff880428ca4000 ffff8803e4b20190 ffff8803e4b201a8 ffff880428ca3460
  [14242.773600]  ffff8803e4b20188 ffff880428ca3438 ffffffff8175a437 ffff8803e4b20190
  [14242.773606] Call Trace:
  [14242.773613]  [<ffffffff8175a437>] schedule+0x37/0x80
  [14242.773656]  [<ffffffffa057ff07>] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
  [14242.773664]  [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0
  [14242.773692]  [<ffffffffa0519c44>] btrfs_lock_root_node+0x34/0x50 [btrfs]
  [14242.773720]  [<ffffffffa051ef6b>] btrfs_search_slot+0x88b/0xa00 [btrfs]
  [14242.773750]  [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
  [14242.773758]  [<ffffffff811ef4a2>] ? kmem_cache_alloc+0x1d2/0x200
  [14242.773786]  [<ffffffffa0520ad1>] btrfs_insert_item+0x71/0xf0 [btrfs]
  [14242.773818]  [<ffffffffa052f292>] btrfs_create_pending_block_groups+0x102/0x200 [btrfs]
  [14242.773850]  [<ffffffffa052f96e>] do_chunk_alloc+0x2ae/0x2f0 [btrfs]
  [14242.773934]  [<ffffffffa0532825>] find_free_extent+0xa55/0xd90 [btrfs]
  [14242.773998]  [<ffffffffa0532c22>] btrfs_reserve_extent+0xc2/0x1d0 [btrfs]
  [14242.774041]  [<ffffffffa0532e38>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
  [14242.774078]  [<ffffffffa051a5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
  [14242.774118]  [<ffffffffa051abff>] btrfs_cow_block+0x11f/0x1c0 [btrfs]
  [14242.774155]  [<ffffffffa051e8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs]
  [14242.774194]  [<ffffffffa0528021>] ? __btrfs_free_extent.isra.70+0x2e1/0xcb0 [btrfs]
  [14242.774235]  [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
  [14242.774274]  [<ffffffffa051994a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
  [14242.774318]  [<ffffffffa052c433>] __btrfs_run_delayed_refs+0xbb3/0x1020 [btrfs]
  [14242.774358]  [<ffffffffa052f404>] btrfs_run_delayed_refs.part.78+0x74/0x280 [btrfs]
  [14242.774391]  [<ffffffffa052f627>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
  [14242.774432]  [<ffffffffa05be236>] commit_cowonly_roots+0x8d/0x2bd [btrfs]
  [14242.774474]  [<ffffffffa059d07f>] ? __btrfs_run_delayed_items+0x1cf/0x210 [btrfs]
  [14242.774516]  [<ffffffffa05adac3>] ? btrfs_qgroup_account_extents+0x83/0x130 [btrfs]
  [14242.774558]  [<ffffffffa0544c40>] btrfs_commit_transaction+0x590/0xb40 [btrfs]
  [14242.774599]  [<ffffffffa0589b9d>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
  [14242.774642]  [<ffffffffa055ac54>] btrfs_sync_file+0x294/0x350 [btrfs]
  [14242.774650]  [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0
  [14242.774657]  [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180
  [14242.774663]  [<ffffffff8123e35d>] do_fsync+0x3d/0x70
  [14242.774669]  [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110
  [14242.774675]  [<ffffffff8123e600>] SyS_fsync+0x10/0x20
  [14242.774681]  [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71

Fix this by never recursing into the finalization phase of block group
creation and making sure we never trigger the finalization of block group
creation while running delayed references.

Reported-by: Josef Bacik <jbacik@fb.com>
Fixes: 00d80e342c ("Btrfs: fix quick exhaustion of the system array in the superblock")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-05 16:56:38 -07:00
Filipe Manana 808f80b467 Btrfs: update fix for read corruption of compressed and shared extents
My previous fix in commit 005efedf2c ("Btrfs: fix read corruption of
compressed and shared extents") was effective only if the compressed
extents cover a file range with a length that is not a multiple of 16
pages. That's because the detection of when we reached a different range
of the file that shares the same compressed extent as the previously
processed range was done at extent_io.c:__do_contiguous_readpages(),
which covers subranges with a length up to 16 pages, because
extent_readpages() groups the pages in clusters no larger than 16 pages.
So fix this by tracking the start of the previously processed file
range's extent map at extent_readpages().

The following test case for fstests reproduces the issue:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_cloner

  rm -f $seqres.full

  test_clone_and_read_compressed_extent()
  {
      local mount_opts=$1

      _scratch_mkfs >>$seqres.full 2>&1
      _scratch_mount $mount_opts

      # Create our test file with a single extent of 64Kb that is going to
      # be compressed no matter which compression algo is used (zlib/lzo).
      $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 64K" \
          $SCRATCH_MNT/foo | _filter_xfs_io

      # Now clone the compressed extent into an adjacent file offset.
      $CLONER_PROG -s 0 -d $((64 * 1024)) -l $((64 * 1024)) \
          $SCRATCH_MNT/foo $SCRATCH_MNT/foo

      echo "File digest before unmount:"
      md5sum $SCRATCH_MNT/foo | _filter_scratch

      # Remount the fs or clear the page cache to trigger the bug in
      # btrfs. Because the extent has an uncompressed length that is a
      # multiple of 16 pages, all the pages belonging to the second range
      # of the file (64K to 128K), which points to the same extent as the
      # first range (0K to 64K), had their contents full of zeroes instead
      # of the byte 0xaa. This was a bug exclusively in the read path of
      # compressed extents, the correct data was stored on disk, btrfs
      # just failed to fill in the pages correctly.
      _scratch_remount

      echo "File digest after remount:"
      # Must match the digest we got before.
      md5sum $SCRATCH_MNT/foo | _filter_scratch
  }

  echo -e "\nTesting with zlib compression..."
  test_clone_and_read_compressed_extent "-o compress=zlib"

  _scratch_unmount

  echo -e "\nTesting with lzo compression..."
  test_clone_and_read_compressed_extent "-o compress=lzo"

  status=0
  exit

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Timofey Titovets <nefelim4ag@gmail.com>
2015-10-05 16:56:27 -07:00
Filipe Manana b786f16ac3 Btrfs: send, fix corner case for reference overwrite detection
When the inode given to did_overwrite_ref() matches the current progress
and has a reference that collides with the reference of other inode that
has the same number as the current progress, we were always telling our
caller that the inode's reference was overwritten, which is incorrect
because the other inode might be a new inode (different generation number)
in which case we must return false from did_overwrite_ref() so that its
callers don't use an orphanized path for the inode (as it will never be
orphanized, instead it will be unlinked and the new inode created later).

The following test case for fstests reproduces the issue:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -fr $send_files_dir
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _need_to_be_root

  send_files_dir=$TEST_DIR/btrfs-test-$seq

  rm -f $seqres.full
  rm -fr $send_files_dir
  mkdir $send_files_dir

  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  # Create our test file with a single extent of 64K.
  mkdir -p $SCRATCH_MNT/foo
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K" $SCRATCH_MNT/foo/bar \
      | _filter_xfs_io

  _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
      $SCRATCH_MNT/mysnap1
  _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT \
      $SCRATCH_MNT/mysnap2

  echo "File digest before being replaced:"
  md5sum $SCRATCH_MNT/mysnap1/foo/bar | _filter_scratch

  # Remove the file and then create a new one in the same location with
  # the same name but with different content. This new file ends up
  # getting the same inode number as the previous one, because that inode
  # number was the highest inode number used by the snapshot's root and
  # therefore when attempting to find the a new inode number for the new
  # file, we end up reusing the same inode number. This happens because
  # currently btrfs uses the highest inode number summed by 1 for the
  # first inode created once a snapshot's root is loaded (done at
  # fs/btrfs/inode-map.c:btrfs_find_free_objectid in the linux kernel
  # tree).
  # Having these two different files in the snapshots with the same inode
  # number (but different generation numbers) caused the btrfs send code
  # to emit an incorrect path for the file when issuing an unlink
  # operation because it failed to realize they were different files.
  rm -f $SCRATCH_MNT/mysnap2/foo/bar
  $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 96K" \
      $SCRATCH_MNT/mysnap2/foo/bar | _filter_xfs_io

  _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT/mysnap2 \
      $SCRATCH_MNT/mysnap2_ro

  _run_btrfs_util_prog send $SCRATCH_MNT/mysnap1 -f $send_files_dir/1.snap
  _run_btrfs_util_prog send -p $SCRATCH_MNT/mysnap1 \
      $SCRATCH_MNT/mysnap2_ro -f $send_files_dir/2.snap

  echo "File digest in the original filesystem after being replaced:"
  md5sum $SCRATCH_MNT/mysnap2_ro/foo/bar | _filter_scratch

  # Now recreate the filesystem by receiving both send streams and verify
  # we get the same file contents that the original filesystem had.
  _scratch_unmount
  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  _run_btrfs_util_prog receive -vv $SCRATCH_MNT -f $send_files_dir/1.snap
  _run_btrfs_util_prog receive -vv $SCRATCH_MNT -f $send_files_dir/2.snap

  echo "File digest in the new filesystem:"
  # Must match the digest from the new file.
  md5sum $SCRATCH_MNT/mysnap2_ro/foo/bar | _filter_scratch

  status=0
  exit

Reported-by: Martin Raiber <martin@urbackup.org>
Fixes: 8b191a6849 ("Btrfs: incremental send, check if orphanized dir inode needs delayed rename")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-05 16:56:27 -07:00
Liu Bo 73416dab23 Btrfs: move kobj stuff out of dev_replace lock range
To avoid deadlock described in commit 084b6e7c76 ("btrfs: Fix a
lockdep warning when running xfstest."), we should move kobj stuff out
of dev_replace lock range.

  "It is because the btrfs_kobj_{add/rm}_device() will call memory
  allocation with GFP_KERNEL,
  which may flush fs page cache to free space, waiting for it self to do
  the commit, causing the deadlock.

  To solve the problem, move btrfs_kobj_{add/rm}_device() out of the
  dev_replace lock range, also involing split the
  btrfs_rm_dev_replace_srcdev() function into remove and free parts.

  Now only btrfs_rm_dev_replace_remove_srcdev() is called in dev_replace
  lock range, and kobj_{add/rm} and btrfs_rm_dev_replace_free_srcdev() are
  called out of the lock range."

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[added lockup description]
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-01 18:07:59 +02:00
Anand Jain f190aa471a Btrfs: add helper for closing one device
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[reworded subject and changelog]
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-01 18:00:05 +02:00
Anand Jain 097efc966a Btrfs: don't log error from btrfs_get_bdev_and_sb
Originally the message was not in a helper but ended up there. We should
print error messages from callers instead.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[reworded subject and changelog]
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-01 17:56:47 +02:00
Anand Jain 9e271ae27e Btrfs: kernel operation should come after user input has been verified
By general rule of thumb there shouldn't be any way that user land
could trigger a kernel operation just by sending wrong arguments.

Here do commit cleanups after user input has been verified.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-01 17:45:10 +02:00
Anand Jain 12b1c2637b Btrfs: enhance btrfs_scratch_superblock to scratch all superblocks
This patch updates and renames btrfs_scratch_superblocks, (which is used
by the replace device thread), with those fixes from the scratch
superblock code section of btrfs_rm_device(). The fixes are:
  Scratch all copies of superblock
  Notify kobject that superblock has been changed
  Update time on the device

So that btrfs_rm_device() can use the function
btrfs_scratch_superblocks() instead of its own scratch code. And further
replace deivce code which similarly releases device back to the system,
will have the fixes from the btrfs device delete.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[renamed to btrfs_scratch_superblock]
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-01 17:37:34 +02:00
Anand Jain 29c36d7253 Btrfs: add btrfs_read_dev_one_super() to read one specific SB
This uses a chunk of code from btrfs_read_dev_super() and creates
a function called btrfs_read_dev_one_super() so that next patch
can use it for scratch superblock.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[renamed bufhead to bh]
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-01 17:29:38 +02:00
Anand Jain d74a625987 Btrfs: use BTRFS_ERROR_DEV_MISSING_NOT_FOUND when missing device is not found
Use btrfs specific error code BTRFS_ERROR_DEV_MISSING_NOT_FOUND instead
of -ENOENT.  Next this removes the logging when user specifies "missing"
and we don't find it in the kernel device list. Logging are for system
events not for user input errors.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-01 16:47:16 +02:00
Anand Jain a4553fefb5 Btrfs: consolidate btrfs_error() to btrfs_std_error()
btrfs_error() and btrfs_std_error() does the same thing
and calls _btrfs_std_error(), so consolidate them together.
And the main motivation is that btrfs_error() is closely
named with btrfs_err(), one handles error action the other
is to log the error, so don't closely name them.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Suggested-by: David Sterba <dsterba@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29 16:30:00 +02:00
Anand Jain 57d816a15b Btrfs: __btrfs_std_error() logic should be consistent w/out CONFIG_PRINTK defined
error handling logic behaves differently with or without
CONFIG_PRINTK defined, since there are two copies of the same
function which a bit of different logic

One, when CONFIG_PRINTK is defined, code is

__btrfs_std_error(..)
{
::
       save_error_info(fs_info);
       if (sb->s_flags & MS_BORN)
               btrfs_handle_error(fs_info);
}

and two when CONFIG_PRINTK is not defined, the code is

__btrfs_std_error(..)
{
::
       if (sb->s_flags & MS_BORN) {
               save_error_info(fs_info);
               btrfs_handle_error(fs_info);
        }
}

I doubt if this was intentional ? and appear to have caused since
we maintain two copies of the same function and they got diverged
with commits.

Now to decide which logic is correct reviewed changes as below,

 533574c6bc
Commit added two copies of this function

 cf79ffb5b7
Commit made change to only one copy of the function and to the
copy when CONFIG_PRINTK is defined.

To fix this, instead of maintaining two copies of same function
approach, maintain single function, and just put the extra
portion of the code under CONFIG_PRINTK define.

This patch just does that. And keeps code of with CONFIG_PRINTK
defined.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29 16:30:00 +02:00
Anand Jain 92fc03fbdc Btrfs: SB read failure should return EIO for __bread failure
This will return EIO when __bread() fails to read SB,
instead of EINVAL.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29 16:29:59 +02:00
Anand Jain c1b7e47459 Btrfs: rename super_kobj to fsid_kobj
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29 16:29:59 +02:00
Anand Jain 3257604048 Btrfs: rename btrfs_kobj_rm_device to btrfs_sysfs_rm_device_link
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29 16:29:59 +02:00
Anand Jain e3bd6973bc Btrfs: rename btrfs_kobj_add_device to btrfs_sysfs_add_device_link
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29 16:29:59 +02:00
Anand Jain 6618a59bfc Btrfs: rename btrfs_sysfs_remove_one to btrfs_sysfs_remove_mounted
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29 16:29:58 +02:00
Anand Jain 96f3136e51 Btrfs: rename btrfs_sysfs_add_one to btrfs_sysfs_add_mounted
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29 16:29:57 +02:00
Linus Torvalds 03e8f64486 Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This is an assorted set I've been queuing up:

  Jeff Mahoney tracked down a tricky one where we ended up starting IO
  on the wrong mapping for special files in btrfs_evict_inode.  A few
  people reported this one on the list.

  Filipe found (and provided a test for) a difficult bug in reading
  compressed extents, and Josef fixed up some quota record keeping with
  snapshot deletion.  Chandan killed off an accounting bug during DIO
  that lead to WARN_ONs as we freed inodes"

* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: keep dropped roots in cache until transaction commit
  Btrfs: Direct I/O: Fix space accounting
  btrfs: skip waiting on ordered range for special files
  Btrfs: fix read corruption of compressed and shared extents
  Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlock
  Btrfs: don't initialize a space info as full to prevent ENOSPC
2015-09-25 12:08:41 -07:00
Josef Bacik 2b9dbef272 Btrfs: keep dropped roots in cache until transaction commit
When dropping a snapshot we need to account for the qgroup changes.  If we drop
the snapshot in all one go then the backref code will fail to find blocks from
the snapshot we dropped since it won't be able to find the root in the fs root
cache.  This can lead to us failing to find refs from other roots that pointed
at blocks in the now deleted root.  To handle this we need to not remove the fs
roots from the cache until after we process the qgroup operations.  Do this by
adding dropped roots to a list on the transaction, and letting the transaction
remove the roots at the same time it drops the commit roots.  This will keep all
of the backref searching code in sync properly, and fixes a problem Mark was
seeing with snapshot delete and qgroups.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Tested-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-09-22 10:22:56 -07:00
chandan 50745b0a7f Btrfs: Direct I/O: Fix space accounting
The following call trace is seen when generic/095 test is executed,

WARNING: CPU: 3 PID: 2769 at /home/chandan/code/repos/linux/fs/btrfs/inode.c:8967 btrfs_destroy_inode+0x284/0x2a0()
Modules linked in:
CPU: 3 PID: 2769 Comm: umount Not tainted 4.2.0-rc5+ #31
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20150306_163512-brownie 04/01/2014
 ffffffff81c08150 ffff8802ec9cbce8 ffffffff81984058 ffff8802ffd8feb0
 0000000000000000 ffff8802ec9cbd28 ffffffff81050385 ffff8802ec9cbd38
 ffff8802d12f8588 ffff8802d12f8588 ffff8802f15ab000 ffff8800bb96c0b0
Call Trace:
 [<ffffffff81984058>] dump_stack+0x45/0x57
 [<ffffffff81050385>] warn_slowpath_common+0x85/0xc0
 [<ffffffff81050465>] warn_slowpath_null+0x15/0x20
 [<ffffffff81340294>] btrfs_destroy_inode+0x284/0x2a0
 [<ffffffff8117ce07>] destroy_inode+0x37/0x60
 [<ffffffff8117cf39>] evict+0x109/0x170
 [<ffffffff8117cfd5>] dispose_list+0x35/0x50
 [<ffffffff8117dd3a>] evict_inodes+0xaa/0x100
 [<ffffffff81165667>] generic_shutdown_super+0x47/0xf0
 [<ffffffff81165951>] kill_anon_super+0x11/0x20
 [<ffffffff81302093>] btrfs_kill_super+0x13/0x110
 [<ffffffff81165c99>] deactivate_locked_super+0x39/0x70
 [<ffffffff811660cf>] deactivate_super+0x5f/0x70
 [<ffffffff81180e1e>] cleanup_mnt+0x3e/0x90
 [<ffffffff81180ebd>] __cleanup_mnt+0xd/0x10
 [<ffffffff81069c06>] task_work_run+0x96/0xb0
 [<ffffffff81003a3d>] do_notify_resume+0x3d/0x50
 [<ffffffff8198cbc2>] int_signal+0x12/0x17

This means that the inode had non-zero "outstanding extents" during
eviction. This occurs because, during direct I/O a task which successfully
used up its reserved data space would set BTRFS_INODE_DIO_READY bit and does
not clear the bit after finishing the DIO write. A future DIO write could
actually fail and the unused reserve space won't be freed because of the
previously set BTRFS_INODE_DIO_READY bit.

Clearing the BTRFS_INODE_DIO_READY bit in btrfs_direct_IO() caused the
following issue,
|-----------------------------------+-------------------------------------|
| Task A                            | Task B                              |
|-----------------------------------+-------------------------------------|
| Start direct i/o write on inode X.|                                     |
| reserve space                     |                                     |
| Allocate ordered extent           |                                     |
| release reserved space            |                                     |
| Set BTRFS_INODE_DIO_READY bit.    |                                     |
|                                   | splice()                            |
|                                   | Transfer data from pipe buffer to   |
|                                   | destination file.                   |
|                                   | - kmap(pipe buffer page)            |
|                                   | - Start direct i/o write on         |
|                                   |   inode X.                          |
|                                   |   - reserve space                   |
|                                   |   - dio_refill_pages()              |
|                                   |     - sdio->blocks_available == 0   |
|                                   |     - Since a kernel address is     |
|                                   |       being passed instead of a     |
|                                   |       user space address,           |
|                                   |       iov_iter_get_pages() returns  |
|                                   |       -EFAULT.                      |
|                                   |   - Since BTRFS_INODE_DIO_READY is  |
|                                   |     set, we don't release reserved  |
|                                   |     space.                          |
|                                   |   - Clear BTRFS_INODE_DIO_READY bit.|
| -EIOCBQUEUED is returned.         |                                     |
|-----------------------------------+-------------------------------------|

Hence this commit introduces "struct btrfs_dio_data" to track the usage of
reserved data space. The remaining unused "reserve space" can now be freed
reliably.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-09-21 13:47:55 -07:00
Jeff Mahoney a30e577c96 btrfs: skip waiting on ordered range for special files
In btrfs_evict_inode, we properly truncate the page cache for evicted
inodes but then we call btrfs_wait_ordered_range for every inode as well.
It's the right thing to do for regular files but results in incorrect
behavior for device inodes for block devices.

filemap_fdatawrite_range gets called with inode->i_mapping which gets
resolved to the block device inode before getting passed to
wbc_attach_fdatawrite_inode and ultimately to inode_to_bdi.  What happens
next depends on whether there's an open file handle associated with the
inode.  If there is, we write to the block device, which is unexpected
behavior.  If there isn't, we through normally and inode->i_data is used.
We can also end up racing against open/close which can result in crashes
when i_mapping points to a block device inode that has been closed.

Since there can't be any page cache associated with special file inodes,
it's safe to skip the btrfs_wait_ordered_range call entirely and avoid
the problem.

Cc: <stable@vger.kernel.org>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=100911
Tested-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
2015-09-15 02:21:08 +01:00
Filipe Manana 005efedf2c Btrfs: fix read corruption of compressed and shared extents
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.

Consider the following example:

  File layout
  [0 - 8K]                      [8K - 24K]
      |                             |
      |                             |
   points to extent X,         points to extent X,
   offset 4K, length of 8K     offset 0, length 16K

  [extent X, compressed length = 4K uncompressed length = 16K]

If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).

So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.

The following test case for fstests triggers the issue:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_cloner

  rm -f $seqres.full

  test_clone_and_read_compressed_extent()
  {
      local mount_opts=$1

      _scratch_mkfs >>$seqres.full 2>&1
      _scratch_mount $mount_opts

      # Create a test file with a single extent that is compressed (the
      # data we write into it is highly compressible no matter which
      # compression algorithm is used, zlib or lzo).
      $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K"        \
                      -c "pwrite -S 0xbb 4K 8K"        \
                      -c "pwrite -S 0xcc 12K 4K"       \
                      $SCRATCH_MNT/foo | _filter_xfs_io

      # Now clone our extent into an adjacent offset.
      $CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
          $SCRATCH_MNT/foo $SCRATCH_MNT/foo

      # Same as before but for this file we clone the extent into a lower
      # file offset.
      $XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K"         \
                      -c "pwrite -S 0xbb 12K 8K"        \
                      -c "pwrite -S 0xcc 20K 4K"        \
                      $SCRATCH_MNT/bar | _filter_xfs_io

      $CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
          $SCRATCH_MNT/bar $SCRATCH_MNT/bar

      echo "File digests before unmounting filesystem:"
      md5sum $SCRATCH_MNT/foo | _filter_scratch
      md5sum $SCRATCH_MNT/bar | _filter_scratch

      # Evicting the inode or clearing the page cache before reading
      # again the file would also trigger the bug - reads were returning
      # all bytes in the range corresponding to the second reference to
      # the extent with a value of 0, but the correct data was persisted
      # (it was a bug exclusively in the read path). The issue happened
      # only if the same readpages() call targeted pages belonging to the
      # first and second ranges that point to the same compressed extent.
      _scratch_remount

      echo "File digests after mounting filesystem again:"
      # Must match the same digests we got before.
      md5sum $SCRATCH_MNT/foo | _filter_scratch
      md5sum $SCRATCH_MNT/bar | _filter_scratch
  }

  echo -e "\nTesting with zlib compression..."
  test_clone_and_read_compressed_extent "-o compress=zlib"

  _scratch_unmount

  echo -e "\nTesting with lzo compression..."
  test_clone_and_read_compressed_extent "-o compress=lzo"

  status=0
  exit

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-09-15 00:59:31 +01:00
Linus Torvalds e91eb6204f Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs cleanups and fixes from Chris Mason:
 "These are small cleanups, and also some fixes for our async worker
  thread initialization.

  I was having some trouble testing these, but it ended up being a
  combination of changing around my test servers and a shiny new
  schedule while atomic from the new start/finish_plug in
  writeback_sb_inodes().

  That one only hits on btrfs raid5/6 or MD raid10, and if I wasn't
  changing a bunch of things in my test setup at once it would have been
  really clear.  Fix for writeback_sb_inodes() on the way as well"

* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: cleanup: remove unnecessary check before btrfs_free_path is called
  btrfs: async_thread: Fix workqueue 'max_active' value when initializing
  btrfs: Add raid56 support for updating  num_tolerated_disk_barrier_failures in btrfs_balance
  btrfs: Cleanup for btrfs_calc_num_tolerated_disk_barrier_failures
  btrfs: Remove noused chunk_tree and chunk_objectid from scrub_enumerate_chunks and scrub_chunk
  btrfs: Update out-of-date "skip parity stripe" comment
2015-09-11 12:38:25 -07:00
Filipe Manana 85e0a0f21a Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlock
After commmit e44163e177 ("btrfs: explictly delete unused block groups
in close_ctree and ro-remount"), added in the 4.3 merge window, we have
calls to btrfs_delete_unused_bgs() while holding the cleaner_mutex.
This can cause a deadlock with a concurrent block group relocation (when
a filesystem balance or shrink operation is in progress for example)
because btrfs_delete_unused_bgs() locks delete_unused_bgs_mutex and the
relocation path locks first delete_unused_bgs_mutex and then it locks
cleaner_mutex, resulting in a classic ABBA deadlock:

         CPU 0                                        CPU 1

lock fs_info->cleaner_mutex

                                           __btrfs_balance() || btrfs_shrink_device()
                                             lock fs_info->delete_unused_bgs_mutex
                                             btrfs_relocate_chunk()
                                               btrfs_relocate_block_group()
                                                 lock fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
  lock fs_info->delete_unused_bgs_mutex

Fix this by not taking the cleaner_mutex before calling
btrfs_delete_unused_bgs() because it's no longer needed after
commit 67c5e7d464 ("Btrfs: fix race between balance and unused block
group deletion"). The mutex fs_info->delete_unused_bgs_mutex, the
spinlock fs_info->unused_bgs_lock and a block group's spinlock are
enough to get correct serialization between tasks running relocation
and unused block group deletion (as well as between multiple tasks
concurrently calling btrfs_delete_unused_bgs()).

This issue was discussed (in the mailing list) during the review of
the patch titled "btrfs: explictly delete unused block groups in
close_ctree and ro-remount" and it was agreed that acquiring the
cleaner mutex had to be dropped after the patch titled
"Btrfs: fix race between balance and unused block group deletion"
got merged (both patches were submitted at about the same time, but
one landed in kernel 4.2 and the other in the 4.3 merge window).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-09-10 11:27:57 +01:00
Filipe Manana 6af3e3adca Btrfs: don't initialize a space info as full to prevent ENOSPC
Commit 2e6e518335 ("Btrfs: fix block group ->space_info null pointer
dereference") accidently marked a space info as full when initializing
it with a value of 0 total bytes. This introduces an ENOSPC problem when
writing file data if we mount a filesystem that has no data block groups
allocated, because the data space info is initialized with 0 total bytes,
marked as full, and it never gets its total bytes incremented by a
(positive) value to unmark it as full (because there are no data block
groups loaded when the fs is mounted).
For metadata and system spaces this issue can never happen since we always
have at least one metadata block group and one system block group (even
for an empty filesystem).

So fix this by just not initializing a space info as full, reverting the
offending part of the commit mentioned above.

The following test case for fstests reproduces the issue:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1

  # Mount our filesystem without space caches enabled so that we do not
  # get any space used from the initial data block group that mkfs creates
  # (space caches used space from data block groups).
  _scratch_mount "-o nospace_cache"

  # Need an fs with at least 2Gb to make sure mkfs.btrfs does not create
  # an fs using mixed block groups (used both for data and metadata). We
  # really need to have dedicated block groups for data to reproduce the
  # issue and mkfs.btrfs defaults to mixed block groups only for small
  # filesystems (up to 1Gb).
  _require_fs_space $SCRATCH_MNT $((2 * 1024 * 1024))

  # Run balance with the purpose of deleting the unused data block group
  # that mkfs created. We could also wait for the background kthread to
  # automatically delete the unused block group, but we do not have a way
  # to make it run and wait for it to complete, so just do a balance
  # instead of some unreliable sleep
  _run_btrfs_util_prog balance start -dusage=0 $SCRATCH_MNT

  # Now unmount the filesystem, mount it again (either with or with space
  # caches enabled, it does not matter to trigger the problem) and attempt
  # to create a file with some data - this used to fail with ENOSPC
  # because there were no data block groups when the filesystem was
  # mounted and the data space info object was marked as full when
  # initialized (because it had 0 total bytes), which prevented the file
  # write path from attempting to allocate a data block group and fail
  # immediately with ENOSPC.
  _scratch_remount
  echo "hello world" > $SCRATCH_MNT/foobar

  echo "Silence is golden"
  status=0
  exit

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-09-08 03:25:10 +01:00
Linus Torvalds 7d9071a095 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "In this one:

   - d_move fixes (Eric Biederman)

   - UFS fixes (me; locking is mostly sane now, a bunch of bugs in error
     handling ought to be fixed)

   - switch of sb_writers to percpu rwsem (Oleg Nesterov)

   - superblock scalability (Josef Bacik and Dave Chinner)

   - swapon(2) race fix (Hugh Dickins)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits)
  vfs: Test for and handle paths that are unreachable from their mnt_root
  dcache: Reduce the scope of i_lock in d_splice_alias
  dcache: Handle escaped paths in prepend_path
  mm: fix potential data race in SyS_swapon
  inode: don't softlockup when evicting inodes
  inode: rename i_wb_list to i_io_list
  sync: serialise per-superblock sync operations
  inode: convert inode_sb_list_lock to per-sb
  inode: add hlist_fake to avoid the inode hash lock in evict
  writeback: plug writeback at a high level
  change sb_writers to use percpu_rw_semaphore
  shift percpu_counter_destroy() into destroy_super_work()
  percpu-rwsem: kill CONFIG_PERCPU_RWSEM
  percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
  percpu-rwsem: introduce percpu_down_read_trylock()
  document rwsem_release() in sb_wait_write()
  fix the broken lockdep logic in __sb_start_write()
  introduce __sb_writers_{acquired,release}() helpers
  ufs_inode_get{frag,block}(): get rid of 'phys' argument
  ufs_getfrag_block(): tidy up a bit
  ...
2015-09-05 20:34:28 -07:00
Linus Torvalds 22365979ab Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This has Jeff Mahoney's long standing trim patch that fixes corners
  where trims were missing.  Omar has some raid5/6 fixes, especially for
  using scrub and device replace when devices are missing.

  Zhao Lie continues cleaning and fixing things, this series fixes some
  really hard to hit corners in xfstests.  I had to pull it last merge
  window due to some deadlocks, but those are now resolved.

  I added support for Tejun's new blkio controllers.  It seems to work
  well for single devices, we'll expand to multi-device as well"

* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (47 commits)
  btrfs: fix compile when block cgroups are not enabled
  Btrfs: fix file read corruption after extent cloning and fsync
  Btrfs: check if previous transaction aborted to avoid fs corruption
  btrfs: use __GFP_NOFAIL in alloc_btrfs_bio
  btrfs: Prevent from early transaction abort
  btrfs: Remove unused arguments in tree-log.c
  btrfs: Remove useless condition in start_log_trans()
  Btrfs: add support for blkio controllers
  Btrfs: remove unused mutex from struct 'btrfs_fs_info'
  Btrfs: fix parity scrub of RAID 5/6 with missing device
  Btrfs: fix device replace of a missing RAID 5/6 device
  Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation
  Btrfs: count devices correctly in readahead during RAID 5/6 replace
  Btrfs: remove misleading handling of missing device scrub
  btrfs: fix clone / extent-same deadlocks
  Btrfs: fix defrag to merge tail file extent
  Btrfs: fix warning in backref walking
  btrfs: Add WARN_ON() for double lock in btrfs_tree_lock()
  btrfs: Remove root argument in extent_data_ref_count()
  btrfs: Fix wrong comment of btrfs_alloc_tree_block()
  ...
2015-09-05 15:14:43 -07:00
Linus Torvalds 1081230b74 Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
 "This first core part of the block IO changes contains:

   - Cleanup of the bio IO error signaling from Christoph.  We used to
     rely on the uptodate bit and passing around of an error, now we
     store the error in the bio itself.

   - Improvement of the above from myself, by shrinking the bio size
     down again to fit in two cachelines on x86-64.

   - Revert of the max_hw_sectors cap removal from a revision again,
     from Jeff Moyer.  This caused performance regressions in various
     tests.  Reinstate the limit, bump it to a more reasonable size
     instead.

   - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me.
     Most devices have huge trim limits, which can cause nasty latencies
     when deleting files.  Enable the admin to configure the size down.
     We will look into having a more sane default instead of UINT_MAX
     sectors.

   - Improvement of the SGP gaps logic from Keith Busch.

   - Enable the block core to handle arbitrarily sized bios, which
     enables a nice simplification of bio_add_page() (which is an IO hot
     path).  From Kent.

   - Improvements to the partition io stats accounting, making it
     faster.  From Ming Lei.

   - Also from Ming Lei, a basic fixup for overflow of the sysfs pending
     file in blk-mq, as well as a fix for a blk-mq timeout race
     condition.

   - Ming Lin has been carrying Kents above mentioned patches forward
     for a while, and testing them.  Ming also did a few fixes around
     that.

   - Sasha Levin found and fixed a use-after-free problem introduced by
     the bio->bi_error changes from Christoph.

   - Small blk cgroup cleanup from Viresh Kumar"

* 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
  blk: Fix bio_io_vec index when checking bvec gaps
  block: Replace SG_GAPS with new queue limits mask
  block: bump BLK_DEF_MAX_SECTORS to 2560
  Revert "block: remove artifical max_hw_sectors cap"
  blk-mq: fix race between timeout and freeing request
  blk-mq: fix buffer overflow when reading sysfs file of 'pending'
  Documentation: update notes in biovecs about arbitrarily sized bios
  block: remove bio_get_nr_vecs()
  fs: use helper bio_add_page() instead of open coding on bi_io_vec
  block: kill merge_bvec_fn() completely
  md/raid5: get rid of bio_fits_rdev()
  md/raid5: split bio for chunk_aligned_read
  block: remove split code in blkdev_issue_{discard,write_same}
  btrfs: remove bio splitting and merge_bvec_fn() calls
  bcache: remove driver private bio splitting code
  block: simplify bio_add_page()
  block: make generic_make_request handle arbitrarily sized bios
  blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
  block: don't access bio->bi_error after bio_put()
  block: shrink struct bio down to 2 cache lines again
  ...
2015-09-02 13:10:25 -07:00
Linus Torvalds 089b669506 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "The usual stuff from trivial tree for 4.3 (kerneldoc updates, printk()
  fixes, Documentation and MAINTAINERS updates)"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits)
  MAINTAINERS: update my e-mail address
  mod_devicetable: add space before */
  scsi: a100u2w: trivial typo in printk
  i2c: Fix typo in i2c-bfin-twi.c
  treewide: fix typos in comment blocks
  Doc: fix trivial typo in SubmittingPatches
  proportions: Spelling s/consitent/consistent/
  dm: Spelling s/consitent/consistent/
  aic7xxx: Fix typo in error message
  pcmcia: Fix typo in locking documentation
  scsi/arcmsr: Fix typos in error log
  drm/nouveau/gr: Fix typo in nv10.c
  [SCSI] Fix printk typos in drivers/scsi
  staging: comedi: Grammar s/Enable support a/Enable support for a/
  Btrfs: Spelling s/consitent/consistent/
  README: GTK+ is a acronym
  ASoC: omap: Fix typo in config option description
  mm: tlb.c: Fix error message
  ntfs: super.c: Fix error log
  fix typo in Documentation/SubmittingPatches
  ...
2015-09-01 18:46:42 -07:00
Tsutomu Itoh 527afb4493 Btrfs: cleanup: remove unnecessary check before btrfs_free_path is called
We need not check path before btrfs_free_path() is called because
path is checked in btrfs_free_path().

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31 11:46:41 -07:00
Qu Wenruo c6dd6ea557 btrfs: async_thread: Fix workqueue 'max_active' value when initializing
At initializing time, for threshold-able workqueue, it's max_active
of kernel workqueue should be 1 and grow if it hits threshold.

But due to the bad naming, there is both 'max_active' for kernel
workqueue and btrfs workqueue.
So wrong value is given at workqueue initialization.

This patch fixes it, and to avoid further misunderstanding, change the
member name of btrfs_workqueue to 'current_active' and 'limit_active'.

Also corresponding comment is added for readability.

Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31 11:46:40 -07:00
Zhao Lei 943c6e9925 btrfs: Add raid56 support for updating
num_tolerated_disk_barrier_failures in btrfs_balance

Code for updating fs_info->num_tolerated_disk_barrier_failures in
btrfs_balance() lacks raid56 support.

Reason:
 Above code was wroten in 2012-08-01, together with
 btrfs_calc_num_tolerated_disk_barrier_failures()'s first version.

 Then, btrfs_calc_num_tolerated_disk_barrier_failures() got updated
 later to support raid56, but code in btrfs_balance() was not
 updated together.

Fix:
 Merge above similar code to a common function:
 btrfs_get_num_tolerated_disk_barrier_failures()
 and make it support both case.

 It can fix this bug with a bonus of cleanup, and make these code
 never in above no-sync state from now on.

Suggested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31 11:45:48 -07:00
Zhao Lei 2c4580454f btrfs: Cleanup for btrfs_calc_num_tolerated_disk_barrier_failures
1: Use ARRAY_SIZE(types) to replace a static-value variant:
   int num_types = 4;

2: Use 'continue' on condition to reduce one level tab
   if (!XXX) {
       code;
       ...
   }
   ->
   if (XXX)
       continue;
   code;
   ...

3: Put setting 'num_tolerated_disk_barrier_failures = 2' to
   (num_tolerated_disk_barrier_failures > 2) condition to make
   make logic neat.
   if (num_tolerated_disk_barrier_failures > 0 && XXX)
       num_tolerated_disk_barrier_failures = 0;
   else if (num_tolerated_disk_barrier_failures > 1) {
       if (XXX)
           num_tolerated_disk_barrier_failures = 1;
       else if (XXX)
           num_tolerated_disk_barrier_failures = 2;
   ->
   if (num_tolerated_disk_barrier_failures > 0 && XXX)
       num_tolerated_disk_barrier_failures = 0;
   if (num_tolerated_disk_barrier_failures > 1 && XXX)
       num_tolerated_disk_barrier_failures = ;
   if (num_tolerated_disk_barrier_failures > 2 && XXX)
       num_tolerated_disk_barrier_failures = 2;

4: Remove comment of:
   num_mirrors - 1: if RAID1 or RAID10 is configured and more
   than 2 mirrors are used.
   which is not fit with code.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31 11:45:47 -07:00
Zhao Lei 8c204c9657 btrfs: Remove noused chunk_tree and chunk_objectid from scrub_enumerate_chunks and scrub_chunk
These variables are not used from introduced version, remove them.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31 11:45:46 -07:00
Zhao Lei 7955323bdc btrfs: Update out-of-date "skip parity stripe" comment
Because btrfs support scrub raid56 parity stripe now.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31 11:45:45 -07:00
Chris Mason 3a9508b022 btrfs: fix compile when block cgroups are not enabled
bio->bi_css and bio->bi_ioc don't exist when block cgroups are not on.
This adds an ifdef around them.  It's not perfect, but our
use of bi_ioc is being removed in the 4.3 merge window.

The bi_css usage really should go into bio_clone, but I want to make
sure that doesn't introduce problems for other bio_clone use cases.

Signed-off-by: Chris Mason <clm@fb.com>
2015-08-21 10:08:13 -07:00
Filipe Manana b84b8390d6 Btrfs: fix file read corruption after extent cloning and fsync
If we partially clone one extent of a file into a lower offset of the
file, fsync the file, power fail and then mount the fs to trigger log
replay, we can get multiple checksum items in the csum tree that overlap
each other and result in checksum lookup failures later. Those failures
can make file data read requests assume a checksum value of 0, but they
will not return an error (-EIO for example) to userspace exactly because
the expected checksum value 0 is a special value that makes the read bio
endio callback return success and set all the bytes of the corresponding
page with the value 0x01 (at fs/btrfs/inode.c:__readpage_endio_check()).
From a userspace perspective this is equivalent to file corruption
because we are not returning what was written to the file.

Details about how this can happen, and why, are included inline in the
following reproducer test case for fstests and the comment added to
tree-log.c.

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      _cleanup_flakey
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_dm_flakey
  _require_cloner
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test file with a single 100K extent starting at file
  # offset 800K. We fsync the file here to make the fsync log tree gets
  # a single csum item that covers the whole 100K extent, which causes
  # the second fsync, done after the cloning operation below, to not
  # leave in the log tree two csum items covering two sub-ranges
  # ([0, 20K[ and [20K, 100K[)) of our extent.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 800K 100K"  \
                  -c "fsync"                     \
                   $SCRATCH_MNT/foo | _filter_xfs_io

  # Now clone part of our extent into file offset 400K. This adds a file
  # extent item to our inode's metadata that points to the 100K extent
  # we created before, using a data offset of 20K and a data length of
  # 20K, so that it refers to the sub-range [20K, 40K[ of our original
  # extent.
  $CLONER_PROG -s $((800 * 1024 + 20 * 1024)) -d $((400 * 1024)) \
      -l $((20 * 1024)) $SCRATCH_MNT/foo $SCRATCH_MNT/foo

  # Now fsync our file to make sure the extent cloning is durably
  # persisted. This fsync will not add a second csum item to the log
  # tree containing the checksums for the blocks in the sub-range
  # [20K, 40K[ of our extent, because there was already a csum item in
  # the log tree covering the whole extent, added by the first fsync
  # we did before.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  echo "File digest before power failure:"
  md5sum $SCRATCH_MNT/foo | _filter_scratch

  # Silently drop all writes and ummount to simulate a crash/power
  # failure.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Allow writes again, mount to trigger log replay and validate file
  # contents.
  # The fsync log replay first processes the file extent item
  # corresponding to the file offset 400K (the one which refers to the
  # [20K, 40K[ sub-range of our 100K extent) and then processes the file
  # extent item for file offset 800K. It used to happen that when
  # processing the later, it erroneously left in the csum tree 2 csum
  # items that overlapped each other, 1 for the sub-range [20K, 40K[ and
  # 1 for the whole range of our extent. This introduced a problem where
  # subsequent lookups for the checksums of blocks within the range
  # [40K, 100K[ of our extent would not find anything because lookups in
  # the csum tree ended up looking only at the smaller csum item, the
  # one covering the subrange [20K, 40K[. This made read requests assume
  # an expected checksum with a value of 0 for those blocks, which caused
  # checksum verification failure when the read operations finished.
  # However those checksum failure did not result in read requests
  # returning an error to user space (like -EIO for e.g.) because the
  # expected checksum value had the special value 0, and in that case
  # btrfs set all bytes of the corresponding pages with the value 0x01
  # and produce the following warning in dmesg/syslog:
  #
  #  "BTRFS warning (device dm-0): csum failed ino 257 off 917504 csum\
  #   1322675045 expected csum 0"
  #
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  echo "File digest after log replay:"
  # Must match the same digest he had after cloning the extent and
  # before the power failure happened.
  md5sum $SCRATCH_MNT/foo | _filter_scratch

  _unmount_flakey

  status=0
  exit

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19 14:27:46 -07:00
Filipe Manana 1f9b8c8fbc Btrfs: check if previous transaction aborted to avoid fs corruption
While we are committing a transaction, it's possible the previous one is
still finishing its commit and therefore we wait for it to finish first.
However we were not checking if that previous transaction ended up getting
aborted after we waited for it to commit, so we ended up committing the
current transaction which can lead to fs corruption because the new
superblock can point to trees that have had one or more nodes/leafs that
were never durably persisted.
The following sequence diagram exemplifies how this is possible:

          CPU 0                                                        CPU 1

  transaction N starts

  (...)

  btrfs_commit_transaction(N)

    cur_trans->state = TRANS_STATE_COMMIT_START;
    (...)
    cur_trans->state = TRANS_STATE_COMMIT_DOING;
    (...)

    cur_trans->state = TRANS_STATE_UNBLOCKED;
    root->fs_info->running_transaction = NULL;

                                                              btrfs_start_transaction()
                                                                 --> starts transaction N + 1

    btrfs_write_and_wait_transaction(trans, root);
      --> starts writing all new or COWed ebs created
          at transaction N

                                                              creates some new ebs, COWs some
                                                              existing ebs but doesn't COW or
                                                              deletes eb X

                                                              btrfs_commit_transaction(N + 1)
                                                                (...)
                                                                cur_trans->state = TRANS_STATE_COMMIT_START;
                                                                (...)
                                                                wait_for_commit(root, prev_trans);
                                                                  --> prev_trans == transaction N

    btrfs_write_and_wait_transaction() continues
    writing ebs
       --> fails writing eb X, we abort transaction N
           and set bit BTRFS_FS_STATE_ERROR on
           fs_info->fs_state, so no new transactions
           can start after setting that bit

       cleanup_transaction()
         btrfs_cleanup_one_transaction()
           wakes up task at CPU 1

                                                                continues, doesn't abort because
                                                                cur_trans->aborted (transaction N + 1)
                                                                is zero, and no checks for bit
                                                                BTRFS_FS_STATE_ERROR in fs_info->fs_state
                                                                are made

                                                                btrfs_write_and_wait_transaction(trans, root);
                                                                  --> succeeds, no errors during writeback

                                                                write_ctree_super(trans, root, 0);
                                                                  --> succeeds
                                                                  --> we have now a superblock that points us
                                                                      to some root that uses eb X, which was
                                                                      never written to disk

In this scenario future attempts to read eb X from disk results in an
error message like "parent transid verify failed on X wanted Y found Z".

So fix this by aborting the current transaction if after waiting for the
previous transaction we verify that it was aborted.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19 14:27:31 -07:00
Michal Hocko 277fb5fc17 btrfs: use __GFP_NOFAIL in alloc_btrfs_bio
alloc_btrfs_bio relies on GFP_NOFS allocation when committing the
transaction but this allocation context is rather weak wrt. reclaim
capabilities. The page allocator currently tries hard to not fail these
allocations if they are small (<=PAGE_ALLOC_COSTLY_ORDER) but it can
still fail if the _current_ process is the OOM killer victim. Moreover
there is an attempt to move away from the default no-fail behavior and
allow these allocation to fail more eagerly. This would lead to:

[   37.928625] kernel BUG at fs/btrfs/extent_io.c:4045

which is clearly undesirable and the nofail behavior should be explicit
if the allocation failure cannot be tolerated.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19 14:25:15 -07:00
Michal Hocko d1b5c5671d btrfs: Prevent from early transaction abort
Btrfs relies on GFP_NOFS allocation when committing the transaction but
this allocation context is rather weak wrt. reclaim capabilities. The
page allocator currently tries hard to not fail these allocations if
they are small (<=PAGE_ALLOC_COSTLY_ORDER) so this is not a problem
currently but there is an attempt to move away from the default no-fail
behavior and allow these allocation to fail more eagerly. And this would
lead to a pre-mature transaction abort as follows:

[   55.328093] Call Trace:
[   55.328890]  [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
[   55.330518]  [<ffffffff8108fa28>] ? console_unlock+0x334/0x363
[   55.332738]  [<ffffffff8110873e>] __alloc_pages_nodemask+0x81d/0x8d4
[   55.334910]  [<ffffffff81100752>] pagecache_get_page+0x10e/0x20c
[   55.336844]  [<ffffffffa007d916>] alloc_extent_buffer+0xd0/0x350 [btrfs]
[   55.338973]  [<ffffffffa0059d8c>] btrfs_find_create_tree_block+0x15/0x17 [btrfs]
[   55.341329]  [<ffffffffa004f728>] btrfs_alloc_tree_block+0x18c/0x405 [btrfs]
[   55.343566]  [<ffffffffa003fa34>] split_leaf+0x1e4/0x6a6 [btrfs]
[   55.345577]  [<ffffffffa0040567>] btrfs_search_slot+0x671/0x831 [btrfs]
[   55.347679]  [<ffffffff810682d7>] ? get_parent_ip+0xe/0x3e
[   55.349434]  [<ffffffffa0041cb2>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
[   55.351681]  [<ffffffffa004ecfb>] __btrfs_run_delayed_refs+0x7a6/0xf35 [btrfs]
[   55.353979]  [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[   55.356212]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.358378]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.360626]  [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[   55.362894]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.365221]  [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
[   55.367273]  [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
[   55.369047]  [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
[   55.370654]  [<ffffffff81186869>] do_fsync+0x34/0x4e
[   55.372246]  [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
[   55.373851]  [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
[   55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: errno=-12 Out of memory
[   55.382431] BTRFS warning (device hdb1): Skipping commit of aborted transaction.
[   55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting unused transaction(IO failure).
[   55.384280] ------------[ cut here ]------------
[   55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0xd9/0xfe [btrfs]()
[...]
[   55.384337] Call Trace:
[   55.384353]  [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
[   55.384357]  [<ffffffff8107f717>] ? down_trylock+0x2d/0x37
[   55.384359]  [<ffffffff81046977>] warn_slowpath_common+0xa1/0xbb
[   55.384398]  [<ffffffffa00a1d6b>] ? btrfs_select_ref_head+0xd9/0xfe [btrfs]
[   55.384400]  [<ffffffff81046a34>] warn_slowpath_null+0x1a/0x1c
[   55.384423]  [<ffffffffa00a1d6b>] btrfs_select_ref_head+0xd9/0xfe [btrfs]
[   55.384446]  [<ffffffffa004e5f7>] ? __btrfs_run_delayed_refs+0xa2/0xf35 [btrfs]
[   55.384455]  [<ffffffffa004e600>] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs]
[   55.384476]  [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[   55.384499]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.384521]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.384543]  [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[   55.384565]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.384588]  [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
[   55.384591]  [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
[   55.384592]  [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
[   55.384593]  [<ffffffff81186869>] do_fsync+0x34/0x4e
[   55.384594]  [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
[   55.384595]  [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
[...]
[   55.384608] ---[ end trace c29799da1d4dd621 ]---
[   55.437323] BTRFS info (device hdb1): forced readonly
[   55.438815] BTRFS info (device hdb1): delayed_refs has NO entry

Fix this by being explicit about the no-fail behavior of this allocation
path and use __GFP_NOFAIL.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19 14:25:15 -07:00
Zhaolei 60d53eb310 btrfs: Remove unused arguments in tree-log.c
Following arguments are not used in tree-log.c:
 insert_one_name(): path, type
 wait_log_commit(): trans
 wait_for_writer(): trans

This patch remove them.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19 14:25:15 -07:00
Zhaolei 34eb2a5249 btrfs: Remove useless condition in start_log_trans()
Dan Carpenter <dan.carpenter@oracle.com> reported a smatch warning
for start_log_trans():
 fs/btrfs/tree-log.c:178 start_log_trans()
 warn: we tested 'root->log_root' before and it was 'false'

 fs/btrfs/tree-log.c
 147          if (root->log_root) {
 We test "root->log_root" here.
 ...

Reason:
 Condition of:
 fs/btrfs/tree-log.c:178: if (!root->log_root) {
 is not necessary after commit: 7237f1833

 It caused a smatch warning, and no functionally error.

Fix:
 Deleting above condition will make smatch shut up,
 but a better way is to do cleanup for start_log_trans()
 to remove duplicated code and make code more readable.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19 14:24:49 -07:00
Oleg Nesterov bee9182d95 introduce __sb_writers_{acquired,release}() helpers
Preparation to hide the sb->s_writers internals from xfs and btrfs.
Add 2 trivial define's they can use rather than play with ->s_writers
directly. No changes in btrfs/transaction.o and xfs/xfs_aops.o.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jan Kara <jack@suse.com>
2015-08-15 13:52:08 +02:00
Kent Overstreet b54ffb73ca block: remove bio_get_nr_vecs()
We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.

Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[hch: rebased and wrote a changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13 12:32:04 -06:00
Kent Overstreet 0e28997ec4 btrfs: remove bio splitting and merge_bvec_fn() calls
Btrfs has been doing bio splitting from btrfs_map_bio(), by checking
device limits as well as calling ->merge_bvec_fn() etc. That is not
necessary any more, because generic_make_request() is now able to
handle arbitrarily sized bios. So clean up unnecessary code paths.

Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13 12:31:43 -06:00
Greg Kroah-Hartman 5d44f4b348 Merge 4.2-rc6 into char-misc-next
We want the fixes in Linus's tree in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-09 16:28:09 -07:00
Chris Mason 46cd28555f Merge branch 'jeffm-discard-4.3' into for-linus-4.3 2015-08-09 07:35:33 -07:00
Chris Mason da2f0f74cf Btrfs: add support for blkio controllers
This attaches accounting information to bios as we submit them so the
new blkio controllers can throttle on btrfs filesystems.

Not much is required, we're just associating bios with blkcgs during clone,
calling wbc_init_bio()/wbc_account_io() during writepages submission,
and attaching the bios to the current context during direct IO.

Finally if we are splitting bios during btrfs_map_bio, this attaches
accounting information to the split.

The end result is able to throttle nicely on single disk filesystems.  A
little more work is required for multi-device filesystems.

Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:35:06 -07:00
Byongho Lee a4027a20c5 Btrfs: remove unused mutex from struct 'btrfs_fs_info'
The code using 'ordered_extent_flush_mutex' mutex has removed by below
commit.
 - 8d875f95da
   btrfs: disable strict file flushes for renames and truncates
But the mutex still lives in struct 'btrfs_fs_info'.

So, this patch removes the mutex from struct 'btrfs_fs_info' and its
initialization code.

Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:34:27 -07:00
Omar Sandoval 4a770891d9 Btrfs: fix parity scrub of RAID 5/6 with missing device
When testing the previous patch, Zhao Lei reported a similar bug when
attempting to scrub a degraded RAID 5/6 filesystem with a missing
device, leading to NULL pointer dereferences from the RAID 5/6 parity
scrubbing code.

The first cause was the same as in the previous patch: attempting to
call bio_add_page() on a missing block device. To fix this,
scrub_extent_for_parity() can just mark the sectors on the missing
device as errors instead of attempting to read from it.

Additionally, the code uses scrub_remap_extent() to map the extent of
the corresponding data stripe, but the extent wasn't already mapped. If
scrub_remap_extent() finds a missing block device, it doesn't initialize
extent_dev, so we're left with a NULL struct btrfs_device. The solution
is to use btrfs_map_block() directly.

Reported-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:34:26 -07:00
Omar Sandoval 73ff61dbe5 Btrfs: fix device replace of a missing RAID 5/6 device
The original implementation of device replace on RAID 5/6 seems to have
missed support for replacing a missing device. When this is attempted,
we end up calling bio_add_page() on a bio with a NULL ->bi_bdev, which
crashes when we try to dereference it. This happens because
btrfs_map_block() has no choice but to return us the missing device
because RAID 5/6 don't have any alternate mirrors to read from, and a
missing device has a NULL bdev.

The idea implemented here is to handle the missing device case
separately, which better only happen when we're replacing a missing RAID
5/6 device. We use the new BTRFS_RBIO_REBUILD_MISSING operation to
reconstruct the data from parity, check it with
scrub_recheck_block_checksum(), and write it out with
scrub_write_block_to_dev_replace().

Reported-by: Philip <bugzilla@philip-seeger.de>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=96141
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:34:26 -07:00
Omar Sandoval b4ee178268 Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation
The current RAID 5/6 recovery code isn't quite prepared to handle
missing devices. In particular, it expects a bio that we previously
attempted to use in the read path, meaning that it has valid pages
allocated. However, missing devices have a NULL blkdev, and we can't
call bio_add_page() on a bio with a NULL blkdev. We could do manual
manipulation of bio->bi_io_vec, but that's pretty gross. So instead, add
a separate path that allows us to manually add pages to the rbio.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:34:26 -07:00
Omar Sandoval 7cb2c4202e Btrfs: count devices correctly in readahead during RAID 5/6 replace
Commit 5fbc7c59fd ("Btrfs: fix unfinished readahead thread for raid5/6
degraded mounting") fixed a problem where we would skip a missing device
when we shouldn't have because there are no other mirrors to read from
in RAID 5/6. After commit 2c8cdd6ee4 ("Btrfs, replace: write dirty
pages into the replace target device"), the fix doesn't work when we're
doing a missing device replace on RAID 5/6 because the replace device is
counted as a mirror so we're tricked into thinking we can safely skip
the missing device. The fix is to count only the real stripes and decide
based on that.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:34:26 -07:00
Omar Sandoval 03679ade86 Btrfs: remove misleading handling of missing device scrub
scrub_submit() claims that it can handle a bio with a NULL block device,
but this is misleading, as calling bio_add_page() on a bio with a NULL
->bi_bdev would've already crashed. Delete this, as we're about to
properly handle a missing block device.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:34:26 -07:00
Mark Fasheh 293a8489f3 btrfs: fix clone / extent-same deadlocks
Clone and extent same lock their source and target inodes in opposite order.
In addition to this, the range locking in clone doesn't take ordering into
account. Fix this by having clone use the same locking helpers as
btrfs-extent-same.

In addition, I do a small cleanup of the locking helpers, removing a case
(both inodes being the same) which was poorly accounted for and never
actually used by the callers.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:34:25 -07:00
Liu Bo 4a3560c4f3 Btrfs: fix defrag to merge tail file extent
The file layout is

[extent 1]...[extent n][4k extent][HOLE][extent x]

extent 1~n and 4k extent can be merged during defrag, and the whole
defrag bytes is larger than our defrag thresh(256k), 4k extent as a
tail is left unmerged since we check if its next extent can be merged
(the next one is a hole, so the check will fail), the layout thus can
be

[new extent][4k extent][HOLE][extent x]
 (1~n)

To fix it, beside looking at the next one, this also looks at the
previous one by checking @defrag_end, which is set to 0 when we
decide to stop merging contiguous extents, otherwise, we can merge
the previous one with our extent.

Also, this makes btrfs behave consistent with how xfs and ext4 do.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:33:50 -07:00
Liu Bo acdf898de8 Btrfs: fix warning in backref walking
When we do backref walking, we search firstly in queued delayed refs
and then the on-disk backrefs, but we parse differently for shared
references, for delayed refs we also add 'ref->root' while for on-disk
backrefs we don't, this can prevent us from merging refs indexed
by the same bytenr and cause find_parent_nodes() to throw a warning at
'WARN_ON(ref->count < 0)', for example, when we have a shared data extent
with 'ref_cnt=1' and a delayed shared data with a BTRFS_DROP_DELAYED_REF,
that happens.

For shared references, no matter if it's delayed or on-disk, ref->root is
not at all used, instead it's ref->parent that really matters, so this has
delayed refs handled as the same way as on-disk refs.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:33:50 -07:00
Zhaolei 166f66d0bc btrfs: Add WARN_ON() for double lock in btrfs_tree_lock()
When a task trying to double lock a extent buffer, there are no
lockdep warning about it because this lock may be in "blocking_lock"
state, and make us hard to debug.

This patch add a WARN_ON() for above condition, it can not report
all deadlock cases(as lock between tasks), but at least helps us
some.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:14 -07:00
Zhaolei 9ed0dea09f btrfs: Remove root argument in extent_data_ref_count()
Because it is never used.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:14 -07:00
Zhaolei d02207512d btrfs: Fix wrong comment of btrfs_alloc_tree_block()
These wrong comment was copyed from another function(expired) from
init, this patch fixed them.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:14 -07:00
Zhaolei 93314e3b64 btrfs: abort transaction on btrfs_reloc_cow_block()
When btrfs_reloc_cow_block() failed in __btrfs_cow_block(), current
code just return a err-value to caller, but leave new_created extent
buffer exist and locked.

Then subsequent code (in relocate) try to lock above eb again,
and caused deadlock without any dmesg.
(eb lock use wait_event(), so no lockdep message)

It is hard to do recover work in __btrfs_cow_block() at this error
point, but we can abort transaction to avoid deadlock and operate on
unstable state.a

It also helps developer to find wrong place quickly.
(better than a frozen fs without any dmesg before patch)

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:14 -07:00
Zhaolei 147d256e09 btrfs: Remove unnecessary variants in relocation.c
These arguments are not used in functions, remove them for cleanup
and make kernel stack happy.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:14 -07:00
Zhaolei dc2ee4e244 btrfs: Cleanup: Remove chunk_objectid argument from btrfs_relocate_chunk()
Remove chunk_objectid argument from btrfs_relocate_chunk() because
it is not necessary, it can also cleanup some code in caller for
prepare its value.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:13 -07:00
Zhaolei 4624900dd3 btrfs: Cleanup: Remove objectid's init-value in create_reloc_inode()
objectid's init-value is not used in any case, remove it.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:13 -07:00
Zhaolei 4b3576e450 btrfs: Error handle for get_ref_objectid_v0() in relocate_block_group()
We need error checking code for get_ref_objectid_v0() in
relocate_block_group(), to avoid unpredictable result, especially
for accessing uninitialized value(when function failed) after
this line.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:13 -07:00
Zhaolei 55e3a601c8 btrfs: Fix data checksum error cause by replace with io-load.
xfstests btrfs/070 sometimes failed.
In my test machine, its fail rate is about 30%.
In another vm(vmware), its fail rate is about 50%.

Reason:
  btrfs/070 do replace and defrag with fsstress simultaneously,
  after above operation, checksum error is found by scrub.

  Actually, it have no relationship with defrag operation, only
  replace with fsstress can trigger this bug.

  New data writen to target device have possibility rewrited by
  old data from source device by replace code in debug, to avoid
  above problem, we can set target block group to readonly in
  replace period, so new data requested by other operation will
  not write to same place with replace code.

  Before patch(4.1-rc3):
    30% failed in 100 xfstests.
  After patch:
    0% failed in 300 xfstests.

It also happened in btrfs/071 as it's another scrub with IO load tests.

Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:13 -07:00
Zhaolei b708ce969a btrfs: use scrub_pause_on/off() to reduce code in scrub_enumerate_chunks()
Use new intruduced scrub_pause_on/off() can make this code block
clean and more readable.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:12 -07:00
Zhaolei 0e22be890e btrfs: Separate scrub_blocked_if_needed() to scrub_pause_on/off()
It can reduce current duplicated code which is similar to
scrub_blocked_if_needed() but can not call it because little
different.
It also used by my next patch which is in same case.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:12 -07:00
Zhaolei 868f401ae3 btrfs: Use ref_cnt for set_block_group_ro()
More than one code call set_block_group_ro() and restore rw in fail.

Old code use bool bit to save blockgroup's ro state, it can not
support parallel case(it is confirmd exist in my debug log).

This patch use ref count to store ro state, and rename
set_block_group_ro/set_block_group_rw
to
inc_block_group_ro/dec_block_group_ro.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:12 -07:00
Zhao Lei d7cad23895 btrfs: Bypass unrelated items before accessing its contents in scrub
When we access extent_root in scrub_stripe() and
scrub_raid56_parity(), we need bypass unrelated tree item firstly
before using its contents to do other condition.

It is not a bug fix, only making code sequence in logic.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:12 -07:00
Zhao Lei fe8cf654b1 btrfs: Load only necessary csums into list in scrub
We need not load csum of whole strip in scrub because strip is trimed
before use, it is to say, what we really need to calculate csum is
data between [extent_logical, extent_len).

This patch changed to use above segment for btrfs_lookup_csums_range()
in scrub_stripe()

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:11 -07:00
Zhao Lei a0dd59de3c btrfs: Fix calculate typo caused by ambiguous meaning of logic_end
For example, in scrub_raid56_parity(), following lines are used
to judge is all data processed:
 place1: if (key.objectid > logic_end) ...
 place2: if (logic_start >= logic_end) ...
 ...
 (place2 is typo, is should be ">", it is copied from other
  place, where logic_end's meaning is different, long story...)

We can fix above typo directly, but the root reason is ambiguous
meaning of logic_end in scrub raid56 parity.

In other place, XXX_end is pointed to data which is not included,
and we need to process segment of [XXX_start, XXX_end).

But for scrub raid56 parity, logic_end is pointed to lattest data
need to process, and introduced many "+ 1" and "- 1" in code as
below:
 length = sparity->logic_end - sparity->logic_start + 1
 logic_end - logic_start + 1
 stripe_logical + increment - 1

This patch changed logic_end's meaning to make it in normal understanding
in raid56 parity functions and data struct alone with above bugfix.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:11 -07:00
Zhao Lei 6fa96d72f7 btrfs: Free checksum list on scrub_extent() fail
When scrub_extent() failed, we need to free previois created
checksum list.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:11 -07:00
Zhao Lei f2f66a2f88 btrfs: Check cancel and pause in interval of scrub operation
Old code checking cancel and pause request inside scrub stripe
operation, like:
  loop() {
    if (parity) {
      scrub_parity_stripe();
      continue;
    }

    check_cancel_and_pause()

    scrub_normal_stripe();
  }

Reason is when introduce raid56 stripe scrub, new code is inserted
simplely to front of loop.

Better to:
  loop() {
    check_cancel_and_pause()

    if (parity)
      scrub_parity_stripe();
    else
      scrub_normal_stripe();
  }

This patch adjusted code place to realize above sequence.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:11 -07:00
Zhao Lei 78fa177029 btrfs: Show detail information when mount failed on missing devices
When mount failed because missing device, we can see following
dmesg:
 [ 1060.267743] BTRFS: too many missing devices, writeable mount is not allowed
 [ 1060.273158] BTRFS: open_ctree failed

This patch add missing_device_number and tolerated_missing_device_number
to above output, to let user know what really happened, and helps
bug-report and debug.

dmesg after patch:
 [  127.050367] BTRFS: missing devices(1) exceeds the limit(0), writeable mount is not allowed
 [  127.056099] BTRFS: open_ctree failed

Changelog v1->v2:
1: Changed to more clear description, suggested-by:
   Anand Jain <anand.jain@oracle.com>

Suggested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:10 -07:00
Zhao Lei a323e8139c btrfs: Fix scrub panic when leaf crosses stripes
Scrub panic in following operation:
  mkfs.ext4 /dev/vdh
  btrfs-convert /dev/vdh
  mount /dev/vdh /mnt/tmp1
  btrfs scrub start -B /dev/vdh
  (panic)

Reason:
  1: In some case, leaf created by btrfs-convert was splited into 2
     strips.
  2: Scrub bypassed part of above wrong leaf data, but remain data
     caused panic in scrub_checksum_tree_block().

For reason 1:
  we can get following information after some simple operation.
  a. mkfs.ext4 /dev/vdh
     btrfs-convert /dev/vdh
  b. btrfs-debug-tree /dev/vdh
     we can see following item in extent tree:
     item 25 key (27054080 METADATA_ITEM 0) itemoff 15083 itemsize 33
     Its logical address is [27054080, 27070464)
     and acrossed 2 strips:
     [27000832, 27066368)
     [27066368, 27131904)
  Will be fixed in btrfs-progs(btrfs-convert, btrfsck, ...)

For reason 2:
  Scrub is trying to do a "bypass" in this case, but the result is
  "panic", because current code lacks of some condition in bypass,
  and let some wrong leaf data escaped.

This patch fixed above scrub code.

Before patch:
  # btrfs scrub start -B /dev/vdh
  (panic)

After patch:
  # btrfs scrub start -B /dev/vdh
  scrub done for 353cec8f-da31-4a94-aa35-be72d997b06e
  ...
  # dmesg
  ...
  [   59.088697] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27000832
  [   59.089929] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27066368
  #

Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:00:31 -07:00
Filipe Manana 18aa092297 Btrfs: fix stale dir entries after removing a link and fsync
We have one more case where after a log tree is replayed we get
inconsistent metadata leading to stale directory entries, due to
some directories having entries pointing to some inode while the
inode does not have a matching BTRFS_INODE_[REF|EXTREF]_KEY item.

To trigger the problem we need to have a file with multiple hard links
belonging to different parent directories. Then if one of those hard
links is removed and we fsync the file using one of its other links
that belongs to a different parent directory, we end up not logging
the fact that the removed hard link doesn't exists anymore in the
parent directory.

Simple reproducer:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      _cleanup_flakey
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _need_to_be_root
  _supported_fs generic
  _supported_os Linux
  _require_scratch
  _require_dm_flakey
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test directory and file.
  mkdir $SCRATCH_MNT/testdir
  touch $SCRATCH_MNT/foo
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/testdir/foo2
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/testdir/foo3

  # Make sure everything done so far is durably persisted.
  sync

  # Now we remove one of our file's hardlinks in the directory testdir.
  unlink $SCRATCH_MNT/testdir/foo3

  # We now fsync our file using the "foo" link, which has a parent that
  # is not the directory "testdir".
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Silently drop all writes and unmount to simulate a crash/power
  # failure.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Allow writes again, mount to trigger journal/log replay.
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # After the journal/log is replayed we expect to not see the "foo3"
  # link anymore and we should be able to remove all names in the
  # directory "testdir" and then remove it (no stale directory entries
  # left after the journal/log replay).
  echo "Entries in testdir:"
  ls -1 $SCRATCH_MNT/testdir

  rm -f $SCRATCH_MNT/testdir/*
  rmdir $SCRATCH_MNT/testdir

  _unmount_flakey

  status=0
  exit

The test fails with:

  $ ./check generic/107
  FSTYP         -- btrfs
  PLATFORM      -- Linux/x86_64 debian3 4.1.0-rc6-btrfs-next-11+
  MKFS_OPTIONS  -- /dev/sdc
  MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1

  generic/107 3s ... - output mismatch (see .../results/generic/107.out.bad)
    --- tests/generic/107.out	2015-08-01 01:39:45.807462161 +0100
    +++ /home/fdmanana/git/hub/xfstests/results//generic/107.out.bad
    @@ -1,3 +1,5 @@
     QA output created by 107
     Entries in testdir:
     foo2
    +foo3
    +rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
    ...
    _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent \
      (see /home/fdmanana/git/hub/xfstests/results//generic/107.full)
    _check_dmesg: something found in dmesg (see .../results/generic/107.dmesg)
  Ran: generic/107
  Failures: generic/107
  Failed 1 of 1 tests

  $ cat /home/fdmanana/git/hub/xfstests/results//generic/107.full
  (...)
  checking fs roots
  root 5 inode 257 errors 200, dir isize wrong
	unresolved ref dir 257 index 3 namelen 4 name foo3 filetype 1 errors 5, no dir item, no inode ref
  (...)

And produces the following warning in dmesg:

  [127298.759064] BTRFS info (device dm-0): failed to delete reference to foo3, inode 258 parent 257
  [127298.762081] ------------[ cut here ]------------
  [127298.763311] WARNING: CPU: 10 PID: 7891 at fs/btrfs/inode.c:3956 __btrfs_unlink_inode+0x182/0x35a [btrfs]()
  [127298.767327] BTRFS: Transaction aborted (error -2)
  (...)
  [127298.788611] Call Trace:
  [127298.789137]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
  [127298.790090]  [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
  [127298.791157]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
  [127298.792323]  [<ffffffffa065ad09>] ? __btrfs_unlink_inode+0x182/0x35a [btrfs]
  [127298.793633]  [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
  [127298.794699]  [<ffffffffa065ad09>] __btrfs_unlink_inode+0x182/0x35a [btrfs]
  [127298.797640]  [<ffffffffa065be8f>] btrfs_unlink_inode+0x1e/0x40 [btrfs]
  [127298.798876]  [<ffffffffa065bf11>] btrfs_unlink+0x60/0x9b [btrfs]
  [127298.800154]  [<ffffffff8116fb48>] vfs_unlink+0x9c/0xed
  [127298.801303]  [<ffffffff81173481>] do_unlinkat+0x12b/0x1fb
  [127298.802450]  [<ffffffff81253855>] ? lockdep_sys_exit_thunk+0x12/0x14
  [127298.803797]  [<ffffffff81174056>] SyS_unlinkat+0x29/0x2b
  [127298.805017]  [<ffffffff81465197>] system_call_fastpath+0x12/0x6f
  [127298.806310] ---[ end trace bbfddacb7aaada7b ]---
  [127298.807325] BTRFS warning (device dm-0): __btrfs_unlink_inode:3956: Aborting unused transaction(No such entry).

So fix this by logging all parent inodes, current and old ones, to make
sure we do not get stale entries after log replay. This is not a simple
solution such as triggering a full transaction commit because it would
imply full transaction commit when an inode is fsynced in the same
transaction that modified it and reloaded it after eviction (because its
last_unlink_trans is set to the same value as its last_trans as of the
commit with the title "Btrfs: fix stale dir entries after unlink, inode
eviction and fsync"), and it would also make fstest generic/066 fail
since one of the fsyncs triggers a full commit and the next fsync will
not find the inode in the log anymore (therefore not removing the xattr).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 06:17:04 -07:00
Naohiro Aota dd81d459a3 btrfs: fix search key advancing condition
The search key advancing condition used in copy_to_sk() is loose. It can
advance the key even if it reaches sk->max_*: e.g. when the max key = (512,
1024, -1) and the current key = (512, 1025, 10), it increments the
offset by 1, continues hopeless search from (512, 1025, 11). This issue
make ioctl() to take unexpectedly long time scanning all the leaf a blocks
one by one.

This commit fix the problem using standard way of key comparison:
btrfs_comp_cpu_keys()

Signed-off-by: Naohiro Aota <naota@elisp.net>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 06:17:02 -07:00
Filipe Manana d6589101b6 Btrfs: teach backref walking about backrefs with underflowed offset values
When cloning/deduplicating file extents (through the clone and extent_same
ioctls) we can get data back references with offset values that are a
result of an unsigned integer arithmetic underflow, that is, values that
are much larger then they could be otherwise.

This is not a problem when decrementing or dropping the back references
(happens when we overwrite the extents or punch a hole for example, through
__btrfs_drop_extents()), since we compute the same too large offset value,
but it is a problem for the backref walking code, used by an incremental
send and the ioctls that are used by the btrfs tool "inspect-internal"
commands, as it makes it miss the corresponding file extent items because
the search key is set for an extent item that starts at an offset matching
the exceptionally large offset value of the data back reference. For an
incremental send this causes the send ioctl to fail with -EIO.

So teach the backref walking code to deal with these cases by setting the
search key's offset to 0 if the backref's offset value is larger than
LLONG_MAX (the largest possible file offset). This makes sure the backref
walking code finds the corresponding file extent items at the expense of
scanning more items and leafs in the btree.

Fixing the clone/dedup ioctls to not produce such underflowed results would
require major changes breaking backward compatibility, updating user space
tools, etc.

Simple reproducer case for fstests:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -fr $send_files_dir
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_cloner
  _need_to_be_root

  send_files_dir=$TEST_DIR/btrfs-test-$seq

  rm -f $seqres.full
  rm -fr $send_files_dir
  mkdir $send_files_dir

  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  # Create our test file with a single extent of 64K starting at file
  # offset 128K.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 128K 64K" $SCRATCH_MNT/foo \
      | _filter_xfs_io

  _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
      $SCRATCH_MNT/mysnap1

  # Now clone parts of the original extent into lower offsets of the file.
  #
  # The first clone operation adds a file extent item to file offset 0
  # that points to our initial extent with a data offset of 16K. The
  # corresponding data back reference in the extent tree has an offset of
  # 18446744073709535232, which is the result of file_offset - data_offset
  # = 0 - 16K.
  #
  # The second clone operation adds a file extent item to file offset 16K
  # that points to our initial extent with a data offset of 48K. The
  # corresponding data back reference in the extent tree has an offset of
  # 18446744073709518848, which is the result of file_offset - data_offset
  # = 16K - 48K.
  #
  # Those large back reference offsets (result of unsigned arithmetic
  # underflow) confused the back reference walking code (used by an
  # incremental send and the multiple inspect-internal ioctls) and made it
  # miss the back references, which for the case of an incremental send it
  # made it fail with -EIO and print a message like the following to
  # dmesg:
  #
  # "BTRFS error (device sdc): did not find backref in send_root. \
  #  inode=257, offset=0, disk_byte=12845056 found extent=12845056"
  #
  $CLONER_PROG -s $(((128 + 16) * 1024)) -d 0 -l $((16 * 1024)) \
      $SCRATCH_MNT/foo $SCRATCH_MNT/foo
  $CLONER_PROG -s $(((128 + 48) * 1024)) -d $((16 * 1024)) \
      -l $((16 * 1024)) $SCRATCH_MNT/foo $SCRATCH_MNT/foo

  _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
      $SCRATCH_MNT/mysnap2

  _run_btrfs_util_prog send $SCRATCH_MNT/mysnap1 -f $send_files_dir/1.snap
  _run_btrfs_util_prog send -p $SCRATCH_MNT/mysnap1 $SCRATCH_MNT/mysnap2 \
      -f $send_files_dir/2.snap

  echo "File digest in the original filesystem:"
  md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch

  # Now recreate the filesystem by receiving both send streams and verify
  # we get the same file contents that the original filesystem had.
  _scratch_unmount
  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
  _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/2.snap

  echo "File digest in the new filesystem:"
  md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch

  status=0
  exit

The test's expected golden output is:

  wrote 65536/65536 bytes at offset 131072
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File digest in the original filesystem:
  6c6079335cff141b8a31233ead04cbff  SCRATCH_MNT/mysnap2/foo
  File digest in the new filesystem:
  6c6079335cff141b8a31233ead04cbff  SCRATCH_MNT/mysnap2/foo

But it failed with:

    (...)
    @@ -1,7 +1,5 @@
     QA output created by 097
     wrote 65536/65536 bytes at offset 131072
     XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
    -File digest in the original filesystem:
    -6c6079335cff141b8a31233ead04cbff  SCRATCH_MNT/mysnap2/foo
    -File digest in the new filesystem:
    -6c6079335cff141b8a31233ead04cbff  SCRATCH_MNT/mysnap2/foo
    ...

  $ cat /home/fdmanana/git/hub/xfstests/results//btrfs/097.full
  (...)
  ERROR: send ioctl failed with -5: Input/output error

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 06:17:00 -07:00
Filipe Manana bde6c24202 Btrfs: fix stale dir entries after unlink, inode eviction and fsync
If we remove a hard link from an inode, the inode gets evicted, then
we fsync the inode and then power fail/crash, when the log tree is
replayed, the parent directory inode still has entries pointing to
the name that no longer exists, while our inode no longer has the
BTRFS_INODE_REF_KEY item matching the deleted hard link (as expected),
leaving the filesystem in an inconsistent state. The stale directory
entries can not be deleted (an attempt to delete them causes -ESTALE
errors), which makes it impossible to delete the parent directory.

This happens because we track the id of the transaction where the last
unlink operation for the inode happened (last_unlink_trans) in an
in-memory only field of the inode, that is, a value that is never
persisted in the inode item stored on the fs/subvol btree. So if an
inode is evicted and loaded again, the value for last_unlink_trans is
set to 0, which prevents the fsync from logging the parent directory
at btrfs_log_inode_parent(). So fix this by setting last_unlink_trans
to the id of the transaction that last modified the inode when we
load the inode. This is a pessimistic approach but it always ensures
correctness with the trade off of ocassional full transaction commits
when an fsync is done against the inode in the same transaction where
it was evicted and reloaded when our inode is a directory and often
logging its parent unnecessarily when our inode is not a directory.

The following test case for fstests triggers the problem:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      _cleanup_flakey
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _need_to_be_root
  _supported_fs generic
  _supported_os Linux
  _require_scratch
  _require_dm_flakey
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test file with 2 hard links.
  mkdir $SCRATCH_MNT/testdir
  touch $SCRATCH_MNT/testdir/foo
  ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/bar

  # Make sure everything done so far is durably persisted.
  sync

  # Now remove one of the links, trigger inode eviction and then fsync
  # our inode.
  unlink $SCRATCH_MNT/testdir/bar
  echo 2 > /proc/sys/vm/drop_caches
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/foo

  # Silently drop all writes on our scratch device to simulate a power failure.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Allow writes again and mount the fs to trigger log/journal replay.
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # Now verify our directory entries.
  echo "Entries in testdir:"
  ls -1 $SCRATCH_MNT/testdir

  # If we remove our inode, its parent should become empty and therefore we should
  # be able to remove the parent.
  rm -f $SCRATCH_MNT/testdir/*
  rmdir $SCRATCH_MNT/testdir

  _unmount_flakey

  # The fstests framework will call fsck against our filesystem which will verify
  # that all metadata is in a consistent state.

  status=0
  exit

The test failed on btrfs with:

  generic/098 4s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad)
    --- tests/generic/098.out	2015-07-23 18:01:12.616175932 +0100
    +++ /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad	2015-07-23 18:04:58.924138308 +0100
    @@ -1,3 +1,6 @@
     QA output created by 098
     Entries in testdir:
    +bar
     foo
    +rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/testdir/foo': Stale file handle
    +rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
    ...
    (Run 'diff -u tests/generic/098.out /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad'  to see the entire diff)
  _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent (see /home/fdmanana/git/hub/xfstests/results//generic/098.full)

  $ cat /home/fdmanana/git/hub/xfstests/results//generic/098.full
  (...)
  checking fs roots
  root 5 inode 258 errors 2001, no inode item, link count wrong
     unresolved ref dir 257 index 0 namelen 3 name foo filetype 1 errors 6, no dir index, no inode ref
     unresolved ref dir 257 index 3 namelen 3 name bar filetype 1 errors 5, no dir item, no inode ref
  Checking filesystem on /dev/sdc
  (...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 06:16:58 -07:00
Filipe Manana bb53eda902 Btrfs: fix stale directory entries after fsync log replay
We have another case where after an fsync log replay we get an inode with
a wrong link count (smaller than it should be) and a number of directory
entries greater than its link count. This happens when we add a new link
hard link to our inode A and then we fsync some other inode B that has
the side effect of logging the parent directory inode too. In this case
at log replay time we add the new hard link to our inode (the item with
key BTRFS_INODE_REF_KEY) when processing the parent directory but we
never adjust the link count of our inode A. As a result we get stale dir
entries for our inode A that can never be deleted and therefore it makes
it impossible to remove the parent directory (as its i_size can never
decrease back to 0).

A simple reproducer for fstests that triggers this issue:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      _cleanup_flakey
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _need_to_be_root
  _supported_fs generic
  _supported_os Linux
  _require_scratch
  _require_dm_flakey
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test directory and files.
  mkdir $SCRATCH_MNT/testdir
  touch $SCRATCH_MNT/testdir/foo
  touch $SCRATCH_MNT/testdir/bar

  # Make sure everything done so far is durably persisted.
  sync

  # Create one hard link for file foo and another one for file bar. After
  # that fsync only the file bar.
  ln $SCRATCH_MNT/testdir/bar $SCRATCH_MNT/testdir/bar_link
  ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/foo_link
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/bar

  # Silently drop all writes on scratch device to simulate power failure.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Allow writes again and mount the fs to trigger log/journal replay.
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # Now verify both our files have a link count of 2.
  echo "Link count for file foo: $(stat --format=%h $SCRATCH_MNT/testdir/foo)"
  echo "Link count for file bar: $(stat --format=%h $SCRATCH_MNT/testdir/bar)"

  # We should be able to remove all the links of our files in testdir, and
  # after that the parent directory should become empty and therefore
  # possible to remove it.
  rm -f $SCRATCH_MNT/testdir/*
  rmdir $SCRATCH_MNT/testdir

  _unmount_flakey

  # The fstests framework will call fsck against our filesystem which will verify
  # that all metadata is in a consistent state.

  status=0
  exit

The test fails with:

 -Link count for file foo: 2
 +Link count for file foo: 1
  Link count for file bar: 2
 +rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/testdir/foo_link': Stale file handle
 +rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
 (...)
 _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent

And fsck's output:

  (...)
  checking fs roots
  root 5 inode 258 errors 2001, no inode item, link count wrong
      unresolved ref dir 257 index 5 namelen 8 name foo_link filetype 1 errors 4, no inode ref
  Checking filesystem on /dev/sdc
  (...)

So fix this by marking inodes for link count fixup at log replay time
whenever a directory entry is replayed if the entry was created in the
transaction where the fsync was made and if it points to a non-directory
inode.

This isn't a new problem/regression, the issue exists for a long time,
possibly since the log tree feature was added (2008).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 06:16:56 -07:00
Linus Torvalds af0b3152bb Merge branch 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "We have a btrfs quota regression fix.

  I merged this one on Thursday and have run it through tests against
  current master.

  Normally I wouldn't have sent this while you were finalizing rc6, but
  I'm feeding mosquitoes in the adirondacks next week, so I wanted to
  get this one out before leaving.  I'll leave longer tests running and
  check on things during the week, but I don't expect any problems"

* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: qgroup: Fix a regression in qgroup reserved space.
2015-08-09 05:56:31 +03:00
Geert Uytterhoeven d41e36a0ab Btrfs: Spelling s/consitent/consistent/
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
2015-08-07 14:13:21 +02:00
Qu Wenruo c05f9429e1 btrfs: qgroup: Fix a regression in qgroup reserved space.
During the change to new btrfs extent-oriented qgroup implement, due to
it doesn't use the old __qgroup_excl_accounting() for exclusive extent,
it didn't free the reserved bytes.

The bug will cause limit function go crazy as the reserved space is
never freed, increasing limit will have no effect and still cause
EQOUT.

The fix is easy, just free reserved bytes for newly created exclusive
extent as what it does before.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Yang Dongsheng <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-06 14:51:15 -07:00
Greg Kroah-Hartman f368ed6088 char: make misc_deregister a void function
With well over 200+ users of this api, there are a mere 12 users that
actually checked the return value of this function.  And all of them
really didn't do anything with that information as the system or module
was shutting down no matter what.

So stop pretending like it matters, and just return void from
misc_deregister().  If something goes wrong in the call, you will get a
WARNING splat in the syslog so you know how to fix up your driver.
Other than that, there's nothing that can go wrong.

Cc: Alasdair Kergon <agk@redhat.com>
Cc: Neil Brown <neilb@suse.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Wim Van Sebroeck <wim@iguana.be>
Cc: Christine Caulfield <ccaulfie@redhat.com>
Cc: David Teigland <teigland@redhat.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Acked-by: Alessandro Zummo <a.zummo@towertech.it>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-05 10:35:49 -07:00
Linus Torvalds acea568fa9 Merge branch 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Filipe fixed up a hard to trigger ENOSPC regression from our merge
  window pull, and we have a few other smaller fixes"

* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix quick exhaustion of the system array in the superblock
  btrfs: its btrfs_err() instead of btrfs_error()
  btrfs: Avoid NULL pointer dereference of free_extent_buffer when read_tree_block() fail
  btrfs: Fix lockdep warning of btrfs_run_delayed_iputs()
2015-07-31 17:05:37 -07:00
Jeff Mahoney e33e17ee10 btrfs: add missing discards when unpinning extents with -o discard
When we clear the dirty bits in btrfs_delete_unused_bgs for extents
in the empty block group, it results in btrfs_finish_extent_commit being
unable to discard the freed extents.

The block group removal patch added an alternate path to forget extents
other than btrfs_finish_extent_commit.  As a result, any extents that
would be freed when the block group is removed aren't discarded.  In my
test run, with a large copy of mixed sized files followed by removal, it
left nearly 2/3 of extents undiscarded.

To clean up the block groups, we add the removed block group onto a list
that will be discarded after transaction commit.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-29 08:15:29 -07:00
Jeff Mahoney e44163e177 btrfs: explictly delete unused block groups in close_ctree and ro-remount
The cleaner thread may already be sleeping by the time we enter
close_ctree.  If that's the case, we'll skip removing any unused
block groups queued for removal, even during a normal umount.
They'll be cleaned up automatically at next mount, but users
expect a umount to be a clean synchronization point, especially
when used on thin-provisioned storage with -odiscard.  We also
explicitly remove unused block groups in the ro-remount path
for the same reason.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-29 08:15:27 -07:00
Jeff Mahoney 499f377f49 btrfs: iterate over unused chunk space in FITRIM
Since we now clean up block groups automatically as they become
empty, iterating over block groups is no longer sufficient to discard
unused space.

This patch iterates over the unused chunk space and discards any regions
that are unallocated, regardless of whether they were ever used.  This is
a change for btrfs but is consistent with other file systems.

We do this in a transactionless manner since the discard process can take
a substantial amount of time and a transaction would need to be started
before the acquisition of the device list lock.  That would mean a
transaction would be held open across /all/ of the discards collectively.
In order to prevent other threads from allocating or freeing chunks, we
hold the chunks lock across the search and discard calls.  We release it
between searches to allow the file system to perform more-or-less
normally.  Since the running transaction can commit and disappear while
we're using the transaction pointer, we take a reference to it and
release it after the search.  This is safe since it would happen normally
at the end of the transaction commit after any locks are released anyway.
We also take the commit_root_sem to protect against a transaction starting
and committing while we're running.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-29 08:15:26 -07:00
Jeff Mahoney 86557861df btrfs: skip superblocks during discard
Btrfs doesn't track superblocks with extent records so there is nothing
persistent on-disk to indicate that those blocks are in use.  We track
the superblocks in memory to ensure they don't get used by removing them
from the free space cache when we load a block group from disk.  Prior
to 47ab2a6c6a (Btrfs: remove empty block groups automatically), that
was fine since the block group would never be reclaimed so the superblock
was always safe.  Once we started removing the empty block groups, we
were protected by the fact that discards weren't being properly issued
for unused space either via FITRIM or -odiscard.  The block groups were
still being released, but the blocks remained on disk.

In order to properly discard unused block groups, we need to filter out
the superblocks from the discard range.  Superblocks are located at fixed
locations on each device, so it makes sense to filter them out in
btrfs_issue_discard, which is used by both -odiscard and FITRIM.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-29 08:15:25 -07:00
Jeff Mahoney 4d89d377bb btrfs: btrfs_issue_discard ensure offset/length are aligned to sector boundaries
It's possible, though unexpected, to pass unaligned offsets and lengths
to btrfs_issue_discard.  We then shift the offset/length values to sector
units.  If an unaligned offset has been passed, it will result in the
entire sector being discarded, possibly losing data.  An unaligned
length is safe but we'll end up returning an inaccurate number of
discarded bytes.

This patch aligns the offset to the 512B boundary, adjusts the length,
and warns, since we shouldn't be discarding on an offset that isn't
aligned with our sector size.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-29 08:15:24 -07:00
Jeff Mahoney d04c6b8832 btrfs: make btrfs_issue_discard return bytes discarded
Initially this will just be the length argument passed to it,
but the following patches will adjust that to reflect re-alignment
and skipped blocks.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-29 08:15:22 -07:00
Christoph Hellwig 4246a0b63b block: add a bi_error field to struct bio
Currently we have two different ways to signal an I/O error on a BIO:

 (1) by clearing the BIO_UPTODATE flag
 (2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario.  Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29 08:55:15 -06:00
Filipe Manana 00d80e342c Btrfs: fix quick exhaustion of the system array in the superblock
Omar reported that after commit 4fbcdf6694 ("Btrfs: fix -ENOSPC when
finishing block group creation"), introduced in 4.2-rc1, the following
test was failing due to exhaustion of the system array in the superblock:

  #!/bin/bash

  truncate -s 100T big.img
  mkfs.btrfs big.img
  mount -o loop big.img /mnt/loop

  num=5
  sz=10T
  for ((i = 0; i < $num; i++)); do
      echo fallocate $i $sz
      fallocate -l $sz /mnt/loop/testfile$i
  done
  btrfs filesystem sync /mnt/loop

  for ((i = 0; i < $num; i++)); do
        echo rm $i
        rm /mnt/loop/testfile$i
        btrfs filesystem sync /mnt/loop
  done
  umount /mnt/loop

This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
allocation of system block groups. This happened because the test creates
a large number of data block groups per transaction and when committing
the transaction we start the writeout of the block group caches for all
the new new (dirty) block groups, which results in pre-allocating space
for each block group's free space cache using the same transaction handle.
That in turn often leads to creation of more block groups, and all get
attached to the new_bgs list of the same transaction handle to the point
of getting a list with over 1500 elements, and creation of new block groups
leads to the need of reserving space in the chunk block reserve and often
creating a new system block group too.

So that made us quickly exhaust the chunk block reserve/system space info,
because as of the commit mentioned before, we do reserve space for each
new block group in the chunk block reserve, unlike before where we would
not and would at most allocate one new system block group and therefore
would only ensure that there was enough space in the system space info to
allocate 1 new block group even if we ended up allocating thousands of
new block groups using the same transaction handle. That worked most of
the time because the computed required space at check_system_chunk() is
very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
that all nodes/leafs in a path will be COWed and split) and since the
updates to the chunk tree all happen at btrfs_create_pending_block_groups
it is unlikely that a path needs to be COWed more than once (unless
writepages() for the btree inode is called by mm in between) and that
compensated for the need of creating any new nodes/leads in the chunk
tree.

So fix this by ensuring we don't accumulate a too large list of new block
groups in a transaction's handles new_bgs list, inserting/updating the
chunk tree for all accumulated new block groups and releasing the unused
space from the chunk block reserve whenever the list becomes sufficiently
large. This is a generic solution even though the problem currently can
only happen when starting the writeout of the free space caches for all
dirty block groups (btrfs_start_dirty_block_groups()).

Reported-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-22 18:20:54 -07:00
Anand Jain 3e303ea60d btrfs: its btrfs_err() instead of btrfs_error()
sorry I indented to use btrfs_err() and I have no idea
how btrfs_error() got there.
infact I was thinking about these kind of oversights
since these two func are too closely named.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-22 18:20:53 -07:00
Zhao Lei 95ab1f6490 btrfs: Avoid NULL pointer dereference of free_extent_buffer when read_tree_block() fail
When read_tree_block() failed, we can see following dmesg:
 [  134.371389] BUG: unable to handle kernel NULL pointer dereference at 0000000000000063
 [  134.372236] IP: [<ffffffff813a4a51>] free_extent_buffer+0x21/0x90
 [  134.372236] PGD 0
 [  134.372236] Oops: 0000 [#1] SMP
 [  134.372236] Modules linked in:
 [  134.372236] CPU: 0 PID: 2289 Comm: mount Not tainted 4.2.0-rc1_HEAD_c65b99f046843d2455aa231747b5a07a999a9f3d_+ #115
 [  134.372236] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
 [  134.372236] task: ffff88003b6e1a00 ti: ffff880011e60000 task.ti: ffff880011e60000
 [  134.372236] RIP: 0010:[<ffffffff813a4a51>]  [<ffffffff813a4a51>] free_extent_buffer+0x21/0x90
 ...
 [  134.372236] Call Trace:
 [  134.372236]  [<ffffffff81379aa1>] free_root_extent_buffers+0x91/0xb0
 [  134.372236]  [<ffffffff81379c3d>] free_root_pointers+0x17d/0x190
 [  134.372236]  [<ffffffff813801b0>] open_ctree+0x1ca0/0x25b0
 [  134.372236]  [<ffffffff8144d017>] ? disk_name+0x97/0xb0
 [  134.372236]  [<ffffffff813558aa>] btrfs_mount+0x8fa/0xab0
 ...

Reason:
 read_tree_block() changed to return error number on fail,
 and this value(not NULL) is set to tree_root->node, then subsequent
 code will run to:
  free_root_pointers()
  ->free_root_extent_buffers()
  ->free_extent_buffer()
  ->atomic_read((extent_buffer *)(-E_XXX)->refs);
 and trigger above error.

Fix:
 Set tree_root->node to NULL on fail to make error_handle code
 happy.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-22 18:20:52 -07:00
Zhao Lei 8a73301304 btrfs: Fix lockdep warning of btrfs_run_delayed_iputs()
Liu Bo <bo.li.liu@oracle.com> reported a lockdep warning of
delayed_iput_sem in xfstests generic/241:
  [ 2061.345955] =============================================
  [ 2061.346027] [ INFO: possible recursive locking detected ]
  [ 2061.346027] 4.1.0+ #268 Tainted: G        W
  [ 2061.346027] ---------------------------------------------
  [ 2061.346027] btrfs-cleaner/3045 is trying to acquire lock:
  [ 2061.346027]  (&fs_info->delayed_iput_sem){++++..}, at:
  [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100
  [ 2061.346027] but task is already holding lock:
  [ 2061.346027]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100
  [ 2061.346027] other info that might help us debug this:
  [ 2061.346027]  Possible unsafe locking scenario:

  [ 2061.346027]        CPU0
  [ 2061.346027]        ----
  [ 2061.346027]   lock(&fs_info->delayed_iput_sem);
  [ 2061.346027]   lock(&fs_info->delayed_iput_sem);
  [ 2061.346027]
   *** DEADLOCK ***
It is rarely happened, about 1/400 in my test env.

The reason is recursion of btrfs_run_delayed_iputs():
  cleaner_kthread
  -> btrfs_run_delayed_iputs() *1
  -> get delayed_iput_sem lock *2
  -> iput()
  -> ...
  -> btrfs_commit_transaction()
  -> btrfs_run_delayed_iputs() *1
  -> get delayed_iput_sem lock (dead lock) *2
  *1: recursion of btrfs_run_delayed_iputs()
  *2: warning of lockdep about delayed_iput_sem

When fs is in high stress, new iputs may added into fs_info->delayed_iputs
list when btrfs_run_delayed_iputs() is running, which cause
second btrfs_run_delayed_iputs() run into down_read(&fs_info->delayed_iput_sem)
again, and cause above lockdep warning.

Actually, it will not cause real problem because both locks are read lock,
but to avoid lockdep warning, we can do a fix.

Fix:
  Don't do btrfs_run_delayed_iputs() in btrfs_commit_transaction() for
  cleaner_kthread thread to break above recursion path.
  cleaner_kthread is calling btrfs_run_delayed_iputs() explicitly in code,
  and don't need to call btrfs_run_delayed_iputs() again in
  btrfs_commit_transaction(), it also give us a bonus to avoid stack overflow.

Test:
  No above lockdep warning after patch in 1200 generic/241 tests.

Reported-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-22 18:20:51 -07:00
Linus Torvalds 8be5701342 Merge branch 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "These are all from Filipe, and cover a few problems we've had reported
  on the list recently (along with ones he found on his own)"

* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix file corruption after cloning inline extents
  Btrfs: fix order by which delayed references are run
  Btrfs: fix list transaction->pending_ordered corruption
  Btrfs: fix memory leak in the extent_same ioctl
  Btrfs: fix shrinking truncate when the no_holes feature is enabled
2015-07-17 21:46:57 -07:00
Filipe Manana ed95876264 Btrfs: fix file corruption after cloning inline extents
Using the clone ioctl (or extent_same ioctl, which calls the same extent
cloning function as well) we end up allowing copy an inline extent from
the source file into a non-zero offset of the destination file. This is
something not expected and that the btrfs code is not prepared to deal
with - all inline extents must be at a file offset equals to 0.

For example, the following excerpt of a test case for fstests triggers
a crash/BUG_ON() on a write operation after an inline extent is cloned
into a non-zero offset:

  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  # Create our test files. File foo has the same 2K of data at offset 4K
  # as file bar has at its offset 0.
  $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 4K" \
      -c "pwrite -S 0xbb 4k 2K" \
      -c "pwrite -S 0xcc 8K 4K" \
      $SCRATCH_MNT/foo | _filter_xfs_io

  # File bar consists of a single inline extent (2K size).
  $XFS_IO_PROG -f -s -c "pwrite -S 0xbb 0 2K" \
     $SCRATCH_MNT/bar | _filter_xfs_io

  # Now call the clone ioctl to clone the extent of file bar into file
  # foo at its offset 4K. This made file foo have an inline extent at
  # offset 4K, something which the btrfs code can not deal with in future
  # IO operations because all inline extents are supposed to start at an
  # offset of 0, resulting in all sorts of chaos.
  # So here we validate that clone ioctl returns an EOPNOTSUPP, which is
  # what it returns for other cases dealing with inlined extents.
  $CLONER_PROG -s 0 -d $((4 * 1024)) -l $((2 * 1024)) \
      $SCRATCH_MNT/bar $SCRATCH_MNT/foo

  # Because of the inline extent at offset 4K, the following write made
  # the kernel crash with a BUG_ON().
  $XFS_IO_PROG -c "pwrite -S 0xdd 6K 2K" $SCRATCH_MNT/foo | _filter_xfs_io

  status=0
  exit

The stack trace of the BUG_ON() triggered by the last write is:

  [152154.035903] ------------[ cut here ]------------
  [152154.036424] kernel BUG at mm/page-writeback.c:2286!
  [152154.036424] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  [152154.036424] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpu$
  [152154.036424] CPU: 2 PID: 17873 Comm: xfs_io Tainted: G        W       4.1.0-rc6-btrfs-next-11+ #2
  [152154.036424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
  [152154.036424] task: ffff880429f70990 ti: ffff880429efc000 task.ti: ffff880429efc000
  [152154.036424] RIP: 0010:[<ffffffff8111a9d5>]  [<ffffffff8111a9d5>] clear_page_dirty_for_io+0x1e/0x90
  [152154.036424] RSP: 0018:ffff880429effc68  EFLAGS: 00010246
  [152154.036424] RAX: 0200000000000806 RBX: ffffea0006a6d8f0 RCX: 0000000000000001
  [152154.036424] RDX: 0000000000000000 RSI: ffffffff81155d1b RDI: ffffea0006a6d8f0
  [152154.036424] RBP: ffff880429effc78 R08: ffff8801ce389fe0 R09: 0000000000000001
  [152154.036424] R10: 0000000000002000 R11: ffffffffffffffff R12: ffff8800200dce68
  [152154.036424] R13: 0000000000000000 R14: ffff8800200dcc88 R15: ffff8803d5736d80
  [152154.036424] FS:  00007fbf119f6700(0000) GS:ffff88043d280000(0000) knlGS:0000000000000000
  [152154.036424] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [152154.036424] CR2: 0000000001bdc000 CR3: 00000003aa555000 CR4: 00000000000006e0
  [152154.036424] Stack:
  [152154.036424]  ffff8803d5736d80 0000000000000001 ffff880429effcd8 ffffffffa04e97c1
  [152154.036424]  ffff880429effd68 ffff880429effd60 0000000000000001 ffff8800200dc9c8
  [152154.036424]  0000000000000001 ffff8800200dcc88 0000000000000000 0000000000001000
  [152154.036424] Call Trace:
  [152154.036424]  [<ffffffffa04e97c1>] lock_and_cleanup_extent_if_need+0x147/0x18d [btrfs]
  [152154.036424]  [<ffffffffa04ea82c>] __btrfs_buffered_write+0x245/0x4c8 [btrfs]
  [152154.036424]  [<ffffffffa04ed14b>] ? btrfs_file_write_iter+0x150/0x3e0 [btrfs]
  [152154.036424]  [<ffffffffa04ed15a>] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs]
  [152154.036424]  [<ffffffffa04ed2c7>] btrfs_file_write_iter+0x2cc/0x3e0 [btrfs]
  [152154.036424]  [<ffffffff81165a4a>] __vfs_write+0x7c/0xa5
  [152154.036424]  [<ffffffff81165f89>] vfs_write+0xa0/0xe4
  [152154.036424]  [<ffffffff81166855>] SyS_pwrite64+0x64/0x82
  [152154.036424]  [<ffffffff81465197>] system_call_fastpath+0x12/0x6f
  [152154.036424] Code: 48 89 c7 e8 0f ff ff ff 5b 41 5c 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 ae ef 00 00 49 89 c4 48 8b 03 a8 01 75 02 <0f> 0b 4d 85 e4 74 59 49 8b 3c 2$
  [152154.036424] RIP  [<ffffffff8111a9d5>] clear_page_dirty_for_io+0x1e/0x90
  [152154.036424]  RSP <ffff880429effc68>
  [152154.242621] ---[ end trace e3d3376b23a57041 ]---

Fix this by returning the error EOPNOTSUPP if an attempt to copy an
inline extent into a non-zero offset happens, just like what is done for
other scenarios that would require copying/splitting inline extents,
which were introduced by the following commits:

   00fdf13a2e ("Btrfs: fix a crash of clone with inline extents's split")
   3f9e3df8da ("btrfs: replace error code from btrfs_drop_extents")

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-07-14 16:09:39 +01:00
Filipe Manana cffc3374e5 Btrfs: fix order by which delayed references are run
When we have an extent that got N references removed and N new references
added in the same transaction, we must run the insertion of the references
first because otherwise the last removed reference will remove the extent
item from the extent tree, resulting in a failure for the insertions.

This is a regression introduced in the 4.2-rc1 release and this fix just
brings back the behaviour of selecting reference additions before any
reference removals.

The following test case for fstests reproduces the issue:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      _cleanup_flakey
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_dm_flakey
  _require_cloner
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create prealloc extent covering range [160K, 620K[
  $XFS_IO_PROG -f -c "falloc 160K 460K" $SCRATCH_MNT/foo

  # Now write to the last 80K of the prealloc extent plus 40K to the unallocated
  # space that immediately follows it. This creates a new extent of 40K that spans
  # the range [620K, 660K[.
  $XFS_IO_PROG -c "pwrite -S 0xaa 540K 120K" $SCRATCH_MNT/foo | _filter_xfs_io

  # At this point, there are now 2 back references to the prealloc extent in our
  # extent tree. Both are for our file offset 160K and one relates to a file
  # extent item with a data offset of 0 and a length of 380K, while the other
  # relates to a file extent item with a data offset of 380K and a length of 80K.

  # Make sure everything done so far is durably persisted (all back references are
  # in the extent tree, etc).
  sync

  # Now clone all extents of our file that cover the offset 160K up to its eof
  # (660K at this point) into itself at offset 2M. This leaves a hole in the file
  # covering the range [660K, 2M[. The prealloc extent will now be referenced by
  # the file twice, once for offset 160K and once for offset 2M. The 40K extent
  # that follows the prealloc extent will also be referenced twice by our file,
  # once for offset 620K and once for offset 2M + 460K.
  $CLONER_PROG -s $((160 * 1024)) -d $((2 * 1024 * 1024)) -l 0 $SCRATCH_MNT/foo \
	$SCRATCH_MNT/foo

  # Now create one new extent in our file with a size of 100Kb. It will span the
  # range [3M, 3M + 100K[. It also will cause creation of a hole spanning the
  # range [2M + 460K, 3M[. Our new file size is 3M + 100K.
  $XFS_IO_PROG -c "pwrite -S 0xbb 3M 100K" $SCRATCH_MNT/foo | _filter_xfs_io

  # At this point, there are now (in memory) 4 back references to the prealloc
  # extent.
  #
  # Two of them are for file offset 160K, related to file extent items
  # matching the file offsets 160K and 540K respectively, with data offsets of
  # 0 and 380K respectively, and with lengths of 380K and 80K respectively.
  #
  # The other two references are for file offset 2M, related to file extent items
  # matching the file offsets 2M and 2M + 380K respectively, with data offsets of
  # 0 and 380K respectively, and with lengths of 389K and 80K respectively.
  #
  # The 40K extent has 2 back references, one for file offset 620K and the other
  # for file offset 2M + 460K.
  #
  # The 100K extent has a single back reference and it relates to file offset 3M.

  # Now clone our 100K extent into offset 600K. That offset covers the last 20K
  # of the prealloc extent, the whole 40K extent and 40K of the hole starting at
  # offset 660K.
  $CLONER_PROG -s $((3 * 1024 * 1024)) -d $((600 * 1024)) -l $((100 * 1024)) \
      $SCRATCH_MNT/foo $SCRATCH_MNT/foo

  # At this point there's only one reference to the 40K extent, at file offset
  # 2M + 460K, we have 4 references for the prealloc extent (2 for file offset
  # 160K and 2 for file offset 2M) and 2 references for the 100K extent (1 for
  # file offset 3M and a new one for file offset 600K).

  # Now fsync our file to make all its new data and metadata updates are durably
  # persisted and present if a power failure/crash happens after a successful
  # fsync and before the next transaction commit.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  echo "File digest before power failure:"
  md5sum $SCRATCH_MNT/foo | _filter_scratch

  # Silently drop all writes and ummount to simulate a crash/power failure.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Allow writes again, mount to trigger log replay and validate file contents.
  # During log replay, the btrfs delayed references implementation used to run the
  # deletion of back references before the addition of new back references, which
  # made the addition fail as it didn't find the key in the extent tree that it
  # was looking for. The failure triggered by this test was related to the 40K
  # extent, which got 1 reference dropped and 1 reference added during the fsync
  # log replay - when running the delayed references at transaction commit time,
  # btrfs was applying the deletion before the insertion, resulting in a failure
  # of the insertion that ended up turning the fs into read-only mode.
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  echo "File digest after log replay:"
  md5sum $SCRATCH_MNT/foo | _filter_scratch

  _unmount_flakey

  status=0
  exit

This issue turned the filesystem into read-only mode (current transaction
aborted) and produced the following traces:

  [ 8247.578385] ------------[ cut here ]------------
  [ 8247.579947] WARNING: CPU: 0 PID: 11341 at fs/btrfs/extent-tree.c:1547 lookup_inline_extent_backref+0x17d/0x45d [btrfs]()
  (...)
  [ 8247.601697] Call Trace:
  [ 8247.602222]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
  [ 8247.604320]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
  [ 8247.605488]  [<ffffffffa0506c8d>] ? lookup_inline_extent_backref+0x17d/0x45d [btrfs]
  [ 8247.608226]  [<ffffffffa0506c8d>] lookup_inline_extent_backref+0x17d/0x45d [btrfs]
  [ 8247.617061]  [<ffffffffa0507957>] insert_inline_extent_backref+0x41/0xb2 [btrfs]
  [ 8247.621856]  [<ffffffffa0507c4f>] __btrfs_inc_extent_ref+0x8c/0x20a [btrfs]
  [ 8247.624366]  [<ffffffffa050ee60>] __btrfs_run_delayed_refs+0xb0c/0xd49 [btrfs]
  [ 8247.626176]  [<ffffffffa0510dcd>] btrfs_run_delayed_refs+0x6d/0x1d4 [btrfs]
  [ 8247.627435]  [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6
  [ 8247.628531]  [<ffffffffa0520482>] btrfs_commit_transaction+0x4c/0xa20 [btrfs]
  (...)
  [ 8247.648430] ---[ end trace 2461e55f92c2ac2d ]---

  [ 8247.727263] WARNING: CPU: 3 PID: 11341 at fs/btrfs/extent-tree.c:2771 btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]()
  [ 8247.728954] BTRFS: Transaction aborted (error -5)
  (...)
  [ 8247.760866] Call Trace:
  [ 8247.761534]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
  [ 8247.764271]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
  [ 8247.767582]  [<ffffffffa0510e04>] ? btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]
  [ 8247.769373]  [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
  [ 8247.770836]  [<ffffffffa0510e04>] btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]
  [ 8247.772532]  [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6
  [ 8247.773664]  [<ffffffffa0520482>] btrfs_commit_transaction+0x4c/0xa20 [btrfs]
  [ 8247.775047]  [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf
  [ 8247.776176]  [<ffffffff81155dd5>] ? kmem_cache_free+0x12b/0x189
  [ 8247.777427]  [<ffffffffa055a920>] btrfs_recover_log_trees+0x2da/0x33d [btrfs]
  [ 8247.778575]  [<ffffffffa055898e>] ? replay_one_extent+0x4fc/0x4fc [btrfs]
  [ 8247.779838]  [<ffffffffa051e265>] open_ctree+0x1cc0/0x201a [btrfs]
  [ 8247.781020]  [<ffffffff81120f48>] ? register_shrinker+0x56/0x81
  [ 8247.782285]  [<ffffffffa04fb12c>] btrfs_mount+0x5f0/0x734 [btrfs]
  (...)
  [ 8247.793394] ---[ end trace 2461e55f92c2ac2e ]---
  [ 8247.794276] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2771: errno=-5 IO failure
  [ 8247.797335] BTRFS: error (device dm-0) in btrfs_replay_log:2375: errno=-5 IO failure (Failed to recover log tree)

Fixes: c6fc245499 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Acked-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
2015-07-11 22:36:44 +01:00
Filipe Manana d3efe08400 Btrfs: fix list transaction->pending_ordered corruption
When we call btrfs_commit_transaction(), we splice the list "ordered"
of our transaction handle into the transaction's "pending_ordered"
list, but we don't re-initialize the "ordered" list of our transaction
handle, this means it still points to the same elements it used to
before the splice. Then we check if the current transaction's state is
>= TRANS_STATE_COMMIT_START and if it is we end up calling
btrfs_end_transaction() which simply splices again the "ordered" list
of our handle into the transaction's "pending_ordered" list, leaving
multiple pointers to the same ordered extents which results in list
corruption when we are iterating, removing and freeing ordered extents
at btrfs_wait_pending_ordered(), resulting in access to dangling
pointers / use-after-free issues.
Similarly, btrfs_end_transaction() can end up in some cases calling
btrfs_commit_transaction(), and both did a list splice of the transaction
handle's "ordered" list into the transaction's "pending_ordered" without
re-initializing the handle's "ordered" list, resulting in exactly the
same problem.

This produces the following warning on a kernel with linked list
debugging enabled:

[109749.265416] ------------[ cut here ]------------
[109749.266410] WARNING: CPU: 7 PID: 324 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()
[109749.267969] list_del corruption. prev->next should be ffff8800ba087e20, but was fffffff8c1f7c35d
(...)
[109749.287505] Call Trace:
[109749.288135]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
[109749.298080]  [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
[109749.331605]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
[109749.334849]  [<ffffffff81260642>] ? __list_del_entry+0x5a/0x98
[109749.337093]  [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
[109749.337847]  [<ffffffff81260642>] __list_del_entry+0x5a/0x98
[109749.338678]  [<ffffffffa053e8bf>] btrfs_wait_pending_ordered+0x46/0xdb [btrfs]
[109749.340145]  [<ffffffffa058a65f>] ? __btrfs_run_delayed_items+0x149/0x163 [btrfs]
[109749.348313]  [<ffffffffa054077d>] btrfs_commit_transaction+0x36b/0xa10 [btrfs]
[109749.349745]  [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf
[109749.350819]  [<ffffffffa055370d>] btrfs_sync_file+0x36f/0x3fc [btrfs]
[109749.351976]  [<ffffffff8118ec98>] vfs_fsync_range+0x8f/0x9e
[109749.360341]  [<ffffffff8118ecc3>] vfs_fsync+0x1c/0x1e
[109749.368828]  [<ffffffff8118ee1d>] do_fsync+0x34/0x4e
[109749.369790]  [<ffffffff8118f045>] SyS_fsync+0x10/0x14
[109749.370925]  [<ffffffff81465197>] system_call_fastpath+0x12/0x6f
[109749.382274] ---[ end trace 48e0d07f7c03d95a ]---

On a non-debug kernel this leads to invalid memory accesses, causing a
crash. Fix this by using list_splice_init() instead of list_splice() in
btrfs_commit_transaction() and btrfs_end_transaction().

Cc: stable@vger.kernel.org
Fixes: 50d9aa99bd ("Btrfs: make sure logged extents complete in the current transaction V3"
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
2015-07-11 22:35:05 +01:00
Filipe Manana 497b4050e0 Btrfs: fix memory leak in the extent_same ioctl
We were allocating memory with memdup_user() but we were never releasing
that memory. This affected pretty much every call to the ioctl, whether
it deduplicated extents or not.

This issue was reported on IRC by Julian Taylor and on the mailing list
by Marcel Ritter, credit goes to them for finding the issue.

Reported-by: Julian Taylor <jtaylor.debian@googlemail.com>
Reported-by: Marcel Ritter <ritter.marcel@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
2015-07-11 22:34:26 +01:00
Filipe Manana c1aa45759e Btrfs: fix shrinking truncate when the no_holes feature is enabled
If the no_holes feature is enabled, we attempt to shrink a file to a size
that ends up in the middle of a hole and we don't have any file extent
items in the fs/subvol tree that go beyond the new file size (or any
ordered extents that will insert such file extent items), we end up not
updating the inode's disk_i_size, we only update the inode's i_size.

This means that after unmounting and mounting the filesystem, or after
the inode is evicted and reloaded, its i_size ends up being incorrect
(an inode's i_size is set to the disk_i_size field when an inode is
loaded). This happens when btrfs_truncate_inode_items() doesn't find
any file extent items to drop - in this case it never makes a call to
btrfs_ordered_update_i_size() in order to update the inode's disk_i_size.

Example reproducer:

  $ mkfs.btrfs -O no-holes -f /dev/sdd
  $ mount /dev/sdd /mnt

  # Create our test file with some data and durably persist it.
  $ xfs_io -f -c "pwrite -S 0xaa 0 128K" /mnt/foo
  $ sync

  # Append some data to the file, increasing its size, and leave a hole
  # between the old size and the start offset if the following write. So
  # our file gets a hole in the range [128Kb, 256Kb[.
  $ xfs_io -c "truncate 160K" /mnt/foo

  # We expect to see our file with a size of 160Kb, with the first 128Kb
  # of data all having the value 0xaa and the remaining 32Kb of data all
  # having the value 0x00.
  $ od -t x1 /mnt/foo
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0400000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  0500000

  # Now cleanly unmount and mount again the filesystem.
  $ umount /mnt
  $ mount /dev/sdd /mnt

  # We expect to get the same result as before, a file with a size of
  # 160Kb, with the first 128Kb of data all having the value 0xaa and the
  # remaining 32Kb of data all having the value 0x00.
  $ od -t x1 /mnt/foo
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0400000

In the example above the file size/data do not match what they were before
the remount.

Fix this by always calling btrfs_ordered_update_i_size() with a size
matching the size the file was truncated to if btrfs_truncate_inode_items()
is not called for a log tree and no file extent items were dropped. This
ensures the same behaviour as when the no_holes feature is not enabled.

A test case for fstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-07-11 22:33:14 +01:00
Linus Torvalds 31b7a57c9e Merge branch 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This is an assortment of fixes.  Most of the commits are from Filipe
  (fsync, the inode allocation cache and a few others).  Mark kicked in
  a series fixing corners in the extent sharing ioctls, and everyone
  else fixed up on assorted other problems"

* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix wrong check for btrfs_force_chunk_alloc()
  Btrfs: fix warning of bytes_may_use
  Btrfs: fix hang when failing to submit bio of directIO
  Btrfs: fix a comment in inode.c:evict_inode_truncate_pages()
  Btrfs: fix memory corruption on failure to submit bio for direct IO
  btrfs: don't update mtime/ctime on deduped inodes
  btrfs: allow dedupe of same inode
  btrfs: fix deadlock with extent-same and readpage
  btrfs: pass unaligned length to btrfs_cmp_data()
  Btrfs: fix fsync after truncate when no_holes feature is enabled
  Btrfs: fix fsync xattr loss in the fast fsync path
  Btrfs: fix fsync data loss after append write
  Btrfs: fix crash on close_ctree() if cleaner starts new transaction
  Btrfs: fix race between caching kthread and returning inode to inode cache
  Btrfs: use kmem_cache_free when freeing entry in inode cache
  Btrfs: fix race between balance and unused block group deletion
  btrfs: add error handling for scrub_workers_get()
  btrfs: cleanup noused initialization of dev in btrfs_end_bio()
  btrfs: qgroup: allow user to clear the limitation on qgroup
2015-07-11 10:26:34 -07:00
Linus Torvalds 1dc51b8288 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "Assorted VFS fixes and related cleanups (IMO the most interesting in
  that part are f_path-related things and Eric's descriptor-related
  stuff).  UFS regression fixes (it got broken last cycle).  9P fixes.
  fs-cache series, DAX patches, Jan's file_remove_suid() work"

[ I'd say this is much more than "fixes and related cleanups".  The
  file_table locking rule change by Eric Dumazet is a rather big and
  fundamental update even if the patch isn't huge.   - Linus ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
  9p: cope with bogus responses from server in p9_client_{read,write}
  p9_client_write(): avoid double p9_free_req()
  9p: forgetting to cancel request on interrupted zero-copy RPC
  dax: bdev_direct_access() may sleep
  block: Add support for DAX reads/writes to block devices
  dax: Use copy_from_iter_nocache
  dax: Add block size note to documentation
  fs/file.c: __fget() and dup2() atomicity rules
  fs/file.c: don't acquire files->file_lock in fd_install()
  fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
  vfs: avoid creation of inode number 0 in get_next_ino
  namei: make set_root_rcu() return void
  make simple_positive() public
  ufs: use dir_pages instead of ufs_dir_pages()
  pagemap.h: move dir_pages() over there
  remove the pointless include of lglock.h
  fs: cleanup slight list_entry abuse
  xfs: Correctly lock inode when removing suid and file capabilities
  fs: Call security_ops->inode_killpriv on truncate
  fs: Provide function telling whether file_remove_privs() will do anything
  ...
2015-07-04 19:36:06 -07:00
Shilong Wang 9689457b5b Btrfs: fix wrong check for btrfs_force_chunk_alloc()
btrfs_force_chunk_alloc() return 1 for allocation chunk successfully.
This problem exists since commit c87f08ca4.

With this patch, we might fix some enospc problems for balances.

Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:22 -07:00
Liu Bo ddba1bfc23 Btrfs: fix warning of bytes_may_use
While running generic/019, dmesg got several warnings from
btrfs_free_reserved_data_space().

Test generic/019 produces some disk failures so sumbit dio will get errors,
in which case, btrfs_direct_IO() goes to the error handling and free
bytes_may_use, but the problem is that bytes_may_use has been free'd
during get_block().

This adds a runtime flag to show if we've gone through get_block(), if so,
don't do the cleanup work.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:21 -07:00
Liu Bo ad9ee2053f Btrfs: fix hang when failing to submit bio of directIO
The hang is uncoverd by generic/019.

btrfs_endio_direct_write() skips the "finish_ordered_fn" part when it hits
an error, thus those added ordered extents will never get processed, which
block processes that waiting for them via btrfs_start_ordered_extent().

This fixes the above, and meanwhile finish_ordered_fn will do the space
accounting work.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:20 -07:00
Filipe Manana 9c6429d96d Btrfs: fix a comment in inode.c:evict_inode_truncate_pages()
The comment was not correct about the part where it says the endio
callback of the bio might have not yet been called - update it
to mention that by that time the endio callback execution might
still be in progress only.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:19 -07:00
Filipe Manana 61de718fce Btrfs: fix memory corruption on failure to submit bio for direct IO
If we fail to submit a bio for a direct IO request, we were grabbing the
corresponding ordered extent and decrementing its reference count twice,
once for our lookup reference and once for the ordered tree reference.
This was a problem because it caused the ordered extent to be freed
without removing it from the ordered tree and any lists it might be
attached to, leaving dangling pointers to the ordered extent around.
Example trace with CONFIG_DEBUG_PAGEALLOC=y:

[161779.858707] BUG: unable to handle kernel paging request at 0000000087654330
[161779.859983] IP: [<ffffffff8124ca68>] rb_prev+0x22/0x3b
[161779.860636] PGD 34d818067 PUD 0
[161779.860636] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
(...)
[161779.860636] Call Trace:
[161779.860636]  [<ffffffffa06b36a6>] __tree_search+0xd9/0xf9 [btrfs]
[161779.860636]  [<ffffffffa06b3708>] tree_search+0x42/0x63 [btrfs]
[161779.860636]  [<ffffffffa06b4868>] ? btrfs_lookup_ordered_range+0x2d/0xa5 [btrfs]
[161779.860636]  [<ffffffffa06b4873>] btrfs_lookup_ordered_range+0x38/0xa5 [btrfs]
[161779.860636]  [<ffffffffa06aab8e>] btrfs_get_blocks_direct+0x11b/0x615 [btrfs]
[161779.860636]  [<ffffffff8119727f>] do_blockdev_direct_IO+0x5ff/0xb43
[161779.860636]  [<ffffffffa06aaa73>] ? btrfs_page_exists_in_range+0x1ad/0x1ad [btrfs]
[161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
[161779.860636]  [<ffffffff811977f5>] __blockdev_direct_IO+0x32/0x34
[161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
[161779.860636]  [<ffffffffa06a10ae>] btrfs_direct_IO+0x198/0x21f [btrfs]
[161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
[161779.860636]  [<ffffffff81112ca1>] generic_file_direct_write+0xb3/0x128
[161779.860636]  [<ffffffffa06affaa>] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs]
[161779.860636]  [<ffffffffa06b004c>] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
(...)

We were also not freeing the btrfs_dio_private we allocated previously,
which kmemleak reported with the following trace in its sysfs file:

unreferenced object 0xffff8803f553bf80 (size 96):
  comm "xfs_io", pid 4501, jiffies 4295039588 (age 173.936s)
  hex dump (first 32 bytes):
    88 6c 9b f5 02 88 ff ff 00 00 00 00 00 00 00 00  .l..............
    00 00 00 00 00 00 00 00 00 00 c4 00 00 00 00 00  ................
  backtrace:
    [<ffffffff81161ffe>] create_object+0x172/0x29a
    [<ffffffff8145870f>] kmemleak_alloc+0x25/0x41
    [<ffffffff81154e64>] kmemleak_alloc_recursive.constprop.40+0x16/0x18
    [<ffffffff811579ed>] kmem_cache_alloc_trace+0xfb/0x148
    [<ffffffffa03d8cff>] btrfs_submit_direct+0x65/0x16a [btrfs]
    [<ffffffff811968dc>] dio_bio_submit+0x62/0x8f
    [<ffffffff811975fe>] do_blockdev_direct_IO+0x97e/0xb43
    [<ffffffff811977f5>] __blockdev_direct_IO+0x32/0x34
    [<ffffffffa03d70ae>] btrfs_direct_IO+0x198/0x21f [btrfs]
    [<ffffffff81112ca1>] generic_file_direct_write+0xb3/0x128
    [<ffffffffa03e604d>] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
    [<ffffffff8116586a>] __vfs_write+0x7c/0xa5
    [<ffffffff81165da9>] vfs_write+0xa0/0xe4
    [<ffffffff81166675>] SyS_pwrite64+0x64/0x82
    [<ffffffff81464fd7>] system_call_fastpath+0x12/0x6f
    [<ffffffffffffffff>] 0xffffffffffffffff

For read requests we weren't doing any cleanup either (none of the work
done by btrfs_endio_direct_read()), so a failure submitting a bio for a
read request would leave a range in the inode's io_tree locked forever,
blocking any future operations (both reads and writes) against that range.

So fix this by making sure we do the same cleanup that we do for the case
where the bio submission succeeds.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:18 -07:00
Mark Fasheh 1c919a5e13 btrfs: don't update mtime/ctime on deduped inodes
One issue users have reported is that dedupe changes mtime on files,
resulting in tools like rsync thinking that their contents have changed when
in fact the data is exactly the same. We also skip the ctime update as no
user-visible metadata changes here and we want dedupe to be transparent to
the user.

Clone still wants time changes, so we special case this in the code.

This was tested with the btrfs-extent-same tool.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:17 -07:00
Mark Fasheh 0efa9f48c7 btrfs: allow dedupe of same inode
clone() supports cloning within an inode so extent-same can do
the same now. This patch fixes up the locking in extent-same to
know about the single-inode case. In addition to that, we add a
check for overlapping ranges, which clone does not allow.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:15 -07:00