Граф коммитов

1759 Коммитов

Автор SHA1 Сообщение Дата
David Sterba 3ffbd68c48 btrfs: simplify pointer chasing of local fs_info variables
Functions that get btrfs inode can simply reach the fs_info by
dereferencing the root and this looks a bit more straightforward
compared to the btrfs_sb(...) indirection.

If the transaction handle is available and not NULL it's used instead.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:43 +02:00
David Sterba 3750851562 btrfs: simplify some assignments of inode numbers
There are several places when the btrfs inode is converted to the
generic inode, back to btrfs and then passed to btrfs_ino. We can remove
the extra back and forth conversions.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:43 +02:00
Arnd Bergmann d3c6be6fda btrfs: use timespec64 for i_otime
While the regular inode timestamps all use timespec64 now, the i_otime
field is btrfs specific and still needs to be converted to correctly
represent times beyond 2038.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:39 +02:00
Qu Wenruo e41ca58974 btrfs: Get rid of the confusing btrfs_file_extent_inline_len
We used to call btrfs_file_extent_inline_len() to get the uncompressed
data size of an inlined extent.

However this function is hiding evil, for compressed extent, it has no
choice but to directly read out ram_bytes from btrfs_file_extent_item.
While for uncompressed extent, it uses item size to calculate the real
data size, and ignoring ram_bytes completely.

In fact, for corrupted ram_bytes, due to above behavior kernel
btrfs_print_leaf() can't even print correct ram_bytes to expose the bug.

Since we have the tree-checker to verify all EXTENT_DATA, such mismatch
can be detected pretty easily, thus we can trust ram_bytes without the
evil btrfs_file_extent_inline_len().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:38 +02:00
Al Viro 32955c5422 btrfs: switch to discard_new_inode()
Make sure that no partially set up inodes can be returned by
open-by-handle.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-03 16:03:29 -04:00
Linus Torvalds d3bc0e67f8 for-4.18-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAls4zz0ACgkQxWXV+ddt
 WDs0ZhAAplEAcN1BP986BS7GpjrG20vQtP9AnHlnSEJJJmsnykpspBylOcLRkKjF
 LKBfBPCKqIo7kn5ebKT1Kk7zJPkOOEfmxGW7hffVN/oa/oMtmgJbbHDUgl2TDgdu
 rky1O+Bj+S37s5rhiXAJ4oU9ekdpWIlN30GczfynjiPqGigKh/cINsEEhQIIAiJG
 PRDQfSIJeh67x1AP0KE8sJAYSsaeFxT+kHrT/NPs1NFDSzrQSa/QWPFVjGVVuI/Y
 w84Mo0EqdRV7tap7D3QyWyYea6zdP00PG8TyLl0Kba+LckFbzpNN5hP3SUxleBzL
 0ZBJi7/tOqnrMV3YaGm40dLfgD4B+CFt8zDyg2JvWUxxEzfQfYif7KIT2IV8fSqS
 QrVw2NrzQC7EZ4Zu98wCN7dyyOE8yhqbq805YdG3Nj+zT6DqRu01TBo4Yr/Ek8ux
 +ITAtQVbaOZmTIt/qh/Oxc5jRsurAno1FP3XRH+1hfSlS7xc3LfI1CUbX3jAKzXN
 edxdM4/h+d4nekvROnKBH4EheS6+ZVfgzYlYUW9c2rjcJ1RHhDElbh14+IoM6LKJ
 nJ+Cp+744F6W5jaG3oWElJrdhlY31mWUjiZaj2CHl16EcH3MToytrxKMX+OWo95W
 gChnKicrtpO6+9nbED3Tdhp7SkbDysun6jvEpSgdlm8+2H5Kwrw=
 =KWL7
 -----END PGP SIGNATURE-----

Merge tag 'for-4.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "We have a few regression fixes for qgroup rescan status tracking and
  the vm_fault_t conversion that mixed up the error values"

* tag 'for-4.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix mount failure when qgroup rescan is in progress
  Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion
  btrfs: quota: Set rescan progress to (u64)-1 if we hit last leaf
2018-07-01 12:38:16 -07:00
Chris Mason 717beb96d9 Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion
The vm_fault_t conversion commit introduced a ret2 variable for tracking
the integer return values from internal btrfs functions.  It was
sometimes returning VM_FAULT_LOCKED for pages that were actually invalid
and had been removed from the radix.  Something like this:

    ret2 = btrfs_delalloc_reserve_space() // returns zero on success

    lock_page(page)
    if (page->mapping != inode->i_mapping)
	goto out_unlock;

...

out_unlock:
    if (!ret2) {
	    ...
	    return VM_FAULT_LOCKED;
    }

This ends up triggering this WARNING in btrfs_destroy_inode()
    WARN_ON(BTRFS_I(inode)->block_rsv.size);

xfstests generic/095 was able to reliably reproduce the errors.

Since out_unlock: is only used for errors, this fix moves it below the
if (!ret2) check we use to return VM_FAULT_LOCKED for success.

Fixes: a528a24150 (btrfs: change return type of btrfs_page_mkwrite to vm_fault_t)
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-28 11:30:50 +02:00
Linus Torvalds 84bfed40fc for-4.18-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsvtAIACgkQxWXV+ddt
 WDvPJQ//eEr+ACNxG5n42e63TaNpLOeoGAsUChwekP6DAIc2n/r78SHX+T/woDqd
 7/dc2eqlYF5Fmn6MPQ1ufL2xlfw0t3OnemK8T5+4sxZDdzeAH9V+kHaqoaLUPExL
 4r6lK5Ywwa2cWghC7WvQg3+bWLX18OEExG63SlEhLo3YJM5uUqVhGVPi6ARrbxNM
 GJvXcQsxjXLqukm4gYvHC6Zra9Awv65uiAU+VCm2y96j1fEJ0yjK/pC1RtoFGCqS
 48Jjuzfq/V3nxy0Wjr3DvpQVEQcKyGha6i/eazZISdRhGSjrYdvIwpUn7gZeoo2Q
 hT8VVergLbVYgIeaOwgwubQzNaG2C/ZTsEjPQrNrA7a/AGsh5C/ommYE9MSyS6L0
 PG0NLUNDXFmEj8WdI97So+1Sm2OCb04DPbPhHbbkhw5L/MPE1TaLN5aUWguj8laB
 NnyPRdVP9jCJAI0OhJY7nPDmsPKe2jogVVsRheTcob+V5G+vIgzDXlGfsW/88Seg
 dHubMaC0nz6u8Cj4dIviitiLXuustyz0JkVdljTLawWWJ/Hs7NlsSf3Q3nj2Kvia
 e8QMID0vLphQyL1hqC0n7M0g2ohq9NUGT1nhLTPSpFl0l8bIA9PQehOx9Q5vx8yp
 tJF+d0qiNfgvadA+KhyvQ8puQb2+zZQ8+Pqfjwd4ySD7SI5dcA4=
 =fCdm
 -----END PGP SIGNATURE-----

Merge tag 'for-4.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Two regression fixes and an incorrect error value propagation fix from
  'rename exchange'"

* tag 'for-4.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix return value on rename exchange failure
  btrfs: fix invalid-free in btrfs_extent_same
  Btrfs: fix physical offset reported by fiemap for inline extents
2018-06-26 08:41:54 -07:00
Filipe Manana c5b4a50b74 Btrfs: fix return value on rename exchange failure
If we failed during a rename exchange operation after starting/joining a
transaction, we would end up replacing the return value, stored in the
local 'ret' variable, with the return value from btrfs_end_transaction().
So this could end up returning 0 (success) to user space despite the
operation having failed and aborted the transaction, because if there are
multiple tasks having a reference on the transaction at the time
btrfs_end_transaction() is called by the rename exchange, that function
returns 0 (otherwise it returns -EIO and not the original error value).
So fix this by not overwriting the return value on error after getting
a transaction handle.

Fixes: cdd1fedf82 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-22 12:59:08 +02:00
Linus Torvalds 7a932516f5 vfs/y2038: inode timestamps conversion to timespec64
This is a late set of changes from Deepa Dinamani doing an automated
 treewide conversion of the inode and iattr structures from 'timespec'
 to 'timespec64', to push the conversion from the VFS layer into the
 individual file systems.
 
 There were no conflicts between this and the contents of linux-next
 until just before the merge window, when we saw multiple problems:
 
 - A minor conflict with my own y2038 fixes, which I could address
   by adding another patch on top here.
 - One semantic conflict with late changes to the NFS tree. I addressed
   this by merging Deepa's original branch on top of the changes that
   now got merged into mainline and making sure the merge commit includes
   the necessary changes as produced by coccinelle.
 - A trivial conflict against the removal of staging/lustre.
 - Multiple conflicts against the VFS changes in the overlayfs tree.
   These are still part of linux-next, but apparently this is no longer
   intended for 4.18 [1], so I am ignoring that part.
 
 As Deepa writes:
 
   The series aims to switch vfs timestamps to use struct timespec64.
   Currently vfs uses struct timespec, which is not y2038 safe.
 
   The series involves the following:
   1. Add vfs helper functions for supporting struct timepec64 timestamps.
   2. Cast prints of vfs timestamps to avoid warnings after the switch.
   3. Simplify code using vfs timestamps so that the actual
      replacement becomes easy.
   4. Convert vfs timestamps to use struct timespec64 using a script.
      This is a flag day patch.
 
   Next steps:
   1. Convert APIs that can handle timespec64, instead of converting
      timestamps at the boundaries.
   2. Update internal data structures to avoid timestamp conversions.
 
 Thomas Gleixner adds:
 
   I think there is no point to drag that out for the next merge window.
   The whole thing needs to be done in one go for the core changes which
   means that you're going to play that catchup game forever. Let's get
   over with it towards the end of the merge window.
 
 [1] https://www.spinics.net/lists/linux-fsdevel/msg128294.html
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJbInZAAAoJEGCrR//JCVInReoQAIlVIIMt5ZX6wmaKbrjy9Itf
 MfgbFihQ/djLnuSPVQ3nztcxF0d66BKHZ9puVjz6+mIHqfDvJTRwZs9nU+sOF/T1
 g78fRkM1cxq6ZCkGYAbzyjyo5aC4PnSMP/NQLmwqvi0MXqqrbDoq5ZdP9DHJw39h
 L9lD8FM/P7T29Fgp9tq/pT5l9X8VU8+s5KQG1uhB5hii4VL6pD6JyLElDita7rg+
 Z7/V7jkxIGEUWF7vGaiR1QTFzEtpUA/exDf9cnsf51OGtK/LJfQ0oiZPPuq3oA/E
 LSbt8YQQObc+dvfnGxwgxEg1k5WP5ekj/Wdibv/+rQKgGyLOTz6Q4xK6r8F2ahxs
 nyZQBdXqHhJYyKr1H1reUH3mrSgQbE5U5R1i3My0xV2dSn+vtK5vgF21v2Ku3A1G
 wJratdtF/kVBzSEQUhsYTw14Un+xhBLRWzcq0cELonqxaKvRQK9r92KHLIWNE7/v
 c0TmhFbkZA+zR8HdsaL3iYf1+0W/eYy8PcvepyldKNeW2pVk3CyvdTfY2Z87G2XK
 tIkK+BUWbG3drEGG3hxZ3757Ln3a9qWyC5ruD3mBVkuug/wekbI8PykYJS7Mx4s/
 WNXl0dAL0Eeu1M8uEJejRAe1Q3eXoMWZbvCYZc+wAm92pATfHVcKwPOh8P7NHlfy
 A3HkjIBrKW5AgQDxfgvm
 =CZX2
 -----END PGP SIGNATURE-----

Merge tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground

Pull inode timestamps conversion to timespec64 from Arnd Bergmann:
 "This is a late set of changes from Deepa Dinamani doing an automated
  treewide conversion of the inode and iattr structures from 'timespec'
  to 'timespec64', to push the conversion from the VFS layer into the
  individual file systems.

  As Deepa writes:

   'The series aims to switch vfs timestamps to use struct timespec64.
    Currently vfs uses struct timespec, which is not y2038 safe.

    The series involves the following:
    1. Add vfs helper functions for supporting struct timepec64
       timestamps.
    2. Cast prints of vfs timestamps to avoid warnings after the switch.
    3. Simplify code using vfs timestamps so that the actual replacement
       becomes easy.
    4. Convert vfs timestamps to use struct timespec64 using a script.
       This is a flag day patch.

    Next steps:
    1. Convert APIs that can handle timespec64, instead of converting
       timestamps at the boundaries.
    2. Update internal data structures to avoid timestamp conversions'

  Thomas Gleixner adds:

   'I think there is no point to drag that out for the next merge
    window. The whole thing needs to be done in one go for the core
    changes which means that you're going to play that catchup game
    forever. Let's get over with it towards the end of the merge window'"

* tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
  pstore: Remove bogus format string definition
  vfs: change inode times to use struct timespec64
  pstore: Convert internal records to timespec64
  udf: Simplify calls to udf_disk_stamp_to_time
  fs: nfs: get rid of memcpys for inode times
  ceph: make inode time prints to be long long
  lustre: Use long long type to print inode time
  fs: add timespec64_truncate()
2018-06-15 07:31:07 +09:00
Arnd Bergmann 15eefe2a99 Merge branch 'vfs_timespec64' of https://github.com/deepa-hub/vfs into vfs-timespec64
Pull the timespec64 conversion from Deepa Dinamani:
 "The series aims to switch vfs timestamps to use
  struct timespec64. Currently vfs uses struct timespec,
  which is not y2038 safe.

  The flag patch applies cleanly. I've not seen the timestamps
  update logic change often. The series applies cleanly on 4.17-rc6
  and linux-next tip (top commit: next-20180517).

  I'm not sure how to merge this kind of a series with a flag patch.
  We are targeting 4.18 for this.
  Let me know if you have other suggestions.

  The series involves the following:
  1. Add vfs helper functions for supporting struct timepec64 timestamps.
  2. Cast prints of vfs timestamps to avoid warnings after the switch.
  3. Simplify code using vfs timestamps so that the actual
     replacement becomes easy.
  4. Convert vfs timestamps to use struct timespec64 using a script.
     This is a flag day patch.

  I've tried to keep the conversions with the script simple, to
  aid in the reviews. I've kept all the internal filesystem data
  structures and function signatures the same.

  Next steps:
  1. Convert APIs that can handle timespec64, instead of converting
     timestamps at the boundaries.
  2. Update internal data structures to avoid timestamp conversions."

I've pulled it into a branch based on top of the NFS changes that
are now in mainline, so I could resolve the non-obvious conflict
between the two while merging.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-06-14 14:54:00 +02:00
Souptick Joarder a528a24150 btrfs: change return type of btrfs_page_mkwrite to vm_fault_t
Use the new return type vm_fault_t for fault handler. For now, this is
just documenting that the function returns a VM_FAULT value rather than
an errno. Once all instances are converted, vm_fault_t will become a
distinct type.

Reference commit 1c8f422059 ("mm: change return type to vm_fault_t")

vmf_error() is the newly introduced inline function in 4.17-rc6.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-07 17:27:45 +02:00
Deepa Dinamani 95582b0083 vfs: change inode times to use struct timespec64
struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.

The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.

The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.

virtual patch

@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
  current_time ( ... )
  {
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
  ...
- return timespec_trunc(
+ return timespec64_trunc(
  ... );
  }

@ depends on patch @
identifier xtime;
@@
 struct \( iattr \| inode \| kstat \) {
 ...
-       struct timespec xtime;
+       struct timespec64 xtime;
 ...
 }

@ depends on patch @
identifier t;
@@
 struct inode_operations {
 ...
int (*update_time) (...,
-       struct timespec t,
+       struct timespec64 t,
...);
 ...
 }

@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
 fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
 ...) { ... }

@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
  ) { ... }

@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)

<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 =  timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)

@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
 fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}

@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1  ;
|
 node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
 node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
 node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
 stat->xtime = node2->i_xtime1;
|
 stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1  ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
 node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
 node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>
2018-06-05 16:57:31 -07:00
Omar Sandoval ad7e1a740d Btrfs: clean up error handling in btrfs_truncate()
btrfs_truncate() uses two variables for error handling, ret and err (if
this sounds familiar, it's because btrfs_truncate_inode_items() did
something similar). This is error prone, as was made evident by "Btrfs:
fix error handling in btrfs_truncate()". We only have err because we
don't want to mask an error if we call btrfs_update_inode() and
btrfs_end_transaction(), so let's make that its own scoped return
variable and use ret everywhere else.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 21:27:32 +02:00
Nikolay Borisov c5794e5178 btrfs: Factor out write portion of btrfs_get_blocks_direct
Now that the read side is extracted into its own function, do the same
to the write side. This leaves btrfs_get_blocks_direct_write with the
sole purpose of handling common locking required. Also flip the
condition in btrfs_get_blocks_direct_write so that the write case
comes first and we check for if (Create) rather than if (!create). This
is purely subjective but I believe makes reading a bit more "linear".
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 19:01:44 +02:00
Nikolay Borisov 1c8d0175df btrfs: Factor out read portion of btrfs_get_blocks_direct
Currently this function handles both the READ and WRITE dio cases. This
is facilitated by a bunch of 'if' statements, a goto short-circuit
statement and a very perverse aliasing of "!created"(READ) case
by setting lockstart = lockend and checking for lockstart < lockend for
detecting the write. Let's simplify this mess by extracting the
READ-only code into a separate __btrfs_get_block_direct_read function.
This is only the first step, the next one will be to factor out the
write side as well. The end goal will be to have the common locking/
unlocking code in btrfs_get_blocks_direct and then it will call either
the read|write subvariants. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 19:01:43 +02:00
Su Yue 090a127afa btrfs: return error value if create_io_em failed in cow_file_range
In cow_file_range(), create_io_em() may fail, but its return value is
not recorded.  Then return value may be 0 even it failed which is a
wrong behavior.

Let cow_file_range() return PTR_ERR(em) if create_io_em() failed.

Fixes: 6f9994dbab ("Btrfs: create a helper to create em for IO")
CC: stable@vger.kernel.org # 4.11+
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:51:08 +02:00
Gu JinXiang c4c129db5d btrfs: drop unused parameter qgroup_reserved
Since commit 7775c8184e ("btrfs: remove unused parameter from
btrfs_subvolume_release_metadata") parameter qgroup_reserved is not used
by caller of function btrfs_subvolume_reserve_metadata.  So remove it.

Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:53 +02:00
Ethan Lien e73e81b6d0 btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
[Problem description and how we fix it]
We should balance dirty metadata pages at the end of
btrfs_finish_ordered_io, since a small, unmergeable random write can
potentially produce dirty metadata which is multiple times larger than
the data itself. For example, a small, unmergeable 4KiB write may
produce:

    16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
    16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
    16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree

Although we do call balance dirty pages in write side, but in the
buffered write path, most metadata are dirtied only after we reach the
dirty background limit (which by far only counts dirty data pages) and
wakeup the flusher thread. If there are many small, unmergeable random
writes spread in a large btree, we'll find a burst of dirty pages
exceeds the dirty_bytes limit after we wakeup the flusher thread - which
is not what we expect. In our machine, it caused out-of-memory problem
since a page cannot be dropped if it is marked dirty.

Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay,
but since we do btrfs_finish_ordered_io in a separate worker, it will not
stop the flusher consuming dirty pages. Also, we use different worker for
metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle
the size of dirty metadata pages.

[Reproduce steps]
To reproduce the problem, we need to do 4KiB write randomly spread in a
large btree. In our 2GiB RAM machine:

1) Create 4 subvolumes.
2) Run fio on each subvolume:

   [global]
   direct=0
   rw=randwrite
   ioengine=libaio
   bs=4k
   iodepth=16
   numjobs=1
   group_reporting
   size=128G
   runtime=1800
   norandommap
   time_based
   randrepeat=0

3) Take snapshot on each subvolume and repeat fio on existing files.
4) Repeat step (3) until we get large btrees.
   In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of
   metadata in each subvolume tree and 12GiB of metadata in extent tree.
5) Stop all fio, take snapshot again, and wait until all delayed work is
   completed.
6) Start all fio. Few seconds later we hit OOM when the flusher starts
   to work.

It can be reproduced even when using nocow write.

Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:53 +02:00
Ethan Lien 78d4295b1e btrfs: lift some btrfs_cross_ref_exist checks in nocow path
In nocow path, we check if the extent is snapshotted in
btrfs_cross_ref_exist(). We can do the similar check earlier and avoid
unnecessary search into extent tree.

A fio test on a Intel D-1531, 16GB RAM, SSD RAID-5 machine as follows:

[global]
group_reporting
time_based
thread=1
ioengine=libaio
bs=4k
iodepth=32
size=64G
runtime=180
numjobs=8
rw=randwrite

[file1]
filename=/mnt/nocow/testfile

IOPS result:   unpatched     patched

1 fio round:     46670        46958
snapshot
2 fio round:     51826        54498
3 fio round:     59767        61289

After snapshot, the first fio get about 5% performance gain. As we
continually write to the same file, all writes will resume to nocow mode
and eventually we have no performance gain.

Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:53 +02:00
Lu Fengqi d19577912d btrfs: Remove fs_info argument from btrfs_uuid_tree_rem
This function always takes a transaction handle which contains a
reference to the fs_info. Use that and remove the extra argument.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ rename the function ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:53 +02:00
Omar Sandoval 399b0bbf5f Btrfs: reserve space for O_TMPFILE orphan item deletion
btrfs_link() calls btrfs_orphan_del() if it's linking an O_TMPFILE but
it doesn't reserve space to do so. Even before the removal of the
orphan_block_rsv it wasn't using it.

Fixes: ef3b9af50b ("Btrfs: implement inode_operations callback tmpfile")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:24:00 +02:00
Omar Sandoval a575ceeb13 Btrfs: get rid of unused orphan infrastructure
Now that we don't keep long-standing reservations for orphan items,
root->orphan_block_rsv isn't used. We can git rid of it, along with:

- root->orphan_lock, which was used to protect root->orphan_block_rsv
- root->orphan_inodes, which was used as a refcount for root->orphan_block_rsv
- BTRFS_INODE_ORPHAN_META_RESERVED, which was used to track reservations
  in root->orphan_block_rsv
- btrfs_orphan_commit_root(), which was the last user of any of these
  and does nothing else

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:57 +02:00
Omar Sandoval 27919067f1 Btrfs: fix ENOSPC caused by orphan items reservations
Currently, we keep space reserved for all inode orphan items until the
inode is evicted (i.e., all references to it are dropped). We hit an
issue where an application would keep a bunch of deleted files open (by
design) and thus keep a large amount of space reserved, causing ENOSPC
errors when other operations tried to reserve space. This long-standing
reservation isn't absolutely necessary for a couple of reasons:

- We can almost always make the reservation we need or steal from the
  global reserve for the orphan item
- If we can't, it's not the end of the world if we drop the orphan item
  on the floor and let the next mount clean it up

So, get rid of persistent reservation and just reserve space in
btrfs_evict_inode().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:54 +02:00
Omar Sandoval 4b9d7b59bf Btrfs: refactor btrfs_evict_inode() reserve refill dance
The truncate loop in btrfs_evict_inode() does two things at once:

- It refills the temporary block reserve, potentially stealing from the
  global reserve or committing
- It calls btrfs_truncate_inode_items()

The tangle of continues hides the fact that these two steps are actually
separate. Split the first step out into a separate function both for
clarity and so that we can reuse it in a later patch.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:52 +02:00
Omar Sandoval c08db7d8d2 Btrfs: don't return ino to ino cache if inode item removal fails
In btrfs_evict_inode(), if btrfs_truncate_inode_items() fails, the inode
item will still be in the tree but we still return the ino to the ino
cache. That will blow up later when someone tries to allocate that ino,
so don't return it to the cache.

Fixes: 581bb05094 ("Btrfs: Cache free inode numbers in memory")
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:51 +02:00
Omar Sandoval 05a5bd7c4d Btrfs: delete dead code in btrfs_orphan_commit_root()
btrfs_orphan_commit_root() tries to delete an orphan item for a
subvolume in the tree root, but we don't actually insert that item in
the first place. See commit 0a0d4415e3 ("Btrfs: delete dead code in
btrfs_orphan_add()"). We can get rid of it.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:49 +02:00
Omar Sandoval 7b40b695b4 Btrfs: get rid of BTRFS_INODE_HAS_ORPHAN_ITEM
Now that we don't add orphan items for truncate, there can't be races on
adding or deleting an orphan item, so this bit is unnecessary.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:48 +02:00
Omar Sandoval f7e9e8fc79 Btrfs: stop creating orphan items for truncate
Currently, we insert an orphan item during a truncate so that if there's
a crash, we don't leak extents past the on-disk i_size. However, since
commit 7f4f6e0a3f ("Btrfs: only update disk_i_size as we remove
extents"), we keep disk_i_size in sync with the extent items as we
truncate, so orphan cleanup will never have any extents to remove. Don't
bother with the superfluous orphan item.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:46 +02:00
Omar Sandoval 0552210997 Btrfs: don't BUG_ON() in btrfs_truncate_inode_items()
btrfs_free_extent() can fail because of ENOMEM. There's no reason to
panic here, we can just abort the transaction.

Fixes: f4b9aa8d3b ("btrfs_truncate")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:45 +02:00
Omar Sandoval fd86a3a315 Btrfs: fix error handling in btrfs_truncate_inode_items()
btrfs_truncate_inode_items() uses two variables for error handling, ret
and err. These are not handled consistently, leading to a couple of
bugs.

- Errors from btrfs_del_items() are handled but not propagated to the
  caller
- If btrfs_run_delayed_refs() fails and aborts the transaction, we
  continue running

Just use ret everywhere and simplify things a bit, fixing both of these
issues.

Fixes: 79787eaab4 ("btrfs: replace many BUG_ONs with proper error handling")
Fixes: 1262133b8d ("Btrfs: account for crcs in delayed ref processing")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:44 +02:00
Omar Sandoval d1342aadbd Btrfs: update stale comments referencing vmtruncate()
Commit a41ad394a0 ("Btrfs: convert to the new truncate sequence")
changed btrfs_setsize() to call truncate_setsize() instead of
vmtruncate() but didn't update the comment above it. truncate_setsize()
never fails (the IS_SWAPFILE() check happens elsewhere), so remove the
comment.

Additionally, the comment above btrfs_page_mkwrite() references
vmtruncate(), but truncate_setsize() does the size write and page
locking now.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:43 +02:00
David Sterba 7b6a221e5b btrfs: rename btrfs_update_iflags to reflect which flags it touches
The btrfs inode flag flavour is now simply called 'inode flags' and the
vfs inode are i_flags.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:20 +02:00
Nikolay Borisov 20a6800402 btrfs: Unexport and rename btrfs_invalidate_inodes
This function is no longer used outside of inode.c so just make it
static. At the same time give a more becoming name, since it's not
really invalidating the inodes but just calling d_prune_alias. Last,
but not least - move the function above the sole caller to avoid
introducing yet-another-pointless forward declaration.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:10 +02:00
David Sterba 093258e6eb btrfs: replace waitqueue_actvie with cond_wake_up
Use the wrappers and reduce the amount of low-level details about the
waitqueue management.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:09 +02:00
Al Viro 7a1b1e7028 btrfs: take the last remnants of ->d_fsdata use out
[spotted while going through ->d_fsdata handling around d_splice_alias();
don't really care which tree that goes through]

The only thing even looking at ->d_fsdata in there (since 2012)
had been kfree(dentry->d_fsdata) in btrfs_dentry_delete().  Which,
incidentally, is all btrfs_dentry_delete() does.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:37 +02:00
Nikolay Borisov 7c8a0d363a btrfs: Add assert in __btrfs_del_delalloc_inode
The invariant is that when nr_delalloc_inodes is 0 then the root
mustn't have any inodes on its delalloc inodes list.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:32 +02:00
Nikolay Borisov 3a2f8c07e1 btrfs: Unexport btrfs_alloc_delalloc_work
It's used only in inode.c so makes no sense to have it exported. Also
move the definition of btrfs_delalloc_work to inode.c since it's used
only this file.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:29 +02:00
Nikolay Borisov 076da91cd9 btrfs: Remove delayed_iput member from btrfs_delalloc_work
When allocating a delalloc work we are always setting the delayed_iput
to 0. So remove the delay_iput member of btrfs_delalloc_work, as a
result also remove it as a parameter from btrfs_alloc_delalloc_work
since it's not used anymore.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:29 +02:00
Nikolay Borisov 4fbb514785 btrfs: Remove delay_iput parameter from __start_delalloc_inodes
It's always set to 0 so remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ rename to start_delalloc_inodes ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:28 +02:00
Nikolay Borisov 76f32e240e btrfs: Remove delayed_iput parameter from btrfs_start_delalloc_inodes
It's always set to 0, so just remove it and collapse the constant value
to the only function we are passing it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:28 +02:00
Nikolay Borisov 82b3e53b8d btrfs: Remove delayed_iput parameter of btrfs_start_delalloc_roots
This parameter was introduced alongside the function in
eb73c1b7ce ("Btrfs: introduce per-subvolume delalloc inode list") to
avoid deadlocks since this function was used in the transaction commit
path. However, commit 8d875f95da ("btrfs: disable strict file flushes
for renames and truncates") removed that usage, rendering the parameter
obsolete.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:28 +02:00
Nikolay Borisov 40012f96b6 btrfs: Remove btrfs_wait_and_free_delalloc_work
This function is called from only 1 place and is effectively a wrapper
over wait_completion/kfree. It doesn't really bring any value having
those two calls in a separate function. Just open code it and remove it.
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:20 +02:00
Nikolay Borisov 8ae225a8a4 btrfs: Remove tree argument from extent_writepages
It can be directly referenced from the passed address_space so do that.
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:20 +02:00
Nikolay Borisov 81f1d39035 btrfs: Use list_empty instead of list_empty_careful
list_empty_careful usually is a signal of something tricky going on. Its
usage in btrfs is actually not needed since both lists it's used on are
local to a function and cannot be modified concurrently. So switch to
plain list_empty. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:19 +02:00
Nikolay Borisov 2a3ff0adc9 btrfs: Remove redundant tree argument from extent_readpages
This function is called only from btrfs_readpage and is already passed
the mapping. Simplify its signature by moving the code obtaining
reference to the extent tree in the function. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:19 +02:00
Nikolay Borisov 477a30ba5f btrfs: Sink extent_tree arguments in try_release_extent_mapping
This function already gets the page from which the two extent trees
are referenced. Simplify its signature by moving the code getting the
trees inside the function. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:19 +02:00
Misono Tomohiro a79a464d56 btrfs: Allow rmdir(2) to delete an empty subvolume
Change the behavior of rmdir(2) and allow it to delete an empty
subvolume by using btrfs_delete_subvolume() which is used by
btrfs_ioctl_snap_destroy().

This is a change in behaviour and has been requested by users. Deleting
the subvolume by ioctl requires root permissions while the rmdir way
does works with standard tools and syscalls for all users that can
access the subvolume.

The main usecase is to allow 'rm -rf /path/with/subvols' to simply work.
We were not able to find any nasty usability surprises, the intention is
to do the destructive rm. Without allowing rmdir, this would have to be
followed by the ioctl subvolume deletion, which is more of an annoyance.

Implementation details:

The required lock for @dir and inode of @dentry is already acquired in
vfs layer.

We need some check before deleting a subvolume. Permission check is done
in vfs layer, emptiness check is in btrfs_rmdir() and additional check
(i.e. neither the subvolume is a default subvolume nor send is in progress)
is in btrfs_delete_subvolume().

Note that in btrfs_ioctl_snap_destroy(), d_delete() is called after
btrfs_delete_subvolume(). For rmdir(2), d_delete() is called in vfs
layer later.

Tested-by: Goffredo Baroncelli <kreijack@inwind.it>
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhance changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:18 +02:00
Misono Tomohiro f60a2364a4 btrfs: Factor out the main deletion process from btrfs_ioctl_snap_destroy()
Factor out the second half of btrfs_ioctl_snap_destroy() as
btrfs_delete_subvolume(), which performs some subvolume specific checks
before deletion:

1. send is not in progress
2. the subvolume is not the default subvolume
3. the subvolume does not contain other subvolumes

and actual deletion process. btrfs_delete_subvolume() requires
inode_lock for both @dir and inode of @dentry. The remaining part of
btrfs_ioctl_snap_destroy() is mainly permission checks.

Note that call of d_delete() is not included in btrfs_delete_subvolume()
as this function will also be used by btrfs_rmdir() to delete an empty
subvolume and in that case d_delete() is called in VFS layer.

As a result, btrfs_unlink_subvol() and may_destroy_subvol()
become static functions. No functional changes.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:18 +02:00
Misono Tomohiro ec42f16734 btrfs: Move may_destroy_subvol() from ioctl.c to inode.c
This is a preparation work to refactor btrfs_ioctl_snap_destroy()
and to allow rmdir(2) to delete an empty subvolume.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor update of the function comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:18 +02:00
David Sterba f46b24c945 btrfs: use fs_info for btrfs_handle_em_exist tracepoint
We really want to know to which filesystem the extent map events belong,
but as it cannot be reached from the extent_map pointers, we need to
pass it down the callchain.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:17 +02:00
Nikolay Borisov b25f0d0012 btrfs: Use while loop instead of labels in __endio_write_update_ordered
Currently __endio_write_update_ordered uses labels to implement
what is essentially a simple while loop. This makes the code more
cumbersome to follow than it actually has to be. No functional
changes. No xfstest regressions were found during testing.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:15 +02:00
Linus Torvalds d7b66b4ab0 for-4.17-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsG/ecACgkQxWXV+ddt
 WDvr3w/8D12pwR9sPcEwxD4pvoLv7LP1VRQy2u+ivSifdBD7MueKh3y0igUMyARR
 LERsK0zUsTQGkkC6c7ZYd4cT9PikPpXtO1P9iATFAKqR/YMDIV/haSqT8DwbI/qb
 7F+ZMeTy1LzL01YlYBrGVDxP8AWVO2Dml6JolYxzplILSLvdPH6G8xOSjei/p9sm
 RK5ERHJENEI0l/cThpiLoAEWjzciPtR39T5Hq45onHyCs3bjJCcx51/QE8sBsl8x
 +BKvCmL40UKd30YKudJZYDM6NgMgWENhfTtIZQIInv99sMNCxIgTEUdX8ExdyjRZ
 24rst/BuQz4d8r/8zqE/hdFsHRGWwnEiYmGWylanPY5KdQ41ULfXC06xuoNOLoW8
 KQwD8SWv+W5vEJW0UQz5cb3vUgv5RnUzPvcmMfSztLeo2K4zj6zCK5L6XJwIJNbM
 1AJR7R4TRkQdf5QEeziFl738Yv1AgsPQuKSiiFa9YwXMLU8dYXlx14ioUzBL8MLe
 1wZPJ03x/N7eKJ0g6OIAAVfUTFFejv4Z2B2IDoObuLLsPwTdK6tS+9tJ5mos7ngG
 Vf1ZVmhmeJdw1qwK8ROzAJHkK807KgGO7LWmA7tIVLwWuZX14F7xLQIg3Ux3MhIh
 NhoBTFy2AGmdE0hFYv/4FA5dnUOU4VTVYVw3QUV4DMc0XIodZrE=
 =iYyx
 -----END PGP SIGNATURE-----

Merge tag 'for-4.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "A one-liner that prevents leaking an internal error value 1 out of the
  ftruncate syscall.

  This has been observed in practice. The steps to reproduce make a
  common pattern (open/write/fync/ftruncate) but also need the
  application to not check only for negative values and happens only for
  compressed inlined files.

  The conditions are narrow but as this could break userspace I think
  it's better to merge it now and not wait for the merge window"

* tag 'for-4.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix error handling in btrfs_truncate()
2018-05-24 11:47:43 -07:00
Omar Sandoval d50147381a Btrfs: fix error handling in btrfs_truncate()
Jun Wu at Facebook reported that an internal service was seeing a return
value of 1 from ftruncate() on Btrfs in some cases. This is coming from
the NEED_TRUNCATE_BLOCK return value from btrfs_truncate_inode_items().

btrfs_truncate() uses two variables for error handling, ret and err.
When btrfs_truncate_inode_items() returns non-zero, we set err to the
return value. However, NEED_TRUNCATE_BLOCK is not an error. Make sure we
only set err if ret is an error (i.e., negative).

To reproduce the issue: mount a filesystem with -o compress-force=zstd
and the following program will encounter return value of 1 from
ftruncate:

int main(void) {
        char buf[256] = { 0 };
        int ret;
        int fd;

        fd = open("test", O_CREAT | O_WRONLY | O_TRUNC, 0666);
        if (fd == -1) {
                perror("open");
                return EXIT_FAILURE;
        }

        if (write(fd, buf, sizeof(buf)) != sizeof(buf)) {
                perror("write");
                close(fd);
                return EXIT_FAILURE;
        }

        if (fsync(fd) == -1) {
                perror("fsync");
                close(fd);
                return EXIT_FAILURE;
        }

        ret = ftruncate(fd, 128);
        if (ret) {
                printf("ftruncate() returned %d\n", ret);
                close(fd);
                return EXIT_FAILURE;
        }

        close(fd);
        return EXIT_SUCCESS;
}

Fixes: ddfae63cc8 ("btrfs: move btrfs_truncate_block out of trans handle")
CC: stable@vger.kernel.org # 4.15+
Reported-by: Jun Wu <quark@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-24 11:56:57 +02:00
Linus Torvalds 5997aab0a1 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Assorted fixes all over the place"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  aio: fix io_destroy(2) vs. lookup_ioctx() race
  ext2: fix a block leak
  nfsd: vfs_mkdir() might succeed leaving dentry negative unhashed
  cachefiles: vfs_mkdir() might succeed leaving dentry negative unhashed
  unfuck sysfs_mount()
  kernfs: deal with kernfs_fill_super() failures
  cramfs: Fix IS_ENABLED typo
  befs_lookup(): use d_splice_alias()
  affs_lookup: switch to d_splice_alias()
  affs_lookup(): close a race with affs_remove_link()
  fix breakage caused by d_find_alias() semantics change
  fs: don't scan the inode cache before SB_BORN is set
  do d_instantiate/unlock_new_inode combinations safely
  iov_iter: fix memory leak in pipe_get_pages_alloc()
  iov_iter: fix return type of __pipe_get_pages()
2018-05-21 11:54:57 -07:00
Nikolay Borisov 2b87733134 btrfs: Split btrfs_del_delalloc_inode into 2 functions
This is in preparation of fixing delalloc inodes leakage on transaction
abort. Also export the new function.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-17 14:18:26 +02:00
Al Viro 1e2e547a93 do d_instantiate/unlock_new_inode combinations safely
For anything NFS-exported we do _not_ want to unlock new inode
before it has grown an alias; original set of fixes got the
ordering right, but missed the nasty complication in case of
lockdep being enabled - unlock_new_inode() does
	lockdep_annotate_inode_mutex_key(inode)
which can only be done before anyone gets a chance to touch
->i_mutex.  Unfortunately, flipping the order and doing
unlock_new_inode() before d_instantiate() opens a window when
mkdir can race with open-by-fhandle on a guessed fhandle, leading
to multiple aliases for a directory inode and all the breakage
that follows from that.

	Correct solution: a new primitive (d_instantiate_new())
combining these two in the right order - lockdep annotate, then
d_instantiate(), then the rest of unlock_new_inode().  All
combinations of d_instantiate() with unlock_new_inode() should
be converted to that.

Cc: stable@kernel.org	# 2.6.29 and later
Tested-by: Mike Marshall <hubcap@omnibond.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-11 15:36:37 -04:00
David Sterba 92d3217084 btrfs: fix unaligned access in readdir
The last update to readdir introduced a temporary buffer to store the
emitted readdir data, but as there are file names of variable length,
there's a lot of unaligned access.

This was observed on a sparc64 machine:

  Kernel unaligned access at TPC[102f3080] btrfs_real_readdir+0x51c/0x718 [btrfs]

Fixes: 23b5ec7494 ("btrfs: fix readdir deadlock with pagefault")
CC: stable@vger.kernel.org # 4.14+
Reported-and-tested-by: René Rebe <rene@exactcode.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-19 00:35:08 +02:00
David Sterba c1d7c514f7 btrfs: replace GPL boilerplate by SPDX -- sources
Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-12 16:29:51 +02:00
Qu Wenruo 43b18595d6 btrfs: qgroup: Use separate meta reservation type for delalloc
Before this patch, btrfs qgroup is mixing per-transcation meta rsv with
preallocated meta rsv, making it quite easy to underflow qgroup meta
reservation.

Since we have the new qgroup meta rsv types, apply it to delalloc
reservation.

Now for delalloc, most of its reserved space will use META_PREALLOC qgroup
rsv type.

And for callers reducing outstanding extent like btrfs_finish_ordered_io(),
they will convert corresponding META_PREALLOC reservation to
META_PERTRANS.

This is mainly due to the fact that current qgroup numbers will only be
updated in btrfs_commit_transaction(), that's to say if we don't keep
such placeholder reservation, we can exceed qgroup limitation.

And for callers freeing outstanding extent in error handler, we will
just free META_PREALLOC bytes.

This behavior makes callers of btrfs_qgroup_release_meta() or
btrfs_qgroup_convert_meta() to be aware of which type they are.
So in this patch, btrfs_delalloc_release_metadata() and its callers get
an extra parameter to info qgroup to do correct meta convert/release.

The good news is, even we use the wrong type (convert or free), it won't
cause obvious bug, as prealloc type is always in good shape, and the
type only affects how per-trans meta is increased or not.

So the worst case will be at most metadata limitation can be sometimes
exceeded (no convert at all) or metadata limitation is reached too soon
(no free at all).

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 01:41:14 +02:00
Omar Sandoval 0a0d4415e3 Btrfs: delete dead code in btrfs_orphan_add()
btrfs_orphan_add() has had this case commented out since it was first
introduced in commit d68fc57b7e ("Btrfs: Metadata reservation for
orphan inodes"). Most of the orphan cleanup code has been rewritten
since then, so it's safe to say that this code isn't needed.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ switch to bool ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 01:41:12 +02:00
Nikolay Borisov c79a70b133 btrfs: drop fs_info parameter from btrfs_run_delayed_refs
It's provided by the transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 01:41:10 +02:00
Nikolay Borisov 8535dc1967 btrfs: Remove unused root var from relink_file_extents
Added in 38c227d87c ("Btrfs: snapshot-aware defrag") but subsequently
made redundant by 0b246afa62 ("btrfs: root->fs_info cleanup, add
fs_info convenience variables").

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 01:26:57 +02:00
David Sterba d0ee393493 btrfs: rename submit callbacks and drop double underscores
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 01:26:56 +02:00
David Sterba 6c55343587 btrfs: remove unused parameters from extent_submit_bio_done_t
Remove parameters not used by any of the callbacks.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 01:26:55 +02:00
David Sterba d0779291b1 btrfs: remove unused parameters from extent_submit_bio_start_t
Remove parameters not used by any of the callbacks.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 01:26:55 +02:00
David Sterba 051c98eb11 btrfs: open code trivial helper btrfs_page_exists_in_range
The called function name is self explanatory.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 01:26:50 +02:00
Matthew Wilcox 965aab1cfc btrfs: Use filemap_range_has_page()
The current implementation of btrfs_page_exists_in_range() gives the
wrong answer if the workingset code has stored a shadow entry in the
page cache.  The filemap_range_has_page() function does not have this
problem, and it's shared code, so use it instead.

eigned-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 01:26:45 +02:00
Su Yue 005d67127f btrfs: adjust return values of btrfs_inode_by_name
Previously, btrfs_inode_by_name() returned 0 which left caller to check
objectid of location even location if the type was invalid.

Let btrfs_inode_by_name() return -EUCLEAN if a corrupted location of a
dir entry is found.  Removal of label out_err also simplifies the
function.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ drop unlikely ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:43 +02:00
Nikolay Borisov d02c0e2019 btrfs: Remove root argument from cow_file_range_inline
This argument is always set to the root of the inode, which is also
passed. So let's get a reference inside the function and simplify
the arg list.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:42 +02:00
Filipe Manana 213e8c5520 Btrfs: skip writeback of last page when truncating file to same size
When we truncate a file to the same size and that size is not aligned
with the sector size, we end up triggering writeback (and wait for it to
complete) of the last page. This is unncessary as we can not have delayed
allocation beyond the inode's i_size and the goal of truncating a file
to its own size is to discard prealloc extents (allocated via the
fallocate(2) system call). Besides the unnecessary IO start and wait, it
also breaks the oppurtunity for larger contiguous extents on disk, as
before the last dirty page there might be other dirty pages.

This scenario is probably not very common in general, however it is
common for btrfs receive implementations because currently the send
stream always issues a truncate operation for each processed inode as
the last operation for that inode (this truncate operation is not
always needed and the send implementation will be addressed to avoid
them).

So improve this by not starting and waiting for writeback of the inode's
last page when we are truncating to exactly the same size.

The following script was used to quickly measure the time a receive
operation takes:

 $ cat test_send.sh
 #!/bin/bash

 SRC_DEV=/dev/sdc
 DST_DEV=/dev/sdd
 SRC_MNT=/mnt/sdc
 DST_MNT=/mnt/sdd

 mkfs.btrfs -f $SRC_DEV >/dev/null
 mkfs.btrfs -f $DST_DEV >/dev/null
 mount $SRC_DEV $SRC_MNT
 mount $DST_DEV $DST_MNT

 echo "Creating source filesystem"
 for ((t = 0; t < 10; t++)); do
     (
         for ((i = 1; i <= 20000; i++)); do
             xfs_io -f -c "pwrite -S 0xab 0 5000" \
                $SRC_MNT/file_$i > /dev/null
         done
     ) &
     worker_pids[$t]=$!
 done
 wait ${worker_pids[@]}

 echo "Creating and sending snapshot"
 btrfs subvolume snapshot -r $SRC_MNT $SRC_MNT/snap1 >/dev/null
 /usr/bin/time -f "send took %e seconds"    \
     btrfs send -f $SRC_MNT/send_file $SRC_MNT/snap1
 /usr/bin/time -f "receive took %e seconds" \
     btrfs receive -f $SRC_MNT/send_file $DST_MNT

 umount $SRC_MNT
 umount $DST_MNT

The results for 5 runs were the following:

* Without this change

average receive time was 26.49 seconds
standard deviation of 2.53 seconds

* With this change

average receive time was 12.51 seconds
standard deviation of 0.32 seconds

Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:40 +02:00
Nikolay Borisov de224b7c56 btrfs: Remove redundant memory barriers around dio_private error status
Using any kind of memory barriers around atomic operations which have
a return value is redundant, since those operations themselves are
fully ordered. atomic_t.txt states:

    - RMW operations that have a return value are fully ordered;

    Fully ordered primitives are ordered against everything prior and
    everything subsequent. Therefore a fully ordered primitive is like
    having an smp_mb() before and an smp_mb() after the primitive.

Given this let's replace the extra memory barriers with comments.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:40 +02:00
David Sterba e67c718b5b btrfs: add more __cold annotations
The __cold functions are placed to a special section, as they're
expected to be called rarely. This could help i-cache prefetches or help
compiler to decide which branches are more/less likely to be taken
without any other annotations needed.

Though we can't add more __exit annotations, it's still possible to add
__cold (that's also added with __exit). That way the following function
categories are tagged:

- printf wrappers, error messages
- exit helpers

Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:39 +02:00
Liu Bo 5811375325 Btrfs: fix unexpected cow in run_delalloc_nocow
Fstests generic/475 provides a way to fail metadata reads while
checking if checksum exists for the inode inside run_delalloc_nocow(),
and csum_exist_in_range() interprets error (-EIO) as inode having
checksum and makes its caller enter the cow path.

In case of free space inode, this ends up with a warning in
cow_file_range().

The same problem applies to btrfs_cross_ref_exist() since it may also
read metadata in between.

With this, run_delalloc_nocow() bails out when errors occur at the two
places.

cc: <stable@vger.kernel.org> v2.6.28+
Fixes: 17d217fe97 ("Btrfs: fix nodatasum handling in balancing code")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:39 +02:00
Nikolay Borisov 9678c54388 btrfs: Remove custom crc32c init code
The custom crc32 init code was introduced in
14a958e678 ("Btrfs: fix btrfs boot when compiled as built-in") to
enable using btrfs as a built-in. However, later as pointed out by
60efa5eb2e ("Btrfs: use late_initcall instead of module_init") this
wasn't enough and finally btrfs was switched to late_initcall which
comes after the generic crc32c implementation is initiliased. The
latter commit superseeded the former. Now that we don't have to
maintain our own code let's just remove it and switch to using the
generic implementation.

Despite touching a lot of files the patch is really simple. Here is the gist of
the changes:

1. Select LIBCRC32C rather than the low-level modules.
2. s/btrfs_crc32c/crc32c/g
3. replace hash.h with linux/crc32c.h
4. Move the btrfs namehash funcs to ctree.h and change the tree accordingly.

I've tested this with btrfs being both a module and a built-in and xfstest
doesn't complain.

Does seem to fix the longstanding problem of not automatically selectiong
the crc32c module when btrfs is used. Possibly there is a workaround in
dracut.

The modinfo confirms that now all the module dependencies are there:

before:
depends:        zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate

after:
depends:        libcrc32c,zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add more info to changelog from mails ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:39 +02:00
Nikolay Borisov c1c3fac2a9 btrfs: Remove btrfs_inode::delayed_iput_count
delayed_iput_count wa supposed to be used to implement, well, delayed
iput. The idea is that we keep accumulating the number of iputs we do
until eventually the inode is deleted. Turns out we never really
switched the delayed_iput_count from 0 to 1, hence all conditional
code relying on the value of that member being different than 0 was
never executed. This, as it turns out, didn't cause any problem due
to the simple fact that the generic inode's i_count member was always
used to count the number of iputs. So let's just remove the unused
member and all unused code. This patch essentially provides no
functional changes. While at it, also add proper documentation for
btrfs_add_delayed_iput

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:37 +02:00
Liu Bo 3d5addafd0 Btrfs: do not check inode's runtime flags under root->orphan_lock
It's not necessary to hold ->orphan_lock when checking inode's runtime
flags.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:36 +02:00
Anand Jain 566b1760b4 btrfs: use ASSERT to report logical error in cow_file_range()
Use ASSERT to report logical error in cow_file_range(), also move it a
bit closer to when the num_bytes is derived.

The extent start could be (u64)-1 in some cases, the assert should catch
that we do not accidentally pass it to cow_file_range.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:35 +02:00
Anand Jain 3752d22fce btrfs: cow_file_range() num_bytes and disk_num_bytes are same
This patch deletes local variable disk_num_bytes as its value
is same as num_bytes in the function cow_file_range().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:35 +02:00
Linus Torvalds af8c081627 for-4.16-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlqbuK8ACgkQxWXV+ddt
 WDsiVQ//eE8Axfw4qWOyHHjozoAIu9kifvOFQIhvKviLiXHJNrD/vBI6YwD1hyD1
 rbbLilMsEl1OD1Sq3AeOUMNSSl5qUFEB+CA8vg9GznFTNRobTkL+p96Zt5xlRDu3
 lsFFV93tED+dK4D/eSGP+xYbknA8hIk/2gWPkd6hpYKyh2QdsPBYDqCnaEXvd79P
 DIP/cAjIfzqQn0FTiZ9wbaES+LPO+NoZgQRC2w9McYQ5CEMc+oAChEmPJRwpPoKy
 YdhuwoULniRNHVnVOIVfi4w9jkSPSz7YIgLeRFli/WGBYGcKeHTMFkMa12KdpuUC
 JUclOogJ5ZMbFV3C0W8XEJ7Vb9ltIevrH8MgfKP/3BScuZzbJZ+n5KkH2Gf7vcpe
 w5cZGOsKDz+35fDCdTmsFpDK9kpGzHq48JlRifOjARbdyqNwVq4emxOeQlO1ygzq
 Y9H5UeMpp/FDAm6g/bV8ezCXzwuwUDV9CwAJBD+WCiZhD2nX85FfIp1kfF2zLcUg
 Y8irqV6A/J/0BFkF7Iu9AuzTxOdo6zMLkcHEKb+sL7yGxZ2248v5gZxDoyV5iNWR
 WY4Y2GaMemZGu6+NyyLFAJzlFyACD5fSy8vT7KD6upCzwmlAtaJ3VULfTrLyi/uM
 SNZAKf3WzKBVe5h52GNGRCi5OLTteH6Yqz5m5/h8r74mBO00IrA=
 =bF96
 -----END PGP SIGNATURE-----

Merge tag 'for-4.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - when NR_CPUS is large, a SRCU structure can significantly inflate
   size of the main filesystem structure that would not be possible to
   allocate by kmalloc, so the kvalloc fallback is used

 - improved error handling

 - fix endiannes when printing some filesystem attributes via sysfs,
   this is could happen when a filesystem is moved between different
   endianity hosts

 - send fixes: the NO_HOLE mode should not send a write operation for a
   file hole

 - fix log replay for for special files followed by file hardlinks

 - fix log replay failure after unlink and link combination

 - fix max chunk size calculation for DUP allocation

* tag 'for-4.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix log replay failure after unlink and link combination
  Btrfs: fix log replay failure after linking special file and fsync
  Btrfs: send, fix issuing write op when processing hole in no data mode
  btrfs: use proper endianness accessors for super_copy
  btrfs: alloc_chunk: fix DUP stripe size handling
  btrfs: Handle btrfs_set_extent_delalloc failure in relocate_file_extent_cluster
  btrfs: handle failure of add_pending_csums
  btrfs: use kvzalloc to allocate btrfs_fs_info
2018-03-04 11:04:27 -08:00
Nikolay Borisov ac01f26a27 btrfs: handle failure of add_pending_csums
add_pending_csums was added as part of the new data=ordered
implementation in e6dcd2dc9c ("Btrfs: New data=ordered
implementation"). Even back then it called the btrfs_csum_file_blocks
which can fail but it never bothered handling the failure. In ENOMEM
situation this could lead to the filesystem failing to write the
checksums for a particular extent and not detect this. On read this
could lead to the filesystem erroring out due to crc mismatch. Fix it by
propagating failure from add_pending_csums and handling them.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-01 16:16:00 +01:00
Linus Torvalds da370f1d63 for-4.16-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlqG8poACgkQxWXV+ddt
 WDuHSA//eC+69XpHwohI6pcPQ7Jbr9UCj1L/Gt0U96YSzijGW4Hv3OQEWLIRBu4c
 nZbzQYtUunpguLYwfXgUUgXRHBTo2Y5bXZNmF2MtL7JcPOLhLh4h/IcGY7eRd2Vq
 qvv2bqr3yAcQo7s6z5U/D8ulohzHQTxG7Jaq/BkVxQqhvu+vdu/9T8ikAWnmSTjw
 lONu8soR5QO7tewxz23Cguw/t1bWe1aMXG9Ykd4avyhQHtgzNE+l82i4DYUhK2CM
 x8M5/CxnDLPe73IJuA2INCUtpPvR4Qufi5Nz6EN3BrJNCGBkmg18sPIvWlH6LsVh
 bsm4Lwz/piq+hkDq2GG+Z79uiGAfCVUWAsnm7yYHwpVyMvwHKlfrcVSAuRZixw5E
 /NZ0JEkEOtvzpv4inZFYbAgD+oKfvYvwj9BW5BXfu2aH6hJBImfAeMSd1aHB3uZI
 kGgy52k2v2P3WKQOFUbmW417P05DvvGmRvRmU+tSFpB+lXAZqRzoiVIuFm0xwhf1
 1SmnYgnSYzPmzIRXAMsSYQeK/8NXDdMZMutaw/AYwX+QBEdIAErf6MWcjI6XZRyG
 g8Gr8JcpwSa+H5/LKN5uswfXxfSAsqVHnZhbOVrjyGX0wyR4KJg3ag3KsHd9SCxb
 LDEjPSYEDn9yfmw6pK2Q6J26FGYiKpuUXaNiYVNymGe6162IiBM=
 =VeA/
 -----END PGP SIGNATURE-----

Merge tag 'for-4.16-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "We have a few assorted fixes, some of them show up during fstests so I
  gave them more testing"

* tag 'for-4.16-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: Fix use-after-free when cleaning up fs_devs with a single stale device
  Btrfs: fix null pointer dereference when replacing missing device
  btrfs: remove spurious WARN_ON(ref->count < 0) in find_parent_nodes
  btrfs: Ignore errors from btrfs_qgroup_trace_extent_post
  Btrfs: fix unexpected -EEXIST when creating new inode
  Btrfs: fix use-after-free on root->orphan_block_rsv
  Btrfs: fix btrfs_evict_inode to handle abnormal inodes correctly
  Btrfs: fix extent state leak from tree log
  Btrfs: fix crash due to not cleaning up tree log block's dirty bits
  Btrfs: fix deadlock in run_delalloc_nocow
2018-02-16 09:26:18 -08:00
Liu Bo 1a932ef4e4 Btrfs: fix use-after-free on root->orphan_block_rsv
I got these from running generic/475,

WARNING: CPU: 0 PID: 26384 at fs/btrfs/inode.c:3326 btrfs_orphan_commit_root+0x1ac/0x2b0 [btrfs]
BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: btrfs_block_rsv_release+0x1c/0x70 [btrfs]
Call Trace:
  btrfs_orphan_release_metadata+0x9f/0x200 [btrfs]
  btrfs_orphan_del+0x10d/0x170 [btrfs]
  btrfs_setattr+0x500/0x640 [btrfs]
  notify_change+0x7ae/0x870
  do_truncate+0xca/0x130
  vfs_truncate+0x2ee/0x3d0
  do_sys_truncate+0xaf/0xf0
  SyS_truncate+0xe/0x10
  entry_SYSCALL_64_fastpath+0x1f/0x96

The race is between btrfs_orphan_commit_root and btrfs_orphan_del,
        t1                                        t2
btrfs_orphan_commit_root                     btrfs_orphan_del
   spin_lock
   check (&root->orphan_inodes)
   root->orphan_block_rsv = NULL;
   spin_unlock
                                             atomic_dec(&root->orphan_inodes);
                                             access root->orphan_block_rsv

Accessing root->orphan_block_rsv must be done before decreasing
root->orphan_inodes.

cc: <stable@vger.kernel.org> v3.12+
Fixes: 703c88e035 ("Btrfs: fix tracking of orphan inode count")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-02-02 16:24:40 +01:00
Liu Bo e8f1bc1493 Btrfs: fix btrfs_evict_inode to handle abnormal inodes correctly
This regression is introduced in
commit 3d48d9810d ("btrfs: Handle uninitialised inode eviction").

There are two problems,

a) it is ->destroy_inode() that does the final free on inode, not
   ->evict_inode(),
b) clear_inode() must be called before ->evict_inode() returns.

This could end up hitting BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
in evict() because I_CLEAR is set in clear_inode().

Fixes: commit 3d48d9810d ("btrfs: Handle uninitialised inode eviction")
Cc: <stable@vger.kernel.org> # v4.7-rc6+
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-02-02 16:24:35 +01:00
Liu Bo e89166990f Btrfs: fix deadlock in run_delalloc_nocow
@cur_offset is not set back to what it should be (@cow_start) if
btrfs_next_leaf() returns something wrong, and the range [cow_start,
cur_offset) remains locked forever.

cc: <stable@vger.kernel.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-02-02 16:24:19 +01:00
Linus Torvalds 31466f3ed7 for-4.16-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlpvikQACgkQxWXV+ddt
 WDs6qA//ZE7eEH0sKpD4Z+3gUevk/MMXwE9prRijEdjXz/K/UXtvpq0sI7HMQskZ
 Ls9Wmzof+3WEQoa08RQZFzwuclW1Udm09SqE2oHP2gXQB5rC0BtWdrlMaKUJy03y
 NUwxHetbE6TsFLU5HIVmi05NexNx9SVV6oJTWt00RlXTePw9Aoc88ikoXXUE2vqH
 wbH9/ccmM9EkDFxdG+YG5QX054kQV8/5RXdqBJnIiGVRX5ZsAY84AN9x9YoRCVUw
 wq9TfPu6XmeA6Uq6knpeLlXDms5w+FE3n5CduROk7Q7YNgpoZekF20c8uK8HzT4T
 KF8hc0QpQgRCVBJ8I4MbPSMRIDf3IWfZmWSDEDda/6/ep6Bl99b8PFvdDKDBMUct
 8wsgGrwGbHuz2l2QUIXjpBL9Cv9Tbu8vjmg0h2hFrpiH1c8JaXjKtJXAMtigWsZ1
 DdX+5Y0zqvV/YLpzKF4aMDWXIteN4qaznvjdmj3B7BxgcnITOV/cmPCyMplNrtUa
 Cs2fzGV5tpxhBzxE490v+frMULmLq2W1e6WPfFCqPKBCqulcR75TozDQS9M2/h4k
 uAZzVKoguHrUPP1ONVas9aVC05K473nbYPd28eecYwkZ4z32hK4SAdnQY1aPreTe
 axoV7p7a+i1bkzT5LK6gIfpddVWth8w45nz4P0lwxp0Z6XlkbJE=
 =Irul
 -----END PGP SIGNATURE-----

Merge tag 'for-4.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "Features or user visible changes:

   - fallocate: implement zero range mode

   - avoid losing data raid profile when deleting a device

   - tree item checker: more checks for directory items and xattrs

  Notable fixes:

   - raid56 recovery: don't use cached stripes, that could be
     potentially changed and a later RMW or recovery would lead to
     corruptions or failures

   - let raid56 try harder to rebuild damaged data, reading from all
     stripes if necessary

   - fix scrub to repair raid56 in a similar way as in the case above

  Other:

   - cleanups: device freeing, removed some call indirections, redundant
     bio_put/_get, unused parameters, refactorings and renames

   - RCU list traversal fixups

   - simplify mount callchain, remove recursing back when mounting a
     subvolume

   - plug for fsync, may improve bio merging on multiple devices

   - compression heurisic: replace heap sort with radix sort, gains some
     performance

   - add extent map selftests, buffered write vs dio"

* tag 'for-4.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (155 commits)
  btrfs: drop devid as device_list_add() arg
  btrfs: get device pointer from device_list_add()
  btrfs: set the total_devices in device_list_add()
  btrfs: move pr_info into device_list_add
  btrfs: make btrfs_free_stale_devices() to match the path
  btrfs: rename btrfs_free_stale_devices() arg to skip_dev
  btrfs: make btrfs_free_stale_devices() argument optional
  btrfs: make btrfs_free_stale_device() to iterate all stales
  btrfs: no need to check for btrfs_fs_devices::seeding
  btrfs: Use IS_ALIGNED in btrfs_truncate_block instead of opencoding it
  Btrfs: noinline merge_extent_mapping
  Btrfs: add WARN_ONCE to detect unexpected error from merge_extent_mapping
  Btrfs: extent map selftest: dio write vs dio read
  Btrfs: extent map selftest: buffered write vs dio read
  Btrfs: add extent map selftests
  Btrfs: move extent map specific code to extent_map.c
  Btrfs: add helper for em merge logic
  Btrfs: fix unexpected EEXIST from btrfs_get_extent
  Btrfs: fix incorrect block_len in merge_extent_mapping
  btrfs: Remove unused readahead spinlock
  ...
2018-01-29 14:04:23 -08:00
Linus Torvalds a4b7fd7d34 inode->i_version rework for v4.16
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJabwjlAAoJEAAOaEEZVoIVeEEP/R84kZJjlZV/vNmFFvY46jM+
 0hpMHXRNym+nW1Du1CKNkesEUAY8ACAQIyzJh63Q72341QTDdz3+asHwPYRNOqdC
 PgryidPieojkNKQg+h7dmoKYlYh1xiCicvn66Q5PFb9B0lH36twekOK4X1qqJj8Z
 breRmRoFLka9looMSuYgwbErts023fmASalvGum6T0ZM/7F9hUj4O3OsQtKTLUNM
 VQ+gLJTQrUqrgzvWUwq3WTMa9YAaKP4oad8nsglNSpiVLG7WtURr5HokW9hAziqL
 k99Y+K2ni1wZJlNGJAyV7PyEG2ieI5Xn+LzM2RM+SndD1QHF2QXACmSTDYfL51k5
 G2RsKeTZvQPtX4qx9+vnCp/4oV6JduvCaq2Mt8SQb9nYZxKjs85TNLrARJv+85eQ
 zP0OTxlH1Gfu3j36n3cny4XemyMYYF4hCFYfRPqTGst37fgLBtfIfUSQ6jedoCK2
 Xcyb6ukGXMh6If/A7DSy91hvSSPrWSH7TPPsbfLy6o+wUOtpAGR4eXVlEuAiXrzc
 gnoAz85oIMUQae66LrdrPk1NyE59qOb24g/yU5gyRBSpi2+/aoboNCKaD73tgs/C
 XIMwGXLYmqkcud7IBQF0tHHiM+jsEkbSM4LUqRXSnqMdwNnS18Z4Q+JKqpdP0cii
 eRdenDvUfu8Gu1Y9vWBv
 =iihN
 -----END PGP SIGNATURE-----

Merge tag 'iversion-v4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull inode->i_version rework from Jeff Layton:
 "This pile of patches is a rework of the inode->i_version field. We
  have traditionally incremented that field on every inode data or
  metadata change. Typically this increment needs to be logged on disk
  even when nothing else has changed, which is rather expensive.

  It turns out though that none of the consumers of that field actually
  require this behavior. The only real requirement for all of them is
  that it be different iff the inode has changed since the last time the
  field was checked.

  Given that, we can optimize away most of the i_version increments and
  avoid dirtying inode metadata when the only change is to the i_version
  and no one is querying it. Queries of the i_version field are rather
  rare, so we can help write performance under many common workloads.

  This patch series converts existing accesses of the i_version field to
  a new API, and then converts all of the in-kernel filesystems to use
  it. The last patch in the series then converts the backend
  implementation to a scheme that optimizes away a large portion of the
  metadata updates when no one is looking at it.

  In my own testing this series significantly helps performance with
  small I/O sizes. I also got this email for Christmas this year from
  the kernel test robot (a 244% r/w bandwidth improvement with XFS over
  DAX, with 4k writes):

    https://lkml.org/lkml/2017/12/25/8

  A few of the earlier patches in this pile are also flowing to you via
  other trees (mm, integrity, and nfsd trees in particular)".

* tag 'iversion-v4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: (22 commits)
  fs: handle inode->i_version more efficiently
  btrfs: only dirty the inode in btrfs_update_time if something was changed
  xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing
  fs: only set S_VERSION when updating times if necessary
  IMA: switch IMA over to new i_version API
  xfs: convert to new i_version API
  ufs: use new i_version API
  ocfs2: convert to new i_version API
  nfsd: convert to new i_version API
  nfs: convert to new i_version API
  ext4: convert to new i_version API
  ext2: convert to new i_version API
  exofs: switch to new i_version API
  btrfs: convert to new i_version API
  afs: convert to new i_version API
  affs: convert to new i_version API
  fat: convert to new i_version API
  fs: don't take the i_lock in inode_inc_iversion
  fs: new API for handling inode->i_version
  ntfs: remove i_version handling
  ...
2018-01-29 13:33:53 -08:00
Jeff Layton 3a8c7231d5 btrfs: only dirty the inode in btrfs_update_time if something was changed
At this point, we know that "now" and the file times may differ, and we
suspect that the i_version has been flagged to be bumped. Attempt to
bump the i_version, and only mark the inode dirty if that actually
occurred or if one of the times was updated.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2018-01-29 06:42:21 -05:00
Jeff Layton c7f88c4e78 btrfs: convert to new i_version API
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: David Sterba <dsterba@suse.com>
2018-01-29 06:42:20 -05:00
Jeff Layton ae5e165d85 fs: new API for handling inode->i_version
Add a documentation blob that explains what the i_version field is, how
it is expected to work, and how it is currently implemented by various
filesystems.

We already have inode_inc_iversion. Add several other functions for
manipulating and accessing the i_version counter. For now, the
implementation is trivial and basically works the way that all of the
open-coded i_version accesses work today.

Future patches will convert existing users of i_version to use the new
API, and then convert the backend implementation to do things more
efficiently.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2018-01-29 06:41:30 -05:00
Nikolay Borisov b03ebd992f btrfs: Use IS_ALIGNED in btrfs_truncate_block instead of opencoding it
No functional changes, just makes the code more readable

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:22 +01:00
Liu Bo c04e61b5e4 Btrfs: move extent map specific code to extent_map.c
These helpers are extent map specific, move them to extent_map.c.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:22 +01:00
Liu Bo 7b4df058b0 Btrfs: add helper for em merge logic
This is a prepare work for the following extent map selftest, which
runs tests against em merge logic.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:21 +01:00
Liu Bo 18e83ac75b Btrfs: fix unexpected EEXIST from btrfs_get_extent
This fixes a corner case that is caused by a race of dio write vs dio
read/write.

Here is how the race could happen.

Suppose that no extent map has been loaded into memory yet.
There is a file extent [0, 32K), two jobs are running concurrently
against it, t1 is doing dio write to [8K, 32K) and t2 is doing dio
read from [0, 4K) or [4K, 8K).

t1 goes ahead of t2 and splits em [0, 32K) to em [0K, 8K) and [8K 32K).

------------------------------------------------------
             t1                                t2
      btrfs_get_blocks_direct()         btrfs_get_blocks_direct()
       -> btrfs_get_extent()              -> btrfs_get_extent()
           -> lookup_extent_mapping()
           -> add_extent_mapping()            -> lookup_extent_mapping()
              # load [0, 32K)
       -> btrfs_new_extent_direct()
           -> btrfs_drop_extent_cache()
              # split [0, 32K) and
	      # drop [8K, 32K)
           -> add_extent_mapping()
              # add [8K, 32K)
                                              -> add_extent_mapping()
                                                 # handle -EEXIST when adding
                                                 # [0, 32K)
------------------------------------------------------
About how t2(dio read/write) runs into -EEXIST:

a) add_extent_mapping() gets -EEXIST for adding em [0, 32k),

b) search_extent_mapping() then returns [0, 8k) as the existing em,
   even though start == existing->start, em is [0, 32k) so that
   extent_map_end(em) > extent_map_end(existing), i.e. 32k > 8k,

c) then it goes thru merge_extent_mapping() which tries to add a [8k, 8k)
   (with a length 0) and returns -EEXIST as [8k, 32k) is already in tree,

d) so btrfs_get_extent() ends up returning -EEXIST to dio read/write,
   which is confusing applications.

Here I conclude all the possible situations,
1) start < existing->start

            +-----------+em+-----------+
+--prev---+ |     +-------------+      |
|         | |     |             |      |
+---------+ +     +---+existing++      ++
                +
                |
                +
             start

2) start == existing->start

      +------------em------------+
      |     +-------------+      |
      |     |             |      |
      +     +----existing-+      +
            |
            |
            +
         start

3) start > existing->start && start < (existing->start + existing->len)

      +------------em------------+
      |     +-------------+      |
      |     |             |      |
      +     +----existing-+      +
               |
               |
               +
             start

4) start >= (existing->start + existing->len)

+-----------+em+-----------+
|     +-------------+      | +--next---+
|     |             |      | |         |
+     +---+existing++      + +---------+
                      +
                      |
                      +
                   start

As we can see, it turns out that if start is within existing em (front
inclusive), then the existing em should be returned as is, otherwise,
we try our best to merge candidate em with sibling ems to form a
larger em (in order to reduce the total number of em).

Reported-by: David Vallender <david.vallender@landmark.co.uk>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:21 +01:00
Liu Bo a520a7e0b5 Btrfs: fix incorrect block_len in merge_extent_mapping
%block_len could be checked on deciding if two em are mergeable.

merge_extent_mapping() has only added the front pad if the front part
of em gets truncated, but it's possible that the end part gets
truncated.

For both compressed extent and inline extent, em->block_len is not
adjusted accordingly, and for regular extent, em->block_len always
equals to em->len, hence this sets em->block_len with em->len.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:21 +01:00
Su Yue df6703e15c btrfs: correct wrong comment about magic number of index_cnt
There is no function named btrfs_get_inode_index_count.
Explanation for magic number index_cnt=2 in btrfs_new_inode() is
actually located in btrfs_set_inode_index_count().

So replace 'btrfs_get_inode_index_count' in the comment by
'btrfs_set_inode_index_count'.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:21 +01:00
David Sterba e43bbe5e16 btrfs: sink unlock_extent parameter gfp_flags
All callers pass either GFP_NOFS or GFP_KERNEL now, so we can sink the
parameter to the function, though we lose some of the slightly better
semantics of GFP_KERNEL in some places, it's worth cleaning up the
callchains.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Nikolay Borisov 36f7894f66 btrfs: Remove redundant pair of bio_get/set in __btrfs_submit_dio_bio
The bio is not referenced after it has been submitted and the endio is
going to consume the sole reference on successful submission. On error,
the callers of __btrfs_submit_dio_bio do invoke bio_put so we don't
leak it either.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Nikolay Borisov ea057f6daf btrfs: Remove redundant bio_get/set from submit_dio_repair_bio
The bio that is passsed is the newly created repair bio which already
has a reference count of 1, which is going to be consumed by the
endio routine on successful submission. On error the handler also
calls bio_put.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Nikolay Borisov 0a9b0e5351 btrfs: sink extent_write_full_page tree argument
The tree argument passed to extent_write_full_page is referenced from
the page being passed to the same function. Since we already have
enough information to get the reference, remove the function parameter.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:16 +01:00
Nikolay Borisov 5e3ee23648 btrfs: sink extent_write_locked_range tree parameter
This function is called only from submit_compressed_extents and the
io tree being passed is always that of the inode. But we are also
passing the inode, so just move getting the io tree pointer in
extent_write_locked_range to simplify the signature.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:16 +01:00
Nikolay Borisov f3038ee3a3 btrfs: Handle btrfs_set_extent_delalloc failure in fixup worker
This function was introduced by 247e743cbe ("Btrfs: Use async helpers
to deal with pages that have been improperly dirtied") and it didn't do
any error handling then. This function might very well fail in ENOMEM
situation, yet it's not handled, this could lead to inconsistent state.
So let's handle the failure by setting the mapping error bit.

Cc: stable@vger.kernel.org
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:16 +01:00
Nikolay Borisov bf8d32b9b3 btrfs: remove redundant check in btrfs_get_extent_fiemap
Before returning hole_em in btrfs_get_fiemap_extent we check if it's different
than null. However, by the time this null check is triggered we already know
hole_em is not null because it means it points to the em we found and it
has already been dereferenced.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:15 +01:00
Nikolay Borisov 5c9a702ed1 btrfs: Remove unused variable in btrfs_get_extent
trans was statically assigned to NULL and this never changed over the
course of btrfs_get_extent. So remove any code which checks whether
trans != NULL and just hardcode the fact trans is always NULL.

Resolves-coverity-id: 112806
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:15 +01:00
Timofey Titovets e9679de3fd Btrfs: compress_file_range() change page dirty status once
We need to call extent_range_clear_dirty_for_io()
on compression range to prevent application from changing
page content, while pages compressing.

extent_range_clear_dirty_for_io() runs on each loop iteration,
"(end - start)" can be much (up to 1024 times) bigger
then compression range (BTRFS_MAX_UNCOMPRESSED).

The start pointer is advanced each time we manage to compress part of
the range. The end pointer does not change so we could redirty the
remaining parts repeatedly.

Fix that behaviour by call extent_range_clear_dirty_for_io()
only once, the first time it happens.

This is the safest but probably not the best behaviour. Previous
iterations of the patch tried to redirty only the range that we were not
able to compress. This has been refused by David for safety reasons, the
writeout callchain is complex and there could be some path that relies
on redirtying the entire unwritten range.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhance changelog, the history and safety concerns, add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:15 +01:00
Nikolay Borisov 4a2d25cd93 btrfs: Remove redundant FLAG_VACANCY
Commit 9036c10208 ("Btrfs: update hole handling v2") added the
FLAG_VACANCY to denote holes, however there was already a consistent way
of flagging extents which represent hole - ->block_start =
EXTENT_MAP_HOLE. And also the only place where this flag is checked is
in the fiemap code, but the block_start value is also checked and every
other place in the filesystem detects holes by using block_start
value's. So remove the extra flag. This survived a full xfstest run.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:14 +01:00
David Sterba 7c2871a2f4 btrfs: switch btrfs_trans_handle::adding_csums to bool
The semantics of adding_csums matches bool, 'short' was most likely used
to save space in a698d0755a ("Btrfs: add a type field for the
transaction handle").

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:14 +01:00
Edmund Nadolski bf46f52db9 btrfs: remove dead code from btrfs_get_extent
Due to new_inline logic, the create == 0 is always true at this
point in the code, so the create != 0 branch can be removed.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:14 +01:00
David Sterba 0932584b66 btrfs: sink get_extent parameter to extent_readpages
There's only one caller that passes btrfs_get_extent.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:13 +01:00
David Sterba 2135fb9bb4 btrfs: sink get_extent parameter to extent_fiemap
All callers pass btrfs_get_extent_fiemap and we don't expect anything
else in the context of extent_fiemap.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:13 +01:00
David Sterba deac642d7e btrfs: sink get_extent parameter to extent_write_full_page
There's only one caller.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:13 +01:00
David Sterba 916b929831 btrfs: sink get_extent parameter to extent_write_locked_range
There's only one caller.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
David Sterba 433175992c btrfs: sink get_extent parameter to extent_writepages
There's only one caller.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
Qu Wenruo bae15d95e2 btrfs: Cleanup existing name_len checks
Since tree-checker has verified leaf when reading from disk, we don't
need the existing verify_dir_item() or btrfs_is_name_len_valid() checks.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
David Sterba 712e36c5f2 btrfs: use GFP_KERNEL in btrfs_alloc_inode
This callback is called directly from VFS, no locks are held at the
allocation time.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
David Sterba f08dc36f78 btrfs: sink gfp parameter to clear_extent_uptodate
There's only one callsite with GFP_NOFS.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
David Sterba ae0f162534 btrfs: sink gfp parameter to clear_extent_bit
All callers use GFP_NOFS, we don't have to pass it as an argument. The
built-in tests pass GFP_KERNEL, but they run only at module load time
and NOFS works there as well.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
Liu Bo 4c274bc67b Btrfs: document rules about bio async submit
These rules have been hidden in several if-else and are not
straightforward to follow, for example, dio submit hook's nocsum case
has a bug , i.e. doing async submit instead of sync submit, which has
been fixed recently.

This is documenting the rules for reference.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:11 +01:00
Liu Bo f5c29bd9db Btrfs: add __init macro to btrfs init functions
Adding __init macro gives kernel a hint that this function is only used
during the initialization phase and its memory resources can be freed up
after.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:11 +01:00
Liu Bo 1805f2ca3f Btrfs: remove redundant btrfs_balance_delayed_items
In functions like btrfs_create(), we run both
btrfs_balance_delayed_items() and btrfs_btree_balance_dirty() after
the operation, but btrfs_btree_balance_dirty() is surely going to run
btrfs_balance_delayed_items().

This keeps only btrfs_btree_balance_dirty().

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:10 +01:00
Ming Lei c16a8ac3c0 btrfs: avoid accessing bvec table directly for a cloned bio
Commit 17347cec15f919901c90(Btrfs: change how we iterate bios in endio)
mentioned that for dio the submitted bio may be fast cloned, we
can't access the bvec table directly for a cloned bio, so use
bio_get_first_bvec() to retrieve the 1st bvec.

Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Cc: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Acked: David Sterba <dsterba@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Ming Lei 263663cd3c block: convert to bio_first_bvec_all & bio_first_page_all
This patch converts to bio_first_bvec_all() & bio_first_page_all() for
retrieving the 1st bvec/page, and prepares for supporting multipage bvec.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Justin Maggard b430b77512 btrfs: Fix quota reservation leak on preallocated files
Commit c6887cd111 ("Btrfs: don't do nocow check unless we have to")
changed the behavior of __btrfs_buffered_write() so that it first tries
to get a data space reservation, and then skips the relatively expensive
nocow check if the reservation succeeded.

If we have quotas enabled, the data space reservation also includes a
quota reservation.  But in the rewrite case, the space has already been
accounted for in qgroups.  So btrfs_check_data_free_space() increases
the quota reservation, but it never gets decreased when the data
actually gets written and overwrites the pre-existing data.  So we're
left with both the qgroup and qgroup reservation accounting for the same
space.

This commit adds the missing btrfs_qgroup_free_data() call in the case
of BTRFS_ORDERED_PREALLOC extents.

Fixes: c6887cd111 ("Btrfs: don't do nocow check unless we have to")
Signed-off-by: Justin Maggard <jmaggard@netgear.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-12-07 00:28:12 +01:00
Filipe Manana e3b8a48585 Btrfs: fix reported number of inode blocks after buffered append writes
The patch from commit a7e3b975a0 ("Btrfs: fix reported number of inode
blocks") introduced a regression where if we do a buffered write starting
at position equal to or greater than the file's size and then stat(2) the
file before writeback is triggered, the number of used blocks does not
change (unless there's a prealloc/unwritten extent). Example:

  $ xfs_io -f -c "pwrite -S 0xab 0 64K" foobar
  $ du -h foobar
  0	foobar
  $ sync
  $ du -h foobar
  64K	foobar

The first version of that patch didn't had this regression and the second
version, which was the one committed, was made only to address some
performance regression detected by the intel test robots using fs_mark.

This fixes the regression by setting the new delaloc bit in the range, and
doing it at btrfs_dirty_pages() while setting the regular dealloc bit as
well, so that this way we set both bits at once avoiding navigation of the
inode's io tree twice. Doing it at btrfs_dirty_pages() is also the most
meaninful place, as we should set the new dellaloc bit when if we set the
delalloc bit, which happens only if we copied bytes into the pages at
__btrfs_buffered_write().

This was making some of LTP's du tests fail, which can be quickly run
using a command line like the following:

  $ ./runltp -q -p -l /ltp.log -f commands -s du -d /mnt

Fixes: a7e3b975a0 ("Btrfs: fix reported number of inode blocks")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-15 17:27:46 +01:00
Liu Bo 56a0e706fc Btrfs: bail out gracefully rather than BUG_ON
If a file's DIR_ITEM key is invalid (due to memory errors) and gets
written to disk, a future lookup_path can end up with kernel panic due
to BUG_ON().

This gets rid of the BUG_ON(), meanwhile output the corrupted key and
return ENOENT if it's invalid.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reported-by: Guillaume Bouchard <bouchard@mercs-eng.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-15 14:47:01 +01:00
Liu Bo f82b735936 Btrfs: add write_flags for compression bio
Compression code path has only flaged bios with REQ_OP_WRITE no matter
where the bios come from, but it could be a sync write if fsync starts
this writeback or a normal writeback write if wb kthread starts a
periodic writeback.

It breaks the rule that sync writes and writeback writes need to be
differentiated from each other, because from the POV of block layer,
all bios need to be recognized by these flags in order to do some
management, e.g. throttlling.

This passes writeback_control to compression write path so that it can
send bios with proper flags to block layer.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-15 14:44:31 +01:00
Josef Bacik ddfae63cc8 btrfs: move btrfs_truncate_block out of trans handle
Since we do a delalloc reserve in btrfs_truncate_block we can deadlock
with freeze.  If somebody else is trying to allocate metadata for this
inode and it gets stuck in start_delalloc_inodes because of freeze we
will deadlock.  Be safe and move this outside of a trans handle.  This
also has a side-effect of making sure that we're not leaving stale data
behind in the other_encoding or encryption case.  Not an issue now since
nobody uses it, but it would be a problem in the future.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:35 +01:00
Josef Bacik 69fe2d75dd btrfs: make the delalloc block rsv per inode
The way we handle delalloc metadata reservations has gotten
progressively more complicated over the years.  There is so much cruft
and weirdness around keeping the reserved count and outstanding counters
consistent and handling the error cases that it's impossible to
understand.

Fix this by making the delalloc block rsv per-inode.  This way we can
calculate the actual size of the outstanding metadata reservations every
time we make a change, and then reserve the delta based on that amount.
This greatly simplifies the code everywhere, and makes the error
handling in btrfs_delalloc_reserve_metadata far less terrifying.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:35 +01:00
Josef Bacik 8b62f87bad Btrfs: rework outstanding_extents
Right now we do a lot of weird hoops around outstanding_extents in order
to keep the extent count consistent.  This is because we logically
transfer the outstanding_extent count from the initial reservation
through the set_delalloc_bits.  This makes it pretty difficult to get a
handle on how and when we need to mess with outstanding_extents.

Fix this by revamping the rules of how we deal with outstanding_extents.
Now instead everybody that is holding on to a delalloc extent is
required to increase the outstanding extents count for itself.  This
means we'll have something like this

btrfs_delalloc_reserve_metadata	- outstanding_extents = 1
 btrfs_set_extent_delalloc	- outstanding_extents = 2
btrfs_release_delalloc_extents	- outstanding_extents = 1

for an initial file write.  Now take the append write where we extend an
existing delalloc range but still under the maximum extent size

btrfs_delalloc_reserve_metadata - outstanding_extents = 2
  btrfs_set_extent_delalloc
    btrfs_set_bit_hook		- outstanding_extents = 3
    btrfs_merge_extent_hook	- outstanding_extents = 2
btrfs_delalloc_release_extents	- outstanding_extnets = 1

In order to make the ordered extent transition we of course must now
make ordered extents carry their own outstanding_extent reservation, so
for cow_file_range we end up with

btrfs_add_ordered_extent	- outstanding_extents = 2
clear_extent_bit		- outstanding_extents = 1
btrfs_remove_ordered_extent	- outstanding_extents = 0

This makes all manipulations of outstanding_extents much more explicit.
Every successful call to btrfs_delalloc_reserve_metadata _must_ now be
combined with btrfs_release_delalloc_extents, even in the error case, as
that is the only function that actually modifies the
outstanding_extents counter.

The drawback to this is now we are much more likely to have transient
cases where outstanding_extents is much larger than it actually should
be.  This could happen before as we manipulated the delalloc bits, but
now it happens basically at every write.  This may put more pressure on
the ENOSPC flushing code, but I think making this code simpler is worth
the cost.  I have another change coming to mitigate this side-effect
somewhat.

I also added trace points for the counter manipulation.  These were used
by a bpf script I wrote to help track down leak issues.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:35 +01:00
Zygo Blaxell c995ab3cda btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents
The LOGICAL_INO ioctl provides a backward mapping from extent bytenr and
offset (encoded as a single logical address) to a list of extent refs.
LOGICAL_INO complements TREE_SEARCH, which provides the forward mapping
(extent ref -> extent bytenr and offset, or logical address).  These are
useful capabilities for programs that manipulate extents and extent
references from userspace (e.g. dedup and defrag utilities).

When the extents are uncompressed (and not encrypted and not other),
check_extent_in_eb performs filtering of the extent refs to remove any
extent refs which do not contain the same extent offset as the 'logical'
parameter's extent offset.  This prevents LOGICAL_INO from returning
references to more than a single block.

To find the set of extent references to an uncompressed extent from [a, b),
userspace has to run a loop like this pseudocode:

	for (i = a; i < b; ++i)
		extent_ref_set += LOGICAL_INO(i);

At each iteration of the loop (up to 32768 iterations for a 128M extent),
data we are interested in is collected in the kernel, then deleted by
the filter in check_extent_in_eb.

When the extents are compressed (or encrypted or other), the 'logical'
parameter must be an extent bytenr (the 'a' parameter in the loop).
No filtering by extent offset is done (or possible?) so the result is
the complete set of extent refs for the entire extent.  This removes
the need for the loop, since we get all the extent refs in one call.

Add an 'ignore_offset' argument to iterate_inodes_from_logical,
[...several levels of function call graph...], and check_extent_in_eb, so
that we can disable the extent offset filtering for uncompressed extents.
This flag can be set by an improved version of the LOGICAL_INO ioctl to
get either behavior as desired.

There is no functional change in this patch.  The new flag is always
false.

Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor coding style fixes ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:34 +01:00
David Sterba f51d2b5912 btrfs: allow to set compression level for zlib
Preliminary support for setting compression level for zlib, the
following works:

$ mount -o compess=zlib                 # default
$ mount -o compess=zlib0                # same
$ mount -o compess=zlib9                # level 9, slower sync, less data
$ mount -o compess=zlib1                # level 1, faster sync, more data
$ mount -o remount,compress=zlib3	# level set by remount

The compress-force works the same as compress'.  The level is visible in
the same format in /proc/mounts. Level set via file property does not
work yet.

Required patch: "btrfs: prepare for extensions in compression options"

Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:29 +01:00
Anand Jain 2dbe0c7718 btrfs: use BLK_STS defines where needed
At few places we could use BLK_STS_OK and BLK_STS_NOSUPP.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Satoru Taekeuchi <satoru.takeuchi@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ dropped first hunk btrfs_endio_direct_read ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:01 +01:00
Josef Bacik 84f7d8e624 btrfs: pass root to various extent ref mod functions
We need the actual root for the ref verifier tool to work, so change
these functions to pass the root around instead.  This will be used in
a subsequent patch.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
Liu Bo 736cd52e0c Btrfs: remove nr_async_submits and async_submit_draining
Now that we have the combo of flushing twice, which can make sure IO
have started since the second flush will wait for page lock which
won't be unlocked unless setting page writeback and queuing ordered
extents, we don't need %async_submit_draining, %async_delalloc_pages
and %nr_async_submits to tell whether the IO has actually started.

Moreover, all the flushers in use are followed by functions that wait
for ordered extents to complete, so %nr_async_submits, which tracks
whether bio's async submit has made progress, doesn't really make
sense.

However, %async_delalloc_pages is still required by shrink_delalloc()
as that function doesn't flush twice in the normal case (just issues a
writeback with WB_REASON_FS_FREE_SPACE).

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Timofey Titovets 1170862d78 Btrfs: compress_file_range remove dead variable num_bytes
Remove dead assigment of num_bytes.

Also as num_bytes only used in the will_compress block as copy of
total_in just replace that with total_in and drop num_bytes entirely.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Thomas Meyer 897ca8194c btrfs: Fix bool initialization/comparison
Bool initializations should use true and false. Bool tests don't need
comparisons.

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Timofey Titovets 6018ba0a0e Btrfs: cleanup 'start' subtraction from try uncompressed inline extent
Was added in:
  c8b978188c
  "Btrfs: Add zlib compression support"
Survive to near time (from 08.10.2008).

Because 'start' checked for zero before branch, so it's safe to remove
that subtraction.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Satoru Takeuchi <satoru.takeuchi@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:57 +01:00
Nikolay Borisov 8c70c9f81e btrfs: Remove unused parameter from check_direct_IO
Introduced by 5a5f79b570 ("Btrfs: allow unaligned DIO") and never
used. The buffered fallback from unaligned DIO works as expected.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:56 +01:00
Liu Bo 9b4a9b283d Btrfs: do not async submit for nodatasum inodes
While we submit direct writes, if the inode is flagged with nodatasum,
there's no benefit to submit asynchronously, because

a) we don't have to calculate checksum across processors,

b) and direct IO has started a plug, but async submit makes us queue
IO on each device's scheduled IO list instead of DIO's plug list, so
that IOs get much less merges in general.

Lets use sync submit for nodatasum inodes.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Linus Torvalds 5ba88cd6e9 Merge branch 'for-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
 "We've collected a bunch of isolated fixes, for crashes, user-visible
  behaviour or missing bits from other subsystem cleanups from the past.

  The overall number is not small but I was not able to make it
  significantly smaller. Most of the patches are supposed to go to
  stable"

* 'for-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: log csums for all modified extents
  Btrfs: fix unexpected result when dio reading corrupted blocks
  btrfs: Report error on removing qgroup if del_qgroup_item fails
  Btrfs: skip checksum when reading compressed data if some IO have failed
  Btrfs: fix kernel oops while reading compressed data
  Btrfs: use btrfs_op instead of bio_op in __btrfs_map_block
  Btrfs: do not backup tree roots when fsync
  btrfs: remove BTRFS_FS_QUOTA_DISABLING flag
  btrfs: propagate error to btrfs_cmp_data_prepare caller
  btrfs: prevent to set invalid default subvolid
  Btrfs: send: fix error number for unknown inode types
  btrfs: fix NULL pointer dereference from free_reloc_roots()
  btrfs: finish ordered extent cleaning if no progress is found
  btrfs: clear ordered flag on cleaning up ordered extents
  Btrfs: fix incorrect {node,sector}size endianness from BTRFS_IOC_FS_INFO
  Btrfs: do not reset bio->bi_ops while writing bio
  Btrfs: use the new helper wbc_to_write_flags
2017-09-29 12:57:35 -07:00
Liu Bo 99c4e3b96c Btrfs: fix unexpected result when dio reading corrupted blocks
commit 4246a0b63b ("block: add a bi_error field to struct bio")
changed the logic of how dio read endio reports errors.

For single stripe dio read, %bio->bi_status reflects the error before
verifying checksum, and now we're updating it when data block matches
with its checksum, while in the mismatching case, %bio->bi_status is
not updated to relfect that.

When some blocks in a file have been corrupted on disk, reading such a
file ends up with

1) checksum errors are reported in kernel log
2) read(2) returns successfully with some content being 0x01.

In order to fix it, we need to report its checksum mismatch error to
the upper layer (dio layer in this case) as well.

Fixes: 4246a0b63b ("block: add a bi_error field to struct bio")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reported-by: Goffredo Baroncelli <kreijack@inwind.it>
Tested-by: Goffredo Baroncelli <kreijack@inwind.it>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:54:07 +02:00
Naohiro Aota 67c003f90f btrfs: finish ordered extent cleaning if no progress is found
__endio_write_update_ordered() repeats the search until it reaches the end
of the specified range. This works well with direct IO path, because before
the function is called, it's ensured that there are ordered extents filling
whole the range. It's not the case, however, when it's called from
run_delalloc_range(): it is possible to have error in the midle of the loop
in e.g. run_delalloc_nocow(), so that there exisits the range not covered
by any ordered extents. By cleaning such "uncomplete" range,
__endio_write_update_ordered() stucks at offset where there're no ordered
extents.

Since the ordered extents are created from head to tail, we can stop the
search if there are no offset progress.

Fixes: 524272607e ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
Cc: <stable@vger.kernel.org> # 4.12
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:49:06 +02:00
Naohiro Aota 63d71450c8 btrfs: clear ordered flag on cleaning up ordered extents
Commit 524272607e ("btrfs: Handle delalloc error correctly to avoid
ordered extent hang") introduced btrfs_cleanup_ordered_extents() to cleanup
submitted ordered extents. However, it does not clear the ordered bit
(Private2) of corresponding pages. Thus, the following BUG occurs from
free_pages_check_bad() (on btrfs/125 with nospace_cache).

BUG: Bad page state in process btrfs  pfn:3fa787
page:ffffdf2acfe9e1c0 count:0 mapcount:0 mapping:          (null) index:0xd
flags: 0x8000000000002008(uptodate|private_2)
raw: 8000000000002008 0000000000000000 000000000000000d 00000000ffffffff
raw: ffffdf2acf5c1b20 ffffb443802238b0 0000000000000000 0000000000000000
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags: 0x2000(private_2)

This patch clears the flag same as other places calling
btrfs_dec_test_ordered_pending() for every page in the specified range.

Fixes: 524272607e ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
Cc: <stable@vger.kernel.org> # 4.12
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:49:00 +02:00
Linus Torvalds 0f0d12728e Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull mount flag updates from Al Viro:
 "Another chunk of fmount preparations from dhowells; only trivial
  conflicts for that part. It separates MS_... bits (very grotty
  mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
  only a small subset of MS_... stuff).

  This does *not* convert the filesystems to new constants; only the
  infrastructure is done here. The next step in that series is where the
  conflicts would be; that's the conversion of filesystems. It's purely
  mechanical and it's better done after the merge, so if you could run
  something like

	list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')

	sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \
	        -e 's/\<MS_NOSUID\>/SB_NOSUID/g' \
	        -e 's/\<MS_NODEV\>/SB_NODEV/g' \
	        -e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \
	        -e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \
	        -e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \
	        -e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \
	        -e 's/\<MS_NOATIME\>/SB_NOATIME/g' \
	        -e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \
	        -e 's/\<MS_SILENT\>/SB_SILENT/g' \
	        -e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \
	        -e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \
	        -e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \
	        -e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \
	        $list

  and commit it with something along the lines of 'convert filesystems
  away from use of MS_... constants' as commit message, it would save a
  quite a bit of headache next cycle"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  VFS: Differentiate mount flags (MS_*) from internal superblock flags
  VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
  vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
2017-09-14 18:54:01 -07:00
Linus Torvalds 66ba772ee3 Merge branch 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
 "The changes range through all types: cleanups, core chagnes, sanity
  checks, fixes, other user visible changes, detailed list below:

   - deprecated: user transaction ioctl

   - mount option ssd does not change allocation alignments

   - degraded read-write mount is allowed if all the raid profile
     constraints are met, now based on more accurate check

   - defrag: do not reset compression afterwards; the NOCOMPRESS flag
     can be now overriden by defrag

   - prep work for better extent reference tracking (related to the
     qgroup slowness with balance)

   - prep work for compression heuristics

   - memory allocation reductions (may help latencies on a loaded
     system)

   - better accounting for io waiting states

   - error handling improvements (removed BUGs)

   - added more sanity checks for shared refs

   - fix readdir vs pagefault deadlock under some circumstances

   - fix for 'no-hole' mode, certain combination of compressed and
     inline extents

   - send: fix emission of invalid clone operations

   - fixup file mode if setting acls fail

   - more fixes from fuzzing

   - oher cleanups"

* 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (104 commits)
  btrfs: submit superblock io with REQ_META and REQ_PRIO
  btrfs: remove unnecessary memory barrier in btrfs_direct_IO
  btrfs: remove superfluous chunk_tree argument from btrfs_alloc_dev_extent
  btrfs: Remove chunk_objectid parameter of btrfs_alloc_dev_extent
  btrfs: pass fs_info to btrfs_del_root instead of tree_root
  Btrfs: add one more sanity check for shared ref type
  Btrfs: remove BUG_ON in __add_tree_block
  Btrfs: remove BUG() in add_data_reference
  Btrfs: remove BUG() in print_extent_item
  Btrfs: remove BUG() in btrfs_extent_inline_ref_size
  Btrfs: convert to use btrfs_get_extent_inline_ref_type
  Btrfs: add a helper to retrive extent inline ref type
  btrfs: scrub: simplify scrub worker initialization
  btrfs: scrub: clean up division in scrub_find_csum
  btrfs: scrub: clean up division in __scrub_mark_bitmap
  btrfs: scrub: use bool for flush_all_writes
  btrfs: preserve i_mode if __btrfs_set_acl() fails
  btrfs: Remove extraneous chunk_objectid variable
  btrfs: Remove chunk_objectid argument from btrfs_make_block_group
  btrfs: Remove extra parentheses from condition in copy_items()
  ...
2017-09-09 13:27:51 -07:00
Omar Sandoval 58efbc9f54 Btrfs: fix blk_status_t/errno confusion
This fixes several instances of blk_status_t and bare errno ints being
mixed up, some of which are real bugs.

In the normal case, 0 matches BLK_STS_OK, so we don't observe any
effects of the missing conversion, but in case of errors or passes
through the repair/retry paths, the errors get mixed up.

The changes were identified using 'sparse', we don't have reports of the
buggy behaviour.

Fixes: 4e4cbee93d ("block: switch bios to blk_status_t")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-24 17:19:02 +02:00
Nikolay Borisov dc59215d4f btrfs: remove unnecessary memory barrier in btrfs_direct_IO
Commit 38851cc19a ("Btrfs: implement unlocked dio write") implemented
unlocked dio write, allowing multiple dio writers to write to
non-overlapping, and non-eof-extending regions. In doing so it also
introduced a broken memory barrier. It is broken due to 2 things:

1. Memory barriers _MUST_ always be paired, this is clearly not the case
   here

2. Checkpatch actually produces a warning if a memory barrier is
   introduced that doesn't have a comment explaining how it's being
   paired.

Specifically for inode::i_dio_count that's wrapped inside
inode_dio_begin, there is no explicit barrier semantics attached, so
removing is fine as the atomic is used in common the waiter/wakeup
pattern.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhance changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 18:49:21 +02:00
Nikolay Borisov e6961cac73 btrfs: Move skip checksum check from btrfs_submit_direct to __btrfs_submit_dio_bio
Currently the code checks whether we should do data checksumming in
btrfs_submit_direct and the boolean result of this check is passed to
btrfs_submit_direct_hook, in turn passing it to __btrfs_submit_dio_bio which
actually consumes it. The last function actually has all the necessary context
to figure out whether to skip the check or not, so let's move the check closer
to where it's being consumed. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:29 +02:00
Filipe Manana 4a4b964f42 Btrfs: avoid unnecessarily locking inode when clearing a range
If the range being cleared was not marked for defrag and we are not
about to clear the range from the defrag status, we don't need to
lock and unlock the inode.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:29 +02:00
Josef Bacik 23b5ec7494 btrfs: fix readdir deadlock with pagefault
Readdir does dir_emit while under the btree lock.  dir_emit can trigger
the page fault which means we can deadlock.  Fix this by allocating a
buffer on opening a directory and copying the readdir into this buffer
and doing dir_emit from outside of the tree lock.

Thread A
readdir  <holding tree lock>
  dir_emit
    <page fault>
      down_read(mmap_sem)

Thread B
mmap write
  down_write(mmap_sem)
    page_mkwrite
      wait_ordered_extents

Process C
finish_ordered_extent
  insert_reserved_file_extent
   try to lock leaf <hang>

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copy the deadlock scenario to changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:05 +02:00
David Sterba 1e20d1c45f btrfs: allow defrag compress to override NOCOMPRESS attribute
Currently, the BTRFS_INODE_NOCOMPRESS will prevent any compression on a
given file, except when the mount is force-compress. As users have
reported on IRC, this will also prevent compression when requested by
defrag (btrfs fi defrag -c file).

The nocompress flag is set automatically by filesystem when the ratios
are bad and the user would have to manually drop the bit in order to
make defrag -c work. This is not good from the usability perspective.

This patch will raise priority for the defrag -c over nocompress, ie.
any file with NOCOMPRESS bit set will get defragmented. The bit will
remain untouched.

Alternate option was to also drop the nocompress bit and keep the
decision logic as is, but I think this is not the right solution.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:05 +02:00
David Sterba eec63c65dc btrfs: separate defrag and property compression
Add new value for compression to distinguish between defrag and
property. Previously, a single variable was used and this caused clashes
when the per-file 'compression' was set and a defrag -c was called.

The property-compression is loaded when the file is open, defrag will
overwrite the same variable and reset to 0 (ie. NONE) at when the file
defragmentaion is finished. That's considered a usability bug.

Now we won't touch the property value, use the defrag-compression. The
precedence of defrag is higher than for property (and whole-filesystem).

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:05 +02:00
David Sterba b52aa8c93e btrfs: rename variable holding per-inode compression type
This is preparatory for separating inode compression requested by defrag
and set via properties. This will fix a usability bug when defrag will
reset compression type to NONE. If the file has compression set via
property, it will not apply anymore (until next mount or reset through
command line).

We're going to fix that by adding another variable just for the defrag
call and won't touch the property. The defrag will have higher priority
when deciding whether to compress the data.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:05 +02:00
Timofey Titovets c2fcdcdf36 Btrfs: add skeleton code for compression heuristic
Add skeleton code for compresison heuristics. Now it iterates over all
the pages, but in the end always says "yes, compress please", ie it does
not change the current behaviour.

In the future we're going to add various heuristics to analyze the data.
This patch can be used as a baseline for measuring if the effectivness
and performance.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhanced changelog, modified comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba 9c17f6cda1 btrfs: account that we're waiting for DIO read
Correctly account for IO when waiting for a submitted DIO read, the case
when we're retrying.  This only for the accounting purposes and should
not change other behaviour.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba ea14b57fd1 btrfs: fix spelling of snapshotting
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba f1c77c55cd btrfs: cleanup types storing REQ_*
Unify types of local variables and parameters that store various
REQ_* values to unsigned int.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
Anand Jain 19aee8dea3 btrfs: btrfs_inherit_iflags() can be static
btrfs_new_inode() is the only consumer move it to inode.c,
from ioctl.c.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
David Sterba 913e153572 btrfs: drop newlines from strings when using btrfs_* helpers
The helpers append "\n" so we can keep the actual strings shorter. The
extra newline will print an empty line.  Some messages have been
slightly modified to be more consistent with the rest (lowercase first
letter).

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Liu Bo 0d1e0bead6 Btrfs: report errors when checksum is not found
When btrfs fails the checksum check, it'll fill the whole page with
"1".

However, if %csum_expected is 0 (which means there is no checksum), then
for some unknown reason, we just pretend that the read is correct, so
userspace would be confused about the dilemma that read is successful but
getting a page with all content being "1".

This can happen due to a bug in btrfs-convert.

This fixes it by always returning errors if checksum doesn't match.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
David Howells bc98a42c1f VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
Firstly by applying the following with coccinelle's spatch:

	@@ expression SB; @@
	-SB->s_flags & MS_RDONLY
	+sb_rdonly(SB)

to effect the conversion to sb_rdonly(sb), then by applying:

	@@ expression A, SB; @@
	(
	-(!sb_rdonly(SB)) && A
	+!sb_rdonly(SB) && A
	|
	-A != (sb_rdonly(SB))
	+A != sb_rdonly(SB)
	|
	-A == (sb_rdonly(SB))
	+A == sb_rdonly(SB)
	|
	-!(sb_rdonly(SB))
	+!sb_rdonly(SB)
	|
	-A && (sb_rdonly(SB))
	+A && sb_rdonly(SB)
	|
	-A || (sb_rdonly(SB))
	+A || sb_rdonly(SB)
	|
	-(sb_rdonly(SB)) != A
	+sb_rdonly(SB) != A
	|
	-(sb_rdonly(SB)) == A
	+sb_rdonly(SB) == A
	|
	-(sb_rdonly(SB)) && A
	+sb_rdonly(SB) && A
	|
	-(sb_rdonly(SB)) || A
	+sb_rdonly(SB) || A
	)

	@@ expression A, B, SB; @@
	(
	-(sb_rdonly(SB)) ? 1 : 0
	+sb_rdonly(SB)
	|
	-(sb_rdonly(SB)) ? A : B
	+sb_rdonly(SB) ? A : B
	)

to remove left over excess bracketage and finally by applying:

	@@ expression A, SB; @@
	(
	-(A & MS_RDONLY) != sb_rdonly(SB)
	+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
	|
	-(A & MS_RDONLY) == sb_rdonly(SB)
	+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
	)

to make comparisons against the result of sb_rdonly() (which is a bool)
work correctly.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-07-17 08:45:34 +01:00
Linus Torvalds bc243704fb Merge branch 'for-4.13-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
 "We've identified and fixed a silent corruption (introduced by code in
  the first pull), a fixup after the blk_status_t merge and two fixes to
  incremental send that Filipe has been hunting for some time"

* 'for-4.13-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix unexpected return value of bio_readpage_error
  btrfs: btrfs_create_repair_bio never fails, skip error handling
  btrfs: cloned bios must not be iterated by bio_for_each_segment_all
  Btrfs: fix write corruption due to bio cloning on raid5/6
  Btrfs: incremental send, fix invalid memory access
  Btrfs: incremental send, fix invalid path for link commands
2017-07-14 22:55:52 -07:00
David Sterba e8f5b395d5 btrfs: btrfs_create_repair_bio never fails, skip error handling
As the function uses the non-failing bio allocation, we can remove error
handling from the callers as well.

Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-14 20:42:08 +02:00
David Sterba c09abff87f btrfs: cloned bios must not be iterated by bio_for_each_segment_all
We've started using cloned bios more in 4.13, there are some specifics
regarding the iteration.  Filipe found [1] that the raid56 iterated a
cloned bio using bio_for_each_segment_all, which is incorrect. The
cloned bios have wrong bi_vcnt and this could lead to silent
corruptions.  This patch adds assertions to all remaining
bio_for_each_segment_all cases.

[1] https://patchwork.kernel.org/patch/9838535/

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-14 20:39:31 +02:00
Linus Torvalds a4c20b9a57 Merge branch 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu updates from Tejun Heo:
 "These are the percpu changes for the v4.13-rc1 merge window. There are
  a couple visibility related changes - tracepoints and allocator stats
  through debugfs, along with __ro_after_init markings and a cosmetic
  rename in percpu_counter.

  Please note that the simple O(#elements_in_the_chunk) area allocator
  used by percpu allocator is again showing scalability issues,
  primarily with bpf allocating and freeing large number of counters.
  Dennis is working on the replacement allocator and the percpu
  allocator will be seeing increased churns in the coming cycles"

* 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  percpu: fix static checker warnings in pcpu_destroy_chunk
  percpu: fix early calls for spinlock in pcpu_stats
  percpu: resolve err may not be initialized in pcpu_alloc
  percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch
  percpu: add tracepoint support for percpu memory
  percpu: expose statistics about percpu memory via debugfs
  percpu: migrate percpu data structures to internal header
  percpu: add missing lockdep_assert_held to func pcpu_free_area
  mark most percpu globals as __ro_after_init
2017-07-06 08:59:41 -07:00
Linus Torvalds 8c27cb3566 Merge branch 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
 "The core updates improve error handling (mostly related to bios), with
  the usual incremental work on the GFP_NOFS (mis)use removal,
  refactoring or cleanups. Except the two top patches, all have been in
  for-next for an extensive amount of time.

  User visible changes:

   - statx support

   - quota override tunable

   - improved compression thresholds

   - obsoleted mount option alloc_start

  Core updates:

   - bio-related updates:
       - faster bio cloning
       - no allocation failures
       - preallocated flush bios

   - more kvzalloc use, memalloc_nofs protections, GFP_NOFS updates

   - prep work for btree_inode removal

   - dir-item validation

   - qgoup fixes and updates

   - cleanups:
       - removed unused struct members, unused code, refactoring
       - argument refactoring (fs_info/root, caller -> callee sink)
       - SEARCH_TREE ioctl docs"

* 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits)
  btrfs: Remove false alert when fiemap range is smaller than on-disk extent
  btrfs: Don't clear SGID when inheriting ACLs
  btrfs: fix integer overflow in calc_reclaim_items_nr
  btrfs: scrub: fix target device intialization while setting up scrub context
  btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges
  btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
  btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled
  btrfs: qgroup: Return actually freed bytes for qgroup release or free data
  btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function
  btrfs: qgroup: Add quick exit for non-fs extents
  Btrfs: rework delayed ref total_bytes_pinned accounting
  Btrfs: return old and new total ref mods when adding delayed refs
  Btrfs: always account pinned bytes when dropping a tree block ref
  Btrfs: update total_bytes_pinned when pinning down extents
  Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT()
  Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64
  btrfs: fix validation of XATTR_ITEM dir items
  btrfs: Verify dir_item in iterate_object_props
  btrfs: Check name_len before in btrfs_del_root_ref
  btrfs: Check name_len before reading btrfs_get_name
  ...
2017-07-05 16:41:23 -07:00
Qu Wenruo bc42bda223 btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges
[BUG]
For the following case, btrfs can underflow qgroup reserved space
at an error path:
(Page size 4K, function name without "btrfs_" prefix)

         Task A                  |             Task B
----------------------------------------------------------------------
Buffered_write [0, 2K)           |
|- check_data_free_space()       |
|  |- qgroup_reserve_data()      |
|     Range aligned to page      |
|     range [0, 4K)          <<< |
|     4K bytes reserved      <<< |
|- copy pages to page cache      |
                                 | Buffered_write [2K, 4K)
                                 | |- check_data_free_space()
                                 | |  |- qgroup_reserved_data()
                                 | |     Range alinged to page
                                 | |     range [0, 4K)
                                 | |     Already reserved by A <<<
                                 | |     0 bytes reserved      <<<
                                 | |- delalloc_reserve_metadata()
                                 | |  And it *FAILED* (Maybe EQUOTA)
                                 | |- free_reserved_data_space()
                                      |- qgroup_free_data()
                                         Range aligned to page range
                                         [0, 4K)
                                         Freeing 4K
(Special thanks to Chandan for the detailed report and analyse)

[CAUSE]
Above Task B is freeing reserved data range [0, 4K) which is actually
reserved by Task A.

And at writeback time, page dirty by Task A will go through writeback
routine, which will free 4K reserved data space at file extent insert
time, causing the qgroup underflow.

[FIX]
For btrfs_qgroup_free_data(), add @reserved parameter to only free
data ranges reserved by previous btrfs_qgroup_reserve_data().
So in above case, Task B will try to free 0 byte, so no underflow.

Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
Qu Wenruo 364ecf3651 btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
Introduce a new parameter, struct extent_changeset for
btrfs_qgroup_reserved_data() and its callers.

Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
which range it reserved in current reserve, so it can free it in error
paths.

The reason we need to export it to callers is, at buffered write error
path, without knowing what exactly which range we reserved in current
allocation, we can free space which is not reserved by us.

This will lead to qgroup reserved space underflow.

Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
Qu Wenruo a12b877b55 btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled
[BUG]
Under the following case, we can underflow qgroup reserved space.

            Task A                |            Task B
---------------------------------------------------------------
 Quota disabled                   |
 Buffered write                   |
 |- btrfs_check_data_free_space() |
 |  *NO* qgroup space is reserved |
 |  since quota is *DISABLED*     |
 |- All pages are copied to page  |
    cache                         |
                                  | Enable quota
                                  | Quota scan finished
                                  |
                                  | Sync_fs
                                  | |- run_delalloc_range
                                  | |- Write pages
                                  | |- btrfs_finish_ordered_io
                                  |    |- insert_reserved_file_extent
                                  |       |- btrfs_qgroup_release_data()
                                  |          Since no qgroup space is
                                             reserved in Task A, we
                                             underflow qgroup reserved
                                             space
This can be detected by fstest btrfs/104.

[CAUSE]
In insert_reserved_file_extent() we tell qgroup to release the @ram_bytes
size of qgroup reserved_space in all cases.
And btrfs_qgroup_release_data() will check if quotas are enabled.

However in the above case, the buffered write happens before quota is
enabled, so we don't have the reserved space for that range.

[FIX]
In insert_reserved_file_extent(), we tell qgroup to release the acctual
byte number it released.
In the above case, since we don't have the reserved space, we tell
qgroups to release 0 byte, so the problem can be fixed.

And thanks to the @reserved parameter introduced by the qgroup rework,
and previous patch to return released bytes, the fix can be as small as
10 lines.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
[ changelog updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
Su Yue e79a33270d btrfs: Check name_len with boundary in verify dir_item
Originally, verify_dir_item verifies name_len of dir_item with fixed
values but not item boundary.
If corrupted name_len was not bigger than the fixed value, for example
255, the function will think the dir_item is fine. And then reading
beyond boundary will cause crash.

Example:
	1. Corrupt one dir_item name_len to be 255.
        2. Run 'ls -lar /mnt/test/ > /dev/null'
dmesg:
[   48.451449] BTRFS info (device vdb1): disk space caching is enabled
[   48.451453] BTRFS info (device vdb1): has skinny extents
[   48.489420] general protection fault: 0000 [#1] SMP
[   48.489571] Modules linked in: ext4 jbd2 mbcache btrfs xor raid6_pq
[   48.489716] CPU: 1 PID: 2710 Comm: ls Not tainted 4.10.0-rc1 #5
[   48.489853] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
[   48.490008] task: ffff880035df1bc0 task.stack: ffffc90004800000
[   48.490008] RIP: 0010:read_extent_buffer+0xd2/0x190 [btrfs]
[   48.490008] RSP: 0018:ffffc90004803d98 EFLAGS: 00010202
[   48.490008] RAX: 000000000000001b RBX: 000000000000001b RCX: 0000000000000000
[   48.490008] RDX: ffff880079dbf36c RSI: 0005080000000000 RDI: ffff880079dbf368
[   48.490008] RBP: ffffc90004803dc8 R08: ffff880078e8cc48 R09: ffff880000000000
[   48.490008] R10: 0000160000000000 R11: 0000000000001000 R12: ffff880079dbf288
[   48.490008] R13: ffff880078e8ca88 R14: 0000000000000003 R15: ffffc90004803e20
[   48.490008] FS:  00007fef50c60800(0000) GS:ffff88007d400000(0000) knlGS:0000000000000000
[   48.490008] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   48.490008] CR2: 000055f335ac2ff8 CR3: 000000007356d000 CR4: 00000000001406e0
[   48.490008] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   48.490008] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   48.490008] Call Trace:
[   48.490008]  btrfs_real_readdir+0x3b7/0x4a0 [btrfs]
[   48.490008]  iterate_dir+0x181/0x1b0
[   48.490008]  SyS_getdents+0xa7/0x150
[   48.490008]  ? fillonedir+0x150/0x150
[   48.490008]  entry_SYSCALL_64_fastpath+0x18/0xad
[   48.490008] RIP: 0033:0x7fef5032546b
[   48.490008] RSP: 002b:00007ffeafcdb830 EFLAGS: 00000206 ORIG_RAX: 000000000000004e
[   48.490008] RAX: ffffffffffffffda RBX: 00007fef5061db38 RCX: 00007fef5032546b
[   48.490008] RDX: 0000000000008000 RSI: 000055f335abaff0 RDI: 0000000000000003
[   48.490008] RBP: 00007fef5061dae0 R08: 00007fef5061db48 R09: 0000000000000000
[   48.490008] R10: 000055f335abafc0 R11: 0000000000000206 R12: 00007fef5061db38
[   48.490008] R13: 0000000000008040 R14: 00007fef5061db38 R15: 000000000000270e
[   48.490008] RIP: read_extent_buffer+0xd2/0x190 [btrfs] RSP: ffffc90004803d98
[   48.499455] ---[ end trace 321920d8e8339505 ]---

Fix it by adding a parameter @slot and check name_len with item boundary
by calling btrfs_is_name_len_valid.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
rev
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:16:04 +02:00
Nikolay Borisov 104b4e5139 percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch
Currently, percpu_counter_add is a wrapper around __percpu_counter_add
which is preempt safe due to explicit calls to preempt_disable.  Given
how __ prefix is used in percpu related interfaces, the naming
unfortunately creates the false sense that __percpu_counter_add is
less safe than percpu_counter_add.  In terms of context-safety,
they're equivalent.  The only difference is that the __ version takes
a batch parameter.

Make this a bit more explicit by just renaming __percpu_counter_add to
percpu_counter_add_batch.

This patch doesn't cause any functional changes.

tj: Minor updates to patch description for clarity.  Cosmetic
    indentation updates.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: linux-mm@kvack.org
Cc: "David S. Miller" <davem@davemloft.net>
2017-06-20 15:42:32 -04:00
Goldwyn Rodrigues edf064e7c6 btrfs: nowait aio support
Return EAGAIN if any of the following checks fail
 + i_rwsem is not lockable
 + NODATACOW or PREALLOC is not set
 + Cannot nocow at the desired location
 + Writing beyond end of file which is not allocated

Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 07:12:03 -06:00
David Sterba 79b4f4c605 btrfs: cleanup duplicate return value in insert_inline_extent
The pattern when err is used for function exit and ret is used for
return values of callees is not used here.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-20 14:22:12 +02:00
Timofey Titovets 170607ebd9 Btrfs: compression must free at least one sector size
We already skip storing data where compression does not make the result
at least one byte less.  Let's make the logic better and check
that compression frees at least one sector size of bytes, otherwise it's
not that useful.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ changelog updated ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:04 +02:00
Liu Bo ef7cdac101 Btrfs: skip checksum verification if IO error occurs
Currently dio read also goes to verify checksum if -EIO has been returned,
although it usually fails on checksum, it's not necessary at all, we could
directly check if there is another copy to read.

And with this, the behavior of dio read is now consistent with that of
buffered read.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use bool for uptodate ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:03 +02:00
Liu Bo e3d37faba2 Btrfs: tolerate errors if we have retried successfully
With raid1 profile, dio read isn't tolerating IO errors if read length is
less than the stripe length (64K).

Our bio didn't get split in btrfs_submit_direct_hook() if (dip->flags &
BTRFS_DIO_ORIG_BIO_SUBMITTED) is true and that happens when the read
length is less than 64k.  In this case, if the underlying device returns
error somehow, bio->bi_error has recorded that error.

If we could recover the correct data from another copy in profile raid1/10/5/6,
with btrfs_subio_endio_read() returning 0, bio would have the correct data in
its vector, but bio->bi_error is not updated accordingly so that the following
dio_end_io(dio_bio, bio->bi_error) makes directIO think this read has failed.

This fixes the problem by setting bio's error to 0 if a good copy has been
found.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:03 +02:00
David Sterba 8b6c1d56f2 btrfs: sink gfp parameter to btrfs_bio_clone
All callers pass GFP_NOFS.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:03 +02:00
David Sterba 3aa8e074ab btrfs: btrfs_bio_clone never fails, skip error handling
Update direct callers of btrfs_bio_clone that do error handling, that we
can now remove.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
Guoqing Jiang 054ec2f626 btrfs: simplify code with bio_io_error
bio_io_error was introduced in the commit 4246a0b63b
("block: add a bi_error field to struct bio"), so use it to simplify
code.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
David Sterba 4b5faeac46 btrfs: use generic slab for for btrfs_transaction
Observing the number of slab objects of btrfs_transaction, there's just
one active on an almost quiescent filesystem, and the number of objects
goes to about ten when sync is in progress. Then the nubmer goes down to
1.  This matches the expectations of the transaction lifetime.

For such use the separate slab cache is not justified, as we do not
reuse objects frequently. For the shortlived transaction, the generic
slab (size 512) should be ok. We can optimistically expect that the 512
slabs are not all used (fragmentation) and there are free slots to take
when we do the allocation, compared to potentially allocating a whole new
page for the separate slab.

We'll lose the stats about the object use, which could be added later if
we really need them.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
Yonghong Song 04a87e3472 Btrfs: add statx support
Return enhanced file attributes from the btrfs, including:
  (1). inode creation time as stx_btime, and
  (2). Certain BTRFS_INODE_xxx flags are mapped to stx_attributes flags.

Example output:
	[root@localhost ~]# cat t.sh
	touch t
	chattr +aic t
	~/linux/samples/statx/test-statx t
	chattr -aic t
	touch t
	echo "========================================"
	~/linux/samples/statx/test-statx t
	/bin/rm t
	[root@localhost ~]# ./t.sh
	statx(t) = 0
	results=fff
  	  Size: 0               Blocks: 0          IO Block: 4096    regular file
	Device: 00:1c           Inode: 63962       Links: 1
	Access: (0644/-rw-r--r--)  Uid:     0   Gid:     0
	Access: 2017-05-11 16:03:13.999856591-0700
	Modify: 2017-05-11 16:03:13.999856591-0700
	Change: 2017-05-11 16:03:14.000856663-0700
 	 Birth: 2017-05-11 16:03:13.999856591-0700
	Attributes: 0000000000000034 (........ ........ ........ ........ ........ ........ ........ .-ai.c..)
	========================================
	statx(t) = 0
	results=fff
	  Size: 0               Blocks: 0          IO Block: 4096    regular file
	Device: 00:1c           Inode: 63962       Links: 1
	Access: (0644/-rw-r--r--)  Uid:     0   Gid:     0
	Access: 2017-05-11 16:03:14.006857097-0700
	Modify: 2017-05-11 16:03:14.006857097-0700
	Change: 2017-05-11 16:03:14.006857097-0700
 	Birth: 2017-05-11 16:03:13.999856591-0700
	Attributes: 0000000000000000 (........ ........ ........ ........ ........ ........ ........ .---.-..)
	[root@localhost ~]#

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
Jeff Mahoney 1b86826d12 btrfs: cleanup root usage by btrfs_get_alloc_profile
There are two places where we don't already know what kind of alloc
profile we need before calling btrfs_get_alloc_profile, but we need
access to a root everywhere we call it.

This patch adds helpers for btrfs_{data,metadata,system}_alloc_profile()
and relegates btrfs_system_alloc_profile to a static for use in those
two cases.  The next patch will eliminate one of those.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
David Sterba e03733da5a btrfs: fix bool type in btrfs_page_exists_in_range
We use only a simple bool indicator, int is not a problem here.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo e477094f0d Btrfs: hardcode GFP_NOFS for btrfs_bio_clone_partial
We only pass GFP_NOFS to btrfs_bio_clone_partial, so lets hardcode it.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Arnd Bergmann 3c91ee6964 Btrfs: work around maybe-uninitialized warning
A rewrite of btrfs_submit_direct_hook appears to have introduced a warning:

fs/btrfs/inode.c: In function 'btrfs_submit_direct_hook':
fs/btrfs/inode.c:8467:14: error: 'bio' may be used uninitialized in this function [-Werror=maybe-uninitialized]

Where the 'bio' variable was previously initialized unconditionally, it
is now set in the "while (submit_len > 0)" loop that would never execute
if submit_len is zero.

Assuming this cannot happen in practice, we can avoid the warning
by simply replacing the while{} loop with a do{}while() loop so
the compiler knows that it will always be entered at least once.

Fixes changes introduced in "Btrfs: use bio_clone_bioset_partial to
simplify DIO submit".

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo 3892ac9086 Btrfs: unify naming of btrfs_io_bio
All dio endio functions are using io_bio for struct btrfs_io_bio, this
makes btrfs_submit_direct to follow this convention.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo 629ebf4fad Btrfs: record error if one block has failed to retry
In the nocsum case of dio read endio, it returns immediately if an error
gets returned when repairing, which leaves the rest blocks unrepaired.  The
behavior is different from how buffered read endio works in the same case.
This changes it to record error only and go on repairing the rest blocks.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo 17347cec15 Btrfs: change how we iterate bios in endio
Since dio submit has used bio_clone_fast, the submitted bio may not have a
reliable bi_vcnt, for the bio vector iterations in checksum related
functions, bio->bi_iter is not modified yet and it's safe to use
bio_for_each_segment, while for those bio vector iterations in dio read's
endio, we now save a copy of bvec_iter in struct btrfs_io_bio when cloning
bios and use the helper __bio_for_each_segment with the saved bvec_iter to
access each bvec.

Also for dio reads which don't get split, we also need to save a copy of
bio iterator in btrfs_bio_clone to let __bio_for_each_segments to access
each bvec in dio read's endio.  Note that it doesn't affect other calls of
btrfs_bio_clone() because they don't need to use this iterator.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo 725130bac5 Btrfs: use bio_clone_bioset_partial to simplify DIO submit
Currently when mapping bio to limit bio to a single stripe length, we
split bio by adding page to bio one by one, but later we don't modify
the vector of bio at all, thus we can use bio_clone_fast to use the
original bio vector directly.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Josef Bacik 7870d0822b Btrfs: don't pass the inode through clean_io_failure
Instead pass around the failure tree and the io tree.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Josef Bacik c6100a4b4e Btrfs: replace tree->mapping with tree->private_data
For extent_io tree's we have carried the address_mapping of the inode
around in the io tree in order to pull the inode back out for calling
into various tree ops hooks.  This works fine when everything that has
an extent_io_tree has an inode.  But we are going to remove the
btree_inode, so we need to change this.  Instead just have a generic
void * for private data that we can initialize with, and have all the
tree ops use that instead.  This had a lot of cascading changes but
should be relatively straightforward.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor reordering of the callback prototypes ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Dan Carpenter 97d038562a Btrfs: remove an unused variable
"item" is never used.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:57 +02:00
Jens Axboe 8f66439eec Linux 4.12-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJZPdbLAAoJEHm+PkMAQRiGx4wH/1nCjfnl6fE8oJ24/1gEAOUh
 biFdqJkYZmlLYHVtYfLm4Ueg4adJdg0wx6qM/4RaAzmQVvLfDV34bc1qBf1+P95G
 kVF+osWyXrZo5cTwkwapHW/KNu4VJwAx2D1wrlxKDVG5AOrULH1pYOYGOpApEkZU
 4N+q5+M0ce0GJpqtUZX+UnI33ygjdDbBxXoFKsr24B7eA0ouGbAJ7dC88WcaETL+
 2/7tT01SvDMo0jBSV0WIqlgXwZ5gp3yPGnklC3F4159Yze6VFrzHMKS/UpPF8o8E
 W9EbuzwxsKyXUifX2GY348L1f+47glen/1sedbuKnFhP6E9aqUQQJXvEO7ueQl4=
 =m2Gx
 -----END PGP SIGNATURE-----

Merge tag 'v4.12-rc5' into for-4.13/block

We've already got a few conflicts and upcoming work depends on some of the
changes that have gone into mainline as regression fixes for this series.

Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream
trees to continue working on 4.13 changes.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-12 08:30:13 -06:00
Liu Bo 452e62b71f Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io
Before this, we use 'filled' mode here, ie. if all range has been
filled with EXTENT_DEFRAG bits, get to clear it, but if the defrag
range joins the adjacent delalloc range, then we'll have EXTENT_DEFRAG
bits in extent_state until releasing this inode's pages, and that
prevents extent_data from being freed.

This clears the bit if any was found within the ordered extent.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-06-09 12:48:29 -07:00
Christoph Hellwig 4e4cbee93d block: switch bios to blk_status_t
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Christoph Hellwig 4055351cdb fs: remove the unused error argument to dio_end_io()
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
David Sterba cc2b702c52 btrfs: use correct types for page indices in btrfs_page_exists_in_range
Variables start_idx and end_idx are supposed to hold a page index
derived from the file offsets. The int type is not the right one though,
offsets larger than 1 << 44 will get silently trimmed off the high bits.
(1 << 44 is 16TiB)

What can go wrong, if start is below the boundary and end gets trimmed:
- if there's a page after start, we'll find it (radix_tree_gang_lookup_slot)
- the final check "if (page->index <= end_idx)" will unexpectedly fail

The function will return false, ie. "there's no page in the range",
although there is at least one.

btrfs_page_exists_in_range is used to prevent races in:

* in hole punching, where we make sure there are not pages in the
  truncated range, otherwise we'll wait for them to finish and redo
  truncation, but we're going to replace the pages with holes anyway so
  the only problem is the intermediate state

* lock_extent_direct: we want to make sure there are no pages before we
  lock and start DIO, to prevent stale data reads

For practical occurence of the bug, there are several constaints.  The
file must be quite large, the affected range must cross the 16TiB
boundary and the internal state of the file pages and pending operations
must match.  Also, we must not have started any ordered data in the
range, otherwise we don't even reach the buggy function check.

DIO locking tries hard in several places to avoid deadlocks with
buffered IO and avoids waiting for ranges. The worst consequence seems
to be stale data read.

CC: Liu Bo <bo.li.liu@oracle.com>
CC: stable@vger.kernel.org	# 3.16+
Fixes: fc4adbff82 ("btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking")
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-01 16:56:17 +02:00
Chris Mason bce19f9d23 Merge branch 'for-chris-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.12 2017-04-27 14:13:09 -07:00
Filipe Manana a7e3b975a0 Btrfs: fix reported number of inode blocks
Currently when there are buffered writes that were not yet flushed and
they fall within allocated ranges of the file (that is, not in holes or
beyond eof assuming there are no prealloc extents beyond eof), btrfs
simply reports an incorrect number of used blocks through the stat(2)
system call (or any of its variants), regardless of mount options or
inode flags (compress, compress-force, nodatacow). This is because the
number of blocks used that is reported is based on the current number
of bytes in the vfs inode plus the number of dealloc bytes in the btrfs
inode. The later covers bytes that both fall within allocated regions
of the file and holes.

Example scenarios where the number of reported blocks is wrong while the
buffered writes are not flushed:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt/sdc

  $ xfs_io -f -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo1
  wrote 65536/65536 bytes at offset 0
  64 KiB, 16 ops; 0.0000 sec (259.336 MiB/sec and 66390.0415 ops/sec)

  $ sync

  $ xfs_io -c "pwrite -S 0xbb 0 64K" /mnt/sdc/foo1
  wrote 65536/65536 bytes at offset 0
  64 KiB, 16 ops; 0.0000 sec (192.308 MiB/sec and 49230.7692 ops/sec)

  # The following should have reported 64K...
  $ du -h /mnt/sdc/foo1
  128K	/mnt/sdc/foo1

  $ sync

  # After flushing the buffered write, it now reports the correct value.
  $ du -h /mnt/sdc/foo1
  64K	/mnt/sdc/foo1

  $ xfs_io -f -c "falloc -k 0 128K" -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo2
  wrote 65536/65536 bytes at offset 0
  64 KiB, 16 ops; 0.0000 sec (520.833 MiB/sec and 133333.3333 ops/sec)

  $ sync

  $ xfs_io -c "pwrite -S 0xbb 64K 64K" /mnt/sdc/foo2
  wrote 65536/65536 bytes at offset 65536
  64 KiB, 16 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec)

  # The following should have reported 128K...
  $ du -h /mnt/sdc/foo2
  192K	/mnt/sdc/foo2

  $ sync

  # After flushing the buffered write, it now reports the correct value.
  $ du -h /mnt/sdc/foo2
  128K	/mnt/sdc/foo2

So the number of used file blocks is simply incorrect, unlike in other
filesystems such as ext4 and xfs for example, but only while the buffered
writes are not flushed.

Fix this by tracking the number of delalloc bytes that fall within holes
and beyond eof of a file, and use instead this new counter when reporting
the number of used blocks for an inode.

Another different problem that exists is that the delalloc bytes counter
is reset when writeback starts (by clearing the EXTENT_DEALLOC flag from
the respective range in the inode's iotree) and the vfs inode's bytes
counter is only incremented when writeback finishes (through
insert_reserved_file_extent()). Therefore while writeback is ongoing we
simply report a wrong number of blocks used by an inode if the write
operation covers a range previously unallocated. While this change does
not fix this problem, it does minimizes it a lot by shortening that time
window, as the new dealloc bytes counter (new_delalloc_bytes) is only
decremented when writeback finishes right before updating the vfs inode's
bytes counter. Fully fixing this second problem is not trivial and will
be addressed later by a different patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-04-26 16:27:26 +01:00
Filipe Manana 1c81ba237b Btrfs: fix incorrect space accounting after failure to insert inline extent
When using compression, if we fail to insert an inline extent we
incorrectly end up attempting to free the reserved data space twice,
once through extent_clear_unlock_delalloc(), because we pass it the
flag EXTENT_DO_ACCOUNTING, and once through a direct call to
btrfs_free_reserved_data_space_noquota(). This results in a trace
like the following:

[  834.576240] ------------[ cut here ]------------
[  834.576825] WARNING: CPU: 2 PID: 486 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
[  834.579501] Modules linked in: btrfs crc32c_generic xor raid6_pq ppdev i2c_piix4 acpi_cpufreq psmouse tpm_tis parport_pc pcspkr serio_raw tpm_tis_core sg parport evdev i2c_core tpm button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs]
[  834.592116] CPU: 2 PID: 486 Comm: kworker/u32:4 Not tainted 4.10.0-rc8-btrfs-next-37+ #2
[  834.593316] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[  834.595273] Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs]
[  834.596103] Call Trace:
[  834.596103]  dump_stack+0x67/0x90
[  834.596103]  __warn+0xc2/0xdd
[  834.596103]  warn_slowpath_null+0x1d/0x1f
[  834.596103]  btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
[  834.596103]  compress_file_range.constprop.42+0x2fa/0x3fc [btrfs]
[  834.596103]  ? submit_compressed_extents+0x3a7/0x3a7 [btrfs]
[  834.596103]  async_cow_start+0x32/0x4d [btrfs]
[  834.596103]  btrfs_scrubparity_helper+0x187/0x3e7 [btrfs]
[  834.596103]  btrfs_delalloc_helper+0xe/0x10 [btrfs]
[  834.596103]  process_one_work+0x273/0x4e4
[  834.596103]  worker_thread+0x1eb/0x2ca
[  834.596103]  ? rescuer_thread+0x2b6/0x2b6
[  834.596103]  kthread+0x100/0x108
[  834.596103]  ? __list_del_entry+0x22/0x22
[  834.596103]  ret_from_fork+0x2e/0x40
[  834.611656] ---[ end trace 719902fe6bdef08f ]---

So fix this by not calling directly btrfs_free_reserved_data_space_noquota()
if an error happened.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-04-26 16:27:23 +01:00
Filipe Manana a315e68f6e Btrfs: fix invalid attempt to free reserved space on failure to cow range
When attempting to COW a file range (we are starting writeback and doing
COW), if we manage to reserve an extent for the range we will write into
but fail after reserving it and before creating the respective ordered
extent, we end up in an error path where we attempt to decrement the
data space's bytes_may_use counter after we already did it while
reserving the extent, leading to a warning/trace like the following:

[  847.621524] ------------[ cut here ]------------
[  847.625441] WARNING: CPU: 5 PID: 4905 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
[  847.633704] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq i2c_piix4 ppdev psmouse tpm_tis serio_raw pcspkr parport_pc tpm_tis_core i2c_core sg
[  847.644616] CPU: 5 PID: 4905 Comm: xfs_io Not tainted 4.10.0-rc8-btrfs-next-37+ #2
[  847.648601] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[  847.648601] Call Trace:
[  847.648601]  dump_stack+0x67/0x90
[  847.648601]  __warn+0xc2/0xdd
[  847.648601]  warn_slowpath_null+0x1d/0x1f
[  847.648601]  btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
[  847.648601]  btrfs_clear_bit_hook+0x140/0x258 [btrfs]
[  847.648601]  clear_state_bit+0x87/0x128 [btrfs]
[  847.648601]  __clear_extent_bit+0x222/0x2b7 [btrfs]
[  847.648601]  clear_extent_bit+0x17/0x19 [btrfs]
[  847.648601]  extent_clear_unlock_delalloc+0x3b/0x6b [btrfs]
[  847.648601]  cow_file_range.isra.39+0x387/0x39a [btrfs]
[  847.648601]  run_delalloc_nocow+0x4d7/0x70e [btrfs]
[  847.648601]  ? arch_local_irq_save+0x9/0xc
[  847.648601]  run_delalloc_range+0xa7/0x2b5 [btrfs]
[  847.648601]  writepage_delalloc.isra.31+0xb9/0x15c [btrfs]
[  847.648601]  __extent_writepage+0x249/0x2e8 [btrfs]
[  847.648601]  extent_write_cache_pages.constprop.33+0x28b/0x36c [btrfs]
[  847.648601]  ? arch_local_irq_save+0x9/0xc
[  847.648601]  ? mark_lock+0x24/0x201
[  847.648601]  extent_writepages+0x4b/0x5c [btrfs]
[  847.648601]  ? btrfs_writepage_start_hook+0xed/0xed [btrfs]
[  847.648601]  btrfs_writepages+0x28/0x2a [btrfs]
[  847.648601]  do_writepages+0x23/0x2c
[  847.648601]  __filemap_fdatawrite_range+0x5a/0x61
[  847.648601]  filemap_fdatawrite_range+0x13/0x15
[  847.648601]  btrfs_fdatawrite_range+0x20/0x46 [btrfs]
[  847.648601]  start_ordered_ops+0x19/0x23 [btrfs]
[  847.648601]  btrfs_sync_file+0x136/0x42c [btrfs]
[  847.648601]  vfs_fsync_range+0x8c/0x9e
[  847.648601]  vfs_fsync+0x1c/0x1e
[  847.648601]  do_fsync+0x31/0x4a
[  847.648601]  SyS_fsync+0x10/0x14
[  847.648601]  entry_SYSCALL_64_fastpath+0x18/0xad
[  847.648601] RIP: 0033:0x7f5b05200800
[  847.648601] RSP: 002b:00007ffe204f71c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[  847.648601] RAX: ffffffffffffffda RBX: ffffffff8109637b RCX: 00007f5b05200800
[  847.648601] RDX: 00000000008bd0a0 RSI: 00000000008bd2e0 RDI: 0000000000000003
[  847.648601] RBP: ffffc90001d67f98 R08: 000000000000ffff R09: 000000000000001f
[  847.648601] R10: 00000000000001f6 R11: 0000000000000246 R12: 0000000000000046
[  847.648601] R13: ffffc90001d67f78 R14: 00007f5b054be740 R15: 00007f5b054be740
[  847.648601]  ? trace_hardirqs_off_caller+0x3f/0xaa
[  847.685787] ---[ end trace 2a4a3e15382508e8 ]---

So fix this by not attempting to decrement the data space info's
bytes_may_use counter if we already reserved the extent and an error
happened before creating the ordered extent. We are already correctly
freeing the reserved extent if an error happens, so there's no additional
measure needed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-04-26 16:27:22 +01:00
Qu Wenruo 524272607e btrfs: Handle delalloc error correctly to avoid ordered extent hang
[BUG]
If run_delalloc_range() returns error and there is already some ordered
extents created, btrfs will be hanged with the following backtrace:

Call Trace:
 __schedule+0x2d4/0xae0
 schedule+0x3d/0x90
 btrfs_start_ordered_extent+0x160/0x200 [btrfs]
 ? wake_atomic_t_function+0x60/0x60
 btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
 btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
 btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
 process_one_work+0x2af/0x720
 ? process_one_work+0x22b/0x720
 worker_thread+0x4b/0x4f0
 kthread+0x10f/0x150
 ? process_one_work+0x720/0x720
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x2e/0x40

[CAUSE]

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<>|                       |<---------- cleanup range --------->|
 ||
 \_=> First page handled by end_extent_writepage() in __extent_writepage()

The problem is caused by error handler of run_delalloc_range(), which
doesn't handle any created ordered extents, leaving them waiting on
btrfs_finish_ordered_io() to finish.

However after run_delalloc_range() returns error, __extent_writepage()
won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
except the first page, and btrfs_finish_ordered_io() won't be triggered
for created ordered extents either.

So OE 2~n will hang forever, and if OE 1 is larger than one page, it
will also hang.

[FIX]
Introduce btrfs_cleanup_ordered_extents() function to cleanup created
ordered extents and finish them manually.

The function is based on existing
btrfs_endio_direct_write_update_ordered() function, and modify it to
act just like btrfs_writepage_endio_hook() but handles specified range
other than one page.

After fix, delalloc error will be handled like:

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<>|<--------  ----------->|<------ old error handler --------->|
 ||          ||
 ||          \_=> Cleaned up by cleanup_ordered_extents()
 \_=> First page handled by end_extent_writepage() in __extent_writepage()

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-04-26 16:27:21 +01:00
Qu Wenruo 4dbd80fb91 btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
[BUG]
When btrfs_reloc_clone_csum() reports error, it can underflow metadata
and leads to kernel assertion on outstanding extents in
run_delalloc_nocow() and cow_file_range().

 BTRFS info (device vdb5): relocating block group 12582912 flags data
 BTRFS info (device vdb5): found 1 extents
 assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858

Currently, due to another bug blocking ordered extents, the bug is only
reproducible under certain block group layout and using error injection.

a) Create one data block group with one 4K extent in it.
   To avoid the bug that hangs btrfs due to ordered extent which never
   finishes
b) Make btrfs_reloc_clone_csum() always fail
c) Relocate that block group

[CAUSE]
run_delalloc_nocow() and cow_file_range() handles error from
btrfs_reloc_clone_csum() wrongly:

(The ascii chart shows a more generic case of this bug other than the
bug mentioned above)

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
                    |<----------- cleanup range --------------->|
|<-----------  ----------->|
             \/
 btrfs_finish_ordered_io() range

So error handler, which calls extent_clear_unlock_delalloc() with
EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
will both cover OE n, and free its metadata, causing metadata under flow.

[Fix]
The fix is to ensure after calling btrfs_add_ordered_extent(), we only
call error handler after increasing the iteration offset, so that
cleanup range won't cover any created ordered extent.

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<-----------  ----------->|<---------- cleanup range --------->|
             \/
 btrfs_finish_ordered_io() range

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-04-26 16:27:21 +01:00
Dan Carpenter 9986277e0e Btrfs: handle only applicable errors returned by btrfs_get_extent
btrfs_get_extent() never returns NULL pointers, so this code introduces
a static checker warning.

The btrfs_get_extent() is a bit complex, but trust me that it doesn't
return NULLs and also if it did we would trigger the BUG_ON(!em) before
the last return statement.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
[ updated subject ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:27 +02:00
Liu Bo 09ed2f165c Btrfs: add file item tracepoints
While debugging truncate problems, I found that these tracepoints could
help us quickly know what went wrong.

Two sets of tracepoints are created to track regular/prealloc file item
and inline file item respectively, I put inline as a separate one since
what inline file items cares about are way less than the regular one.

This adds four tracepoints:
- btrfs_get_extent_show_fi_regular
- btrfs_get_extent_show_fi_inline
- btrfs_truncate_show_fi_regular
- btrfs_truncate_show_fi_inline

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ formatting adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Liu Bo f95fda8751 Btrfs: remove ASSERT in btrfs_truncate_inode_items
After 76b42abbf7 ("Btrfs: fix data loss after truncate when using the
no-holes feature"),

For either NO_HOLES or inline extents, we've set last_size to newsize to
avoid data loss after remount or inode got evicted and read again, thus,
we don't need this check anymore.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Linus Torvalds 4b31ac485d Merge branch 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Dave Sterba collected a few more fixes for the last rc.

  These aren't marked for stable, but I'm putting them in with a batch
  were testing/sending by hand for this release"

* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix potential use-after-free for cloned bio
  Btrfs: fix segmentation fault when doing dio read
  Btrfs: fix invalid dereference in btrfs_retry_endio
  btrfs: drop the nossd flag when remounting with -o ssd
2017-04-14 16:53:45 -07:00
Liu Bo 97bf5a5589 Btrfs: fix segmentation fault when doing dio read
Commit 2dabb32484 ("Btrfs: Direct I/O read: Work on sectorsized blocks")
introduced this bug during iterating bio pages in dio read's endio hook,
and it could end up with segment fault of the dio reading task.

So the reason is 'if (nr_sectors--)', and it makes the code assume that
there is one more block in the same page, so page offset is increased and
the bio which is created to repair the bad block then has an incorrect
bvec.bv_offset, and a later access of the page content would throw a
segmentation fault.

This also adds ASSERT to check page offset against page size.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-11 18:49:29 +02:00
Liu Bo 2e949b0a55 Btrfs: fix invalid dereference in btrfs_retry_endio
When doing directIO repair, we have this oops:

[ 1458.532816] general protection fault: 0000 [#1] SMP
...
[ 1458.536291] Workqueue: btrfs-endio-repair btrfs_endio_repair_helper [btrfs]
[ 1458.536893] task: ffff88082a42d100 task.stack: ffffc90002b3c000
[ 1458.537499] RIP: 0010:btrfs_retry_endio+0x7e/0x1a0 [btrfs]
...
[ 1458.543261] Call Trace:
[ 1458.543958]  ? rcu_read_lock_sched_held+0xc4/0xd0
[ 1458.544374]  bio_endio+0xed/0x100
[ 1458.544750]  end_workqueue_fn+0x3c/0x40 [btrfs]
[ 1458.545257]  normal_work_helper+0x9f/0x900 [btrfs]
[ 1458.545762]  btrfs_endio_repair_helper+0x12/0x20 [btrfs]
[ 1458.546224]  process_one_work+0x34d/0xb70
[ 1458.546570]  ? process_one_work+0x29e/0xb70
[ 1458.546938]  worker_thread+0x1cf/0x960
[ 1458.547263]  ? process_one_work+0xb70/0xb70
[ 1458.547624]  kthread+0x17d/0x180
[ 1458.547909]  ? kthread_create_on_node+0x70/0x70
[ 1458.548300]  ret_from_fork+0x31/0x40

It turns out that btrfs_retry_endio is trying to get inode from a directIO
page.

This fixes the problem by using the saved inode pointer, done->inode.
btrfs_retry_endio_nocsum has the same problem, and it's fixed as well.

Also cleanup unused @start (which is too trivial for a separate patch).

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-11 18:49:08 +02:00
Linus Torvalds fe8e12b503 Merge branch 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "We have three small fixes queued up in my for-linus-4.11 branch"

* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix an integer overflow check
  btrfs: Change qgroup_meta_rsv to 64bit
  Btrfs: bring back repair during read
2017-03-31 17:58:48 -07:00
Liu Bo 9d0d1c8b1c Btrfs: bring back repair during read
Commit 20a7db8ab3 ("btrfs: add dummy callback for readpage_io_failed
and drop checks") made a cleanup around readpage_io_failed_hook, and
it was supposed to keep the original sematics, but it also
unexpectedly disabled repair during read for dup, raid1 and raid10.

This fixes the problem by letting data's inode call the generic
readpage_io_failed callback by returning -EAGAIN from its
readpage_io_failed_hook in order to notify end_bio_extent_readpage to
do the rest.  We don't call it directly because the generic one takes
an offset from end_bio_extent_readpage() to calculate the index in the
checksum array and inode's readpage_io_failed_hook doesn't offer that
offset.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ keep the const function attribute ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-29 14:29:07 +02:00
Linus Torvalds 131fbf4f9c Merge branch 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Zygo tracked down a very old bug with inline compressed extents.

  I didn't tag this one for stable because I want to do individual
  tested backports. It's a little tricky and I'd rather do some extra
  testing on it along the way"

* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: add missing memset while reading compressed inline extents
  Btrfs: fix regression in lock_delalloc_pages
  btrfs: remove btrfs_err_str function from uapi/linux/btrfs.h
2017-03-23 11:39:33 -07:00
Zygo Blaxell e1699d2d7b btrfs: add missing memset while reading compressed inline extents
This is a story about 4 distinct (and very old) btrfs bugs.

Commit c8b978188c ("Btrfs: Add zlib compression support") added
three data corruption bugs for inline extents (bugs #1-3).

Commit 93c82d5750 ("Btrfs: zero page past end of inline file items")
fixed bug #1:  uncompressed inline extents followed by a hole and more
extents could get non-zero data in the hole as they were read.  The fix
was to add a memset in btrfs_get_extent to zero out the hole.

Commit 166ae5a418 ("btrfs: fix inline compressed read err corruption")
fixed bug #2:  compressed inline extents which contained non-zero bytes
might be replaced with zero bytes in some cases.  This patch removed an
unhelpful memset from uncompress_inline, but the case where memset is
required was missed.

There is also a memset in the decompression code, but this only covers
decompressed data that is shorter than the ram_bytes from the extent
ref record.  This memset doesn't cover the region between the end of the
decompressed data and the end of the page.  It has also moved around a
few times over the years, so there's no single patch to refer to.

This patch fixes bug #3:  compressed inline extents followed by a hole
and more extents could get non-zero data in the hole as they were read
(i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/).
The fix is the same:  zero out the hole in the compressed case too,
by putting a memset back in uncompress_inline, but this time with
correct parameters.

The last and oldest bug, bug #0, is the cause of the offending inline
extent/hole/extent pattern.  Bug #0 is a subtle and mostly-harmless quirk
of behavior somewhere in the btrfs write code.  In a few special cases,
an inline extent and hole are allowed to persist where they normally
would be combined with later extents in the file.

A fast reproducer for bug #0 is presented below.  A few offending extents
are also created in the wild during large rsync transfers with the -S
flag.  A Linux kernel build (git checkout; make allyesconfig; make -j8)
will produce a handful of offending files as well.  Once an offending
file is created, it can present different content to userspace each
time it is read.

Bug #0 is at least 4 and possibly 8 years old.  I verified every vX.Y
kernel back to v3.5 has this behavior.  There are fossil records of this
bug's effects in commits all the way back to v2.6.32.  I have no reason
to believe bug #0 wasn't present at the beginning of btrfs compression
support in v2.6.29, but I can't easily test kernels that old to be sure.

It is not clear whether bug #0 is worth fixing.  A fix would likely
require injecting extra reads into currently write-only paths, and most
of the exceptional cases caused by bug #0 are already handled now.

Whether we like them or not, bug #0's inline extents followed by holes
are part of the btrfs de-facto disk format now, and we need to be able
to read them without data corruption or an infoleak.  So enough about
bug #0, let's get back to bug #3 (this patch).

An example of on-disk structure leading to data corruption found in
the wild:

        item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160
                inode generation 50 transid 50 size 47424 nbytes 49141
                block group 0 mode 100644 links 1 uid 0 gid 0
                rdev 0 flags 0x0(none)
        item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20
                inode ref index 3 namelen 10 name: DB_File.so
        item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362
                inline extent data size 1341 ram 4085 compress(zlib)
        item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53
                extent data disk byte 5367308288 nr 20480
                extent data offset 0 nr 45056 ram 45056
                extent compression(zlib)

Different data appears in userspace during each read of the 11 bytes
between 4085 and 4096.  The extent in item 63 is not long enough to
fill the first page of the file, so a memset is required to fill the
space between item 63 (ending at 4085) and item 64 (beginning at 4096)
with zero.

Here is a reproducer from Liu Bo, which demonstrates another method
of creating the same inline extent and hole pattern:

Using 'page_poison=on' kernel command line (or enable
CONFIG_PAGE_POISONING) run the following:

	# touch foo
	# chattr +c foo
	# xfs_io -f -c "pwrite -W 0 1000" foo
	# xfs_io -f -c "falloc 4 8188" foo
	# od -x foo
	# echo 3 >/proc/sys/vm/drop_caches
	# od -x foo

This produce the following on my box:

Correct output:  file contains 1000 data bytes followed
by zeros:

	0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
	*
	0001740 cdcd cdcd cdcd cdcd 0000 0000 0000 0000
	0001760 0000 0000 0000 0000 0000 0000 0000 0000
	*
	0020000

Actual output:  the data after the first 1000 bytes
will be different each run:

	0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
	*
	0001740 cdcd cdcd cdcd cdcd 6c63 7400 635f 006d
	0001760 5f74 6f43 7400 435f 0053 5f74 7363 7400
	0002000 435f 0056 5f74 6164 7400 645f 0062 5f74
	(...)

Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-03-17 13:47:10 -07:00
David Howells a528d35e8b statx: Add a system call to make enhanced file info available
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.

The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode.  This change is propagated to the vfs_getattr*()
function.

Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.

========
OVERVIEW
========

The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.

A number of requests were gathered for features to be included.  The
following have been included:

 (1) Make the fields a consistent size on all arches and make them large.

 (2) Spare space, request flags and information flags are provided for
     future expansion.

 (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
     __s64).

 (4) Creation time: The SMB protocol carries the creation time, which could
     be exported by Samba, which will in turn help CIFS make use of
     FS-Cache as that can be used for coherency data (stx_btime).

     This is also specified in NFSv4 as a recommended attribute and could
     be exported by NFSD [Steve French].

 (5) Lightweight stat: Ask for just those details of interest, and allow a
     netfs (such as NFS) to approximate anything not of interest, possibly
     without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
     Dilger] (AT_STATX_DONT_SYNC).

 (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
     its cached attributes are up to date [Trond Myklebust]
     (AT_STATX_FORCE_SYNC).

And the following have been left out for future extension:

 (7) Data version number: Could be used by userspace NFS servers [Aneesh
     Kumar].

     Can also be used to modify fill_post_wcc() in NFSD which retrieves
     i_version directly, but has just called vfs_getattr().  It could get
     it from the kstat struct if it used vfs_xgetattr() instead.

     (There's disagreement on the exact semantics of a single field, since
     not all filesystems do this the same way).

 (8) BSD stat compatibility: Including more fields from the BSD stat such
     as creation time (st_btime) and inode generation number (st_gen)
     [Jeremy Allison, Bernd Schubert].

 (9) Inode generation number: Useful for FUSE and userspace NFS servers
     [Bernd Schubert].

     (This was asked for but later deemed unnecessary with the
     open-by-handle capability available and caused disagreement as to
     whether it's a security hole or not).

(10) Extra coherency data may be useful in making backups [Andreas Dilger].

     (No particular data were offered, but things like last backup
     timestamp, the data version number and the DOS archive bit would come
     into this category).

(11) Allow the filesystem to indicate what it can/cannot provide: A
     filesystem can now say it doesn't support a standard stat feature if
     that isn't available, so if, for instance, inode numbers or UIDs don't
     exist or are fabricated locally...

     (This requires a separate system call - I have an fsinfo() call idea
     for this).

(12) Store a 16-byte volume ID in the superblock that can be returned in
     struct xstat [Steve French].

     (Deferred to fsinfo).

(13) Include granularity fields in the time data to indicate the
     granularity of each of the times (NFSv4 time_delta) [Steve French].

     (Deferred to fsinfo).

(14) FS_IOC_GETFLAGS value.  These could be translated to BSD's st_flags.
     Note that the Linux IOC flags are a mess and filesystems such as Ext4
     define flags that aren't in linux/fs.h, so translation in the kernel
     may be a necessity (or, possibly, we provide the filesystem type too).

     (Some attributes are made available in stx_attributes, but the general
     feeling was that the IOC flags were to ext[234]-specific and shouldn't
     be exposed through statx this way).

(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
     Michael Kerrisk].

     (Deferred, probably to fsinfo.  Finding out if there's an ACL or
     seclabal might require extra filesystem operations).

(16) Femtosecond-resolution timestamps [Dave Chinner].

     (A __reserved field has been left in the statx_timestamp struct for
     this - if there proves to be a need).

(17) A set multiple attributes syscall to go with this.

===============
NEW SYSTEM CALL
===============

The new system call is:

	int ret = statx(int dfd,
			const char *filename,
			unsigned int flags,
			unsigned int mask,
			struct statx *buffer);

The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat().  There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags.  There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.

Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):

 (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
     respect.

 (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
     its attributes with the server - which might require data writeback to
     occur to get the timestamps correct.

 (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
     network filesystem.  The resulting values should be considered
     approximate.

mask is a bitmask indicating the fields in struct statx that are of
interest to the caller.  The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat().  It should be noted that asking for
more information may entail extra I/O operations.

buffer points to the destination for the data.  This must be 256 bytes in
size.

======================
MAIN ATTRIBUTES RECORD
======================

The following structures are defined in which to return the main attribute
set:

	struct statx_timestamp {
		__s64	tv_sec;
		__s32	tv_nsec;
		__s32	__reserved;
	};

	struct statx {
		__u32	stx_mask;
		__u32	stx_blksize;
		__u64	stx_attributes;
		__u32	stx_nlink;
		__u32	stx_uid;
		__u32	stx_gid;
		__u16	stx_mode;
		__u16	__spare0[1];
		__u64	stx_ino;
		__u64	stx_size;
		__u64	stx_blocks;
		__u64	__spare1[1];
		struct statx_timestamp	stx_atime;
		struct statx_timestamp	stx_btime;
		struct statx_timestamp	stx_ctime;
		struct statx_timestamp	stx_mtime;
		__u32	stx_rdev_major;
		__u32	stx_rdev_minor;
		__u32	stx_dev_major;
		__u32	stx_dev_minor;
		__u64	__spare2[14];
	};

The defined bits in request_mask and stx_mask are:

	STATX_TYPE		Want/got stx_mode & S_IFMT
	STATX_MODE		Want/got stx_mode & ~S_IFMT
	STATX_NLINK		Want/got stx_nlink
	STATX_UID		Want/got stx_uid
	STATX_GID		Want/got stx_gid
	STATX_ATIME		Want/got stx_atime{,_ns}
	STATX_MTIME		Want/got stx_mtime{,_ns}
	STATX_CTIME		Want/got stx_ctime{,_ns}
	STATX_INO		Want/got stx_ino
	STATX_SIZE		Want/got stx_size
	STATX_BLOCKS		Want/got stx_blocks
	STATX_BASIC_STATS	[The stuff in the normal stat struct]
	STATX_BTIME		Want/got stx_btime{,_ns}
	STATX_ALL		[All currently available stuff]

stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.

Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution.  Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.

The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does.  The following
attributes map to FS_*_FL flags and are the same numerical value:

	STATX_ATTR_COMPRESSED		File is compressed by the fs
	STATX_ATTR_IMMUTABLE		File is marked immutable
	STATX_ATTR_APPEND		File is append-only
	STATX_ATTR_NODUMP		File is not to be dumped
	STATX_ATTR_ENCRYPTED		File requires key to decrypt in fs

Within the kernel, the supported flags are listed by:

	KSTAT_ATTR_FS_IOC_FLAGS

[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]

New flags include:

	STATX_ATTR_AUTOMOUNT		Object is an automount trigger

These are for the use of GUI tools that might want to mark files specially,
depending on what they are.

Fields in struct statx come in a number of classes:

 (0) stx_dev_*, stx_blksize.

     These are local system information and are always available.

 (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
     stx_size, stx_blocks.

     These will be returned whether the caller asks for them or not.  The
     corresponding bits in stx_mask will be set to indicate whether they
     actually have valid values.

     If the caller didn't ask for them, then they may be approximated.  For
     example, NFS won't waste any time updating them from the server,
     unless as a byproduct of updating something requested.

     If the values don't actually exist for the underlying object (such as
     UID or GID on a DOS file), then the bit won't be set in the stx_mask,
     even if the caller asked for the value.  In such a case, the returned
     value will be a fabrication.

     Note that there are instances where the type might not be valid, for
     instance Windows reparse points.

 (2) stx_rdev_*.

     This will be set only if stx_mode indicates we're looking at a
     blockdev or a chardev, otherwise will be 0.

 (3) stx_btime.

     Similar to (1), except this will be set to 0 if it doesn't exist.

=======
TESTING
=======

The following test program can be used to test the statx system call:

	samples/statx/test-statx.c

Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.

Here's some example output.  Firstly, an NFS directory that crosses to
another FSID.  Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.

	[root@andromeda ~]# /tmp/test-statx -A /warthog/data
	statx(/warthog/data) = 0
	results=7ff
	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
	Device: 00:26           Inode: 1703937     Links: 125
	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
	Access: 2016-11-24 09:02:12.219699527+0000
	Modify: 2016-11-17 10:44:36.225653653+0000
	Change: 2016-11-17 10:44:36.225653653+0000
	Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)

Secondly, the result of automounting on that directory.

	[root@andromeda ~]# /tmp/test-statx /warthog/data
	statx(/warthog/data) = 0
	results=7ff
	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
	Device: 00:27           Inode: 2           Links: 125
	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
	Access: 2016-11-24 09:02:12.219699527+0000
	Modify: 2016-11-17 10:44:36.225653653+0000
	Change: 2016-11-17 10:44:36.225653653+0000

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-03-02 20:51:15 -05:00
Linus Torvalds bbe08c0a43 Merge branch 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull more btrfs updates from Chris Mason:
 "Btrfs round two.

  These are mostly a continuation of Dave Sterba's collection of
  cleanups, but Filipe also has some bug fixes and performance
  improvements"

* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (69 commits)
  btrfs: add dummy callback for readpage_io_failed and drop checks
  btrfs: drop checks for mandatory extent_io_ops callbacks
  btrfs: document existence of extent_io ops callbacks
  btrfs: let writepage_end_io_hook return void
  btrfs: do proper error handling in btrfs_insert_xattr_item
  btrfs: handle allocation error in update_dev_stat_item
  btrfs: remove BUG_ON from __tree_mod_log_insert
  btrfs: derive maximum output size in the compression implementation
  btrfs: use predefined limits for calculating maximum number of pages for compression
  btrfs: export compression buffer limits in a header
  btrfs: merge nr_pages input and output parameter in compress_pages
  btrfs: merge length input and output parameter in compress_pages
  btrfs: constify name of subvolume in creation helpers
  btrfs: constify buffers used by compression helpers
  btrfs: constify input buffer of btrfs_csum_data
  btrfs: constify device path passed to relevant helpers
  btrfs: make btrfs_inode_resume_unlocked_dio take btrfs_inode
  btrfs: make btrfs_inode_block_unlocked_dio take btrfs_inode
  btrfs: Make btrfs_add_nondir take btrfs_inode
  btrfs: Make btrfs_add_link take btrfs_inode
  ...
2017-03-02 16:03:00 -08:00
Chris Mason e9f467d028 Merge branch 'for-chris-4.11-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.11 2017-02-28 14:35:09 -08:00
David Sterba 20a7db8ab3 btrfs: add dummy callback for readpage_io_failed and drop checks
Make extent_io_ops::readpage_io_failed_hook callback mandatory and
define a dummy function for btrfs_extent_io_ops. As the failed IO
callback is not performance critical, the branch vs extra trade off does
not hurt.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:29:24 +01:00
David Sterba 4d53dddbec btrfs: document existence of extent_io ops callbacks
Some of the callbacks defined in btree_extent_io_ops and
btrfs_extent_io_ops do always exist so we don't need to check the
existence before each call. This patch just reorders the definition and
documents which are mandatory/optional.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:29:24 +01:00
David Sterba c3988d630a btrfs: let writepage_end_io_hook return void
There's no error path in any of the instances, always return 0.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:29:24 +01:00
David Sterba e5d7490236 btrfs: derive maximum output size in the compression implementation
The value of max_out can be calculated from the parameters passed to the
compressors, which is number of pages and the page size, and we don't
have to needlessly pass it around.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:26:36 +01:00
David Sterba 069eac7850 btrfs: use predefined limits for calculating maximum number of pages for compression
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:26:35 +01:00
David Sterba ff7638665c btrfs: export compression buffer limits in a header
Move the buffer limit definitions out of compress_file_range.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:26:35 +01:00
David Sterba 4d3a800ebb btrfs: merge nr_pages input and output parameter in compress_pages
The parameter saying how many pages can be allocated at maximum can be
merged with the output page counter, to save some stack space.  The
compression implementation will sink the parameter to a local variable
so everything works as before.

The nr_pages variables can also be simply merged in compress_file_range
into one.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:26:35 +01:00
David Sterba 38c3146408 btrfs: merge length input and output parameter in compress_pages
The length parameter is basically duplicated for input and output in the
top level caller of the compress_pages chain. We can simply use one
variable for that and reduce stack consumption. The compression
implementation will sink the parameter to a local variable so everything
works as before.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 14:26:35 +01:00
Nikolay Borisov 0b581701d9 btrfs: make btrfs_inode_resume_unlocked_dio take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:12 +01:00
Nikolay Borisov abcefb1eee btrfs: make btrfs_inode_block_unlocked_dio take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:12 +01:00
Nikolay Borisov cef415af20 btrfs: Make btrfs_add_nondir take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:12 +01:00
Nikolay Borisov db0a669fb0 btrfs: Make btrfs_add_link take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:11 +01:00
Nikolay Borisov 9e3e97f45c btrfs: Make btrfs_del_delalloc_inode take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:11 +01:00
Nikolay Borisov fc4f21b1d8 btrfs: Make get_extent_t take btrfs_inode
In addition to changing the signature, this patch also switches
all the functions which are used as an argument to also take btrfs_inode.
Namely those are: btrfs_get_extent and btrfs_get_extent_filemap.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:11 +01:00
Nikolay Borisov 6fc0ef6870 btrfs: Make btrfs_clear_bit_hook take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:11 +01:00
Nikolay Borisov 9cdc512410 btrfs: Make btrfs_extent_item_to_extent_map take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:11 +01:00
Nikolay Borisov 73f2e545b6 btrfs: Make btrfs_orphan_add take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:10 +01:00
Nikolay Borisov 3d6ae7bb6a btrfs: make btrfs_orphan_del take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:10 +01:00
Nikolay Borisov 7ab7956ec3 btrfs: make btrfs_free_io_failure_record take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:10 +01:00
Nikolay Borisov b30cb441fc btrfs: make clean_io_failure take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:10 +01:00
Nikolay Borisov 0970a22e58 btrfs: make btrfs_print_data_csum_error take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:09 +01:00
Nikolay Borisov 4ac1f4acd7 btrfs: make free_io_failure take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:09 +01:00
Nikolay Borisov a776c6fa1f btrfs: Make btrfs_lookup_ordered_range take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:08 +01:00
Nikolay Borisov 7a6d706795 btrfs: Make btrfs_mark_extent_written take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:08 +01:00
Nikolay Borisov dcdbc059f0 btrfs: Make btrfs_drop_extent_cache take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:08 +01:00
Nikolay Borisov 6158e1ce1c btrfs: Make (__)btrfs_add_inode_defrag take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:07 +01:00
Nikolay Borisov 691fa05967 btrfs: all btrfs_delalloc_release_metadata take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:07 +01:00
Nikolay Borisov 703b391a03 btrfs: Make btrfs_orphan_release_metadata take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:07 +01:00
Nikolay Borisov 8ed7a2a0e0 btrfs: Make btrfs_orphan_reserve_metadata take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:07 +01:00
Nikolay Borisov 70ddc553b5 btrfs: make btrfs_is_free_space_inode take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:06 +01:00
Nikolay Borisov 6ef06d2790 btrfs: Make btrfs_i_size_write take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:06 +01:00
Nikolay Borisov 877574e254 btrfs: Make btrfs_set_inode_index take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:06 +01:00
Nikolay Borisov 4c570655f4 btrfs: make btrfs_set_inode_index_count take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:06 +01:00
Nikolay Borisov 8e7611cf38 btrfs: Make btrfs_insert_dir_item take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:06 +01:00