Граф коммитов

308 Коммитов

Автор SHA1 Сообщение Дата
Theodore Ts'o 9924a92a8c ext4: pass context information to jbd2__journal_start()
So we can better understand what bits of ext4 are responsible for
long-running jbd2 handles, use jbd2__journal_start() so we can pass
context information for logging purposes.

The recommended way for finding the longer-running handles is:

   T=/sys/kernel/debug/tracing
   EVENT=$T/events/jbd2/jbd2_handle_stats
   echo "interval > 5" > $EVENT/filter
   echo 1 > $EVENT/enable

   ./run-my-fs-benchmark

   cat $T/trace > /tmp/problem-handles

This will list handles that were active for longer than 20ms.  Having
longer-running handles is bad, because a commit started at the wrong
time could stall for those 20+ milliseconds, which could delay an
fsync() or an O_SYNC operation.  Here is an example line from the
trace file describing a handle which lived on for 311 jiffies, or over
1.2 seconds:

postmark-2917  [000] ....   196.435786: jbd2_handle_stats: dev 254,32 
   tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
   dirtied_blocks 0

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-08 21:59:22 -05:00
Lukas Czerner b06acd38a4 ext4: remove explicit WARN_ON when ext4_map_blocks() fails
In two places we call WARN_ON() before we print out the debug message,
however we agreed that the WARN_ON() is unnecessary at those places so
remove them.

Also use ext4_warning() instead of ext4_msg() and printk().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 21:21:12 -05:00
Zheng Liu 8bad6fc813 ext4: add punching hole support for non-extent-mapped files
This patch add supports for indirect file support punching hole.  It
is almost the same as ext4_ext_punch_hole.  First, we invalidate all
pages between this hole, and then we try to deallocate all blocks of
this hole.

A recursive function is used to handle deallocation of blocks.  In
this function, it iterates over the entries in inode's i_blocks or
indirect blocks, and try to free the block for each one of them.

After applying this patch, xfstest #255 will not pass w/o extent because
indirect-based file doesn't support unwritten extents.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:21:37 -05:00
Wang Shilong aebf02430d ext4: use unlikely to improve the efficiency of the kernel
Because the function 'sb_getblk' seldomly fails to return NULL
value,it will be better to use 'unlikely' to optimize it.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-12 16:28:47 -05:00
Theodore Ts'o 860d21e2c5 ext4: return ENOMEM if sb_getblk() fails
The only reason for sb_getblk() failing is if it can't allocate the
buffer_head.  So ENOMEM is more appropriate than EIO.  In addition,
make sure that the file system is marked as being inconsistent if
sb_getblk() fails.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-01-12 16:19:36 -05:00
Forrest Liu c36575e663 ext4: fix extent tree corruption caused by hole punch
When depth of extent tree is greater than 1, logical start value of
interior node is not correctly updated in ext4_ext_rm_idx.

Signed-off-by: Forrest Liu <forrestl@synology.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Ashish Sangwan <ashishsangwan2@gmail.com>
Cc: stable@vger.kernel.org
2012-12-17 09:55:39 -05:00
Zhi Yong Wu 187fd030d8 ext4: remove unused variable from ext4_ext_in_cache()
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Zheng Liu <gnehzuil.liu@gmail.com>
2012-12-10 14:06:04 -05:00
Tao Ma 0c8d414f16 ext4: let fallocate handle inline data correctly
If we are punching hole in a file, we will return ENOTSUPP.
As for the fallocation of some extents, we will convert the
inline data to a normal extent based file first.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10 14:06:03 -05:00
Tao Ma 941919856c ext4: let fiemap work with inline data
fiemap is used to find the disk layout of a file, as for inline data,
let us just pretend like a file with just one extent.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10 14:06:02 -05:00
Tao Ma f19d5870cb ext4: add normal write support for inline data
For a normal write case (not journalled write, not delayed
allocation), we write to the inline if the file is small and convert
it to an extent based file when the write is larger than the max
inline size.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10 14:05:51 -05:00
Theodore Ts'o 4a092d7379 ext4: rationalize ext4_extents.h inclusion
Previously, ext4_extents.h was being included at the end of ext4.h,
which was bad for a number of reasons: (a) it was not being included
in the expected place, and (b) it caused the header to be included
multiple times.  There were #ifdef's to prevent this from causing any
problems, but it still was unnecessary.

By moving the function declarations that were in ext4_extents.h to
ext4.h, which is standard practice for where the function declarations
for the rest of ext4.h can be found, we can remove ext4_extents.h from
being included in ext4.h at all, and then we can only include
ext4_extents.h where it is needed in ext4's source files.

It should be possible to move a few more things into ext4.h, and
further reduce the number of source files that need to #include
ext4_extents.h, but that's a cleanup for another day.

Reported-by: Sachin Kamat <sachin.kamat@linaro.org>
Reported-by: Wei Yongjun <weiyj.lk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-28 13:03:30 -05:00
Lukas Czerner 06348679c9 ext4: simple cleanup in fiemap codepath
This commit is simple cleanup of fiemap codepath which has not been
included in previous commit to make the changes clearer. In this commit
we rename cbex variable to newex in ext4_fill_fiemap_extents() because
callback is no longer present

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-28 12:33:22 -05:00
Lukas Czerner 91dd8c1144 ext4: prevent race while walking extent tree for fiemap
Currently ext4_ext_walk_space() only takes i_data_sem for read when
searching for the extent at given block with ext4_ext_find_extent().
Then it drops the lock and the extent tree can be changed at will.
However later on we're searching for the 'next' extent, but the extent
tree might already have changed, so the information might not be
accurate.

In fact we can hit BUG_ON(end <= start) if the extent got inserted into
the tree after the one we found and before the block we were searching
for. This has been reproduced by running xfstests 225 in loop on s390x
architecture, but theoretically we could hit this on any other
architecture as well, but probably not as often.

Moreover the extent currently in delayed allocation might be allocated
after we search the extent tree and before we search extent status tree
delayed buffers resulting in those delayed buffers being completely
missed, even though completely written and allocated.

We fix all those problems in several steps:

 1. remove unnecessary callback indirection
 2. rename functions
        ext4_ext_walk_space -> ext4_fill_fiemap_extents
        ext4_ext_fiemap_cb -> ext4_find_delayed_extent
 3. move fiemap_fill_next_extent() into ext4_fill_fiemap_extents()
 4. hold the i_data_sem for:
        ext4_ext_find_extent()
        ext4_ext_next_allocated_block()
        ext4_find_delayed_extent()
 5. call fiemap_fill_next_extent after releasing the i_data_sem
 6. move path reinitialization into the critical section.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-28 12:32:26 -05:00
Zheng Liu b3aff3e3f6 ext4: reimplement fiemap using extent status tree
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 21:57:37 -05:00
Zheng Liu 7d1b1fbc95 ext4: reimplement ext4_find_delay_alloc_range on extent status tree
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 21:57:35 -05:00
Zheng Liu 51865fda28 ext4: let ext4 maintain extent status tree
This patch lets ext4 maintain extent status tree.

Currently it only tracks delay extent status in extent status tree.  When a
delay allocation is issued, the related delay extent will be inserted into
extent status tree.  When a delay extent is written out or invalidated, it will
be removed from this tree.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 21:57:32 -05:00
Zheng Liu 3779473246 ext4: fix missing call to trace_ext4_ext_map_blocks_exit
When ext4_ext_handle_uninitialized_extents(), we will directly return
from ext4_ext_map_blocks().  The trace point of
trace_ext4_ext_map_blocks_exit isn't called, and the user doesn't see
any result.  This patch tries to fix this problem.

Meanwhile in ext4_ext_handle_uninitialized_extents it returns errors
or the number of allocated blocks.  So 'ret' variable can be removed
due to previously modifications.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
2012-11-08 14:47:52 -05:00
Zheng Liu 19b303d8b5 ext4: print map->m_flags in trace_ext4_ext/ind_map_blocks_exit
When we use trace_ext4_ext/ind_map_blocks_exit, print the value of
map->m_flags in order that we can understand the extent's current
status.

Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 14:34:04 -05:00
Zheng Liu b5645534ce ext4: print 'flags' in ext4_ext_handle_uninitialized_extents
In trace_ext4_ext_handle_uninitialized_extents we don't care about the
value of map->m_flags because this value is probably 0, and we prefer
to get the value of flags because we can know how to handle this
extent in this function.

Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 14:33:43 -05:00
Dmitry Monakhov dee1f973ca ext4: race-condition protection for ext4_convert_unwritten_extents_endio
We assumed that at the time we call ext4_convert_unwritten_extents_endio()
extent in question is fully inside [map.m_lblk, map->m_len] because
it was already split during submission.  But this may not be true due to
a race between writeback vs fallocate.

If extent in question is larger than requested we will split it again.
Special precautions should being done if zeroout required because
[map.m_lblk, map->m_len] already contains valid data.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-10-10 01:04:58 -04:00
Dmitry Monakhov 60d4616f3d ext4: serialize fallocate with ext4_convert_unwritten_extents
Fallocate should wait for pended ext4_convert_unwritten_extents()
otherwise following race may happen:

ftruncate( ,12288);
fallocate( ,0, 4096)
io_sibmit( ,0, 4096); /* Write to fallocated area, split extent if needed */
fallocate( ,0, 8192); /* Grow extent and broke assumption about extent */

Later kwork completion will do:
 ->ext4_convert_unwritten_extents (0, 4096)
   ->ext4_map_blocks(handle, inode, &map, EXT4_GET_BLOCKS_IO_CONVERT_EXT);
    ->ext4_ext_map_blocks() /* Will find new extent:  ex = [0,2] !!!!!! */
      ->ext4_ext_handle_uninitialized_extents()
        ->ext4_convert_unwritten_extents_endio()
        /* convert [0,2] extent to initialized, but only[0,1] was written */

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-10-05 11:32:02 -04:00
Dmitry Monakhov c278531d39 ext4: fix ext4_flush_completed_IO wait semantics
BUG #1) All places where we call ext4_flush_completed_IO are broken
    because buffered io and DIO/AIO goes through three stages
    1) submitted io,
    2) completed io (in i_completed_io_list) conversion pended
    3) finished  io (conversion done)
    And by calling ext4_flush_completed_IO we will flush only
    requests which were in (2) stage, which is wrong because:
     1) punch_hole and truncate _must_ wait for all outstanding unwritten io
      regardless to it's state.
     2) fsync and nolock_dio_read should also wait because there is
        a time window between end_page_writeback() and ext4_add_complete_io()
        As result integrity fsync is broken in case of buffered write
        to fallocated region:
        fsync                                      blkdev_completion
	 ->filemap_write_and_wait_range
                                                   ->ext4_end_bio
                                                     ->end_page_writeback
          <-- filemap_write_and_wait_range return
	 ->ext4_flush_completed_IO
   	 sees empty i_completed_io_list but pended
   	 conversion still exist
                                                     ->ext4_add_complete_io

BUG #2) Race window becomes wider due to the 'ext4: completed_io
locking cleanup V4' patch series

This patch make following changes:
1) ext4_flush_completed_io() now first try to flush completed io and when
   wait for any outstanding unwritten io via ext4_unwritten_wait()
2) Rename function to more appropriate name.
3) Assert that all callers of ext4_flush_unwritten_io should hold i_mutex to
   prevent endless wait

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2012-10-05 11:31:55 -04:00
Dmitry Monakhov 6f2080e644 ext4: fix ext_remove_space for punch_hole case
Inode is allowed to have empty leaf only if it this is blockless inode.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-09-30 23:03:50 -04:00
Dmitry Monakhov 02d262dffc ext4: punch_hole should wait for DIO writers
punch_hole is the place where we have to wait for all existing writers
(writeback, aio, dio), but currently we simply flush pended end_io request
which is not sufficient. Other issue is that punch_hole performed w/o i_mutex
held which obviously result in dangerous data corruption due to
write-after-free.

This patch performs following changes:
- Guard punch_hole with i_mutex
- Recheck inode flags under i_mutex
- Block all new dio readers in order to prevent information leak caused by
  read-after-free pattern.
- punch_hole now wait for all writers in flight
  NOTE: XXX write-after-free race is still possible because new dirty pages
  may appear due to mmap(), and currently there is no easy way to stop
  writeback while punch_hole is in progress.

[ Fixed error return from ext4_ext_punch_hole() to make sure that we
  release i_mutex before returning EPERM or ETXTBUSY -- Ted ]

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-09-30 23:03:42 -04:00
Dmitry Monakhov 28a535f9a0 ext4: completed_io locking cleanup
Current unwritten extent conversion state-machine is very fuzzy.
- For unknown reason it performs conversion under i_mutex. What for?
  My diagnosis:
  We already protect extent tree with i_data_sem, truncate and punch_hole
  should wait for DIO, so the only data we have to protect is end_io->flags
  modification, but only flush_completed_IO and end_io_work modified this
  flags and we can serialize them via i_completed_io_lock.

  Currently all these games with mutex_trylock result in the following deadlock
   truncate:                          kworker:
    ext4_setattr                       ext4_end_io_work
    mutex_lock(i_mutex)
    inode_dio_wait(inode)  ->BLOCK
                             DEADLOCK<- mutex_trylock()
                                        inode_dio_done()
  #TEST_CASE1_BEGIN
  MNT=/mnt_scrach
  unlink $MNT/file
  fallocate -l $((1024*1024*1024)) $MNT/file
  aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file
  sleep 2
  truncate -s 0 $MNT/file
  #TEST_CASE1_END

Or use 286's xfstests https://github.com/dmonakhov/xfstests/blob/devel/286

This patch makes state machine simple and clean:

(1) xxx_end_io schedule final extent conversion simply by calling
    ext4_add_complete_io(), which append it to ei->i_completed_io_list
    NOTE1: because of (2A) work should be queued only if
    ->i_completed_io_list was empty, otherwise the work is scheduled already.

(2) ext4_flush_completed_IO is responsible for handling all pending
    end_io from ei->i_completed_io_list
    Flushing sequence consists of following stages:
    A) LOCKED: Atomically drain completed_io_list to local_list
    B) Perform extents conversion
    C) LOCKED: move converted io's to to_free list for final deletion
       	     This logic depends on context which we was called from.
    D) Final end_io context destruction
    NOTE1: i_mutex is no longer required because end_io->flags modification
    is protected by ei->ext4_complete_io_lock

Full list of changes:
- Move all completion end_io related routines to page-io.c in order to improve
  logic locality
- Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
- remove EXT4_IO_END_FSYNC
- Improve SMP scalability by removing useless i_mutex which does not
  protect io->flags anymore.
- Reduce lock contention on i_completed_io_lock by optimizing list walk.
- Rename ext4_end_io_nolock to end4_end_io and make it static
- Check flush completion status to ext4_ext_punch_hole(). Because it is
  not good idea to punch blocks from corrupted inode.

Changes since V3 (in request to Jan's comments):
  Fall back to active flush_completed_IO() approach in order to prevent
  performance issues with nolocked DIO reads.
Changes since V2:
  Fix use-after-free caused by race truncate vs end_io_work

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-09-29 00:14:55 -04:00
Dmitry Monakhov 82e5422911 ext4: fix unwritten counter leakage
ext4_set_io_unwritten_flag() will increment i_unwritten counter, so
once we mark end_io with EXT4_END_IO_UNWRITTEN we have to revert it back
on error path.

 - add missed error checks to prevent counter leakage
 - ext4_end_io_nolock() will clear EXT4_END_IO_UNWRITTEN flag to signal
   that conversion finished.
 - add BUG_ON to ext4_free_end_io() to prevent similar leakage in future.

Visible effect of this bug is that unaligned aio_stress may deadlock

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-09-28 23:36:25 -04:00
Dmitry Monakhov f45ee3a1ea ext4: ext4_inode_info diet
Generic inode has unused i_private pointer which may be used as cur_aio_dio
storage.

TODO: If cur_aio_dio will be passed as an argument to get_block_t this allow
      to have concurent AIO_DIO requests.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-09-28 23:21:09 -04:00
Wei Yongjun ba39ebb614 ext4: convert to use leXX_add_cpu()
Convert cpu_to_leXX(leXX_to_cpu(E1) + E2) to use leXX_add_cpu().

dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-09-27 09:37:53 -04:00
Lukas Czerner 63fedaf1c2 ext4: remove unused function ext4_ext_check_cache
Remove unused function ext4_ext_check_cache() and merge the code back to
the ext4_ext_in_cache().

Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-09-26 21:09:06 -04:00
Andrey Sidorov 18888cf088 ext4: speed up truncate/unlink by not using bforget() unless needed
Do not iterate over data blocks scanning for bh's to forget as they're
never exist. This improves time taken by unlink / truncate syscall.
Tested by continuously truncating file that is being written by dd.
Another test is rm -rf of linux tree while tar unpacks it. With
ordered data mode condition unlikely(!tbh) was always met in
ext4_free_blocks. With journal data mode tbh was found only few times,
so optimisation is also possible.

Unlinking fallocated 60G file after doing sync && echo 3 >
/proc/sys/vm/drop_caches && time rm --help

X86 before (linux 3.6-rc4):
# time rm -f test1
real    0m2.710s
user    0m0.000s
sys     0m1.530s

X86 after:
# time rm -f test1
real    0m0.644s
user    0m0.003s
sys     0m0.060s

MIPS before (linux 2.6.37):
# time rm -f test1
real    0m 4.93s
user    0m 0.00s
sys     0m 4.61s

MIPS after:
# time rm -f test1
real    0m 0.16s
user    0m 0.00s
sys     0m 0.06s

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrey Sidorov <qrxd43@motorola.com>
2012-09-19 14:14:53 -04:00
Wang Sheng-Hui 30cb27d661 ext4: fix trivial typo in comment
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-08-18 22:38:07 -04:00
Ashish Sangwan e3d2e433e3 ext4: no need to add inode to orphan list during hole punch
While performing punch hole for an inode, i_disksize is not changed.
So, there is no need to add the inode to orphan list.

Signed-off-by: Ashish Sangwan <ashish.sangwan2@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Acked-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-08-18 22:29:46 -04:00
Zheng Liu 67a5da564f ext4: make the zero-out chunk size tunable
Currently in ext4 the length of zero-out chunk is set to 7 file system
blocks.  But if an inode has uninitailized extents from using
fallocate to preallocate space, and the workload issues many random
writes, this can cause a fragmented extent tree that will
unnecessarily grow the extent tree.

So create a new sysfs tunable, extent_max_zeroout_kb, which controls
the maximum size where blocks will be zeroed out instead of creating a
new uninitialized extent.  The default of this has been sent to 32kb.

CC: Zach Brown <zab@zabbo.net>
CC: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-08-17 09:54:17 -04:00
Theodore Ts'o ecb94f5fdf ext4: collapse a single extent tree block into the inode if possible
If an inode has more than 4 extents, but then later some of the
extents are merged together, we can optimize the file system by moving
the extents up into the inode, and discarding the extent tree block.
This is important, because if there are a large number of inodes with
an external extent tree blocks where the contents could fit in the
inode, this can significantly increase the fsck time of the file
system.

Google-Bug-Id: 6801242

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-08-17 09:44:17 -04:00
Theodore Ts'o 89a4e48f84 ext4: fix kernel BUG on large-scale rm -rf commands
Commit 968dee7722: "ext4: fix hole punch failure when depth is greater
than 0" introduced a regression in v3.5.1/v3.6-rc1 which caused kernel
crashes when users ran run "rm -rf" on large directory hierarchy on
ext4 filesystems on RAID devices:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000028

    Process rm (pid: 18229, threadinfo ffff8801276bc000, task ffff880123631710)
    Call Trace:
     [<ffffffff81236483>] ? __ext4_handle_dirty_metadata+0x83/0x110
     [<ffffffff812353d3>] ext4_ext_truncate+0x193/0x1d0
     [<ffffffff8120a8cf>] ? ext4_mark_inode_dirty+0x7f/0x1f0
     [<ffffffff81207e05>] ext4_truncate+0xf5/0x100
     [<ffffffff8120cd51>] ext4_evict_inode+0x461/0x490
     [<ffffffff811a1312>] evict+0xa2/0x1a0
     [<ffffffff811a1513>] iput+0x103/0x1f0
     [<ffffffff81196d84>] do_unlinkat+0x154/0x1c0
     [<ffffffff8118cc3a>] ? sys_newfstatat+0x2a/0x40
     [<ffffffff81197b0b>] sys_unlinkat+0x1b/0x50
     [<ffffffff816135e9>] system_call_fastpath+0x16/0x1b
    Code: 8b 4d 20 0f b7 41 02 48 8d 04 40 48 8d 04 81 49 89 45 18 0f b7 49 02 48 83 c1 01 49 89 4d 00 e9 ae f8 ff ff 0f 1f 00 49 8b 45 28 <48> 8b 40 28 49 89 45 20 e9 85 f8 ff ff 0f 1f 80 00 00 00

    RIP  [<ffffffff81233164>] ext4_ext_remove_space+0xa34/0xdf0

This could be reproduced as follows:

The problem in commit 968dee7722 was that caused the variable 'i' to
be left uninitialized if the truncate required more space than was
available in the journal.  This resulted in the function
ext4_ext_truncate_extend_restart() returning -EAGAIN, which caused
ext4_ext_remove_space() to restart the truncate operation after
starting a new jbd2 handle.

Reported-by: Maciej Żenczykowski <maze@google.com>
Reported-by: Marti Raudsepp <marti@juffo.org>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-08-17 09:42:17 -04:00
Ashish Sangwan 968dee7722 ext4: fix hole punch failure when depth is greater than 0
Whether to continue removing extents or not is decided by the return
value of function ext4_ext_more_to_rm() which checks 2 conditions:
a) if there are no more indexes to process.
b) if the number of entries are decreased in the header of "depth -1".

In case of hole punch, if the last block to be removed is not part of
the last extent index than this index will not be deleted, hence the
number of valid entries in the extent header of "depth - 1" will
remain as it is and ext4_ext_more_to_rm will return 0 although the
required blocks are not yet removed.

This patch fixes the above mentioned problem as instead of removing
the extents from the end of file, it starts removing the blocks from
the particular extent from which removing blocks is actually required
and continue backward until done.

Signed-off-by: Ashish Sangwan <ashish.sangwan2@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org
2012-07-22 22:49:08 -04:00
HaiboLiu e7bcf82304 ext4: fix out-of-date comments in extents.c
In this patch, ext4_ext_try_to_merge has been change to merge 
an extent both left and right.  So we need to update the comment
in here.

Signed-off-by: HaiboLiu <HaiboLiu6@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-09 16:29:28 -04:00
Zheng Liu f4e95b3316 ext4: honor O_(D)SYNC semantic in ext4_fallocate()
Ext4 must make sure the transaction to be commited to the disk when
user opens a file with O_(D)SYNC flag and do a fallocate(2) call.

This problem had been reported by Christoph Hellwig in this thread:
http://www.spinics.net/lists/linux-btrfs/msg13621.html

Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-06-30 19:12:57 -04:00
Hugh Dickins 5e44f8c374 ext4: hole-punch use truncate_pagecache_range
When truncating a file, we unmap pages from userspace first, as that's
usually more efficient than relying, page by page, on the fallback in
truncate_inode_page() - particularly if the file is mapped many times.

Do the same when punching a hole: 3.4 added truncate_pagecache_range()
to do the unmap and trunc, so use it in ext4_ext_punch_hole(), instead
of calling truncate_inode_pages_range() directly.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-06-01 00:15:28 -04:00
Zheng Liu 4a3c3a5120 ext4: fix format flag in ext4_ext_binsearch_idx()
fix ext_debug format flag in ext4_ext_binsearch_idx().

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-28 17:55:16 -04:00
Darrick J. Wong 7ac5990d5a ext4: verify and calculate checksums for extent tree blocks
Calculate and verify the checksum for each extent tree block.  The
checksum is located in the space immediately after the last possible
ext4_extent in the block.  The space is is typically the last 4-8
bytes in the block.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29 18:37:10 -04:00
Darrick J. Wong f84891289e ext4: create a new BH_Verified flag to avoid unnecessary metadata validation
Create a new BH_Verified flag to indicate that we've verified all the
data in a buffer_head for correctness.  This allows us to bypass
expensive verification steps when they are not necessary without
missing them when they are.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29 18:21:10 -04:00
Linus Torvalds 592fe89806 Ext4 regression fixes for 3.4
This fixes a scalability problem reported by Andi Kleen and Tim Chen;
 they were quite secretive about the precise nature of their workload,
 but they later admitted that it only showed up when they were using a
 large sparse file, so the amount of data I/O that was needed was close
 to zero.  I'm not sure how realistic this is and it's only a
 regression if you consider changes made since 2.6.39 to be a
 "regression" vis-a-vis the policy regarding post-merge window bug
 fixes, but Linus agreed it was worth fixing, so I'm including it in
 this pull request.
 
 This also fixes the journalled quota mount options, which I
 accidentally broke while I was cleaning up the mount option handling.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABCAAGBQJPjbs2AAoJENNvdpvBGATw/4UQAOMsxRlzbMnOAmohIBmesJiB
 nTPX41dNw0QCXstMFjvSyRbUJI0NZPGg6ZDhbtoQ/c42/izZOVNd7gh7QQYHRCt2
 Oqh6WS159P04ixcAe8NgCm5B5AV/C5Er/vSOUZENBaBo430vvrZasifsyUgQnx+b
 PxXlYsfPVzaQVeSxCu/68OeiBRLcLwmKZ6MicOaWt9GCNoCsWzgU+/LNskYuscPI
 841yQL6/BE7redU2E9qoEn/xjxx57hJOj2iiIAuqGPmNmRQqq3VtvTqNZHldNBLp
 Hz5mB3zSZsPg0uwvp+OxrnpP37NQCCn1L64UFIXxqGF47mcCYw7schAoGBtqwGQS
 neGUfkzG4beKk7kojyDawvnrrvVn4iCMaIkR1ZjzjPOk+QBPagckrS2nmuObbYzJ
 l4lmHq1v8nOh0clxqJPioNp5/Y13sbpEOY4tAa6sLpzKLKXF330RNuwwrKKHB7zo
 ZhCvSwVEjmxacgPCPhFJnR3fxtjoXWR8WvJs7H+gZB/QaC8hjEYw6xvvrkw8mAiS
 RNe3cYdYAz6kOJdtJrJaMzp/1CYdHydf+0WvNYCQ/d1poPr7uU5wQY61hdm2gFxh
 owsbVAOiEFjZWJHqrRRdcg2irTpINJTS3iRe4g/ltcvYzzRSeOcZNWvkKFspq3i8
 EUMHRBxLzPMRa+gU6Unm
 =jb3f
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 regression fixes from Ted Ts'o:
 "This fixes a scalability problem reported by Andi Kleen and Tim Chen;
  they were quite secretive about the precise nature of their workload,
  but they later admitted that it only showed up when they were using a
  large sparse file, so the amount of data I/O that was needed was close
  to zero.

  I'm not sure how realistic this is and it's only a regression if you
  consider changes made since 2.6.39 to be a "regression" vis-a-vis the
  policy regarding post-merge window bug fixes, but Linus agreed it was
  worth fixing, so I'm including it in this pull request.

  This also fixes the journalled quota mount options, which I
  accidentally broke while I was cleaning up the mount option handling."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix handling of journalled quota options
  ext4: address scalability issue by removing extent cache statistics
2012-04-17 13:30:34 -07:00
Theodore Ts'o 9cd70b347e ext4: address scalability issue by removing extent cache statistics
Andi Kleen and Tim Chen have reported that under certain circumstances
the extent cache statistics are causing scalability problems due to
cache line bounces.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-04-16 12:16:20 -04:00
Al Viro af1584f570 ext4: fix endianness breakage in ext4_split_extent_at()
->ee_len is __le16, so assigning cpu_to_le32() to it is going to do
Bad Things(tm) on big-endian hosts...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-04-13 10:12:08 -04:00
Lukas Czerner afcff5d80a ext4: remove restrictive checks for EOFBLOCKS_FL
We are going to remove the EOFBLOCKS_FL flag in the future, so this is
the first part of the removal. We can not remove it entirely just now,
since the e2fsck is still checking for it and it might cause headache to
some people. Instead, remove the restrictive checks now and the rest
later, when the new e2fsck code is out and common enough.

This is also needed because punch hole already breaks the EOFBLOCKS_FL
semantics, so it might cause the some troubles. So simply remove it.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-21 21:47:55 -04:00
Theodore Ts'o 92b9781658 ext4: change some printk() calls to use ext4_msg() instead
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-19 23:41:49 -04:00
Lukas Czerner dc1841d6cf ext4: give more helpful error message in ext4_ext_rm_leaf()
The error message produced by the ext4_ext_rm_leaf() when we are
removing blocks which accidentally ends up inside the existing extent,
is not very helpful, because we would like to also know which extent did
we collide with.

This commit changes the error message to get us also the information
about the extent we are colliding with.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-19 23:07:43 -04:00
Lukas Czerner 7877191c28 ext4: remove unused code from ext4_ext_map_blocks()
Since the commit 'Rewrite punch hole to use ext4_ext_remove_space()'
reworked the punch hole implementation to use ext4_ext_remove_space()
instead of ext4_ext_map_blocks(), we can remove the code which is no
longer needed from the ext4_ext_map_blocks().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-19 23:05:43 -04:00
Lukas Czerner 5f95d21fb6 ext4: rewrite punch hole to use ext4_ext_remove_space()
This commit rewrites ext4 punch hole implementation to use
ext4_ext_remove_space() instead of its home gown way of doing this via
ext4_ext_map_blocks(). There are several reasons for changing this.

Firstly it is quite non obvious that punching hole needs to
ext4_ext_map_blocks() to punch a hole, especially given that this
function should map blocks, not unmap it. It also required a lot of new
code in ext4_ext_map_blocks().

Secondly the design of it is not very effective. The reason is that we
are trying to punch out blocks in ext4_ext_punch_hole() in opposite
direction than in ext4_ext_rm_leaf() which causes the ext4_ext_rm_leaf()
to iterate through the whole tree from the end to the start to find the
requested extent for every extent we are going to punch out.

And finally the current implementation does not use the existing code,
but bring a lot of new code, which is IMO unnecessary since there
already is some infrastructure we can use. Specifically
ext4_ext_remove_space().

This commit changes ext4_ext_remove_space() to accept 'end' parameter so
we can not only truncate to the end of file, but also remove the space
in the middle of the file (punch a hole). Moreover, because the last
block to punch out, might be in the middle of the extent, we have to
split the extent at 'end + 1' so ext4_ext_rm_leaf() can easily either
remove the whole fist part of split extent, or change its size.

ext4_ext_remove_space() is then used to actually remove the space
(extents) from within the hole, instead of ext4_ext_map_blocks().

Note that this also fix the issue with punch hole, where we would forget
to remove empty index blocks from the extent tree, resulting in double
free block error and file system corruption. This is simply because we
now use different code path, where this problem does not exist.

This has been tested with fsx running for several days and xfstests,
plus xfstest #251 with '-o discard' run on the loop image (which
converts discard requestes into punch hole to the backing file). All of
it on 1K and 4K file system block size.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-19 23:03:19 -04:00