The default behavior for directory reservations stays the same, but we add a
mount option so people can tweak the size of directory reservations
according to their workloads.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The default reservation size of 4 (32-bit windows) is a bit too ambitious.
Scale it back to 16 bits (resv_level=2). I have been testing various sizes
on a 4-node cluster which runs a mixed workload that is heavily threaded.
With a 256MB local alloc, I get *roughly* the following levels of average file
fragmentation:
resv_level=0 70%
resv_level=1 21%
resv_level=2 23%
resv_level=3 24%
resv_level=4 60%
resv_level=5 did not test
resv_level=6 60%
resv_level=2 seemed like a good compromise between not letting windows be
too small, but not so big that heavier workloads will immediately suffer
without tuning.
This patch also change the behavior of directory reservations - they now
track file reservations. The previous compromise of giving directory
windows only 8 bits wound up fragmenting more at some window sizes because
file allocations had smaller unused windows to poach from.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
I have observed that the current size of 8M gives us pretty poor
fragmentation on multi-threaded workloads which do lots of writes.
Generally, I can increase the size of local alloc windows and observe a
marked decrease in fragmentation, even up and beyond window sizes of 512
megabytes. This makes sense for a couple reasons - larger local alloc means
more room for reservation windows. On multi-node workloads the larger local
alloc helps as well because we don't have to do window slides as often.
Also, I removed the OCFS2_DEFAULT_LOCAL_ALLOC_SIZE constant as it is no
longer used and the comment above it was out of date.
To test fragmentation, I used a workload which launched 4 threads that did
4k writes into a series of about 140 alternating files.
With resv_level=2, and a 4k/4k file system I observed the following average
fragmentation for various localalloc= parameters:
localalloc= avg. fragmentation
8 48
32 16
64 10
120 7
On larger cluster sizes, the difference is more dramatic.
The new default size top out at 256M, which we'll only get for cluster
sizes of 32K and above.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch pulls the local alloc sizing code into localalloc.c and provides
a callout to it from ocfs2_fill_super(). Behavior is essentially unchanged
except that I correctly calculate the maximum local alloc size. The old code
in ocfs2_parse_options() calculated the max size as:
ocfs2_local_alloc_size(sb) * 8
which is correct, in bits. Unfortunately though the option passed in is in
megabytes. Ultimately, this bug made no real difference - the shrink code
would catch a too-large size and bring it down to something reasonable.
Still, it's less than efficient as-is.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Inodes are always allocated from the global bitmap now so we don't need this
any more. Also, the existing implementation bounces reservations around
needlessly.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Otherwise, the need for a very large contiguous allocation tends to
wreak havoc on many inode allocation reservations on the local alloc, thus
ruining any chances for contiguousness.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Use the reservations system for unindexed dir tree allocations. We don't
bother with the indexed tree as reads from it are mostly random anyway.
Directory reservations are marked seperately, to allow the reservations code
a chance to optimize their window sizes. This patch allocates only 8 bits
for directory windows as they generally are not expected to grow as quickly
as file data. Future improvements to dir window sizing can trivially be
made.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch improves Ocfs2 allocation policy by allowing an inode to
reserve a portion of the local alloc bitmap for itself. The reserved
portion (allocation window) is advisory in that other allocation
windows might steal it if the local alloc bitmap becomes
full. Otherwise, the reservations are honored and guaranteed to be
free. When the local alloc window is moved to a different portion of
the bitmap, existing reservations are discarded.
Reservation windows are represented internally by a red-black
tree. Within that tree, each node represents the reservation window of
one inode. An LRU of active reservations is also maintained. When new
data is written, we allocate it from the inodes window. When all bits
in a window are exhausted, we allocate a new one as close to the
previous one as possible. Should we not find free space, an existing
reservation is pulled off the LRU and cannibalized.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
jbd[2]_journal_dirty_metadata() only returns 0. It's been returning 0
since before the kernel moved to git. There is no point in checking
this error.
ocfs2_journal_dirty() has been faithfully returning the status since the
beginning. All over ocfs2, we have blocks of code checking this can't
fail status. In the past few years, we've tried to avoid adding these
checks, because they are pointless. But anyone who looks at our code
assumes they are needed.
Finally, ocfs2_journal_dirty() is made a void function. All error
checking is removed from other files. We'll BUG_ON() the status of
jbd2_journal_dirty_metadata() just in case they change it someday. They
won't.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
When the local alloc file changes windows, unused bits are freed back to the
global bitmap. By defnition, those bits can not be in use by any file. Also,
the local alloc will never have been able to allocate those bits if they
were part of a previous truncate. Therefore it makes sense that we should
clear unused local alloc bits in the undo buffer so that they can be used
immediatly.
[ Modified to call it ocfs2_release_clusters() -- Joel ]
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
You can't store a pointer that you haven't filled in yet and expect it
to work.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
When replacing a xattr's value, in some case we wipe its name/value
first and then re-add it. The wipe is done by
ocfs2_xa_block_wipe_namevalue() when the xattr is in the inode or
block. We currently adjust name_offset for all the entries which have
(offset < name_offset). This does not adjust the entrie we're replacing.
Since we are replacing the entry, we don't adjust the total entry count.
When we calculate a new namevalue location, we trust the entries
now-wrong offset in ocfs2_xa_get_free_start(). The solution is to
also adjust the name_offset for the replaced entry, allowing
ocfs2_xa_get_free_start() to calculate the new namevalue location
correctly.
The following script can trigger a kernel panic easily.
echo 'y'|mkfs.ocfs2 --fs-features=local,xattr -b 4K $DEVICE
mount -t ocfs2 $DEVICE $MNT_DIR
FILE=$MNT_DIR/$RANDOM
for((i=0;i<76;i++))
do
string_76="a$string_76"
done
string_78="aa$string_76"
string_82="aaaa$string_78"
touch $FILE
setfattr -n 'user.test1234567890' -v $string_76 $FILE
setfattr -n 'user.test1234567890' -v $string_78 $FILE
setfattr -n 'user.test1234567890' -v $string_82 $FILE
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
What we were doing before was to ask for the current window size as the
maximum allocation. This had the effect of limiting the amount of allocation
we could get for the local alloc during times when the window size was
shrunk due to fragmentation. In some cases, that could actually *increase*
fragmentation by artificially limiting the number of bits we can accept. So
while we still want to ask for a minimum number of bits equal to window
size, there is no reason why we should limit the number of bits the local
alloc should accept. Hence always allow the maximum number of local alloc
bits.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
ocfs2_set_acl() and ocfs2_init_acl() were setting i_mode on the in-memory
inode, but never setting it on the disk copy. Thus, acls were some times not
getting propagated between nodes. This patch fixes the issue by adding a
helper function ocfs2_acl_set_mode() which does this the right way.
ocfs2_set_acl() and ocfs2_init_acl() are then updated to call
ocfs2_acl_set_mode().
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In reflink, we need to upate i_blocks for the target inode.
Reported-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In ocfs2_validate_gd_parent, we check bg_chain against the
cl_next_free_rec of the dinode. Actually in resize, we have
the chance of bg_chain == cl_next_free_rec. So add some
additional condition check for it.
I also rename paramter "clean_error" to "resize", since the
old one is not clearly enough to indicate that we should only
meet with this case in resize.
btw, the correpsonding bug is
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1230.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
ocfs2_lock() will skip locks on file which has mode set to 02666. This
is a problem in cases where the mode of the file is changed after a
process has obtained a lock on the file.
ocfs2_lock() should skip the check for mandatory locks when unlocking a
file.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (34 commits)
ACPI: processor: push file static MADT pointer into internal map_madt_entry()
ACPI: processor: refactor internal map_lsapic_id()
ACPI: processor: refactor internal map_x2apic_id()
ACPI: processor: refactor internal map_lapic_id()
ACPI: processor: driver doesn't need to evaluate _PDC
ACPI: processor: remove early _PDC optin quirks
ACPI: processor: add internal processor_physically_present()
ACPI: processor: move acpi_get_cpuid into processor_core.c
ACPI: processor: export acpi_get_cpuid()
ACPI: processor: mv processor_pdc.c processor_core.c
ACPI: processor: mv processor_core.c processor_driver.c
ACPI: plan to delete "acpi=ht" boot option
ACPI: remove "acpi=ht" DMI blacklist
PNPACPI: add bus number support
PNPACPI: add window support
resource: add window support
resource: add bus number support
resource: expand IORESOURCE_TYPE_BITS to make room for bus resource type
acpiphp: Execute ACPI _REG method for hotadded devices
ACPI video: Be more liberal in validating _BQC behaviour
...
There's no real need for a pointer to the MADT to be global. The only
function who uses it is map_madt_entry.
This allows us to remove some more ugly #ifdefs.
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Un-nest the if statements for readability.
Remove comments that re-state the obvious.
Change the control flow so that we no longer need a temp variable.
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Untangle the nested if conditions to make this function look
more similar to the other map_*apic_id() functions.
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Untangle the if() statement a little for readability.
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Now that the early _PDC evaluation path knows how to correctly
evaluate _PDC on only physically present processors, there's no
need for the processor driver to evaluate it later when it loads.
To cover the hotplug case, push _PDC evaluation down into the
hotplug paths.
Cc: x86@kernel.org
Cc: Tony Luck <tony.luck@intel.com>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Now that we check for physically present processors before blindly
evaluating _PDC, we no longer need to maintain a DMI opt-in table
nor a kernel param.
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Detect if a processor is physically present before evaluating _PDC.
We want this because some BIOS will provide a _PDC even for processors
that are not present. These bogus _PDC methods then attempt to load
non-existent tables, which causes problems.
Avoid those bogus landmines.
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Enumerating processors (via MADT/_MAT) belongs in the processor core,
which is always built-in, rather than living in the processor driver
which may not be built.
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Rename static get_cpu_id() to acpi_get_cpuid() and export it.
This change also gives us an opportunity to remove the
#ifndef CONFIG_SMP from processor_driver.c and into a header file
where it properly belongs.
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
We've renamed the old processor_core.c to processor_driver.c, to
convey the idea that it can be built modular and has driver-like
bits.
Now let's re-create a processor_core.c for the bits needed
statically by the rest of the kernel. The contents of processor_pdc.c
are a good starting spot, so let's just rename that file and
complete our three card monte.
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
The ACPI processor driver can be built as a module. But it has
pieces of code that should always be built statically into the
kernel.
The plan is for processor_core.c to contain the static bits while
processor_driver.c contains the module-like bits.
Since the bulk of the code in the current processor_core.c is
module-like, first step is to rename the file to processor_driver.c
Next step will re-create processor_core.c and cherry-pick out
the static bits.
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
SuSE added these entries when deploying ACPI in Linux-2.4.
I pulled them into Linux-2.6 on 2003-08-09.
Over the last 6+ years, several entries have proven to be
unnecessary and deleted, while no new entries have been added.
Matthew suggests that they now have negative value, and I agree.
Based-on-patch-by: Matthew Garrett <mjg59@srcf.ucam.org>
Signed-off-by: Len Brown <len.brown@intel.com>
Add support for bus number resources. This is for bridges with a range of
bus numbers behind them. Previously, PNP ignored bus number resources.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Add support for resource windows. This is for bridge resources, i.e.,
regions where a bridge forwards transactions from the primary to the
secondary side. This does not add support for *setting* windows via
the /proc interface.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Add support for resource windows. This is for bridge resources, i.e.,
regions where a bridge forwards transactions from the primary to the
secondary side.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Add support for bus number resources. This is for bridges with a range of
bus numbers behind them.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
No functional change; this just makes room for another resource type.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
The original code returns a freed pointer. This function is expected to
return NULL on errors.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: James Morris <jmorris@namei.org>
Per ACPI spec, _ERG method should be executed before device driver
gets control for hotpluged device. Firmware might do some configuration
there. See http://bugzilla.kernel.org/show_bug.cgi?id=10805. In this
machine, _REG method of docked device will configure cardbus bridge.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Tested-by: Paul Martin <pm@debian.org>
Signed-off-by: Len Brown <len.brown@intel.com>
Right now, if _BQC returns a value we don't understand we immediately
invalidate it. Change this behaviour so we only invalidate it if it
continues to give an invalid answer after we've already set a brightness.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Acked-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
This file was missed in the original patch that went into Linus' tree.
Cc: "Ben Dooks (embedded platforms)" <ben-linux@fluff.org>
Cc: linux-i2c@vger.kernel.org
Signed-off-by: Richard Röjfors <richard.rojfors@pelagicore.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: i8042 - add ALDI/MEDION netbook E1222 to qurik reset table
Input: ALPS - fix stuck buttons on some touchpads
Input: wm831x-on - convert to use genirq
Input: ads7846 - add wakeup support
Input: appletouch - fix integer overflow issue
Input: ad7877 - increase pen up imeout
Input: ads7846 - add support for AD7843 parts
Input: bf54x-keys - fix system hang when pressing a key
Input: alps - add support for the touchpad on Toshiba Tecra A11-11L
Input: remove BKL, fix input_open_file() locking
Input: serio_raw - remove BKL
Input: mousedev - remove BKL
Input: add driver for TWL4030 vibrator device
Input: enable remote wakeup for PNP i8042 keyboard ports
Input: scancode in get/set_keycodes should be unsigned
Input: i8042 - use platfrom_create_bundle() helper
Input: wacom - merge out and in prox events
Input: gamecon - fix off by one range check
Input: wacom - replace WACOM_PKGLEN_PENABLED
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
nilfs2: remove whitespaces before quoted newlines
nilfs2: remove spaces before tabs
nilfs2: fix various typos in comments
nilfs2: fix typo "cout" -> "count" in error message
nilfs2: fix function name typos in docbook comments
nilfs2: fix discrepancy in use of static specifier
* 'i2c-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
i2c-algo-bit: Add pre- and post-xfer hooks
at24: Init dynamic bin_attribute structures
i2c: Drop configure option I2C_DEBUG_CHIP
tsl2550: Move from i2c/chips to misc
i2c-i801: Don't use the block buffer for I2C block writes
i2c-powermac: Be less verbose in the absence of real errors.
i2c-smbus: Use device_lock/device_unlock
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
9p: Skip check for mandatory locks when unlocking
9p: Fixes a simple bug enabling writes beyond 2GB.
9p: Change the name of new protocol from 9p2010.L to 9p2000.L
fs/9p: re-init the wstat in readdir loop
net/9p: Add sysfs mount_tag file for virtio 9P device
net/9p: Use the tag name in the config space for identifying mount point
Commit f56e8a076 "x86/mce: Fix RCU lockdep splats" introduced the
following build bug:
arch/x86/kernel/cpu/mcheck/mce.c: In function 'mce_log':
arch/x86/kernel/cpu/mcheck/mce.c:166: error: 'mce_read_mutex' undeclared (first use in this function)
arch/x86/kernel/cpu/mcheck/mce.c:166: error: (Each undeclared identifier is reported only once
arch/x86/kernel/cpu/mcheck/mce.c:166: error: for each function it appears in.)
Move the in-the-middle-of-file lock variable up to the variable
definition section, the top of the .c file.
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1267830207-9474-3-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>