Documentation/memory-barriers.txt: Add needed ACCESS_ONCE() calls to memory-barriers.txt

The Documentation/memory-barriers.txt file was written before the need for ACCESS_ONCE() was fully appreciated. It therefore contains no ACCESS_ONCE() calls, which can be a problem when people lift examples from it. This commit therefore adds ACCESS_ONCE() calls. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <linux-arch@vger.kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1386799151-2219-1-git-send-email-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-11 13:59:04 -08:00 · 2013-12-11 13:59:04 -08:00 · 2ecf810121
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@ -194,18 +194,22 @@ There are some minimal guarantees that may be expected of a CPU:
 (*) On any given CPU, dependent memory accesses will be issued in order, with
     respect to itself.  This means that for:

-	Q = P; D = *Q;
+	ACCESS_ONCE(Q) = P; smp_read_barrier_depends(); D = ACCESS_ONCE(*Q);

     the CPU will issue the following memory operations:

 	Q = LOAD P, D = LOAD *Q

-     and always in that order.
+     and always in that order.  On most systems, smp_read_barrier_depends()
+     does nothing, but it is required for DEC Alpha.  The ACCESS_ONCE()
+     is required to prevent compiler mischief.  Please note that you
+     should normally use something like rcu_dereference() instead of
+     open-coding smp_read_barrier_depends().

 (*) Overlapping loads and stores within a particular CPU will appear to be
     ordered within that CPU.  This means that for:

-	a = *X; *X = b;
+	a = ACCESS_ONCE(*X); ACCESS_ONCE(*X) = b;

     the CPU will only issue the following sequence of memory operations:

@ -213,7 +217,7 @@ There are some minimal guarantees that may be expected of a CPU:

     And for:

-	*X = c; d = *X;
+	ACCESS_ONCE(*X) = c; d = ACCESS_ONCE(*X);

     the CPU will only issue:

@ -224,6 +228,41 @@ There are some minimal guarantees that may be expected of a CPU:

 And there are a number of things that _must_ or _must_not_ be assumed:

+ (*) It _must_not_ be assumed that the compiler will do what you want with
+     memory references that are not protected by ACCESS_ONCE().  Without
+     ACCESS_ONCE(), the compiler is within its rights to do all sorts
+     of "creative" transformations:
+
+     (-) Repeat the load, possibly getting a different value on the second
+         and subsequent loads.  This is especially prone to happen when
+	 register pressure is high.
+
+     (-) Merge adjacent loads and stores to the same location.  The most
+         familiar example is the transformation from:
+
+		while (a)
+			do_something();
+
+         to something like:
+
+		if (a)
+			for (;;)
+				do_something();
+
+         Using ACCESS_ONCE() as follows prevents this sort of optimization:
+
+		while (ACCESS_ONCE(a))
+			do_something();
+
+     (-) "Store tearing", where a single store in the source code is split
+         into smaller stores in the object code.  Note that gcc really
+	 will do this on some architectures when storing certain constants.
+	 It can be cheaper to do a series of immediate stores than to
+	 form the constant in a register and then to store that register.
+
+     (-) "Load tearing", which splits loads in a manner analogous to
+     	 store tearing.
+
 (*) It _must_not_ be assumed that independent loads and stores will be issued
     in the order given.  This means that for:

@ -455,8 +494,8 @@ following sequence of events:
 	{ A == 1, B == 2, C = 3, P == &A, Q == &C }
 	B = 4;
 	<write barrier>
-	P = &B
-			Q = P;
+	ACCESS_ONCE(P) = &B
+			      Q = ACCESS_ONCE(P);
 			      D = *Q;

 There's a clear data dependency here, and it would seem that by the end of the
@ -482,8 +521,8 @@ between the address load and the data load:
 	{ A == 1, B == 2, C = 3, P == &A, Q == &C }
 	B = 4;
 	<write barrier>
-	P = &B
-			Q = P;
+	ACCESS_ONCE(P) = &B
+			      Q = ACCESS_ONCE(P);
 			      <data dependency barrier>
 			      D = *Q;

@ -509,16 +548,17 @@ access:
 	{ M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 }
 	M[1] = 4;
 	<write barrier>
-	P = 1
-			Q = P;
+	ACCESS_ONCE(P) = 1
+			      Q = ACCESS_ONCE(P);
 			      <data dependency barrier>
 			      D = M[Q];


-The data dependency barrier is very important to the RCU system, for example.
-See rcu_dereference() in include/linux/rcupdate.h.  This permits the current
-target of an RCU'd pointer to be replaced with a new modified target, without
-the replacement target appearing to be incompletely initialised.
+The data dependency barrier is very important to the RCU system,
+for example.  See rcu_assign_pointer() and rcu_dereference() in
+include/linux/rcupdate.h.  This permits the current target of an RCU'd
+pointer to be replaced with a new modified target, without the replacement
+target appearing to be incompletely initialised.

 See also the subsection on "Cache Coherency" for a more thorough example.

@ -530,22 +570,23 @@ A control dependency requires a full read memory barrier, not simply a data
 dependency barrier to make it work correctly.  Consider the following bit of
 code:

-	q = &a;
+	q = ACCESS_ONCE(a);
 	if (p) {
 		<data dependency barrier>
-		q = &b;
+		q = ACCESS_ONCE(b);
 	}
 	x = *q;

 This will not have the desired effect because there is no actual data
-dependency, but rather a control dependency that the CPU may short-circuit by
-attempting to predict the outcome in advance.  In such a case what's actually
-required is:
+dependency, but rather a control dependency that the CPU may short-circuit
+by attempting to predict the outcome in advance, so that other CPUs see
+the load from b as having happened before the load from a.  In such a
+case what's actually required is:

-	q = &a;
+	q = ACCESS_ONCE(a);
 	if (p) {
 		<read barrier>
-		q = &b;
+		q = ACCESS_ONCE(b);
 	}
 	x = *q;

@ -563,11 +604,11 @@ write barrier, though, again, a general barrier is viable:

 	CPU 1		      CPU 2
 	===============	      ===============
-	a = 1;
+	ACCESS_ONCE(a) = 1;
 	<write barrier>
-	b = 2;		x = b;
+	ACCESS_ONCE(b) = 2;   x = ACCESS_ONCE(b);
 			      <read barrier>
-			y = a;
+			      y = ACCESS_ONCE(a);

 Or:

@ -575,7 +616,7 @@ Or:
 	===============	      ===============================
 	a = 1;
 	<write barrier>
-	b = &a;		x = b;
+	ACCESS_ONCE(b) = &a;  x = ACCESS_ONCE(b);
 			      <data dependency barrier>
 			      y = *x;

@ -587,12 +628,12 @@ match the loads after the read barrier or the data dependency barrier, and vice
 versa:

 	CPU 1                               CPU 2
-	===============                 ===============
-	a = 1;           }----   --->{  v = c
-	b = 2;           }    \ /    {  w = d
+	===================                 ===================
+	ACCESS_ONCE(a) = 1;  }----   --->{  v = ACCESS_ONCE(c);
+	ACCESS_ONCE(b) = 2;  }    \ /    {  w = ACCESS_ONCE(d);
 	<write barrier>            \        <read barrier>
-	c = 3;           }    / \    {  x = a;
-	d = 4;           }----   --->{  y = b;
+	ACCESS_ONCE(c) = 3;  }    / \    {  x = ACCESS_ONCE(a);
+	ACCESS_ONCE(d) = 4;  }----   --->{  y = ACCESS_ONCE(b);


 EXAMPLES OF MEMORY BARRIER SEQUENCES
@ -1435,12 +1476,12 @@ three CPUs; then should the following sequence of events occur:

 	CPU 1				CPU 2
 	===============================	===============================
-	*A = a;				*E = e;
+	ACCESS_ONCE(*A) = a;		ACCESS_ONCE(*E) = e;
 	LOCK M				LOCK Q
-	*B = b;				*F = f;
-	*C = c;				*G = g;
+	ACCESS_ONCE(*B) = b;		ACCESS_ONCE(*F) = f;
+	ACCESS_ONCE(*C) = c;		ACCESS_ONCE(*G) = g;
 	UNLOCK M			UNLOCK Q
-	*D = d;				*H = h;
+	ACCESS_ONCE(*D) = d;		ACCESS_ONCE(*H) = h;

 Then there is no guarantee as to what order CPU 3 will see the accesses to *A
 through *H occur in, other than the constraints imposed by the separate locks
@ -1460,17 +1501,17 @@ However, if the following occurs:

 	CPU 1				CPU 2
 	===============================	===============================
-	*A = a;
+	ACCESS_ONCE(*A) = a;
 	LOCK M		     [1]
-	*B = b;
-	*C = c;
+	ACCESS_ONCE(*B) = b;
+	ACCESS_ONCE(*C) = c;
 	UNLOCK M	     [1]
-	*D = d;				*E = e;
+	ACCESS_ONCE(*D) = d;		ACCESS_ONCE(*E) = e;
 					LOCK M		     [2]
-					*F = f;
-					*G = g;
+					ACCESS_ONCE(*F) = f;
+					ACCESS_ONCE(*G) = g;
 					UNLOCK M	     [2]
-					*H = h;
+					ACCESS_ONCE(*H) = h;

 CPU 3 might see:

@ -2177,11 +2218,11 @@ A programmer might take it for granted that the CPU will perform memory
 operations in exactly the order specified, so that if the CPU is, for example,
 given the following piece of code to execute:

-	a = *A;
-	*B = b;
-	c = *C;
-	d = *D;
-	*E = e;
+	a = ACCESS_ONCE(*A);
+	ACCESS_ONCE(*B) = b;
+	c = ACCESS_ONCE(*C);
+	d = ACCESS_ONCE(*D);
+	ACCESS_ONCE(*E) = e;

 they would then expect that the CPU will complete the memory operation for each
 instruction before moving on to the next one, leading to a definite sequence of
@ -2228,12 +2269,12 @@ However, it is guaranteed that a CPU will be self-consistent: it will see its
 _own_ accesses appear to be correctly ordered, without the need for a memory
 barrier.  For instance with the following code:

-	U = *A;
-	*A = V;
-	*A = W;
-	X = *A;
-	*A = Y;
-	Z = *A;
+	U = ACCESS_ONCE(*A);
+	ACCESS_ONCE(*A) = V;
+	ACCESS_ONCE(*A) = W;
+	X = ACCESS_ONCE(*A);
+	ACCESS_ONCE(*A) = Y;
+	Z = ACCESS_ONCE(*A);

 and assuming no intervention by an external influence, it can be assumed that
 the final result will appear to be:
@ -2250,7 +2291,12 @@ accesses:

 in that order, but, without intervention, the sequence may have almost any
 combination of elements combined or discarded, provided the program's view of
-the world remains consistent.
+the world remains consistent.  Note that ACCESS_ONCE() is -not- optional
+in the above example, as there are architectures where a given CPU might
+interchange successive loads to the same location.  On such architectures,
+ACCESS_ONCE() does whatever is necessary to prevent this, for example, on
+Itanium the volatile casts used by ACCESS_ONCE() cause GCC to emit the
+special ld.acq and st.rel instructions that prevent such reordering.

 The compiler may also combine, discard or defer elements of the sequence before
 the CPU even sees them.
@ -2264,13 +2310,13 @@ may be reduced to:

 	*A = W;

-since, without a write barrier, it can be assumed that the effect of the
-storage of V to *A is lost.  Similarly:
+since, without either a write barrier or an ACCESS_ONCE(), it can be
+assumed that the effect of the storage of V to *A is lost.  Similarly:

 	*A = Y;
 	Z = *A;

-may, without a memory barrier, be reduced to:
+may, without a memory barrier or an ACCESS_ONCE(), be reduced to:

 	*A = Y;
 	Z = Y;