When more than one thread accesses a mutually exclusive variable, all threads must use synchronization, otherwise something very bad may happen. The main synchronization means in the Java language is the synchronized keyword (also known as internal lock), it enforces mutual exclusion to ensure that the actions of the threads executing the synchronized block can be seen by other threads executing the synchronized block protected by the same lock. When used properly, internal lock can make the program thread safe, but when locking is used to protect short code paths and threads frequently compete for locks, the lock can be used Setting can become quite onerous operation.
In this paper, we study atomic variables. Atomic variables provide atomic read-write-modify operations, which can safely update shared variables without locks. The memory semantics of atomic variables are similar to volatile variables, but because they can also be modified atomically, they can be used as the basis for concurrent algorithms that do not use locks.
The counter in Listing 1 is thread safe, but the performance cost of using locks bothers some developers. But the lock is necessary, because although adding seems to be a single operation, it is actually a simplification of three independent operations: retrieving the value, adding 1 to the value, and then writing back the value. (synchronization is also required on the getValue method to ensure that the thread calling getValue sees the latest value. Although many developers reluctantly convince themselves that ignoring the locking requirement is acceptable, ignoring the locking requirement is not a good strategy.) when multiple threads request the same lock at the same time, one thread will win and get the lock, while other threads will be blocked. J VM usually implements blocking by suspending the blocked thread and rescheduling it later. The context switching caused by this will cause considerable delay compared with a few instructions protected by lock.
The nonblocking counter in Listing 2 shows the simplest non blocking algorithm: a counter using the compareandset() (CAS) method of atomicinteger. The compareandset() method specifies that "this variable is updated to the new value, but if other threads modify its value since I last saw this variable, the update fails" (see for more explanation of atomic variables and compare and set.)
Atomic variable classes are called atomic because they provide fine-grained atomic updates to numbers and object references, but they are also atomic in the sense that they are the basic building blocks of non blocking algorithms. Non blocking algorithm has been the subject of scientific research for more than 20 years, but it was not possible in the Java language until the emergence of Java 5.0. Modern processors provide special instructions that can automatically update shared data and detect interference from other threads, which are used by compareandset () instead of locking. (if all you need to do is increment the counter, atomicinteger provides methods to increment, but these methods are based on compareandset (), such as nonblockingcounter increment())。 The non blocking version has several performance advantages over the lock based version. First, it replaces the JVM's locking code path with the native form of hardware, Thus, at a finer granularity level (independent memory location) for synchronization, the failed thread can also retry immediately without being suspended and rescheduled. Finer granularity reduces the chance of contention, and the ability to retry without rescheduling also reduces the cost of contention. Even with a small number of failed CAS operations, this method is still much faster than rescheduling due to lock contention. Nonblock The gcounter example may be simpler, but it demonstrates a basic feature of all non blocking algorithms - the execution of some algorithm steps is risky because it knows that CAS may have to be redone if it fails. Non blocking algorithms are usually called optimistic algorithms because they continue to operate on the assumption that there will be no interference. If interference is found, it will fall back and try again. In the counter example, the risky step is to increment -- it retrieves the old value and adds one to it, hoping that the value will not change during the calculation of the update. If its hope fails, it will retrieve the value again and redo the incremental calculation. An example of a slightly more complex non blocking algorithm is the concurrentstack in Listing 3. The push () and pop () operations in concurrentstack are structurally similar to those on nonblockingcounter, but the work they do is somewhat risky. It is hoped that the underlying assumptions will not fail when "submitting" the work. The push () method observes the current topmost node, constructs a new node and puts it on the stack. Then, if the topmost node does not change after the initial observation, install the new node. If CAS fails, which means that another thread has modified the stack, the process will restart. { AtomicReference > head = new AtomicReference >(); public void push(E item) { Node newHead = new Node (item); Node oldHead; do { oldHead = head.get(); newHead.next = oldHead; } while (!head.compareAndSet(oldHead,newHead)); } public E pop() { Node oldHead; Node newHead; do { oldHead = head.get(); if (oldHead == null) return null; newHead = oldHead.next; } while (!head.compareAndSet(oldHead,newHead)); return oldHead. item; } static class Node { final E item; Node next; public Node(E item) { this.item = item; } } } In the case of mild to moderate contention, the performance of non blocking algorithm will surpass that of blocking algorithm, because CAS succeeds most of the time at the first attempt, and the overhead of contention does not involve thread suspension and context switching, with only a few more loop iterations. CAS without contention is much cheaper than locks without contention (this sentence must be true, because locks without contention involve CAS and additional processing), and CAS with contention involves shorter latency than lock acquisition with contention. In the case of high contention (that is, when multiple threads are constantly competing for a memory location), lock based algorithms begin to provide better throughput than non blocking algorithms, because when a thread is blocked, it will stop competing and wait patiently for its turn, so as to avoid further contention. However, such a high degree of contention is not common, because most of the time, threads will count the thread locally Computing is separated from competing for shared data, giving other threads the opportunity to use shared data. (such a high level of contention also indicates the need to recheck the algorithm and work towards less shared data.) the diagram in is a bit confusing in this regard, because the contention in the measured program is extremely intensive, and it seems that locking is a better solution even for a small number of threads. Examples so far (counter and stack) are very simple non blocking algorithms. Once you master the use of CAS in a loop, you can easily imitate them. For more complex data structures, non blocking algorithms are much more complex than these simple examples, because modifying linked list, tree or hash table may involve updating multiple pointers. CAS supports atomic conditional updating of a single pointer , but more than two pointers are not supported. Therefore, to build a non blocking linked list, tree or hash table, we need to find a way to update multiple pointers with CAS without leaving the data structure in an inconsistent state. Inserting an element at the end of a linked list usually involves updating two pointers: "tail" pointer always points to the last element in the list, and "next" pointer points to the newly inserted element from the last element in the past. Because two pointers need to be updated, two CAS are required. Updating two pointers in a separate CAS raises two potential problems to consider: what happens if the first CAS succeeds and the second CAS fails? What happens if other threads attempt to access the linked list between the first and second CAS? For non complex data structures, The "trick" of building a non blocking algorithm is to ensure that the data structure is always in a consistent state (even between the thread starts to modify the data structure and it completes the modification), it is also necessary to ensure that other threads can not only judge whether the first thread has completed the update or is in the middle of the update, but also judge what operations are required to complete the update if the first thread moves to AWOL. If the thread finds the data structure in the middle of the update, it can The thread that help is executing the update completes the update, and then performs its own operation. When the first thread comes back and tries to complete its own update, it will find that it is no longer needed. Just return, Because CAS will detect the intervention of the help thread (in this case, constructive intervention). This "help neighbor" The requirement of is necessary to protect the data structure from the failure of a single thread. If a thread finds that the data structure is in the middle of being updated by other threads, and then waits for other threads to complete the update, if other threads fail in the middle of the operation, the thread may wait forever. Even if there is no failure, this method will provide poor performance, because the newly arrived thread must abandon the processor, resulting in context switching, Or wait until your own time slice expires (which is worse). The linkedqueue in Listing 4 shows the insertion operation of Michael Scott's non blocking queue algorithm, which is implemented by concurrentlinkedqueue: {private static class node {final e item; final atomicreference > next; node (e item, node next) {this. Item = item; this. Next = new atomicreference > (next) ; } } private AtomicReference > head = new AtomicReference >(new Node (null,null)); private AtomicReference > tail = head; public boolean put(E item) { Node newNode = new Node (item,null); while (true) { Node curTail = tail.get(); Node residue = curTail.next.get(); if (curTail == tail.get()) { if (residue == null) /* A */ { if (curTail.next.compareAndSet(null,newNode)) /* C */ { tail.compareAndSet(curTail,newNode) /* D */ ; return true; } } else { tail.compareAndSet(curTail,residue) /* B */; } } } } } Like many queue algorithms, an empty queue contains only one dummy node. The header pointer always points to the false node; The tail pointer always points to the last node or the penultimate node. Figure 1 illustrates a queue with two elements under normal circumstances: as shown in, inserting an element involves two pointer updates, Both updates are made through CAS: link from the current last node (c) of the queue to the new node, and move the tail pointer to the new last node (D) . if the first step fails, the status of the queue will remain unchanged, and the insertion thread will continue to retry until it succeeds. Once the operation is successful, the insertion will be regarded as effective, and other threads can see the modification. It is also necessary to move the tail pointer to the position of the new node, but this work can be regarded as "cleaning up" , because any thread in this situation can determine whether this cleaning is needed and how to clean it. The queue is always in one of two states: normal state (or quiescent state, and) or intermediate state (Figure 2). Before the insertion operation and after the second CAS (d) succeeds, the queue is quiescent; in the first CAS (C) After success, the queue is in the intermediate state. In the static state, the next field of the link node pointed to by the tail pointer is always null, while in the intermediate state, this field is non null. Any thread can determine the state of the queue by comparing whether tail.next is null, which allows the thread to help other threads "complete" The key to operation. Before inserting a new element (a), check whether the queue is in the intermediate state, as shown in. If it is in the intermediate state, there must be other threads in the middle of the element insertion, between steps (c) and (d). Without waiting for other threads to complete, the current thread can "help" it complete the operation and move the tail pointer forward (B) If necessary, CAS will continue to check the tail pointer and move the pointer forward until the queue is at rest, and then it can start its own insertion (C) It may fail because two threads compete to access the current last element of the queue; in this case, there is no modification, and the thread losing CAS will reload the tail pointer and try again. If the second CAS (d) fails, the inserting thread does not need to retry - because other threads are already in step (B) If you go deep into the JVM and operating system, you will find that non blocking algorithms are everywhere. The garbage collector uses non blocking algorithms to speed up concurrent and parallel garbage collection; the scheduler uses non blocking algorithms to effectively schedule threads and processes and realize internal locks. In Mustang (Java 6.0), the lock based synchronousqueue algorithm is replaced by a new non blocking version. Few developers use synchronousqueue directly, but the thread pool built through the executors.newcachedthreadpool() factory uses it as a work queue. A comparative test comparing the performance of the cache thread pool shows that the new non blocking synchronous queue implementation provides almost three times the speed of the current implementation. Further improvements have been planned in subsequent versions of Mustang (code name dolphin). Non blocking algorithm is much more complex than lock based algorithm. Develop non blocking algorithm