- Threads
- => improve resource utilization
- => exploit available processing capacity
- => improve responsiveness
- => begin processing new tasks immediately while existing still running tasks
- => improve resource utilization
- Techniques for analyzing, monitoring, improving concurrent performance => complexity => increase the likelihood of safety and liveness failures.
- Safety always comes first.
- Improving performance means doing more work with fewer resources.
- e.g., CPU cycles, memory, network bandwidth, I/O bandwidth, database requests, disk space, other resources.
- Performance limited by availability of a resource => CPU-bound, database-bound.
- Using multiple threads => performance costs like:
- Overhead associated with coordinating between threads (locking, signaling, memory synchronization).
- Increased context switching.
- Thread creation and teardown.
- Scheduling overhead.
- Using concurreny to achieve better performance.
- => utilize the processing resources we have more effectively.
- => enable our program to exploit additional processing resources if they become available.
- Application performance => service time, throughput, efficiency, scalability.
- Scalability: the ability to improve throughput or capacity when additional computing resources are added.
- Tuning for scalability => do more work with more resources.
- => often increasing the amount of work done to process each individual task.
- such as dividing tasks into multiple pipelined subtasks.
- Three-tier application model : presentation, business logic, persistence.
- For server applications, "how much" (scalability, throughput, capacity) are of greater concern than "how fast" aspects.
- For interactive applications, latency tends to be more important.
- Avoid premature optimization => most optimizations are often undertaken before a clear set of requirements is available.
- The quest for performance is probably the single greatest source of concurrency bugs.
- Measure, don't guess.
- Amdahl's law: how much a program can theoretically be sped up by additional computing resources, based on the proportion of parallelizable and serial components.
- We can achieve a speedup of at most:
$$Speedup \le \frac{1}{F+\frac{(1-F)}{N}}$$ - F := the fraction of the calculation that must be executed serially
- N := processor number
- Identify the sources of serialization:
- synchronization to maintain the work queue's integrity in the face of concurrency access
- accessing any shared data structure => serialization
- result handling => final merge is a source of serialization
11.2.1 Example: Serialization Hidden in Framework
- The synchronized
LinkedList
guards the entire queue state with a single lock that is held for the duration of theoffer
orremove
call;ConcurrentLinkedQueue
uses a sophisticated nonblocking queue algorithm that uses atomic references to update individual link pointers.
- Amdahl's law => quantifies the possible speedup when more computing resources are available.
- Reducing lock granularity: lock splitting (splitting one lock into two) and lock striping (splitting one lock into many).
- lock striping seems much more promising.
- context switch: requires saving the execution context of the currently running thread and restoring the execution context of the newly scheduled thread.
- => not free; thread scheduling requires manipulating shared data structures in the OS and JVM.
- => a flurry of cache misses.
- Schedulers => give each runnable thread a certain minimum time quantum => it amortizes the cost of the context switch => improving oeverall throughput.
- more blocking (blocking I/O, waiting for contended locks, or waiting on condition variables) => more context switches than one that is CPU-bound => increasing scheduling overhead and reducing throughput.
- A good rule of thumb := a context switch costs equivalent of 5,000 to 10,000 clock cycles, or several microseconds.
- The visibility guarantees provided by
synchronized
andvolatile
may entail using special instructions called memory barriers that can flush or invalidate caches, flush hardware write buffers, and stall execution pipelines.- The
synchronized
mechanism is optimized for the uncontended case (volatile
is always uncontended).
- The
- JVMs => optimize away locking that can be proven never to contend.
- JVMs => escape analysis => identify when a local object reference is never published to the heap and is therefore thread-local.
- Compilers => lock coarsening => the merging of adajacent
synchronized
blocks using the same lock.
- When locking is contended => the losing thread(s) much block.
- Spin-waiting is preferable for short waits and suspension is preferable for long waits.
- Most just suspend threads waiting for a lock.
- Suspending a thread => two additional context switches and all the attendant OS and cache activity.
- Reducing lock contention => improve both performance and scalability.
- Exclusive resource lock => serialized, preventing data corruption, safety => limits scalability.
- To reduce lock contention:
- => Reduce the duration for which locks are held;
- => Reduce the frequency with which locks are requested; or
- => Replace exclusive locks with coordination mechanisms that permit greater concurrency.
- Hold lock as briefly as possible.
- Moving code that doesn't require the lock out of
synchronized
blocks => less serialized code. - Delegating all its thread safety obligation to the underlying thread-safe component.
- Moving code that doesn't require the lock out of
- lock splitting and lock striping: using separate locks to guard multiple independent state variables previously guarded by a single lock.
- lock striping: partition locking on a variablesized set of independent objects.
- => locking the collection for exclusive access is more difficult and costly than witha single lock.
- hot field: every mutative operation needs to access it.
- For implementing
HashMap
, keeping a separate count to speed up operations likesize
andisEmpty
works fine for a single-threaded implementation => every operation that modifies the map must now update the shared counter => much harder to improve scalability. ConcurrentHashMap
hassize
enumerate the stripes and add up the number of elements in each stripe, instead of maintaining a global count.
- Concurrency-friendly means of managing shared state:
- using the concurrent collections, read-write locks, immutable objects and atomic variables.
ReadWriteLock
enforces a multiple-reader, single-writer locking discipling.- Atomic variables reduces the cost of updating "hot fields".
- such as statistics counters, sequence generators, or the reference to the first node in a linked data structure.
- Goal => keep the processors fully utilized.
- Insufficient load.
- You can test by increasing the load and measuring changes in utilization, response time, or service time.
- I/O-bound.
- You can determine whether an application is disk-bound using
iostat
orperfmon
, and whether is bandwidth-limited by monitoring traffic levels on your network.
- You can determine whether an application is disk-bound using
- Externally bound.
- You can test by using a profiler or database administration tools to determine how much time is being spent waiting for answers from the external service.
- Lock contention.
- Profiling tools can tell you how much lock contention your application is experiencing and which locks are "hot".
- If CPU sufficiently hot, you can use monitoring tools to infer whether it would benefit from additional CPUs.
- Allocating objects is usually cheaper than synchronizing.
- Request an object from a pool => some synchronization is necessary to coordinate access to the pool data structure => potential to block.
- Once contention becomes significant, time per operation is dominated by context switch and scheduling delays, and adding more threads has little effect on throughput.
- Logging blocks server application.
- logging service time := I/O streaming time + blocking time.
- multiple threads logging simultaneously => contention for the output stream lock => threads blocked and switched out.
- longer request service time => more lock contention => undesirable.
- Moving the I/O out of the request-processing thread => shorten the mean service time for request processing, although introduced the possibility of contention for the messaging queue.