This blog summarizes learnings from Systems Performance by Brendan Gregg that covers concepts, tools, and performance tuning for operating systems and applications.
1. Introduction
This chapter introduces Systems Performance including all major software and hardware components with goals to improve the end-user experience by reducing latency and computing cost. The author describes challenges for performance engineering such as subjectivity without clear goals, complexity that requires holistic approach, multiple causes, and multiple performance issues. The author defines key concepts for performance such as:
- Latency: measurement of time spent waiting and it allows maximum speedup to be estimated.
- Observability: refers to understanding a system through observation and includes that use counters, profiling, and tracing. This relies on counters, statistics, and metrics, which can be used to trigger alerts by the monitoring software. The profiling performs sampling to paint the picture of target. Tracing is event-based recording, where event data is captured and saved for later analysis using static instrumentation or dynamic instrumentation. The latest dynamic tools use BPF (Berkley Packet Filter) to build dynamic tracing, which is also referred as eBPF.
- Experimentation tools: benchmark tools that test a specific component, e.g. following example performs a TCP network throughput:
iperf -c 192.168.1.101 -i 1 -t 10
The chapter also describes common Linux tools for analysis such as:
- dmesg -T | tail
- vmstat -SM 1
- mpstat -P ALL 1
- pidstat 1
- iostat -sxz 1
- free -m
- sar -n DEV 1
- sar -n TCP,ETCP 1
2. Methodologies
This chapter introduces common performance concepts such as IOPS, throughput, response-time, latency, utilization, saturation, bottleneck, workload, and cache. It defines following models of system performance like System under Test and Queuing System shown below:
The chapter continues with defining concepts such as latency, time-scales, trade-offs, tuning efforts, load vs architecture, scalability, and metrics. It defines utilization based on time as:
U = B / T
where U = utilization, B = total-busy-time, T = observation period
and in terms of capacity, e.g.
U = % used capacity
The chapter defines saturation as the degree to which a resource has queued work it cannot service and defines caching considerations to improve performance such as hit ratio, cold-cache, warm-cache, and hot-cache.
The author then describes analysis perspectives and defines resource analysis that begins with analysis of the system resources for investigating performance issues and capacity planning, and workload analysis that examines the performance of application for identifying and confirming latency issues. Next, author describes following methodologies for analysis:
- Streetlight Anti-Method: It is an absence of a deliberate methodology and user analyzes a familiar tool but can be a hit or miss.
- Random Change Anti-Method: In this approach, user randomly guesses where the problems may be and then changes things until it goes away.
- Blame-Someone-Else Anti-Method: In this approach, user blames someone else and redirects the issue to another team.
- Ad Hoc Checklist Method: It’s a common methodology where a user uses an ad hoc list built from recent experience.
- Problem Statement: It defines the problem statement based on if there has been a performance issue; what was changed recently; who are being affected; etc.
- Scientific Method: This approach is summarized as: Question -> Hypothesis -> Prediction -> Test -> Analysis.
- Diagnostic Cycle: This is defined as hypothesis -> instrumentation -> data -> hypothesis.
- Tools Method: This method lists available performance tools; gather metrics from each tool; and then interpret the metrics collected.
- Utilization, Saturation, and Errors (USE) Method: This method focuses on system resources and checks utilization, saturation and errors for each resource.
- RED Method: This approach checks request rate, errors, and duration for every service.
- Workload Characteristics: This approach answers questions about Who is causing the load; why is the load being called; what are the load characteristics?
- Drill-Down Analysis: This approach defines three-stage drill-down analysis methodology for system resource: Monitoring, Identification, Analysis.
- Latency Analysis: This approach examines the time taken to complete an operation and then breaks it into smaller components, continuing to subdivide the components.
- Method R: This methodology developed for Oracle database that focuses on finding the origin of latency.
Modeling
The chapter then defines analytical modeling of a system using following techniques:
- Enterprise vs Cloud
- Visual Identification uses graphs to identify patterns for linear scalability, contention, coherence (propagation of changes), knee-point (where performance stops scaling linearly), scalability ceiling.
- Admdahl’s Law of Scalability describes content for the serial resource:
C(N) = N / (1 + a(N - 1))
where C(N) is relative capacity, N is scaling dimension such as number of CPU, and a is degree of seriality.
- Universal Scalability Law is described as:
C(N) = N / (1 + a(N - 1) + bN(N - 1))
where b is the coherence parameter and when b = 0, it becomes Amdahl's Law
- Queuing Theory describes Little’s Law as:
L = lambda * W
where L is average number of requests in the system, lambda is average arrival rate, and W is average request time.
Capacity Planning
This section describes Capacity Planning for examining how system will handle load and will scale as load scales. It searches for a resource that will become the bottleneck under load including hardware and software components. It then applies factor analysis to determine what factors to change to achieve the desired performance.
Statistics
This section reviews how to use statistics for analysis such as:
- Quantifying Performance Gains: using observation-based and experimentation-based techniques to compare performance improvements.
- Averages: including geometric-mean (nth root of multiplied values), harmonic-mean (count of values divided by sum of their reciprocals), average over time, decayed average (recent time weighed more).
- Standard Deviation, Percentile, Median
- Coefficient of Variations
- Multimodal Distribution
Monitoring
Monitoring records performance statistics over time for comparison and identification using various time-based patterns such as hourly, daily, weekly, and yearly.
Visualization
This section examines various visualization graphs such as line-chart, scatter-plots, heat-maps, timeline-charts, and surface plot.
3. Operating Systems
This chapter examines operating system and kernel for system performance analysis and defines concepts such as:
Background
Kernel
Kernel is the core of operating system and though Linux and BSED have a monolithic kernel but other kernel models include micokernels, unikernels and hybrid kernels. In addition, new Linux versions include extended BPF for enabling secure kernel-mode applications.
Kernel and User Modes
The kernel runs in a kernel mode to access devices and execution of the privileged instructions. User applications run in a user mode where they can request privileged operations using system calls such as ioctl, mmap, brk, and futex.
Interrupts
An interrupt is a single to the processor that some event has occurred that needs processing, and interrupts the current execution of the processor and runs interrupt service routine to handle the event. The interrupts can be asynchronous for handling interrupt service requests (IRQ) from hardware devices or synchronous generated by software instructions such as traps, exceptions, and faults.
Clocks and Idle
In old kernel implementations, tick latency and tick overhead caused some performance issues but modern implementations have moved much functionality out of the clock routine to on-demand interrupts to create tickless kernel for improving power efficiency.
Processes
A process is an environment for executing a user-level program and consists of memory address space, file descriptors, thread stacks, and registers. A process contains one or more threads, where each thread has a stack, registers and an instruction pointer (PC). Processes are normally created using fork system call that wraps around clone and exec/execve system call.
Stack
A stack is a memory storage area for temporary data, organized as last-in, first-out (FIFO) list. It is used to store the return address when calling a function and passing parameters to function, which is also referred as a stack frame. The call path can be seen by examining saved return addresses across all the stack frames, which is called stack trace or stack back trace. While executing a system call, a process thread has two stacks: a user-level stack and a kernel-level stack.
Virtual Memory
Virtual memory is an abstraction of main memory, providing processes and the kernel with their own private view of main memory. It supports multitasking of threads and over-subscription of main memory. The kernel manages memory using process-swapping and paging schemes.
Schedulers
The scheduler schedules processes on processors by dividing CPU time among the active processes and threads. The scheduler tracks all threads in the read-to-run state in run priority-queues where process priority can be modified to improve performance of the workload. Workloads are categorized as either CPU-bound or I/O bound and scheduler may decrease the priority of CPU-bound processes to allow I/O-bound workloads to run sooner.
File Systems
File systems are an organization of data as files and directories. The virtual file system (VFS) abstracts file system types so that multiple file systems may coexist.
Kernels
This section discusses Unix-like kernel implementations with a focus on performance such as Unix, BSD, and Solaris. In the context of Linux, it describes systemd, which is a commonly used service manager to replace original UNIX init system and extended BPF that can be used for networking, observability, and security. BPF programs run in kernel model and are configured to run on events like USDT probes, kprobes, uprobes, and perf_events.
4. Observability Tools
This chapter identifies static performance and crises tools including their overhead like counters, profiling, and tracing.
Tools Coverage
This section describes static performance tools like sysctl, dmesg, lsblk, mdadm, ldd, tc, etc., and crisis tools like vmstat, ps, dmesg, lscpu, iostat, mpstat, pidstat, sar, and more.
Tools Type
The observability tool can categorized as a system-wide, per-process observability or counters/events based, e.g. top shows system-wide summary; ps, pmap are maintained per-process; and profilers/tracers are event-based tools. Kernel maintains various counters that are incremented when events occur such as network packets received, disk I/O, etc. Profiling collects a set of samples such as CPU usage at fixed-rate or based on untimed hardware events, e.g., perf, profile are system-wide profilers, and gprof and cachegrind are per-process profilers. Tracing instruments occurrence of an event, and can store event-based details for later analysis. Examples of system-wide tracing tools include tcpdump, biosnoop, execsnoop, perf, ftrace and bpftrace, and examples of per-process tracing tools include strace and gdb. Monitoring records statistics continuously for later analysis, e.g. sar, prometheus, collectd are common tools for monitoring.
Observability Sources
The main sources of system performance include /proc and /sys. The /proc is a file system for kernel statistics and is created dynamically by the kernel. For example, ls -F /proc/123
lists per-process statistics for process with PID 123 like limits, maps, sched, smaps, stat, status, cgroup and task. Similarly, ls -Fd /proc/[a-z]*
lists system-wide statistics like cpuinfo, diskstats, loadavg, meminfo, schedstats, and zoneinfo. The /sys was originally designed for device-driver statistics but has been extended for other types of statistics, e.g. find /sys/devices/system/cpu/cpu0 -type f
provides information about CPU caches. Tracepoints are Linux kernel event source based on static instrumentation and provide insight into kernel behavior. For example, perf list tracepoint
lists available tracepoints and perf trace -e block:block_rq_issue
traces events. Tracing tools can also use tracepoints, e.g., strace -e openat ~/iosnoop
and strace -e perf_event_open ~/biosnoop
. A kernel event source based on dynamic instrumentation includes kprobes that can trace entry to functions and instructions, e.g. bpftrace -e 'kprobe:do_nanosleep { printf("sleep %s\n", comm); }'
. A user-space event-source for dynamic instrumentation includes uprobes, e.g., bpftrace -l 'uprob:/bin/bash:*'
. User-level statically-defined tracing (UsDT) is the user-space version of tracepoint and some libraries/apps have added USDt probes, e.g., bpftrace -lv 'usdt:/openjdk/libjvm.so:*'
. Hardware counters (PMC) are used for observing activity by devices, e.g. perf stat gzip words
instruments the architectural PMCs.
Sar
Sar is a key monitoring tool that is provided via the sysstat package, e.g., sar -u -n TCP,ETCP
reports CPU and TCP statistics.
5. Applications
This chapter describes performance tuning objectives, application basics, fundamentals for application performance, and strategies for application performance analysis.
Application Basics
This section defines performance goals including lowering latency, increasing throughput, improving resource utilization, and lowering computing costs. Some companies use a target application performance index (ApDex) as an objective and as a metric to monitor:
Apdex = (satisfactory + 0.5 x tolerable + 0 x frustrating) / total-events
Application Performance Techniques
This section describes common techniques for improving application performance such as increasing I/O size to improve throughput, caching results of commonly performed operations, using ring buffer for continuous transfer between components, and using event notifications instead of polling. This section describes concurrency for loading multiple runnable programs and their execution that may overlap, and recommends parallelism via multiple processes or threads to take advantage of multiprocessor systems. These multiprocess or multithreaded applications use CPU scheduler with the cost of context-switch overhead. Alternatively, use-mode applications may implement their own scheduling mechanisms like fibers (lightweight threads), co-routines (more lightweight than fiber, e.g. Goroutine in Golang), and event-based concurrency (as in Node.js). The common models of user-mode multithreaded programming are: using service thread-pool, CPU thread-pool, and staged event-driven architecture (SEDA). In order to protect integrity of shared memory space when accessing from multiple threads, applications can use mutex, spin locks, RW locks, and semaphores. The implementation of these synchronization primitives may use fastpath (using cmpxchg to set owner), midpath (optimistic spinning), slowpath (blocks and deschedules thread), or read-copy-update (RCU) mechanisms based on the concurrency use-cases. In order to avoid the cost of creation and destruction of mutex-locks, the implementations may use hashtable to store a set of mutex locks instead of using a global mutex lock for all data structures or a mutex lock for every data structure. Further, non-blocking allows issuing I/O operations asynchronously without blocking the thread using O_ASYNC flag in open, io_submit, sendfile and io_uring_enter.
Programming Languages
This section describes compiled languages, compiler optimization flags, interpreted languages, virtual machines, and garbage collection.
Methodology
This section describes methodologies for application analysis and tuning using:
- CPU profiling and visualizing via CPU flame graphs.
- off-CPU analysis using sampling, scheduler tracing, and application instrumentation, which may be difficult to interpret due to wait time in the flame graphs so zooming or kernel-filtering may be required.
- Syscall analysis can be instrumented to study resource based performance issues where the target for syscall analysis include new process tracing, I/O profiling, and kernel time analysis.
- USE method
- Thread state analysis like user, kernel, runnable, swapping, disk I/O, net I/O, sleeping, lock, idle.
- Lock analysis
- Static performance tuning
- Distributed tracing
Observability Tools
This section introduces application performance observability tools:
- perf is standard Linux profiler with many uses such as:
- CPU Profiling:
perf record -F 49 -a -g -- sleep 30
&&per script --header > out.stack
- CPU Flagme Graphs:
./stackcollapse-perf.pl < out.stacks | ./flamgraph.ol --hash > out.svg
- Syscall Tracing:
perf trace -p $(pgrep mysqld)
- Kernel Time Analysis:
perf trace -s -p $(pgrep mysqld)
- CPU Profiling:
- porfile is timer-based CPU profiler from BCC, e.g.,
profile -F 49 10
- offcputime and bpftrace to summarize time spent by thrads blocked and off-CPU, e.g.,
- offcputime 5
- strace is the Linux system call tracer
- strace -ttt -T -p 123
- execsnoop traces new process execution system-wide.
- syscount to count system call system wide.
- bpftrace is a BPF-based tracer for high-level programming languages, e.g.,
bpftrace -e 't:syscalls:sys_enter_kill { time("%H:%M:%S "); }'
6. CPU
This chapter provides basis for CPU analysis such as:
Models
This section describes CPU architecture and memory caches like CPU registers, L1, L2 and L3. It then describes run-queue that queue software threads that are ready to run and time spent waiting on CPU run-queue is called run-queue latency or dispatcher-queue latency.
Concepts
This section describes concepts regarding CPU performance including:
- Clock Rate: Each CPU instruction may take one or more cycles of the clock to execute.
- Instructions: CPU execute instructions chosen from their instruction set.
- Instruction Pipeline: This allows multiple instructions in parallel by executing different components of different instructions at the same time. Modern processors may implement branch prediction to perform out-of-order execution of the pipeline.
- Instruction Width: Superscalar CPU architecture allows more instructions can make progress with each clock cycle based on the width of instruction.
- SMT: Simultaneous multithreading makes use of a superscalar architecture and hardware multi-threading support to improve parallelism.
- IPC, CP: Instructions per cycle ((IPC) describe how CPU is spending its clock cycles.
- Utilization: CPU utilization is measured by the time a CPU instance is busy performing work during an interval.
- User Time/Kernel Time: The CPU time spent executing user-level software is called user time, and kernel-time software is kernel time.
- Saturation: A CPU at 100% utilization is saturated, and threads will encounter scheduler lateny as they wait to run on CPU.
- Priority Inversion: It occurs when a lower-priority threshold holds a resource and blocks a high priority thread from running.
- Multiprocess, Multithreading: Multithreading is generally considered superior.
Architecture
This section describes CPU architecture and implementation:
Hardware
CPU hardware include processor and its subsystems:
- Processor: The processor components include P-cache (prefetch-cache), W-cache (write-cache), Clock, Timestamp counter, Microcode ROM, Temperature sensor, and network interfaces.
- P-States and C-States: The advanced configuration and power interface (ACPI) defines P-states, which provides different levels of performance during execution, and C-states, which provides different idle states for when execution is halted, saving power.
- CPU caches: This include levels of caches like level-1 instruction cache, level-1 data cache, translation lookaside buffer (TLB), level-2 cache, and level-3 cache. Multiple levels of cache are used to deliver the optimum configuration of size and latency.
- Associativity: It describes constraint for locating new entries like full-associative (e.g. LRU), direct-mapped where each entry has only one valid location in cache, set-associative where a subset of the cache is identified by mapping, e.g., four-way associative maps an address to four possible location.
- Cache Line: Cache line size is a range of bytes that are stored and transferred as a unit.
- Cache Coherence: Cache coherence ensures that CPUs are always accessing the correct state of memory.
- MMU: The memory management unit (MMU) is responsible for virtual-to-physical address translation.
- Hardware Counters (PMC): PMCs are processor registers implemented in hardware and include CPU cycles, CPU instructions, Level 1, 2, 3 cache accesses, floating-point unit, memory I/O, and resource I/O.
- GPU: GPU support graphical displays.
- Software: Kernel software include scheduler that performs time-sharing, preemption, and load balancing. The scheduler uses scheduling classes to manage the behavior of runnable threads like priorities and scheduling policies. The scheduling classes for Linux kernels include RT (fixed and high-priorities for real-time workloads), O(1) for reduced latency, CFS (Completely fair scheduling), Idle, and Deadline. Scheduler policies include RR (round-robin), FIFO, NORMAL, BATCH, IDLE, and DEADLINE.
Methodology
This section describes various methodologies for CPU analysis and tuning such as:
- Tools Method: iterate over available tools like uptime, vmstat, mpstat, perf/profile, showboost/turboost, and dmesg.
- USE Method: It checks for utilization, saturation, and errors for each CPU.
- Workload Charcterization like CPU load average, user-time to system-time ratio, syscall rate, voluntary context switch rate, and interrupt rate.
- Profiling: CPU profiler can be performed by time-based sampling or function tracing.
- Cycle Analysis: Using performance monitor counter ((PMC) to understand CPU utilization at the cycle level.
- Performance Monitoring: identifies active issues and patterns over time using metrics for CPU like utilization and saturation.
- Static Performance Tuning
- Priority Tuning
- CPU Binding
Observability Tools
This section introduces CPU performance observability tools such as:
- uptime
- load average – exponentially damped moving average for load including current resource usage plus queued requests (saturation).
- pressure stall information (PSI)
- vmstat, e.g.
vmstat 1
- mpstat, e.g.
mpstat -P ALL 1
- sar
- ps
- top
- pidstat
- time, ptime
- turbostat
- showboost
- pmcash
- tlbstat
- perf, e.g.,
perf record -F 99 command
,perf stat gzip ubuntu.iso
, and perf stat -a -- sleep 10
- profile
- cpudist, e.g.,
cpudist 10 1
- runqlat, e.g.,
runqlat 10 1
- runqlen, e.g.,
runqlen 10 1
- softirqs, e.g.,
softirqs 10 1
- hardirqs, e.g.,
hardirqs 10 1
- bpftrace, e.g.,
bpftrace -l 'tracepoint:sched:*'
Visualization
This section introduces CPU utilization heat maps, CPU subsecond-offset heat maps, flame graphs, and FlameScope.
Tuning
The tuning may use scheduling priority, power stats, and CPU binding.
7. Memory
This chapter provides basis for memory analysis including background information on key concepts and architecture of hardware and software memory.
Concepts
Virtual Memory
Virtual memory is an abstraction that provides each process its own private address space. The process address space is mapped by the virtual memory subsystem to main memory and the physical swap device.
Paging
Paging is the movement of pages in and out of main memory. File system paging (good paging) is caused by the reading and writing of pages in memory-mapped files. Anonymous paging (bad paging) involves data that is private to process: the process heap and stacks.
Demand Paging
Demand paging map pages of virtual memory to physical memory on demand and defers the CPU overhead of creating the mapping until they are needed.
Utilization and Saturation
If demands for the memory exceeds the amount of main memory, main memory becomes saturated and operating system may employ paging or OOM killer to free it.
Architecture
This section introduces memory architecture:
Hardware Main Memory
The common type of the main memory is dynamic random-access memory (DRAM) and column address strobe (CAS) latency for DDR4 is around 10-20ns. The main memory architecture can be uniform memory access (UMA) or non-uniform memory access (NUMA). The main memory may use a shared system-bus to connect a single or multiprocessors, directly attached memory, or interconnected memory bus. The MMU (memory management unit) translates virtual addresses to physical addresses for each page and offset within a page. The MMU uses a TLB (translation lookaside buffer) as a first-level cache for addresses in the page tables.
Software
The kernel tracks free memory in free list of pages that are available for immediate allocation. The kernel may use swapping, reap any memory that can be freed, or use OOM killer to free memory when memory is low.
Process Virtual Address Space
The process virtual address space is a range of virtual pages that are mapped to physical pages and addresses are split into segments like executable text, executable data, heap, and stack. There are a variety of user- and kernel-level allocators for memory with simple APIs (malloc/free), effcient memory usage, performance, and observability.
Methodology
This section describes various methodlogies for memory analysis:
- Tools Method: involves checking page scanning, pressure stall information (PSI), swapping, vmstat, OOM killer, and perf.
- USE Method: utilization of memory, the degree of page scanning, swapping, OOM killer, and hardware errors.
- Characterizing usage
- Performance monitoring
- Leak detection
- Static performance tuning
Observability Tools
This section includes memory observability tools including:
- vmstat
- PSI, e.g.,
cat /proc/pressure/memory
- swapon
- sar
- slabtop, e.g.,
slabtop -sc
- numastat
- ps
- pmap, e.g.
pmap -x 123
- perf
- drsnoop
- wss
- bpftrace
Tuning
This section describes tunable parameters for Linux kernels such as:
- vm.dirty_background_bytes
- vm.dirty_ratio
- kernel.numa_balancing
8. File Systems
This chapter provides basic for file system analysis:
Models
- File System Interfaces: File system interfaces include read, write, open, and more.
- File System Cache: The file system cache may cache reads or buffer writes.
- Second-Level Cache: It can be any memory type like level-1, level2, RAM, and disk.
Concepts
- File System Latency: primary metric of file system performance for time spent in the file system and disk I/O subsystem.
- Cache: The file system will use main memory as a cache to improve performance.
- Random vs Sequential I/O: A series of logical file system I/O can be either random or sequential based on the file offset of each I/O.
- Prefetch/Read-Ahead: Prefetch detects sequential read workload and issue disk reads before the application request it.
- Write-Back Caching: It marks write completed after transfering to main memory and writes to disk asynchronously.
- Synchronous Writes: using O_SYNC, O_DSYNC or O_RSYNC flags.
- Raw and Direct I/O
- Non-Blocking I/O
- Memory-Mapped Files
- Metadata including information about logical and physical that is read and written to the file system.
- Logical vs Physical I/O
- Access Timestamps
- Capacity
Architecture
This section introduces generic and specific file system architecture:
- File System I/O Stack
- VFS (virtual file system) common interface for different file system types.
- File System Caches
- Buffer Cache
- Page Cache
- Dentry Cache
- Inode Cache
- File System Features
- Block (fixed-size) vs Extent (pre-allocated contiguous space)
- Journaling
- Copy-On-Write
- Scrubbing
- File System Types
- FFS – Berkley fast file system
- ext3/ext4
- XFS
- ZFS
- Volume and Pools
Methodology
This section describes various methodologies for file system analysis and tuning:
- Disk Analysis
- Latency Analysis
- Transaction Cost
- Workload Characterization
- cache hit ratio
- cache capacity and utilization
- distribution of I/O arrival times
- errors
- Performance monitoring (operation rate and latency)
- Static Performance Tuning
- Cache Tuning
- Workload Separation
- Micro-Benchmaking
- Operation types (read/write)
- I/O size
- File offset pattern
- Write type
- Working set size
- Concurency
- Memory mapping
- Cache state
- Tuning
Observability Tools
- mount
- free
- vmstat
- sar
- slabtop, e.g.,
slabtop -a
- strace, e.g.,
strace -ttT -p 123
- fatrace
- opensnoop, e.g.,
opensnoop -T
- filetop
- cachestat, e.g.,
cachestat -T 1
- bpftrace
9. Disks
This chapter provides basis for disk I/O analysis. The parts are as follows:
Models
- Simple Disk: includes an on-disk queue for I/O requests
- Caching Disk: on-disk cache
- Controller: HBA (host-bus adapter) bridges CPU I/O transport with the storage transport and attached disk devices.
Concepts
- Measuring Time: I/O request time = I/O wait-time + I/O service-time
- disk service time = utilization / IOPS
- Time Scales
- Caching
- Random vs Sequential I/O
- Read/Write Ratio
- I/O size
- Utilization
- Saturation
- I/O Wait
- Synchronous vs Asynchronous
- Disk vs Application I/O
Architecture
- Disk Types
- Magnetic Rotational
- max throughput = max sector per track x sector-size x rpm / 60 s
- Short-Stroking
- Sector Zoning
- On-Disk Cache
- Solid-State Drives
- Flash Memory
- Persistent Memory
- Magnetic Rotational
- Interfaces
- SCSI
- SAS
- SATA
- NVMe
- Storage Type
- Disk Devices
- RAID
- Operating System Disk I/O Stack
Methodology
- Tools Method
- iostat
- iotop/biotop
- biolatency
- biosnoop
- USE Method
- Performance Monitoring
- Workload Characterization
- I/O rate
- I/O throughput
- I/O size
- Read/write ratio
- Random vs sequential
- Latency Analysis
- Static Performance Tuning
- Cache Tuning
- Micro-Benchmarking
- Scaling
Observability Tools
- iostat
- pressure stall information (PSI)
- perf
- biolatency, e.g.,
biolatency 10 1
- biosnoop
- biotop
- ioping
10. Network
This chapter introduces network analysis. The parts are as follows:
Models
- Network Interface
- Controller
- Protocol Stack
- TCP/IP
- OSI Model
Concepts
- Network and Routing
- Protocols
- Encapsulation
- Packet Size
- Latency
- Connection Latency
- First-Byte Latency
- Round-Trip Time
- Buffering
- Connection Backlog
- Congestion Avoidance
- Utilization
- Local Connection
Architecture
- Protocols
- IP
- TCP
- Sliding Window
- Congestion avoidance
- TCP SYN cookies
- 3-way Handshake
- Duplicate Ack Detection
- Congestion Controls
- Nagle algorithm
- Delayed Acks
- UDP
- QUIC and HTTP/3
- Hardware
- Interfaces
- Controller
- Switches and Routers
- Firewalls
- Software
- Network Stack
- Linux Stack
- TCP Connection queues
- TCP Buffering
- Segmentation Offload: GSO and TSO
- Network Device Drivers
- CPU Scaling
- Kernel Bypass
Methodology
- Tools Method
- netstat -s
- ip -s link
- ss -tiepm
- nicstat
- tcplife
- tcptop
- tcpdump
- USE Method
- Workload Characterization
- Network interface throughput
- Network interface IOPS
- TCP connection rate
- Latency Analysis
- Performance Monitoring
- Throughput
- Connections
- Errors
- TCP retransmits
- TCP out-of-order pack
- Packet Sniffing
- tcpdump -ni eth4
- TCP Analysis
- Static Performance Tuning
- Resource Controls
- Micro-Benchmarking
Observability Tools
- ss, e.g.,
ss -tiepm
- strace, e.g.,
strace -e sendmesg,recvmsg ss -t
- ip, e.g.,
ip -s link
- ifconfig
- netstat
- sar
- nicstat
- ethtool, e.g.,
ethtool -S eth0
- tcplife, tcptop
- tcpretrans
- bpftrace
- tcpdump
- wireshark
- pathchar
- iperf
- netperf
- tc
Tuning
sysctl -a | grep tcp
11. Cloud Computing
This chapter introduces cloud performance analysis with following parts:
Background
- Instance Types: m5 (general-purpose), c5 (compute optimized)
- Scalable Architecture: horizontal scalability with load balancers, web servers, application servers, and databases.
- Capacity Planning: Dynamic sizing (auto-scaling) using auto-scaling-group and and Scalability testing.
- Storage: File store, block store, object store
- Multitenancy
- Orchestration (Kubernetes)
Hardware Virtualization
- Type 1: execute directly on the processors using native hypervisor or bare-metal hypervisor (e.g., Xen)
- Type 2: execute within a host OS and hypervisor is scheduled by the host kernel
- Implementation: Xen, Hyper-V, KVM, Nitro
- Overhead:
- CPU overhead (binary translation, paravirtualization, hardware assisted)
- Memory Mapping
- Memory Size
- I/O
- MultiTenant Contention
- Resource Controls
- Resource Controls
- CPUs – Borrowed virtual time, Simple earliest deadline, Credit based schedulers
- CPU Caches
- Memory Capacity
- File System Capacity
- Device I/O
- Observability
- xentop
perf kvm stat live
bpftrace -lv t:kvm:kvm_exit
mpstat -P ALL 1
OS Virtualization
- Implementation: Linux supports namespaces and cgroups that are used to create containers. Kubernetes uses following architecture for Pods, Kube Proxy and CNI.
- Namespaces
lsns
- Control Groups (cgroups) limit the usage of resources
- Overhead – CPU, Memory Mapping, Memory Sie, I/O, and Multi-Tenant Contention
- Resource Controls – throttle access to resources so they can be shared more fairly
- CPU
- Shares and Bandwidth
- CPU Cache
- Memory Capacity
- Swap Capacity
- File System Capacity
- File System Cache
- Disk I/O
- Network I/O
- Observability
- from Host
kubectl get pod
- docker ps
- kubectl top nodes
- kubectl top pods
- docke stats
- cgroup stats (cpuacct.usage and cpuacct.usage_percpu)
- system-cgtop
nsenter -t 123 -m -p top
- Resource Controls (throttled time, non-voluntary context switches, idle CPU, busy, all other tenants idle)
- from guest (container)
iostat -sxz 1
- from Host
- Lightweight Virtualization
- Lightweight hypervisor based on process virtualization (Amazon Firecracker)
- Implementation – Intel Clear, Kata, Google gVisor, Amazon Firecracker
- Overhead
- Resource Controls
- Observability
- From Host
- From Guest
mpstat -P ALL 1
12. Benchmarking
This chapter discusses benchmarks and provides advice with methodologies. The parts of this chapter include:
Background
- Reasons
- System design
- Proof of concept
- Tuning
- Development
- Capacity planning
- Troubleshooting
- Marketing
- Effective Bencharmking
- Repeatable
- Observable
- Portable
- Easily presented
- Realistic
- Runnable
- Bencharmk Analysis
- Bencharm Failures
- Causal Bencharmking – you benchmark A, but measure B, and conclude you measured C, e.g. disk vs file system (buffering/caching may affect measurements).
- Blind Faith
- Numbers without Analysis – include description of the benchmark and analysis.
- Complex Benchmark Tools
- Testing the wrong thing
- Ignoring the Environment (not tuning same as production)
- Ignoring Errors
- Ignoring Variance
- Ignoring Perturbations
- Changing Multiple Factors
- Friendly Fire
- Benchmarking Types
- Micro-Benchmarking
- Simulation – simulate customer application workload (macro-bencharmking)
- Replay
- Industry Standards – TPC, SPEC
Methodology
This section describes methologies for performing becharmking:
- Passive Bencharmking (anti-methodology)
- pick a BM tool
- run with a variety of options
- Make a slide of results and share it with management
- Problems
- Invalid due to software bugs
- Limited by benchmark software (e.g., single thread)
- Limited by a component that is unrelated to the benchmark (congested network)
- Limited by configuration
- Subject to perturbation
- Benchmarking the wrong the thing entirely
- Active Benchmarking
- Analyze performance while benchmarking is running
- bonie++
- iostat -sxz 1
- CPU Profiling
- USE Method
- Workload Characterization
- Custom Benchmarks
- Ramping Load
- Statistical Analysis
- Selection of benchmark tool, its configuration
- Execution of the benchmark
- Interpretation of the data
- Benchmarking Checklist
- Why not double
- Did it beak limit
- Did it error
- Did it reproduce
- Does it matter
- Did it even happen
13. perf
This chapter introduces perf tool:
- Subcommands Overview
perf record -F 99 -a -- sleep 30
- perf Events
perf list
- Hardware Events (PMCs)
- Frequency Sampling
perf record -vve cycles -a sleep 1
- Software Events
perf record -vve context-switches -a -- sleep
- Tracepoint Events
perf record -e block:block_rq_issue -a sleep 10; perf script
- Probe Events
- kprobes, e.g.,
perf prob --add do_nanosleep
- uprobes, e.g.,
perf probe -x /lib.so.6 --add fopen
- USDT
- kprobes, e.g.,
- perf stat
- Interval Statistics
- Per-CPU Balance
- Event Filters
- Shadow Statistics
- perf record
- CPU Profiling, e.g.,
perf record -F 99 -a -g -- sleep 30
- Stack Walking
- CPU Profiling, e.g.,
- perf report
- TUI
- STDIO
- perf script
- Flame graphs
- perf trac
12. Ftrace
This chapter introduces Ftrace tool. The sections are:
- Capabilities
- tracefs
- tracefs contents, e.g.,
ls -F /sys/kernel/debug/tracing
- tracefs contents, e.g.,
- Ftrace Function Profiler
- Ftrace Function Tracing
- Tracepoints
- Filter
- Trigger
- kprobes
- Event Tracing
- Argument and Return Values
- Filters and Triggers
- uprobes
- Event Tracing
- Argument and Return Values
- Ftrace function_graph
- Ftrace hwlat
- Ftrace Hist Triggers
- perf ftrace
15. BPF
This chapter introduces BPF tool. The sections are:
- BCC – BPF COmpiler Collection, e.g.,
biolatency.py -mF
- bpftrace – open-source tracer bult upon BPF and BC
- Programming
16. Case Study
This chapter describes the story of a real-world performance issue.
- Problem Statement – Java application in AWS EC2 Cloud
- Analysis Strategy
- Checklist
- USE method
- Statistics
uptime
mpstat 10
- Configuration
cat /proc/cpuinfo
- PMCs
./pmarch -p 123 10
- Software Events
perf stat -e cs -a -I 1000
- Tracing
cpudist -p 123 10 1
- Conclusion
- No container neighbors
- LLC size and workload diffeence
- CPU difference