Extended Stack Profiling - Ideas, Tools and Comments

Blog article:

Topic: This post provides a short summary and pointers to previous work on Extended Stack Profiling for troubleshooting and performance investigations.

Understanding the workload is an important part of troubleshooting activities. We seek answers to questions like: what is the system doing, where is the time spent, which code paths are most used, what are the wait events, etc. Sometimes the relevant diagnostic data is easy to find, other times we need to dig deeper. Stack profiling and flame graph visualizations are very useful techniques for advanced troubleshooting. In particular in the Linux environment stack traces gathered with perf provide a quick and powerful window into CPU-bound workloads, as detailed in the work of Brendan Gregg.

Extended Stack Profiling techniques stem from the experience of using on-CPU stack profiles and flame graphs. The basic idea is to pull together stack traces with OS- and application-specific metrics. This has the advantage of exposing the inner workings of the process under investigation, together with the context of its execution, such as its parent functions and the execution of kernel functions, when relevant. Application metrics that can also be added. Notably, when examining Oracle workloads, data from the wait event interface can be added as an additional dimension to the stack traces.

Examples of using extended stack tracing tools for troubleshooting and investigating Oracle workloads can be found in these blog articles: "Linux Kernel Stack Profiling and Flame Graphs Applied to Oracle Investigations", "Oracle Wait Events Investigated with Extended Stack Profiling and Flame Graphs" and in the presentation "Stack Traces & Flame Graphs for Oracle Troubleshooting".

Three tools and code examples complement the blog articles:

KStackSampler: a tool written in shell script which gathers kernel stack profiling together and process status.
ORA_KStackProfiler: a simple kernel stack profiler written in C, extended with process status information and the option of sampling Oracle wait event information from SGA.
Ptrace_Profiler: an extension of ORA_KStackProfiler with userspace stack sampling implemented using ptrace and libunwind.

Pros: techniques based on extended stack profiling allow to investigate Oracle workloads beyond what is available with the wait event interface. In particular when combined with flame graph visualization they provide insights into the inner workings of complex workloads, which in turns can be used to complement or extend what is available with Oracle instrumentation. The techniques are general and can be used or to troubleshoot many and diverse Linux workloads.

Cons: Limitations of the code discussed here are many and the tools have to to be considered as experimental. One of the main issues is the error that is intrinsic when pasting together different data sources, such as stack traces and process status, that are collected sequentially by the tools. This is particularly evident with the ptrace-based implementation for userspace stack traces which requires to stop the process for a relatively long time (can be 100s of milliseconds). The fact that data collection is based on sampling introduces another potential source of error. The sampling frequency has to be adapted to the workload: not too high to limit the overhead, not too low to avoid losing the details of rapidly varying workloads. Some other 'gotchas' come from the use of flame graphs, in particular it is worth reminding that the horizontal axis of those graphs does not represent the time evolution.

Ideas for future work: the tools can be extended by developing the interfaces to sample more data sources. For Oracle this could be adding more fields from V$SESSION (X$KSUSE) or other V$/X$ structures. More generally, probes can be developed for sampling and investigating a larger range of applications, similarly and extending the the work done for Oracle. Extension for multi-threaded processes would also make the tools more generic. A user interface to simplify the selection of the data sources and integration with flame graph visualization would also be beneficial.

Another area for improvement is with userspace stack tracing. It is currently implemented in Ptrace_Profiler by stopping the process while unwinding the stack, a simple method with a high overhead. There are better methods: for example with techniques for asynchronous stack unwinding that allow to stop the process for a much shorter time. This would reduce the footprint of the measurement and also allow for higher sampling frequency.

Additional work is also needed to better understand the reliability of the measurements and the errors incurred when sampling and pasting together (on-the-fly) the various data sources, as discussed above.

Credits and additional references: Brendan Gregg is the inventor of flame graphs and has published excellent material on this and other related topics of interest for troubleshooting performance. Tanel Poder has covered the topic of stack profiling and many others of interest in his blog. Additional and related investigations of Oracle internals can be found in the blog of Frits Hoogland and in the blog of Stefan Koehler.

Links to previous work: "Linux Kernel Stack Profiling and Flame Graphs Applied to Oracle Investigations", "Oracle Wait Events Investigated with Extended Stack Profiling and Flame Graphs", "Stack Traces & Flame Graphs for Oracle Troubleshooting", "Oracle Optimizer Investigated with Flame Graphs", "Flame Graphs for Oracle". Tools referenced in this post are available on Github and at this web page.