Current HPC cluster and high-end systems feature complex configurations with a huge number of components. A large number of nodes are connected via complicated layouts like multi-level hierarchical trees or highly-dimensional tori. Nodes consists of several multi-core processors with non-trivial cache and memory hierarchies. Some of them employ attached accelerators like Intel MIC or GPUs resulting in a heterogeneous architecture. Deep software hierarchies of large, complex system and application software components are required to make effective use of such systems. To better support developers in porting and tuning their application for these architectures, many parallel tools research groups have developed sophisticated integrated performance measurement, analysis, and optimization tool environments. This tutorial samples some of them.