Aerosol Posted December 29, 2014 Report Posted December 29, 2014 Up to now, we have developed a debugger that can attach and detach from a process, set and remove breakpoints, print registers and a call stack, and modify control flow by changing the executing thread context. These are all pretty essential features of a debugger. The topic of this post, debug symbols, is more of a “nice-to-have”. An application may or may not ship with debug symbols, but in the event that it does, i.e. it’s your own application, then the process of debugging becomes significantly more simple.Debug SymbolsAt its simplest definition, a debug symbol is a piece of information that shows how specific parts of a compiled program map back to the source level. For example, a debug symbol might tell information about the name of a variable at a memory address, or which line of code, and in which file, a series of assembly instructions map to. They are typically generated during debug builds and are used to provide some clarity to a developer that is debugging (or reverse engineering) a piece of code. There is no universal debug symbol format for a language, and they may vary between compilers. On the modern Windows platform, debug symbols come in the form of Program Database (PDB) files, ending with a .pdb extension.These files hold a lot of useful information about the compiled executable or DLL. As mentioned above, they can contain information regarding which source file and line number (or which object file) a symbol at a certain address maps to. They can contain the names and types of global, static, and local variables, as well as classes and structs. They can also contain information compiler optimizations that were used when compiling the code. Some of these things may not be present if the code was compiled with stripped symbols. During a debugging session, the debugger will initialize a symbol handler and begin looking for, either recursively in common directories and/or user-specified directories, and parsing* matching PDB files. When a user is debugging, symbol information can be retrieved and names and source line numbers can be displayed to them (if available).* This is a useful open source parser that can parse the proprietary format of PDB files.ImplementationMicrosoft provides a very rich set of APIs for handling symbols through the DbgHelp API. There are functions to load/enumerate symbols for a module, find a symbol by name or address, enumerate source file and line references found in PDBs, dynamically add or remove entries from the symbol table, interact with symbol stores, and much more. Given the very large API, I’ve only chosen to demonstrate implementation of the more common features. One thing to consider is that all functions in the DbgHelp API set are single threaded. The example code is single threaded, but does not have concurrency synchronization to ensure that it is only called from a single thread, meaning if you’re implementing something off of this code, make sure that you add concurrency synchronization.Initializing a symbol handler is pretty straightforward: it merely involves calling SymInitialize. The function takes a process handle, which is opened by the debugger when it attaches. There is also a parameter for the user search path to locate PDB files, and a third parameter to specify whether the debugger is to enumerate all of the loaded modules in the process and load their symbols as well. For an attaching debugger, specifying that this behavior is dependent on the situation. There is a case, such as the debugger creating the target process to debug, or with delay-loaded DLLs, that can cause some symbols to not be loaded. Additionally, if this third parameter is set to true and the symbol handler is initialized prior to receiving all of the LOAD_DLL_DEBUG_EVENT events, then some symbols may not be loaded. The implementation sample code has been defaulted to false, and symbols for modules will be loaded in the CREATE_PROCESS_DEBUG_EVENT and LOAD_DLL_DEBUG_EVENT event handlers. This ensures that all symbol files for every module will be properly loaded.Prior to initializing the symbol handler, the SymSetOptions function should be called, which configures how and what information the symbol handler will load. Simply put into code, the initialization routine looks like the following:Symbols::Symbols(const HANDLE hProcess, const HANDLE hFile, const bool bLoadAll /*= false*/) : m_hProcess{ hProcess }, m_hFile{ hFile }{ (void)SymSetOptions(SYMOPT_CASE_INSENSITIVE | SYMOPT_DEFERRED_LOADS | SYMOPT_LOAD_LINES | SYMOPT_UNDNAME); const bool bSuccess = BOOLIFY(SymInitialize(hProcess, nullptr, bLoadAll)); if (!bSuccess) { fprintf(stderr, "Could not initialize symbol handler. Error = %X.\n", GetLastError()); }}The options here specify that symbol searches will be case insensitive, that symbols won’t be loaded until a reference is made (not to be confused with delay-loading for DLLs that were mentioned above), that line information will be loaded, and that symbols will be displayed in an undecorated form. Case insensitivity and undecorated names are there for convenience; it would be annoying to search for exact symbol names such as “?f@@YAHD@Z” otherwise.When the symbol handler is finished, i.e. the debugger is detaching from the process, a simple call to SymCleanup will terminate the symbol handler:Symbols::~Symbols(){ const bool bSuccess = BOOLIFY(SymCleanup(m_hProcess)); if (!bSuccess) { fprintf(stderr, "Could not terminate symbol handler. Error = %X.\n", GetLastError()); }}That sets up the initialization and termination of the symbol handler. Time for everything in between.Enumerating SymbolsOne useful feature of a debugger might be to internally enumerate all symbols of a module. This can allow for storage and fast lookup at a later time. Or it can allow for a graphic display for the user and easy navigation to the symbol address from its name. Enumerating symbols is a two step process: first SymLoadModuleEx is called to load the symbol table for the module, then SymEnumSymbols can be called with the base address of the module. SymEnumSymbols takes a callback of type PSYM_ENUMERATESYMBOLS_CALLBACK as a parameter. This callback will be called for every symbol found in the modules symbol table and will have a SYMBOL_INFO structure that shows information about the symbol, such as its name, address, whether it is a register, what value it holds if its a constant, etc. Put in to code, this is rather straightforward:const bool Symbols::EnumerateModuleSymbols(const char * const pModulePath, const DWORD64 dwBaseAddress){ DWORD64 dwBaseOfDll = SymLoadModuleEx(m_hProcess, m_hFile, pModulePath, nullptr, dwBaseAddress, 0, nullptr, 0); if (dwBaseOfDll == 0) { fprintf(stderr, "Could not load modules for %s. Error = %X.\n", pModulePath, GetLastError()); return false; } UserContext userContext = { this, pModulePath }; const bool bSuccess = BOOLIFY(SymEnumSymbols(m_hProcess, dwBaseOfDll, "*!*", SymEnumCallback, &userContext)); if (!bSuccess) { fprintf(stderr, "Could not enumerate symbols for %s. Error = %X.\n", pModulePath, GetLastError()); } return bSuccess;}Resolving SymbolsThere are several ways to resolve symbols, but the two most common are by name and by address. This can be achieved by calling SymFromName and SymFromAddr respectively. Both of these populate a SYMBOL_INFO structure, just as calling SymEnumSymbols does. Invoking them is also rather straightforward:const bool Symbols::SymbolFromAddress(const DWORD64 dwAddress, const SymbolInfo **pFullSymbolInfo){ char pBuffer[sizeof(SYMBOL_INFO) + MAX_SYM_NAME * sizeof(char)] = { 0 }; PSYMBOL_INFO pSymInfo = (PSYMBOL_INFO)pBuffer; pSymInfo->SizeOfStruct = sizeof(SYMBOL_INFO); pSymInfo->MaxNameLen = MAX_SYM_NAME; DWORD64 dwDisplacement = 0; const bool bSuccess = BOOLIFY(SymFromAddr(m_hProcess, dwAddress, &dwDisplacement, pSymInfo)); if (!bSuccess) { fprintf(stderr, "Could not retrieve symbol from address %p. Error = %X.\n", (DWORD_PTR)dwAddress, GetLastError()); return false; } fprintf(stderr, "Symbol found at %p. Name: %.*s. Base address of module: %p\n", (DWORD_PTR)dwAddress, pSymInfo->NameLen, pSymInfo->Name, (DWORD_PTR)pSymInfo->ModBase); *pFullSymbolInfo = FindSymbolByName(pSymInfo->Name); return bSuccess;}const bool Symbols::SymbolFromName(const char * const pName, const SymbolInfo **pFullSymbolInfo){ char pBuffer[sizeof(SYMBOL_INFO) + MAX_SYM_NAME * sizeof(char) + sizeof(ULONG64) - 1 / sizeof(ULONG64)] = { 0 }; PSYMBOL_INFO pSymInfo = (PSYMBOL_INFO)pBuffer; pSymInfo->SizeOfStruct = sizeof(SYMBOL_INFO); pSymInfo->MaxNameLen = MAX_SYM_NAME; const bool bSuccess = BOOLIFY(SymFromName(m_hProcess, pName, pSymInfo)); if (!bSuccess) { fprintf(stderr, "Could not retrieve symbol for name %s. Error = %X.\n", pName, GetLastError()); return false; } fprintf(stderr, "Symbol found for %s. Name: %.*s. Address: %p. Base address of module: %p\n", pName, pSymInfo->NameLen, pSymInfo->Name, (DWORD_PTR)pSymInfo->Address, (DWORD_PTR)pSymInfo->ModBase); *pFullSymbolInfo = FindSymbolByAddress((DWORD_PTR)pSymInfo->Address); return bSuccess;}with the SymbolInfo structure being an extended structure that holds information about source files and line numbers (see example code).Testing the functionalityTo test this functionality, we can take the sample program from the previous post (reproduced below) and see the difference in how call stacks look. The new functionality in this version has added the ability to resolve symbols for the addresses in the callstack. Also, the debugger was augmented to add two new abilities: to dump all symbols from a module, and to set/remove breakpoints on a symbol by name.#include <cstdio>void d(){ printf("d called.\n");}void c(){ printf("c called.\n"); d();}void b(){ printf("b called.\n"); c();}void a(){ printf("a called.\n"); b();}int main(int argc, char *argv[]){ printf("Addresses: \n" "a: %p\n" "b: %p\n" "c: %p\n" "d: %p\n", a, b, c, d); getchar(); while (true) { a(); getchar(); } return 0;}Setting a breakpoint on the d function and printing the call stacks shows the more useful functionality between the previous version of the debugger and this one. Entered commands are shown in red, while new symbol information is shown in orange.a[A]ddress or [s]ymbol name? sName: dReceived breakpoint at address 00401090.Press c to continue or s to begin stepping.lFrame: 0Execution address: 00401090Stack address: 00000000Frame address: 0018FDE8Symbol name: dSymbol address: 00401090Address displacement: 0Source file: c:\users\demo\desktop\demoapp\source.cppLine number: 4Frame: 1Execution address: 0040107CStack address: 00000000Frame address: 0018FDECSymbol found at 0040107C. Name: c. Base address of module: 00400000Symbol name: cSymbol address: 00401060Address displacement: 0Source file: c:\users\demo\desktop\demoapp\source.cppLine number: 9Frame: 2Execution address: 0040104CStack address: 00000000Frame address: 0018FE40Symbol found at 0040104C. Name: b. Base address of module: 00400000Symbol name: bSymbol address: 00401030Address displacement: 0Source file: c:\users\demo\desktop\demoapp\source.cppLine number: 15Frame: 3Execution address: 0040101CStack address: 00000000Frame address: 0018FE94Symbol found at 0040101C. Name: a. Base address of module: 00400000Symbol name: aSymbol address: 00401000Address displacement: 0Source file: c:\users\demo\desktop\demoapp\source.cppLine number: 21Frame: 4Execution address: 004010EFStack address: 00000000Frame address: 0018FEE8Symbol found at 004010EF. Name: main. Base address of module: 00400000Symbol name: mainSymbol address: 004010B0Address displacement: 0Source file: c:\users\demo\desktop\demoapp\source.cppLine number: 27Frame: 5Execution address: 004013A9Stack address: 00000000Frame address: 0018FF3CSymbol found at 004013A9. Name: __tmainCRTStartup. Base address of module: 00400000Symbol name: __tmainCRTStartupSymbol address: 00401210Address displacement: 0Source file: f:\dd\vctools\crt\crtw32\dllstuff\crtexe.cLine number: 473Frame: 6Execution address: 004014EDStack address: 00000000Frame address: 0018FF8CSymbol found at 004014ED. Name: mainCRTStartup. Base address of module: 00400000Symbol name: mainCRTStartupSymbol address: 004014E0Address displacement: 0Source file: f:\dd\vctools\crt\crtw32\dllstuff\crtexe.cLine number: 456Frame: 7Execution address: 76AE919FStack address: 00000000Frame address: 0018FF94Symbol found at 76AE919F. Name: BaseThreadInitThunk. Base address of module: 00000000Symbol name: BaseThreadInitThunkSymbol address: 76AE9191Address displacement: 0Source file: (null)Line number: 0Frame: 8Execution address: 77430BBBStack address: 00000000Frame address: 0018FFA0Symbol found at 77430BBB. Name: RtlInitializeExceptionChain. Base address of module: 00000000Symbol name: RtlInitializeExceptionChainSymbol address: 77430B37Address displacement: 0Source file: (null)Line number: 0Frame: 9Execution address: 77430B91Stack address: 00000000Frame address: 0018FFE4Symbol found at 77430B91. Name: RtlInitializeExceptionChain. Base address of module: 00000000Symbol name: RtlInitializeExceptionChainSymbol address: 77430B37Address displacement: 0Source file: (null)Line number: 0StackWalk64 finished.This looks much more useful compared to just getting absolute addresses as in the previous version. Here, for some symbols, the source files can be found on the host machine and be presented to the user alongside the raw assembly. Additionally, symbols can be printed for any module as shown below:yEnter in module name to dump symbols for: kernel32.dllSymbol name: QuirkIsEnabledWorkerSymbol address: 76AE0010Address displacement: 0Source file: (null)Line number: 0Symbol name: EnumCalendarInfoExExSymbol address: 76AE03BDAddress displacement: 0Source file: (null)Line number: 0Symbol name: GetFileMUIPathSymbol address: 76AE03CEAddress displacement: 0Source file: (null)Line number: 0...That concludes the topic on symbols. The implementation presented here only scratched the surface of what is available in terms of the DbgHelp API, and I recommend that those interested further explore the MSDN documentation on the topics. The next article will conclude the series with a collection of miscellaneous features that debuggers typically possess. For that piece, it will probably include the ability to step over code (step into is currently implemented), present a disassembly listing to the user for x86 and x64, and allow for modification of arbitrary memory, instead of just registers and/or a thread context.Article RoadmapFuture posts will be related on topics closely following the items below:BasicsAdding/Removing Breakpoints, Single-steppingCall Stack, Registers, ContextsSymbolsMiscellaneous FeaturesThe full source code relating to this can be found here. C++11 features were used, so MSVC 2012/2013 is most likely required.Source Quote