Nytro Posted September 22, 2017 Report Posted September 22, 2017 HACK THE HACKER – FUZZING MIMIKATZ ON WINDOWS WITH WINAFL & HEATMAPS (0DAY) On 22. Sep In this blogpost, I want to explain two topics from a theoretical and practical point of view: How to fuzz windows binaries with source code available (this part is for developers) and How to deal with big input files (aka heatmap fuzzing) and crash analysis (for security consultants; more technical) Since this blog post got too long and I didn’t want to remove important theory, I marked background knowledge with grey color. Feel free to skip these sections if you are already familiar with this knowledge. If you are only interested in the exploitable mimikatz flaws you can jump to chapter “Practice: Analysis of the identified crashes“. I (René Freingruber from SEC Consult Vulnerability Lab) am going to give a talk at heise devSec (and IT-SECX and DefCamp) about fuzzing binaries for developers and therefore I wanted to test different approaches to fuzz windows applications where source code is available (the audience are most likely developers). To my knowledge there are several blog posts talking about fuzzing Linux applications with AFL or libfuzzer (just compile the application with afl-gcc instead of gcc or add some flags to clang), but there is no blog post explaining the concept and setup for Windows. This blog post tries to fill this gap. Fuzzing is a very important concept during development and therefore all developers should know how to do it correctly and that such a setup can be simple and fast! WHY I CHOSE MIMIKATZ AS FUZZING TARGET To demonstrate fuzzing I had to find a good target where source code is available. At the same time, I wanted to learn more about the internals of the application by reading the source code. Therefore, my decision fell on mimikatz because it’s an extremely useful tool for security consultants (and hackers) and I always wanted to understand how mimikatz internally works. Mimikatz is a powerful hacker tool for Windows which can be used to extract plaintext credentials, hashes of currently logged on users, machine certificates and many other things. Moreover, mimikatz contains over 261 000 lines of code, must parse many different data structures and is therefore likely to be affected by vulnerabilities itself. At this point I also want to say that penetration testing would not be the same without the amazing work of Benjamin Delpy gentilkiwi, the author of mimikatz. The next thing I need is a good attack vector. Why should I search for vulnerabilities if there is no real attack vector which can trigger the vulnerability? Mimikatz can be used to dump cleartext credentials and hashes of currently logged in users from the LSASS process. But if there exists a bug in the parsing code of mimikatz, what exactly could I achieve with this? I could just exploit myself because mimikatz gets executed on the same system. Well, not always. As it turns out, well-educated security consultants and hackers do not directly invoke mimikatz on the owned system. Instead, it’s nowadays widespread practice to create a minidump of the LSASS process on the owned system, download it and invoke mimikatz locally (on the attacker system). Why do we do this? Because dropping mimikatz on the owned system could trigger all kind of alerts (AntiVirus, Application Whitelisting, Windows Logs, …) and because we want to stay under the radar, we don’t want these alerts. Instead, we can use Microsoft signed binaries to dump the LSASS process (e.g.: Task manager, procdump.exe, msbuild.exe or sqldumper.exe). Now we have a good attack vector! We can create a honeypot, inject some malicious stuff into our own LSASS process, wait until a hacker owns it, dumps the LSASS process and invokes mimikatz on his own system reading our manipulated memory dump file and watch reverse shells coming in! This and similar attack vectors can be interesting features in the future for deception technologies such as CyberTrap (https://www.cybertrap.com/). To be fair, exploiting this scenario can be quite difficult because mimikatz is compiled with protections, such as ASLR and DEP, and exploitation of such a client-side application is extremely challenging (we don’t have the luxury of a remote memory leak nor of scripting possibilities such as with browsers or PDF readers). However, such client-side scriptless exploits are not impossible, for example see https://scarybeastsecurity.blogspot.co.at/2016/11/0day-exploit-advancing-exploitation.html. To my surprise, mimikatz contained a very powerful vulnerability which allows us to bypass ASLR (and DEP). THEORY: FUZZING WINDOWS BINARIES We don’t want to write a mimikatz-specific fuzzer (with knowledge about the parsing structures, implemented checks and so on), instead, we want to use something which is called “coverage-guided fuzzing” (or feedback-based fuzzing). This means we want to extract code coverage information during fuzzing. If one of our mutated input files (the memory dump files) generate more code coverage in mimikatz, we want to add it to our fuzzing queue and fuzz this input later as well. That means we start with one dump file and the fuzzer identifies different code paths itself (ideally all) for us and therefore “learn” all the internal parsing logic autonomously! This also means that we only have to write a fuzzer one time and can use it against nearly all kinds of applications! Luckily for us, we don’t even have to write such a fuzzer because someone else already did an excellent job! Currently the most commonly used fuzzer is called AFL (American fuzzy lop) and implements exactly this idea. It turned out that AFL is extremely effective in identifying vulnerabilities. In my opinion, this is because of four major characteristics of AFL: It is extremely fast. It extracts edge coverage (with hit count) and not only code coverage which simplified means that we try to maximize executed “paths” and not “code” (E.g. if we have an IF statement, code coverage would put the input file which executes the code inside the IF into the queue. Edge coverage on the other side would put one entry with, and one entry without the IF body-code into the queue.). Because of 1) it can implement deterministic mutation (do every operation on every byte/bit of the input). More on this later when talking about heatmap fuzzing. It’s extremely simple to use. Fuzzing can be started within several minutes! The big question is now: How does AFL extract the code/edge coverage information? The answer depends on the configured mode of AFL. The default option is to “hack” generated object files from gcc and insert at all possible code locations instrumentation code. There is also a LLVM mode where a LLVM compiler pass is used to insert the instrumentation. And then there is a qemu mode if source code is not available which emulates the binary and adds instrumentation via qemu. There are several forks which extend AFL with other instrumentation possibilities. For example, hardware features can also be used to extract code coverage (WinAFL-IntelPT, kAFL). Another idea is to use dynamic instrumentation frameworks such as PIN or DynamoRio (e.g. WinAFL). These frameworks can dynamically instrument (during runtime) the target binary which means that a dispatcher loads the next instructions which should be executed and adds additional instrumentation code before (and after) them. All that requires dynamic relocation of the instructions while being completely transparent to the target application. This is quite complicated but the framework hides all that logic from the user. Obviously, this approach is not fast (but works on Windows and has many cool features for the future!). When source code is available, I thought that there should be the same possibility as AFL does on Linux (with GCC). My first attempt was to use “libfuzzer” on Windows. Libfuzzer is a fuzzer like AFL but it acts on function level fuzzing and is therefore more suitable for developers. AFL fuzzes full binaries per default but can also be started in a persistent mode where it fuzzes on function level. Libfuzzer uses the feature “Source-based code coverage” from the LLVM clang compiler which is exactly what I wanted to use for coverage-information on Windows. After some initial errors I could compile mimikatz with LLVM on Windows. When adding the flags for “source-based code coverage” I got again some errors which can be fixed by adding the required libraries to the linker include paths. However, this flag added the same functions to all object files which resulted in linker errors. The only way I could solve this was to merge all .c files into one huge file. Since this approach was more a pain than anything else, I didn’t pursue it further for the talk. Note that using LLVM for application analysis on Windows could be a very interesting approach in the future! The next thing I tested was WinAFL in syzygy mode and this was exactly what I was looking for! For a more detailed description of syzygy see https://doar-e.github.io/blog/2017/08/05/binary-rewriting-with-syzygy/. At this point I want to thank Ivan Fratric (developer of WinAFL) and Axel 0vercl0k Souchet (developer of the syzygy mode) for their amazing job and of course Michal Zalewski for AFL! PRACTICE: FUZZING WINDOWS BINARIES You can download WinAFL from here: https://github.com/ivanfratric/winafl All we must do is include the header and add the code: While(__afl_persistent_loop()) { // code to fuzz } to the project which we want to fuzz and compile a 32-bit application with the /PROFILE linker flag (Visual Studio -> project properties -> Linker -> Advanced -> Profile). For mimikatz I removed the command prompt code from the wmain function (inside mimikatz.c) and just called kuhl_m_sekurlsa_all(argc,argc) because I wanted to directly dump the hashes/passwords from the minidump (issue the sekurlsa::logonpasswords command at program invocation). Since mimikatz would extract this information per default from the LSASS process I added a line inside kuhl_m_sekurlsa_all() to load the dump instead. Moreover, I added the persistent loop inside this function. Here is how my new kuhl_m_sekurlsa_all() function looks: NTSTATUS kuhl_m_sekurlsa_all(int argc, wchar_t *argv[]) { While(__afl_persistent_loop()) { kuhl_m_sekurlsa_reset(); // sekurlsa::minidump command pMinidumpName = argv[1]; // sekurlsa::minidump command kuhl_m_sekurlsa_getLogonData(lsassPackages, ARRAYSIZE(lsassPackages); // sekurlsa::logonpasswords command } Return NT_SUCCESS; } Hint: Do not use the above code during fuzzing. I added a subtle flaw, which I’m going to explain later. Can you already spot it? After compilation, we can start the binary and see some additional messages: The next step is to instrument the code. For this to work we must register “msdia140.dll”. This can be done via the command: Regsvr32 C:\path\to\msdia140.dll Then we can call the downloaded instrument.exe binary with the generated mimikatz.exe and mimikatz.pdb file to add the instrumentation: We can start the instrumented binary and see the adapted status message: Now we can generate a minidump file for mimikatz (on a Windows 7 x86 test system I used task manager to dump the LSASS process), put it into the input directory and start fuzzing: As we can see, we have a file size problem! THEORY: FUZZING AND THE INPUT FILE SIZE PROBLEM We can see that AFL restricts input files to a maximum size of 1 MB. Why? Recall, that AFL is doing deterministic fuzzing before switching to random fuzzing for each queue input. That means that it is doing specific bit and byte flips/additions/removals/… for every (!) byte/bit in the input! This includes several strategies like bitflip 1/1, bitflip 2/1, …, arith 8/8, arith 16/8 and so on. For example, bitflip 1/1 means that AFL flips 1 bit per execution and then walks forward 1 bit to flip the next bit. Whereas 2/1 means that 2 bits are flipped and the step length is 1 bit. Arith 16/8 means that the step is 8 and that AFL tries to add interesting values to a value treated as 16-bit integer. And there are several more such deterministic fuzzing strategies. All these strategies have in common that the number of required application executions depend on the file size! Let’s just assume that AFL only (!) does the bitflip 1/1 deterministic strategy. This means that we must execute the target application exactly input-filesize (bit) number of times. WinAFL is doing in-memory fuzzing which means that we don’t have to start the application every time, but let’s forget this for now so that our discussion does not get too complicated. Let’s say that our input binary has a size of 10 kB. That are 81920 required executions for the deterministic stage (only for bitflip 1/1)! AFL conducts many more deterministic strategies. For AFL on Linux this is not very huge, it’s not uncommon to get execution speeds of 1000 – 8000 executions / second (because of the fork-server, persistent-mode, …) per core. This fast execution speed together with deterministic fuzzing (and edge coverage) made AFL so successful. So, if we have 16 cores we can easily do this strategy for one input file within a second. Now let’s assume that our input has a size of 1 MB (AFL limit), which means 8 388 608 required executions for the bitflip 1/1 and that our target application is a little bit slower because it’s bigger and running on Windows (200 exec / sec) and that we just have one core available for fuzzing. Then we need 11,5 hours only to finish the bitflip 1/1 for one input entry! Recall that we must conduct this deterministic stage for every new queue entry (every time we identify new coverage, we must do this again! It’s not uncommon that the queue grows up to several thousand inputs). And if we consider all the other deterministic operations which must be performed, the situation becomes even worse. And in our case the input (memory dump) has a size of 27 MB! This would be 216 036 224 required executions only for bitflip 1/1. AFL detects this and directly aborts because this would just take too long (and AFL would never find vulnerabilities because it’s stuck inside deterministic fuzzing). Of course, we can tell AFL to skip deterministic fuzzing, but that would not be very good because we still have to find the special byte/bit flip which triggers the vulnerability. And the likelihood for this in such a big input file is not very high…. Here is a cite from Michal Zalewski (author of AFL) from the perf_tips.txt document: To illustrate, let’s say that you’re randomly flipping bits in a file, one bit at a time. Let’s assume that if you flip bit #47, you will hit a security bug; flipping any other bit just results in an invalid document. Now, if your starting test case is 100 bytes long, you will have a 71% chance of triggering the bug within the first 1,000 execs – not bad! But if the test case is 1 kB long, the probability that we will randomly hit the right pattern in the same timeframe goes down to 11%. And if it has 10 kB of non-essential cruft, the odds plunge to 1%. The key learning is that input file size is very important during fuzzing! At least as important as plain execution speed of the fuzzer! There are tools and scripts shipped with AFL (cmin/tmin) to minimize the number of input files and the file size. However, I don’t want to talk about them in this blog post. They shrink files via a fuzzing attempt and since the problem is NP-hard they use heuristics. Tmin is also very slow (execution time depends on the file size…) and often leads to problems with files containing offsets (like our dump file) and checksums. Another idea could be to start WinAFL with an empty input file. However, memory dumps are quite complex and I don’t want to “waste” time while WinAFL identifies the format. And here is where heatmaps come into play. PRACTICE: FUZZING AND HEATMAPS My first attempt to minimize the input file was to read the source code of mimikatz and understand how it finds the important memory regions (containing the plaintext passwords and hashes) in the memory dump. I assumed some kind of pattern-search, however, mimikatz parses and uses lots of structures and I quickly discarded the idea of manually creating a smaller input binary by understanding mimikatz. During fuzzing we also don’t want to understand the application to write a specific fuzzer, instead we want that the fuzzer learns everything itself. If we could somehow give the fuzzer the same ability to detect the “important input bytes”… During reading the code I identified something interesting. Mimikatz loads the memory dump via kernel32!MapViewOfFile(). After that it reads nearly all required information from there (sometimes it also copies it via kull_m_minidump_copy, but let’s not get too complicated for the moment). If we can log all memory access attempts to this memory region, we can easily reduce the number of required executions drastically! If mimikatz does not touch a specific memory region, why should we even fuzz the bytes there? Here is a heat map which I generated based on memory read operations from mimikatz on my 27 MB input file (generated via a plotly python script): The black area is never read by mimikatz. The brighter the area, the more memory read operations accessed this region. The start of the file is located at the left bottom of the picture. I print 1000 bytes per line (from left to right), then go one line up and print the next 1000 bytes and so on. At this zoom level, we do not see access attempts to the start (header) of the memory dump. However, this is only because the file size is 27 MB and therefore the smaller red/yellow/white dots are not visible. But we can zoom in: The most important message from the above pictures: We do not have to fuzz the bytes from the black area! But we can even do better, we can start fuzzing the white bytes (many read attempts, like offset 0xaa00 which gets read several thousand times), then continue with yellow bytes (e.g.: offset 0x08 is read several hundred times) and then red areas (which are only read 1-5 times). This itself is not a perfect approach (more read attempts also mean that the likelihood that the input becomes invalid gets higher and therefore it’s maybe not the best idea to start with white areas but since we must do the full deterministic step it basically does not matter with which offsets we start; Hint: A better strategy is to also consider the triggering read-instruction address together with the hitcounts). You are maybe wondering how I extracted the memory-read information. For mimikatz I just used a stupid PoC debugger script, however, I’m currently developing a more sophisticated script with the use of dynamic instrumentation frameworks which can extract such information from any application. The use of a debugger script is “stupid” because it’s very slow and does not work for every application. However, for mimikatz it worked fine because mimikatz reads most of the time from the region which is mapped via MapViewOfFile(). My WinAppDbg script is very simple, I just use the Api-Hooking mechanism of WinAppDbg to hook the post-routine of MapViewOfFile. The return value contains the memory address where the input memory dump is loaded. I placed a memory breakpoint on it and at every breakpoint hit I just log the relative offset and increment the hit counter. You can also extend this script to become more complex (e.g.: hook kull_m_memory_search() or kull_m_minidump_copy(). For reference, here is the script: We start mimikatz via the debugger, as soon as mimikatz finishes (debug.loop() returns), we just sort the offsets based on the hit counts and dump them to a log file. This is the code which hooks MapViewOfFile to obtain the base address of our input in-memory. This code also adds the memory breakpoint (watch_buffer() method) and invokes memory_breakpoint_callback every time an instruction reads from our input. Hint: WinAppDbg has a bug inside watch_buffer() which does not return the bw variable. If the script should be extended (e.g. to disable the memory breakpoint for search-functions), the WinAppDbg source must be modified to return the “bw” variable in the watch_buffer() function. All left is the callback function: I used capstone to disassemble the triggering instruction to obtain the size of the read operation to update the offsets correctly. As you can see, there is no magic involved and the script is really simple (at least this one which only works with mimikatz). The downside of the debugger memory breakpoint approach is that it is extremely slow. On my test system, this script executes approximately 30 minutes to execute mimikatz one time which is really slow (mimikatz without the script executes in under 1 second). However, we only have to do this once. Another down-side is that it does not work for all applications because typically applications copy input-bytes to other buffers and access them there. A more general approach is to write such a script with a dynamic instrumentation framework which is also much faster (on which I’m currently working on). Developing such a script is much harder because we have to use shadow memory for taint tracking, follow copies of tainted memory (this taint propagation is a complex part because it requires correct semantic understanding of the x86 instruction set) but store at the same time which byte depends on which input byte and this should be as fast as possible (for fuzzing) and does not consume too much memory (so it’s useable for big and complex applications like Adobe Reader or web browsers). Please note that there are some similiar solutions, most notable TaintScope (however, no source code or tool public available and it was just a PoC) or VUzzer (based on DataTracker/libdft which themself are based on PIN) and some other frameworks such as Triton (based on PIN) or Panda (based on Qemu) can also do taint analysis. The problem with these tools is that either source code and the tool is not public available, or it is very slow or it does not work on Windows or it does not propagate the file offset information (just marks data is tainted or not tainted) or they are written with PIN which itself is slower than DynamoRio. I’m developing my own tool here to fulfil the above mentioned conditions so that it is useful during fuzzing (works on Windows and Linux, is as fast as possible and does not consume too much memory for huge applications). Please also bear in mind that AFL ships with Tmin, which also tries to reduce the input file size. However, it does this via a fuzzy approach which is very slow (and can’t handle checksums and file offsets so well). We can also verify that the black bytes don’t matter by just removing (zeroing them out) from the input file: The above figure shows that the output of both files is exactly the same, therefore the black bytes really don’t matter. Now we can start fuzzing with a much smaller input file – only 91KB instead of 27MB! PRACTICE: FUZZING RESULTS / READING THE AFL STATUS SCREEN I first started a simple fuzzer (not WinAFL) because I first had to modify the WinAFL code to only fuzz the identified bytes. I recommend to start fuzzing as soon as possible with a simple fuzzer and while it is running work on a better version of the fuzzer. My first fuzzer was a 50 LoC python script which flipped bytes, invoked mimikatz over WinDbg and parsed the output to detect crashes. I started 8 parallel fuzzing jobs with this fuzzer on my 12 core home system (4 cores for private stuff). Three days later I had 28 unique crashes identified by WinDbg. My analysis scripts reduced them to 4 unique bugs in code (the 28 crashes are just variations of the same 4 bugs in the code). Execution speed was approximately 2 exec/sec per job, therefore 16 exec/sec in total on all cores (which is really slow). The first exploitable bug from the next chapter was found after 3 hours, the second exploitable bug was not found at all with this approach. I also modified WinAFL to fuzz only the heatmap bytes. My first approach was to use the “post_handler” which can be used for test-case post-processing (e.g. fixing checksums, however, it’s better to remove the checksum verification code from the target application). Since this handler is not called for dry-runs (the first executions from AFL) this will not work. Instead, write_to_testcase() can be modified. Inside the main function I copy the full memory dump to a heap buffer. In the input directory I stored a file containing only the heatmap bytes and therefore the file size is much smaller like 91KB or 3KB. Next I added code to write_to_testcase where I copy all bytes from the input file over the heap buffer at the correct positions. Therefore, AFL only sees the small files but passes the correct memory dumps to mimikatz. This approach has a little drawback though. As soon as one queue entry has a different size, fuzzing could become ineffective, but for this PoC it was enough. Later I’m going to invoke the heatmap calculation for every queue entry if a heuristic detects this as more efficient. Please also bear in mind that AFL and WinAFL write every testcase to disk. This means we have to write a 27 MB file per execution (also per in-memory execution like a simple bit flip!!). From performance perspective it would be way better if we modify the testcase inside the target application in-memory (we already do in-memory fuzzing and therefore this is possible). Then we can also skip all the process-switches from WinAFL to the target application and back and then we really get the benefit of in-memory fuzzing! We can do this by injecting the AFL code (or even python code… into the target application, an area on which I’m also currently working on. Here is a screenshot of my WinAFL output (running on a RAMDisk): Look at the different stats of the fuzzer. What do you see? Is this type of fuzzing good or not? Should we stop or continue? Here is the same screenshot with some colored boxes: First the good thing: We already identified 16 unique crashes and a total of 137 unique paths (inputs which result in unique coverage; see the green area). Now the bad: Fuzzing speed is only 13 exec / sec (blue) which is extremly slow for AFL (but much faster than the self-written fuzzer which had 2 exec/sec per core). And now the really bad stuff: We are running for 14,5 hours and it didn’t finish the bitflip 1/1 stage yet for the first input file (and 136 others are in queue). You can see this in the red area because we just finished 97% of this stage. And after that 2/1 and 4/1 must be executed, then byte flips, arithmetics and so on. So we can see that continuing will not be very efficient. For demonstration I kept the fuzzer running, but modified my WinAppDbg script to filter better and started a new fuzzing job. The new WinAppDbg script reduced the number of “hot bytes” to 3KB (first we had 27 MB, then we had 91KB and now we just have 3KB to fuzz). This was possible because the new script does not count hits from search or copy functions, but also log access attempts from copied memory regions. This WinAppDbg script was approximatly 800 LoC (because I have to follow copies which go to heap and stack variables and I have to disable logging when the variables are freed). Here is a screenshot of the above job (with the “big” 91KB input files) after 2 days and 7 hours: We can see that the fuzzer finished bitflip 1/1 and 2/1 and is now inside 4/1 (with 98% finished). You can also see that the bitflip 1/1 strategy required 737k executions and 159 of these 737k executions resulted in new coverage (or crashes). Same for 2/1 where the stats are 29/737k. And we found 22 unique crashes in 2 days and 7 hours. And now the fuzzing of the smaller 3KB files after 2 hours: We can see that we already have 30 unique crashes after 2 hours! (Compare this with only 22 crashes after 2 days and 7 hours with the 91KB file!) We can also see that for example the bitflip 1/1 stage only required 17.9k executions (instead of 737k) because of the reduced input file size. Moreover, we can see that we found 245 unique paths and this with just 103k total executions (compared to 184 unique paths with 2.17 million executions from the 91KB test input). And now consider how long it would have taken us to fuzz the full 27MB file (!!) and what results we would have seen after some days of fuzzing. Now you should understand the importance of the input file size. Here is one more demonstration of the different input file sizes. The following animated image (created via Veles https://github.com/wapiflapi/veles) shows a visualization of the 27 MB original minidump file: And now the 27 MB minidump file where all unimportant (not read) bytes are replaced with 0x00 so we only see the 3 KB bytes which we fuzz with WinAFL; you maybe have to zoom in to see them. Look at the intersection of the 3 coordinate axes. However, in my opinion even the 3KB fuzzing job is too slow / inefficient. If we can start the job on multiple cores and let it run over one or two weeks, we should get into deeper levels and the result should be OK. But why are we even so slow? Why do we only have an execution speed of 13 or 16 exec / sec? Later queue entries will result in faster execution speeds because mimikatz will not execute the full code because the inputs trigger error-conditions (which result in execution speeds like 60 exec / sec). But on Linux we often deal with execution speeds of several thousand executions per second. A major point is that we have to load and search a 27MB file with every execution, so reducing this file size could really be a good idea (but requires lots of manual work). On the other hand, we can compare the execution speeds of different setups: Execution speed with WinAFL in syzygy mode: ~13 exec / sec Native execution speed without WinAFL and without instrumentation (in-memory): ~335 exec / sec Execution without WinAFL but with instrumented (syzygy) binary: ~50 exec / sec Execution of native binary (Instrumentation via DynamoRio drcov): ~163 exec / sec So we can see that syzygy instrumentation results in a slow-down factor of approximatly 6. And syzygy+WinAFL a factor of approximatly 25. This is mainly because of the process switches and file writes / reads because we are not doing full in-memory fuzzing, but there is also another big problem! We can see that instrumentation via DynamoRio is faster than syzygy (163 exec/sec vs. 50 exec/sec) and we can also start WinAFL with the DynamoRio mode (which does not require source code at all). If we do this, we get an execution speed of 0,05 – 2 exec/sec with WinAFL. At this point you should recognize that something is not working correctly because DynamoRio mode should be much faster! The reason can be found in our modification of the kull_m_sekurlsa_all() function. I added the two code lines kuhl_m_sekurlsa_reset() and pMinidumpName = argv[1] at the start because this is exactly what the “sekurlsa::minidump” command is doing. What I wanted to achieve is to immediatly execute the “sekurlsa::minidump” command and after that the “sekurlsa::logonpasswords” command and that’s why I used this sequence of code calls. However, this is a huge problem because we exit the function (DynamoRio mode) or the __afl_persistent_loop (syzygy mode) with a state where the input file is still open! This is because we call the kuhl_m_sekurlsa_reset() function at the start of the function and not at the end! That means that we only execute one “execution” in-memory, then WinAFL tries to modify the input file, detects that the file is still open and can’t be written to and then terminates the running process (via TerminateProcess from afl-fuzz.c:2186 in syzygy mode or nudges in DynamoRio mode from afl-fuzz.c:2195) to write to the file. Therefore, we do not do real in-memory fuzzing because we execute the application one time and then the application must be closed and restarted to write the next test case. That’s why the DynamoRio mode is so slow because the application must be started again for every testcase which means the application must be instrumented again and again (DynamoRio is a dynamic instrumentation framework). And because syzygy instrumented statically it’s not affected so heavily by this flaw (it still has to restart the application, but does not have to instrument again). Let’s fix the problem by reordering kuhl_m_sekurlsa_reset() to the end of the function: NTSTATUS kuhl_m_sekurlsa_all(int argc, wchar_t *argv[]) { While(__afl_persistent_loop()) { pMinidumpName = argv[1]; kuhl_m_sekurlsa_getLogonData(lsassPackages, ARRAYSIZE(lsassPackages); kuhl_m_sekurlsa_reset(); } Return NT_SUCCESS; } If we execute this, we still face the same problem. The reason is a bug inside mimikatz in the kuhl_m_sekurlsa_reset() function. Mimikatz opens the input file via three calls: CreateFile CreateFileMapping MapViewOfFile Therefore, we have to close all these three handles / mappings. However, mimikatz fails at closing the CreateFile handle. Here is the important code from kuhl_m_sekurlsa_reset() case KULL_M_MEMORY_TYPE_PROCESS_DMP: toClose = cLsass.hLsassMem->pHandleProcessDmp->hMinidump; break; ... } cLsass.hLsassMem = kull_m_memory_close(cLsass.hLsassMem); CloseHandle(toClose); Kull_m_memory_close() correctly closes the file mapping, but the last CloseHandle(toClose) call should close the handle received from CreateFile. However, toClose stores a heap address from (kull_m_minidump_open()): *hMinidump = (PKULL_M_MINIDUMP_HANDLE) LocalAlloc(LPTR, sizeof(KULL_M_MINIDUMP_HANDLE)); That means the code calls CloseHandle on a heap address and never calls CloseHandle on the original file handle (which gets never stored). After fixing this issue it starts to work and WinAFL gets 30-50 exec / sec! However, these executions are very inconsistent, sometimes drop down to under 1 execution per second (when the application must be restarted like after crashes). Because of this we got overall better fuzzing performance with the syzygy mode (which now has a speed of ~25 exec / sec) which also uses edge coverage. Screenshot of WinAFL DynamoRio mode (please bear in mind that the default mode is basic block coverage and not edge coverage with DynamoRio): Screenshot of WinAFL syzygy mode: Even if DynamoRio mode has a higher execution speed (29.91 exec / sec vs. 24.02 exec / sec) it’s in total slower because the execution speed is inconsistent. We can see this because DynamoRio mode is running since 25 minutes and just got total executions of 13.9k and syzygy mode just runs for 13 minutes but already got 16.6k executions. We can see that currently it’s more efficent to fuzz with syzygy mode if source code is available (especially if the target application crashes very often). And also very important: We had a bug in the code which slowed down the fuzzing process (at least by a factor of 2) and we didn’t see it during syzygy fuzzing! (A status screen entry with in-memory executions vs application restarts would be a great feature!) Please also note that the WinAFL documentation explicitly mentions this in the “How to select a target function” chapter: Close the input file. This is important because if the input file is not closed WinAFL won’t be able to rewrite it. Return normally (So that WinAFL can “catch” this return and redirect execution. “returning” via ExitProcess() and such won’t work) The second point is also very important. If an error condition triggers code which calls ExitProcess during fuzzing, we also end up starting the application again and again (and do not get the benefit of in-memory fuzzing). This is no problem with mimikatz. However, mimikatz crashes very often (e.g. 1697 times out of 16600 executions) and with every crash we have to restart the application. This mainly affects the performance of the DynamoRio mode because of the dynamic instrumentation and then it’s better to use the syzygy mode. Please also note that we can “fix” this by storing the stack pointer in the pre-fuzzing-handler of DynamoRio, implementing a crash handler where we restore the stack and instruction pointer, free the file mappings and handles and just continue fuzzing as if nothing had happend. Memory leaks can also be handled by hooking and replacing the heap implementation to free all allocations from the fuzzing-loop. Only global variables could become a probem, but this discussion goes too far here. At the end I started a fuzzing job with syzygy mode, one master (deterministic fuzzing) and 7 slaves (non-deterministic fuzzing) and let it run for 3 days (plus one day with page heap). In total I identified ~130 unique AFL crash signatures, which can be reduced to 42 unique WinDbg crash signatures. Most of them are not security-relevant, however, two crashes are critical. PRACTICE: ANALYSIS OF THE IDENTIFIED CRASHES Vulnerability 1: Arbitrary partial relative stack overwrites In this chapter I only want to describe the two critical crashes in depth. After fuzzing I had several unique crashes (based on the callstack of the crash) which I sorted with a simple self-written heuristic. From this list of crashes I took the first one (the one where I thought it’s the most likely to be exploitable) and analysed it. Here are the exact steps: This is the code with the vulnerability: The variable “myDir” is completly under our control. myDir points into our memory dump, we can therefore control all content of this struct. You maybe want to find the problem yourself before continuing, here are some hints: The argument length is always 0x40 (64) which is exactly the length of the destination argument buffer The argument source is also under our full control Obviously we want to reach RtlCopyMemory() from line 29 with some malicious arguments Now try to think how we can exploit this. My thoughts Step-by-Step: I want to reach line 29 (RtlCopyMemory() and therefore the IF statement from line 9-13 must be true. We can exploit the situation if “lengthToRead” (size argument) from the RtlCopyMemory call is bigger than 0x40 or if offsetToWrite is bigger than 0x40. The second case would be better for us because such relative writes are extremely powerful (e.g. just partially overwriting return addresses to bypass ASLR or to skip stack cookies and so on). So I decided to try exactly this, somehow control “offsetToWrite” which is only set in line 18 and 23. Line 23 is bad because it sets it to zero, so we take line 18. Since we want to come to line 18, the IF statement from line 15 must become true. First condition: Source < memory64→StartOfMemory This is no problem because we control both values completely. So far, so simple. Now lets check the first IF statement (line 9-13). One of the three conditions (line 10, 11 or 12) must evaluate to true. Lets focus on line 12 because from the IF from line 15 we know that Source < memory64→StartOfMemory must be true which is exactly what the first check is here for. So this one is true. That means we only have to ensure the second check: Second condition: Source + Length > (memory64→StartOfMemoryRange + memory64→DataSize) I leave this condition for one moment and think about line 25-27. What I want in RtlCopyMemory is to get as much control as possible. That means I want to control the target address (if possible, relative to an address to deal with ASLR), the source address to point into my own buffer and the size – that would really be great. The size is exactly lengthToRead and this variable can be set via line 25 and 27. 25 would be a bad choice because Length is fixed and offsetToWrite is already “used” to control the target address. So we must come to line 27. OffsetToRead is always zero because of line 17 and therefore memory64→DataSize completly controls lengthToRead. Now we can fix the next value and say that we want memory64→DataSize to be always 1 (to make 1-byte partial overwrites; or we set it to 4 / 8 to control full addresses). Now we are ready and take the second condition and fill in the known values. What we get is: Second condition: Source + 0x40 > (memory64→StartOfMemoryRange + 1) This check (together with the first condition) is exactly the check which should ensure that dataSize (the number of bytes which we write) is within the size of the buffer (0x40). However, we can force an integer overflow :). We can just set memory64→StartOfMemoryRange to 0xffffffffffffffff and therefore we survive the check because after adding one to it we get zero and that means Source + 0x40 is always bigger than zero (If we don’t overflow source+0x40). At the same time we survive the first condition because source will be smaller than 0xffffffffffffffff. Now we can also control offsetToWrite via line 18: offsetToWrite = 0xffffffffffffffff – source You might think that we can’t fully control offsetToWrite because the source variable is on x86 mimikatz just 32-bit (and the 0xffff… is 64-bit), however, because of integer truncation the upper 32-bit will be removed and therefore we can really address any byte in the address space relative to the destination buffer (on x64 mimikatz it’s 64-bit and therefore it’s also no problem). This is an extremely powerful primitive! We can overwrite bytes on the stack via relative offsets (ASLR is no obstacle here), we have full control over the size of the write-operation and full control over the values! The code can also be triggered multiple times (loop at line 7 is not useful because the destination pointer is not incremented) via loops which call this function (with some limitations). The next question is what offset do we use? Since we write relative to a stack variable we are going to target a return address. The destination variable is passed as argument inside the kull_m_process_ntheaders() function and this is exactly where we place a breakpoint on the first kull_m_memory_copy() call. Then we can extract the “destination” argument address and the address where the return address is stored and subtract them. What we get is the required offset to overwrite the return address. For reference, here are the offsets which I used for my dump file (dump files from other operating systems require very likely different offsets). Offset 0x6B4 stores the “source” variable. I use 0xffffff9c here because this is exactly the correct offset from “destination” to “return address” on the current release version of mimikatz (mimikatz 2.1.1 from Aug 13 2017; However, this offset should also work for older versions!). Offset 0xF0 stores the “myDir” struct content. The first 8 bytes store the NumberOfMemoryRanges (this controls the number of loops and therefore how often we can write). Since we just want to demonstrate the vulnerability we set it to one to make exactly one write operation (overwrite the return address). The next 8 bytes are the BaseRVA. This value controls “ptr” in line 6 which is the source from where we copy stuff. So we have to store there the relative offset in our dump file where we store the new return address (which should overwrite the original one). I’am using the value 0x90 here but we can use any value. It’s only important that we store the new return address at that offset (in my case offset 0x90) then. I therefore wrote 0x41414141 to offset 0x90. The next 8 bytes control the “StartOfMemoryRange” variable. I originally used 0xffffffffffffffff (like in the above example), however, for demonstration purpose I wanted to overwrite 4 bytes of the return address (and not only 1) and therefore had to subtract 4 (the DataSize, check the second condition in line 12). The next 8 bytes control the “DataSize” and as already explained I set this to 4 to write 4 bytes. Here is the file: Hint: In the above figure the malicious bytes start at offset 0xF0. This offset can differ in your minidump file. If we check the byte at 0x0C (= 0x20) we can see that this is the offset to the “streams” directory. Therefore, the above minidump has a streams directory starting at offset 0x20. Every entry there consists of 3 DWORDS (on x86), the first is the type and the last is the offset. We search for the entry with type 0x09 (=Memory64ListStream). This can be found in our figure at offset 0x50. If we take the 3rd DWORD from there we can see that this is exactly 0xF0 – the offset where our malicious bytes start. If this offset is different in your minidump file you may want to patch it first. And here is the proof that we really have control over the return address: Please note that full exploitation is still tricky because we have to find a way around ASLR. Because of DEP our data is marked as not executable, therefore we can’t jump into shellcode. The default bypass technique is to apply ROP which means that we instead invoke already existing code. However, because of ASLR this code is always loaded at randomized addresses. And here is what we can do with the above vulnerability is: We can make a relative write on the stack which means we can overwrite the return address (or arguments or local variables like loop counters or destination pointers). Because of the relative write we can bypass stack ASLR. Next we can choose the size of the write operation which means we can make a partial overwrite. We can therefore overwrite only the two lower bytes of the return address which means we can bypass module level ASLR (These two conditions make the vulnerability so useful). We can check the ranges which we can reach by checking the call-stack to the vulnerable code (every return address on this stack trace can be targeted via a partial overwrite) and we have multiple paths to come to the vulnerable code (and therefore different call-stacks with different ranges; We can even create more different call-stacks by overwriting a return address with an address which creates another call-stack to the vulnerable code). For example, a simple exploit could be to overwrite one of the return addresses with the address of code which calls LoadLibraryA() (e.g. 0x455DF3 which is in range) or LoadLibraryW (e.g. 0x442145). The address must be chosen in a way that the function argument is a pointer to the stack because then we can use the vulnerability to write the target path (UNC Path to a malicious library) to this address on the stack. Next, this exploit could be extended to first call kull_m_file_writeData() to write a library to the file system which gets later loaded via LoadLibrary (this way UNC paths are not required for exploitation). Another idea would be to make a specific write which exchanges the destination and source arguments from the vulnerable code. Then the first write operations can be used to write the upper bytes of return addresses (which are randomized by ASLR) to the memory dump buffer. After that these bytes can be written back to the stack (with the vulnerability) and full ROP chains can be built because we can now write ROP gadget addresses after each other. Without this idea we cannot execute multiple ROP gadgets after each other because they are not stored adjacent on the stack (return addresses are not stored next to each other on stack because there is other memory stored there like arguments, local variables and so on). However, I believe that this exploitation scenario is much more difficult because it requires multiple writes (which must be sorted to surive all checks in the mimikatz code), so the first approach with LoadLibrary should be more simple to implement. Vulnerability 2: Heap overflow The major cause of the second vulnerability can be found inside kull_m_process_getUnicodeString(). The first parameter (string) is a structure with the fields buffer (data pointer), a maximum length of bytes the string can hold and the length (currently stored characters). The content is completely under attacker control because it is parsed from the minidump. Moreover, the source (the second argument) also points to the minidump (attacker controlled). Mimikatz always extracts the string structure from the minidump and calls after that kull_m_process_getUnicodeString() to fill string->buffer with the real string from the minidump. Can you spot the problem? Line 9 allocates space for string->MaximumLength bytes and after that it copies exactly the same number of bytes from the minidump to the heap (line 11). However, the code never checks string->Length and therefore string->Length can be bigger than string->MaximumLength because this value is retrieved from the minidump. If later code uses string->Length a heap overflow can occur. This is for example the case when MSV1.0 (dpapi) credentials are stored in the minidump. Then mimikatz uses the string as input (and output) inside the decryption function in kuhl_m_sekurlsa_nt6_LsaEncryptMemory() and the manipulated length value leads to a heap overflow (however, the MSV1.0 execution path is not trivial/possible to exploit in my opinion, but there are many other paths which use kull_m_process_getUnicodeString()). Vulnerabilities patch status At time of publication (2017-09-22) there is no fix available for mimikatz. I informed the author on 2017-08-26 about the problems and received on 2017-08-28 a very friendly answer, that he will fix the flaws if it does not expand the code base too much. He also pointed out that mimikatz was developed as a proof-of-concept and it could be more secure by using a higher level language on which I totally agree on. On 2017-08-28 I sent the SMIME encrypted vulnerability details (together with this blog post). Since I didn’t receive answers on further emails or twitter messages, I informed him on 2017-09-06 about the blog post release on 2017-09-21. If you are a security consultant and using mimikatz in minidump mode, make sure to only use it inside a special mimikatz-virtual machine which is not connected to the internet/intranet and does not store important information (I hope you are already doing this anyway). To further mitigate the risk (e.g. a VM escape) I recommend to fix the above mentioned vulnerabilities. THEORY: RECOMMENDED FUZZING WORKFLOW In this last chapter, I want to quickly summarize the fuzzing workflow which I recommend: Download as many input files as possible. Calculate a minset of input files which still trigger the full code/edge coverage (corpus distillation). Use a Bloom-Filter for fast detection of different coverage. Minimize the file size of all input files in the minset. For the generated minset, calculate the code coverage (no Bloom-Filter). Now we can statically add breakpoints (byte 0xCC) at all basic blocks which were not hit yet. This modified application can be started with all files from the input and should not crash. However, we can continue downloading more files from the internet and start the modified binary (or just fuzz it). As soon as the application crashes (and our Just-in-Time configured compiler script kicks in) we know that a new code path was taken. Using this approach, we can achieve native execution speed but extract the information about new coverage! (downside: Checksums from files break; Moreover, a fork-server should also be used.) During fuzzing I recommend extraction of edge coverage (which requires instrumentation) and therefore we should fuzz with instrumentation (and sanitizers / heap libraries). For every input, we conduct an analysis phase before fuzzing. This analysis phase does the following: First identify the common code which gets executed for most inputs (for this we had to log the code coverage without the bloom-filter). Then get the code coverage for the current input and subtract the common code coverage from it. What we get is the code coverage which makes this fuzzing input “important” (code which only gets executed for this input). Next, we start our heatmap analysis but we just log the read operations conducted by this “important code”! What we get from this are the bytes which make our input “important”. Now we don’t have to fuzz the full input file, nor we have to fuzz the bytes which are read by the application, instead we only have to fuzz the few bytes which make the file “special”! I recommend focusing on these “special” bytes but also fuzz the other bytes afterwards (Fuzzing the special bytes fuzzes the new code, fuzzing all bytes fuzzes the old code with the new state resulting from the new code). Moreover, we can add additional slow checks which must be done only once for a new input (e.g. log all heap allocations and check for dangling-pointers after a free-operation; similar concept to Edge’s MemGC). Of course, some additional feedback and symbolic execution. Want to learn more about fuzzing? Come to one of my fuzzing talks (heise devSec or IT-SeCX for fuzzing applications with source code available and DefCamp for fuzzing closed-source applications!) or just follow me on Twitter. Sursa: https://www.sec-consult.com/en/blog/2017/09/hack-the-hacker-fuzzing-mimikatz-on-windows-with-winafl-heatmaps-0day/index.html 2 Quote