Search the Community

Showing results for tags 'cpu'.

Found 2 results

Sort By
- Date
- Relevancy

Fast Finite State Machine for HTTP Parsing

aelius posted a topic in Off-topic

There is special type of DDoS attacks, application level DDoS, which is quite hard to combat against. Analyzing logic which filters this type of DDoS attack must operate on HTTP message level. So in most cases the logic is implemented as custom modules for application layer (usually nowadays user space) HTTP accelerators. And surely Nginx is the most widespread platform for such solutions. However, common HTTP servers and reverse proxies were not designed for DDoS mitigation- they are simply wrong tools for this issue. One of the reason is that they are too slow to combat with massive traffic (see my recent paper and presentation for other reasons). If logging is switched off and all content is in cache, then HTTP parser becomes the hottest spot. Simplified output of perf for Nginx under simple DoS is shown below (Nginx’s calls begin with ’ngx’ prefix, memcpy and recv are standard GLIBC calls): % symbol name 1.5719 ngx_http_parse_header_line 1.0303 ngx_vslprintf 0.6401 memcpy 0.5807 recv 0.5156 ngx_linux_sendfile_chain 0.4990 ngx_http_limit_req_handler The next hot spots are linked to complicated application logic (ngx vslprintf ) and I/O. During Tempesta FW development We have studied several HTTP servers and proxies (Nginx, Apache Traffic Server, Cherokee, node.js, Varnish and userver) and learned that all of them use switch and/or if-else driven state machines. The problem with the approach is that HTTP parsing code is comparable in size with L1i cache and processes one character at a time with significant number of branches. Modern compilers optimize large switch statements to lookup tables that minimizes number of conditional jumps, but branch misprediction and instruction cache misses still hurt performance of the state machine. So the method probably has poor performance. The other well-known approach is table-driven automaton. However, simple HTTP parser can have more than 200 states and 72 alphabet cardinality. That gives 200 x 72 = 14400 bytes for the table, which is about half of L1d of modern microprocessors. So the approach is also could be considered as inefficient due to high memory consumption. The first obvious alternative for the state machine is to use Hybrid State Machine (HSM) described in our paper, which combines very small table with also small switch statement. In our case we tried to encode outgoing transitions from a state with at most 4 ranges. If the state has more outgoing transitions, then all transitions over that 4 must be encoded in switch. All actions (like storing HTTP header names and values) must be performed in switch. Using this technique we can encode each state with only 16 bytes, i.e. one cache line can contain 4 states. Giving this the approach should have significantly improve data cache hit. We also know that Ragel generates perfect automatons and combines case labels in switch statement with direct goto labels (it seems switch is used to be able to enter FSM from any state, i.e. to be able to process chunked data). Such automatons has lower number of loop cycle and bit faster than traditional a-loop-cycle-for-each-transition approach. There was successful attempt to generate simple HTTP parsers using Ragel, but the parsers are limited in functionality. However there are also several research papers which says that an automaton states is just auxiliary information and an automaton can be significantly accelerated if state information is declined. So the second interesting opportunity to generate the fastest HTTP parser is just to encode the automaton directly using simple goto statements, ever w/o any explicit loop. Basically HTTP parsers just matches a string against set of characters (e.g. [A-Za-z_-] for header names), what strspn(3) does. SSE 4.2 provides PCMPSTR instructions family for this purpose (GLIBC since 2.16 uses SSE 4.2 implemenetation for strspn()). However, this is vector instruction which doesn't support accept or reject sets more than 16 characters, so it's not too usable for HTTP parsers. Results I made a simple benchmark for four approaches described above (http_ngx.c - Nginx HTTP parsing routines, http_table.c - table-driven FSM, http_hsm.c - hybrid state machine and http_goto.c - simple goto-driven FSM). And here are the results (routines with 'opt' or 'lw' - are optimized or lightweight versions of functions): Haswell (i7-4650U) Nginx HTTP parser: ngx_request_line: 730ms ngx_header_line: 422ms ngx_lw_header_line: 428ms ngx_big_header_line: 1725ms HTTP Hybrid State Machine: hsm_header_line: 553ms Table-driven Automaton (DPI) tbl_header_line: 473ms tbl_big_header_line: 840ms Goto-driven Automaton: goto_request_line: 470ms goto_opt_request_line: 458ms goto_header_line: 237ms goto_big_header_line: 589ms Core (Xeon E5335) Nginx HTTP parser: ngx_request_line: 909ms ngx_header_line: 583ms ngx_lw_header_line: 661ms ngx_big_header_line: 1938ms HTTP Hybrid State Machine: hsm_header_line: 433ms Table-driven Automaton (DPI) tbl_header_line: 562ms tbl_big_header_line: 1570ms Goto-driven Automaton: goto_request_line: 747ms goto_opt_request_line: 736ms goto_header_line: 375ms goto_big_header_line: 975ms Goto-driven automaton shows the better performance in all the tests on both the architectures. Also it's much easier to implement in comparison with HSM. So in Tempesta FW we migrated from HSM to goto-driven atomaton, but with some additional optimizations. Lessons Learned ** Haswell has very good BPU ** Core micro-architecture has show that HSM behaves much better than switch-driven and table-driven automatons. While this is not the case for Haswell - the approach loses to both the approaches. I've tried many optimizations techniques to improve HSM performance, but the results above are the best and they still worse than the simple FSM approaches. Profiler shows that the problem (hot spot) in HSM on Haswell is in the following code if (likely((unsigned char)(c - RNG_CB(s, 0)) <= RNG_SUB(s, 0))) { st = RNG_ST(s, 0); continue; } Here we extract transition information and compare current character with the range. In most cases only this one branch is observer in the test. 3rd and 4th branches are never observed. The whole automaton was encoded with only 2 cache lines. In first test case, when XTrans.x structure is dereferenced to get access to the ranges, the compiler generates 3 pointer dereferences. In fact these instructions (part of the disassembled branch) sub 0x4010c4(%rax),%bl cmp 0x4010c5(%rax),%bl movzbl 0x4010cc(%rax),%eax produce 3 accesses to L1d and the cache has very limited bandwidth (64 bytes for reading and 32 bytes for writing) on each cycle with minimal latency as 4 cycles for Haswell. While the only one cache line is accessed by all the instructions. So the test case bottle neck is L1d bandwidth. If we use XTrans.l longs (we need only l[0], which can be loaded with only one L1d access, in all the cases) and use bitwise operations to extract the data, then we get lower number of L1d accesses (4G vs 6.7G for previous cases), but branch mispredictions are increased. The problem is that more complex statement in the conditions makes harder to Branch Prediction Unit to predict branches. However, we can see that simple branches (for switch-driven and goto-driven automatons) show perfect performance on Haswell. So advanced Haswell BPU perfectly processes simple automatons making complex HSM inadequate. In fact HSM is only test which is slower on Haswell in comparison with Core Xeon. Probably, this is the difference between server and mobile chips that ever old server processor beats modern mobile CPU on complex loads... -O3 is ambiguous Sometimes -O3 (GCC 4.8.2) generates slower code than -O2. Also benchmarks for -O3 show very strange and unexpected results. For example the below are results for -O2: goto_request_line: 470ms However, -O3 shows worse results: goto_request_line: 852ms Automata must be encoded statically whenever possible Table-driven and HSM automaton are encoded using static constant tables (in difference with run-time generated tables for current DPI parser). This was done during HSM optimizations. Sometimes compiler can't optimize code using run-time generated tables. And this is crucial for real hot spots (for HSM the table is used in the if-statement described above which gets about 50-70% of whole the function execution time) - after the moving to the static data the code can get up to 50% performance improvement (the case for HSM). Source: High Performance Linux: Fast Finite State Machine for HTTP Parsing Refs: - Tempesta FW is a hybrid solution which combines reverse proxy and firewall at the same time. It accelerates Web applications and provide high performance framework with access to all network layers for running complex network traffic classification and blocking modules - http://natsys-lab.com/tpl/tempesta_fw.pdf
- November 25, 2014
- - compiler-flags
  - compilers
  - (and 5 more)
    Tagged with:
    
    compiler-flags
    
    compilers
    
    cpu
    
    ddos
    
    ddos-attacks
    
    gcc
    
    nginx
Physical && Virtual Memory Explained

pyth0n3 posted a topic in Sisteme de operare si discutii hardware

Dupa ce veti termina de citit acest tutorial veti cunoaste urmatoarele lucruti: Physical Memory Virtual memory Memory addressing (basics) Voi folosi cei mai simpli termeni ca fiecare sa inteleaga cum functioneaza Physical memory se refera la memoria fizica unde vin stocate datele Memoria fizica poate fi clasificata in 2 tipuri: Active memory Inactive Memory Physical Active memory (RAM) Aici vin stocate programele atunci cand vin executate.Termenul este folosit pentru a descrie cantitatea totala de memorie RAM instalata in calculator.(Se refera la memoria RAM).Poate fi paragonata cu un birou pe care se lucreaza cu diverse documente. Physical Inactive memory (hard drive) Aici vin stocate datele,termenul este folosit pentru a descrie accesarea datelor stocate in harddisk.Poate fi paragonat cu o biblioteca. Virtual memory Se refera la o arhitectura care este in grad sa simuleze un spatiu de memoriemai mare decat memoria Physical Active Memory (RAM).Presupunem ca vine definit un spatiu virtual cu adrese de la 1 la 10 .Acest spatiu poate fi folosit pentru a incarca programele atunci cand vin executate.Acum multi va veti intreba daca e o memorie virtuala unde stocheaza totusi datele.Ei bine aceasta memorie virtuala va stoca datele in RAM si in SWAP(care se afla pe har disk) Cum am mai spus memoria contine adrese unde pot fi stocate datele deoarece atunci cand se vor cere anumite date pentru a fi procesate de catre cpu trebuie sa stie de unde anume sa le scoata .Asadar pentru acest lucru vine impartita in diverse adrese.Adresele din memoria virtuala corespund la adresele de la memoria fizica(RAM) deoarece memoria virtuala este unul si acelasi lucru cu memoria fizica RAM doar ca numerele adreselor sunt diverse .Exemplu adresa 1 poate fi echivalenta la adresa 11 in RAM. In momentul in care un program va trebui sa stocheze o variabila in adresa 1 a memoriei virtuale aceasta adresa va fi tradusa in adresa 11 si variabila va fi stocata in RAM iar daca memoria RAM este plina va fi stocata in SWAP pana cand se va elibera spatiu in RAM.Sigurul lucru in legatura cu memoria virtuala pe care ar trebui sa il cunoasteti este ca este un spatiu virtual iar ceea ce vine stocat aici vine stocat in RAM sau in SWAP.Acest spatiu virtual se ocupa doar sa puna la dispozitie adrese de memorie cate programe care vor fi traduse si stocate in SWAP.Traducerea adreselor din memoria virtuala catre memoria fizica vine facuta prin MMU care este un component hardware.MMU traduce adresele virtuale in adrese fizice.MMU se ocupa de fiecare cerere de access la memorie de catre CPU. Memory addressing(basics) Note:voi defini cateva concepte basic deoarece in acest tutorial nu voi explica diversele tipuri de segmente (cei care au facut ASM sau C stiu la ce ma refer (ma voi referi la memory segmentation intrun alt tutorial unde voi explica conceptul de stack,heap,bss,text,data) Avem un executabil program1. Acest program cand vine executat va fi incarcat intro anumita memorie.Presupunem ca program1 are 5MB probabil va veti intreba cata memorie va ocupa acest program cand va fi executat.Program1 va ocupa 5MB de memorie + cata memorie a fost alocata in momentul in care a fost programat. Spre exemplu daca in program1 am stocat o variabila in memorie atunci cand va fi executat programul va ocupa 5mb + memoria variabilei declarate + alta memorie care a fost declarata chiar daca nu vine folosita. Unde va fi stocat acest program? In care memorie?Fiecare bucata din acest program va pleca la o anumita adresa a memoriei virtuale care va fi tradusa de catre MMU in adrese care corespund memoriei fizice RAM iar in cazul in care memoria fizica RAM este deja ocupata de catre alte programe acesta va fi stocat in swap asadar in momentul in care se va elibera memorie in RAM va trece in RAM. Cum arata o adresa de memorie? Vom crea cel mai simplu program posibil pentru a demonstra acest lucru #include <stdio.h> int main(void) { //Declar variabila var int var; //Stampez adresa in memorie a variabilei var printf ("Address of var is %p\n", &var); return 0; } o data executat acest program am obtinut urmatorul rezultat : Address of var is 0xbffff9ec Adresa variebilei acestui program este 0xbffff9ec care tradusa in decimal e 3221223916 In memorie adresa arata cam asa 10111111111111111111100111101100 Dupa cum vedeti eu am decis cum vreau sa fie stampata adresa variabilei var , daca careva prefera sa fie stampata intrun mod divers poate poate sa o faca Exemplu: #include <stdio.h> int main(void) { //Declar variabila var int var; //Stampez adresa in memorie a variabilei var in diverse moduri printf ("Address of var is %p\n", &var); //escaped hex C printf ("Address of var is %X\n", &var); //capital hex printf ("Address of var is %x\n", &var); //hex printf ("Address of var is %u\n", &var); //decimal return 0; } Output Address of var is 0xbffff9ec Address of var is BFFFF9EC Address of var is bffff9ec Address of var is 3221223916 Variabila var se gaseste in memoria virtuala la adresa 3221223916 intrun segment al memoriei chemat stack in acest caz (Memoria poate fi impartita in mai multe segmente dar cum am spus mai sus nu voi intra in detaliu)Nu toate variabilele se gasesc in stack . MMU va traduce adresa 3221223916 intro adresa fizica care este stocata in RAM.Daca memoria ram ar fi fost plina.MMU ar fi tradus adresa virtuala 3221223916 intro adresa fizica stocata in SWAP.Am creat o diagrama care explica unde si cum vine stocata memoria programelor care vin executate. Dupa cum vedeti in imagine memoria virtuala vine reprezentata doar de un tabel de adrese pe care MMU le va traduce in alte drese a memoriei fizice (RAM) sau (SWAP) in cazul in care RAM este ocupata. Daca aveti intrebari sunteti liberi sa le faceti ,in acest tutorial am folosit cat mai putini termeni informatici pentru a explica cum functioneaza memoria virtuala si memoria fizica, daca nu ati inteles ceva nu ezitati sa intrebati.
- May 21, 2012
- - address
  - cpu
  - (and 5 more)
    Tagged with:
    
    address
    
    cpu
    
    memory
    
    physical
    
    ram
    
    swap
    
    virtual

Sign In

Search the Community

Search By Tags

Search By Author

Content Type

Forums

Find results in...

Find results that contain...

Date Created

Start

End

Last Updated

Start

End

Filter by number of...

Minimum number of comments

Minimum number of replies

Minimum number of views

Joined

Start

End

Group

Website URL

Yahoo

Jabber

Skype

Location

Interests

Occupation

Interests

Biography

Location

Fast Finite State Machine for HTTP Parsing

Physical && Virtual Memory Explained

Browse

Activity

Pages