Elohim Posted February 19, 2014 Report Posted February 19, 2014 (edited) Ca sa functioneze, ii trebuie un fisier cu domenii, cu sau fara prefixul http:// la inceput.Usage: python emailscrapper.py Threads FileExemplu: python emalscrapper.py 50 domenii.txtMailurile se salveaza in emails.txtUPDATE 1.3:- izolare completa la proceselor- renuntat la threading, acum functioneaza cu multiprocese total paralele- preluare rezultate mai corect- 2-3x mai rapid fata de precedenta versiuneVersiune minima pentru functionare este Python 2.6Pentru variante customizate, aveti jabberul meu in cod."""RST eMail CrawlerVersion: 1.3Author: ElohimContact Jabber: viktor@rows.io"""import urllib2import reimport sysimport cookielibfrom threading import Timerfrom multiprocessing import Process, Queueclass GetResults(Process): def __init__(self, rezqueue): Process.__init__(self) self.rezqueue = rezqueue def run(self): while True: email = self.rezqueue.get() if email is None: return False with open("emails.txt","a") as EmailFile: EmailFile.write(email.rstrip()+"\n") print emailclass Crawler(Process): def __init__(self, queue, rezqueue): Process.__init__(self) self.queue = queue self.rezqueue = rezqueue def run(self): while True: site = self.queue.get() if site is None: return False self.crawl(site) def crawl(self,site): try: WatchIt = Timer(15.0, self.WatchDog) WatchIt.start() cj = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) opener.addheaders = [('Accept:','*'),("User-Agent", "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:31.0) Gecko/20100101 Firefox/31.0")] opener.addheaders = [('Content-Type', 'text/html; charset=utf-8'),("Accept-Encoding", "")] resp = opener.open(site,timeout=10) WatchIt.cancel() self.getem(resp.read()) except Exception, e: #print e f = 1 def getem(self,resp): try: emails = re.findall(r"[A-Za-z0-9%&*+?^_`{|}~-]+(?:\.[A-Za-z0-9!#$%&*+?^_`{|}~-]+)*@(?:[A-Za-z0-9](?:[a-z0-9-]*[A-Za-z0-9])?\.)+(?:[A-Za-z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aer o|asia|jobs|museum)\b", str(resp)) CleanEmails = set(emails) for em in CleanEmails: self.rezqueue.put(em.lower()) except Exception, e: return False def WatchDog(self): return False if __name__ == "__main__": if len(sys.argv) < 3: print "Usage:",sys.argv[0],"Threads DomainFile.txt" print "\tExample: ",sys.argv[0],"30 domains.txt" sys.exit() queue = Queue(maxsize=30000) rezqueue = Queue() ThreadNumber = int(sys.argv[1]) ThreadList = [] for i in range(ThreadNumber): t = Crawler(queue,rezqueue) t.daemon = True t.start() ThreadList.append(t) GR = GetResults(rezqueue) GR.daemon = True GR.start() with open(sys.argv[2],"rU") as urls: for url in urls: try: if url.startswith('http://'): queue.put(url.rstrip()) else: url = 'http://'+url.rstrip() queue.put(url.rstrip()) except Exception, e: print e for i in range(ThreadNumber): queue.put(None) for Worker in ThreadList: Worker.join() print "All done!" rezqueue.put(None) GR.join() Edited July 3, 2015 by Elohim 3 Quote
Elohim Posted February 20, 2014 Author Report Posted February 20, 2014 Update 20 Feb 2014 :- Mici modificari la regex. S-a rezolvat problema cu email-urile cu prima litera mare, cele cu mailto: si SendIM: . Inca astept imbunatatiri pentru regex. Quote
zin0 Posted March 3, 2014 Report Posted March 3, 2014 poti sa il faci sa valideze doar o lista de emailuri ? de ex am in lista 1.txt 100 mii de mailuri de yahoo, sa le verifice care sunt valide si care nu? Quote
intrus Posted March 8, 2014 Report Posted March 8, 2014 interesant dar mie imi gaseste 1 email si apoi se opreste. Quote
Elohim Posted December 28, 2014 Author Report Posted December 28, 2014 Update 28 Dec 2014:- Adaugat cookie support- Adaugat header support- Rezolvat problema in unele cazuri de a ramane blocatDaca macar 5-6 persoane prezinta interes pt a il face sa se propage singur, si sa isi adune el link-uri, sa scrie aici. Quote
yoNut Posted January 1, 2015 Report Posted January 1, 2015 Aveti idee de ce primesc aceasta eroare? File "search.py", line 14 print em.lower() ^SyntaxError: invalid syntaxAm incercat si cu python 2.7 si cu python 3.3. Ambele pe acelasi Centos. Quote
Rickets Posted January 1, 2015 Report Posted January 1, 2015 Idee pentru propagare :Daca dai search pe google cu asta : filetype:txt, "@yahoo.com" iti da fisiere text, de genul : http://www.carpeta.ro/Content/Subscription.txtSunt bune si urmatoarele query-uri :filetype:sql, "@yahoo.com"filetype:txt, "@gmail.com"filetype:sql, "@gmail.com"Daca il faci sa gaseasca si tara , il cumpar. Quote
Hubba Posted March 5, 2015 Report Posted March 5, 2015 de ce ? File "/usr/lib/python2.6/threading.py", line 737, in run File "/usr/lib/python2.6/threading.py", line 380, in set File "/usr/lib/python2.6/threading.py", line 291, in notifyAll<type 'exceptions.TypeError'>: 'NoneType' object is not callable<type 'exceptions.TypeError'>: 'NoneType' object is not callable Quote
Htich Posted March 5, 2015 Report Posted March 5, 2015 de ce ? File "/usr/lib/python2.6/threading.py", line 737, in run File "/usr/lib/python2.6/threading.py", line 380, in set File "/usr/lib/python2.6/threading.py", line 291, in notifyAll<type 'exceptions.TypeError'>: 'NoneType' object is not callable<type 'exceptions.TypeError'>: 'NoneType' object is not callablepython2.6 fol. 2.7 versiunea de python si nu cred ca o sa mai ai probleme. Quote
Htich Posted March 19, 2015 Report Posted March 19, 2015 @meow ce erori iti da, lasa aicea ce problema intampini si poate reusim sa ii dam de cap. Quote
Htich Posted March 19, 2015 Report Posted March 19, 2015 (edited) deja ti-a zis rezultatul ) baga ca url google sau alte domenii mai mari ! Edited March 20, 2015 by Htich Quote
winraw3 Posted March 20, 2015 Report Posted March 20, 2015 e o idee buna sa si valideze emailurile sau macar sa vada daca hosturile sunt valide ..!! checker.!! (sau chiar o lista de maile ) Quote
nimenis Posted April 23, 2015 Report Posted April 23, 2015 Elohim sunt interesat sa cumpar programu asta daca mai adaugi niste chesti la el.da-mi mesaj cu id Quote
mov_0ah_01 Posted April 23, 2015 Report Posted April 23, 2015 @nimenis Atomic Email Hunter Discount Coupon Code 2015 - 100% Working Face ce vrei tu, and more... Quote
nimenis Posted April 23, 2015 Report Posted April 23, 2015 face man dar eu as vrea niste filtre in plus.cum ar fii o tara sau diferite categorii de emaile.plus ca se misca greu dureaza mult pana crawleaza 1 site dar 100. Quote
Elohim Posted July 2, 2015 Author Report Posted July 2, 2015 Actualizat azi cu v1.3.Performante de 2-3x mai bune decat precedenta versiune.Enjoy. Quote