cmiN Posted November 9, 2009 Report Share Posted November 9, 2009 Python pastebin - collaborative debugging tool#! /usr/bin/env python3.1# 08.11.2009 <> 09.11.2009 | cmiN# URL Path Finder (console) 4 pacealik @ rstcenter.comimport sys, threading, urllib.request, urllib.parsedef main(): usage = """\t\t URL Path Finder 1.0\t Usage: upf.py q paths start::end timeout threadsWhere q is the word that are you searching for paths is a file with paths (like /admin /images /forum) start is an integer; from here begins the searching end is an integer; here the searching stops timeout is a float in seconds threads is an integer representing how many threads are running asynchronously\t Example: upf.py rstcenter C:\\paths.txt 0::20 1 50""" args = sys.argv if len(args) == 6: try: print("Please wait...") q = args[1] paths = list() with open(args[2], "r") as fin: for line in fin.readlines(): paths.append(line.strip("\n")) start = int(args[3].split("::")[0]) end = int(args[3].split("::")[1]) timeout = float(args[4]) threads = int(args[5]) url = "http://www.google.com/search?q={q}&start={start}&hl=en" headers = {"User-Agent": "Mozilla/5.0 (compatible; MSIE 8.0)"} for i in range(start, end + 1): while threading.active_count() > threads: pass Scan(url.format(q=q, start=str(i)), headers, timeout, paths).start() while threading.active_count() > 1: pass with open("links.txt", "w") as fout: for link in Scan.links: while Scan.links.count(link) > 1: Scan.links.remove(link) Scan.links.sort() for link in Scan.links: fout.write(link + "\n") except Exception as message: print("An error occurred: {}".format(message)) except: print("Unknown error.") else: print("Ready!") else: print(usage) input()class Scan(threading.Thread): links = list() def __init__(self, url, headers, timeout, paths): threading.Thread.__init__(self) self.url = url self.headers = headers self.timeout = timeout self.paths = paths def run(self): request = urllib.request.Request(self.url, headers=self.headers) with urllib.request.urlopen(request, timeout=self.timeout) as usock: source = usock.read() source = source[source.find(b"Search Results"):] source = source[:source.find(b"</a>")] source = source.decode() url = source[source.find("http://"):] url = url[:url.find('"')] uparser = urllib.parse.urlparse(url) url = "{scheme}://{netloc}".format(scheme=uparser.scheme, netloc=uparser.netloc) for path in self.paths: request = urllib.request.Request(url + path, headers=self.headers) try: with urllib.request.urlopen(request, timeout=self.timeout) as usock: Scan.links.append(usock.geturl()) except: passif __name__ == "__main__": main()1) Download Python3.x Releases2) Start -> Run -> cmd3) cd %locatie% unde %locatie% reprezinta path-ul ce contine fisierul upf.py ce contine codul de mai sus4) upf.py [vezi usage]UPDATED: 13.11.2009 Quote Link to comment Share on other sites More sharing options...
daatdraqq Posted November 9, 2009 Report Share Posted November 9, 2009 Nu merge , dupa "usage" te arunca pe ecran iar ( sau..unde ai salvat ) Quote Link to comment Share on other sites More sharing options...
cmiN Posted November 9, 2009 Author Report Share Posted November 9, 2009 Pai usage-ul iti arata cu ce argumente sa apelezi upf.py. Ai incercat exemplul si nu ti-a mers ? Daca scrie "Please wait..." si apoi "Ready!" inseamna ca a functionat cum trebuie, daca nu sigur iti arunca o eroare ceva. Link-urile extrase sunt salvate in acelasi folder in care se gaseste si upf.py sub numele links.txt. Da mai multe detalii ca nu inteleg la ce te referi . Quote Link to comment Share on other sites More sharing options...
pacealik Posted November 9, 2009 Report Share Posted November 9, 2009 cmiN ti-am dat un reply la sectiunea ajutor Quote Link to comment Share on other sites More sharing options...
daatdraqq Posted November 9, 2009 Report Share Posted November 9, 2009 Da ,acum imi da o eroare.Mai devreme nu mergea deloc ...probabil e de la mine Quote Link to comment Share on other sites More sharing options...
cmiN Posted November 10, 2009 Author Report Share Posted November 10, 2009 Pai zi ce eroare iti da spune ce se intampla . Quote Link to comment Share on other sites More sharing options...
daatdraqq Posted November 10, 2009 Report Share Posted November 10, 2009 Quote Link to comment Share on other sites More sharing options...
cmiN Posted November 10, 2009 Author Report Share Posted November 10, 2009 Ma voi ati citit cu atentie usage-ul ?paths is a file with paths (like /admin /images /forum)Eroarea apare pentru ca nu exista acel fisier C:\paths.txt, trebuie sa-l creati voi si puneti in el path-uri care vreti sa fie incercate pe link-urile cautate:/admin/images/pub/forumCeva de genul trebuie sa contina, iar output-ul este salvat in links.txt langa upf.py. Quote Link to comment Share on other sites More sharing options...
daatdraqq Posted November 10, 2009 Report Share Posted November 10, 2009 Gata ,merge .. .Imi dadea eroarea pentru ca denumisem fisierul "path.txt" si nu mai aveam extensiile la vedere ,in felul asta devenea "path.txt.txt" Quote Link to comment Share on other sites More sharing options...
cmiN Posted November 10, 2009 Author Report Share Posted November 10, 2009 Exact! . Quote Link to comment Share on other sites More sharing options...
NullCode Posted November 12, 2009 Report Share Posted November 12, 2009 Oare Google nu iti v-a bloca IP-ul, pentru ca faci prea multe cereri ? Sau o sa iti apara un "captcha" dupa primele 10-20 de pagini. Poate ar fi fost mai usor prin API-ul de Google, sau folosirea unor proxy? Quote Link to comment Share on other sites More sharing options...
cmiN Posted November 13, 2009 Author Report Share Posted November 13, 2009 Nu mi s-a intamplat o sa-l mai testez ... pot implementa foarte usor o optiune cu o lista de proxy-uri si voi adauga si un switch in a alege intre un API si cautarea normala. Nu m-am dus direct pe API pentru ca auzisem ca are niste restrictii sau probleme. Quote Link to comment Share on other sites More sharing options...
pacealik Posted November 15, 2009 Report Share Posted November 15, 2009 cmiN ca sa poti trece de captcha folosesti in loc de API linkul http://www.google.com/cse?cx=013269018370076798483%3Awdba3dlnxqm&q=$cuvant&num=100&hl=en&as_qdr=all&start=$start&sa=N$cuvant = dupa ce cuvinte vrei sa caute$start = de la cat sa porneasca cautariledaca nu folosesti acel link fi sigur ca intervine captcha. iar API iti limiteaza cautarile max 8 pagini 1. ar putea sa arate la inceput cate linkuri a gasit dupa cautarea stabilita de utilizator ?2. sa scoata din schema linkurile care se repeta3. si sa nu salveze siteurile care folosesc mod_rewrite pentru ca salveaza prostii si te incurca Quote Link to comment Share on other sites More sharing options...