[Python] URL Path Finder (console) [cmiN]

cmiN · November 9, 2009

Python pastebin - collaborative debugging tool

#! /usr/bin/env python3.1
# 08.11.2009 <> 09.11.2009 | cmiN
# URL Path Finder (console) 4 pacealik @ rstcenter.com


import sys, threading, urllib.request, urllib.parse


def main():
    usage = """\t\t URL Path Finder 1.0

\t Usage: upf.py q paths start::end timeout threads

Where q is the word that are you searching for
      paths is a file with paths (like /admin /images /forum)
      start is an integer; from here begins the searching
      end is an integer; here the searching stops
      timeout is a float in seconds
      threads is an integer representing how many threads are running asynchronously

\t Example: upf.py rstcenter C:\\paths.txt 0::20 1 50"""
    args = sys.argv
    if len(args) == 6:
        try:
            print("Please wait...")
            q = args[1]
            paths = list()
            with open(args[2], "r") as fin:
                for line in fin.readlines():
                    paths.append(line.strip("\n"))
            start = int(args[3].split("::")[0])
            end = int(args[3].split("::")[1])
            timeout = float(args[4])
            threads = int(args[5])
            url = "http://www.google.com/search?q={q}&start={start}&hl=en"
            headers = {"User-Agent": "Mozilla/5.0 (compatible; MSIE 8.0)"}
            for i in range(start, end + 1):
                while threading.active_count() > threads:
                    pass
                Scan(url.format(q=q, start=str(i)), headers, timeout, paths).start()
            while threading.active_count() > 1:
                pass
            with open("links.txt", "w") as fout:
                for link in Scan.links:
                    while Scan.links.count(link) > 1:
                        Scan.links.remove(link)
                Scan.links.sort()
                for link in Scan.links:
                    fout.write(link + "\n")
        except Exception as message:
            print("An error occurred: {}".format(message))
        except:
            print("Unknown error.")
        else:
            print("Ready!")
    else:
        print(usage)
    input()


class Scan(threading.Thread):

    links = list()

    def __init__(self, url, headers, timeout, paths):
        threading.Thread.__init__(self)
        self.url = url
        self.headers = headers
        self.timeout = timeout
        self.paths = paths

    def run(self):
        request = urllib.request.Request(self.url, headers=self.headers)
        with urllib.request.urlopen(request, timeout=self.timeout) as usock:
            source = usock.read()
            source = source[source.find(b"Search Results"):]
            source = source[:source.find(b"</a>")]
            source = source.decode()
        url = source[source.find("http://"):]
        url = url[:url.find('"')]
        uparser = urllib.parse.urlparse(url)
        url = "{scheme}://{netloc}".format(scheme=uparser.scheme, netloc=uparser.netloc)
        for path in self.paths:
            request = urllib.request.Request(url + path, headers=self.headers)
            try:
                with urllib.request.urlopen(request, timeout=self.timeout) as usock:
                    Scan.links.append(usock.geturl())
            except:
                pass


if __name__ == "__main__":
    main()

1) Download Python3.x Releases

2) Start -> Run -> cmd

3) cd %locatie% unde %locatie% reprezinta path-ul ce contine fisierul upf.py ce contine codul de mai sus

4) upf.py [vezi usage]

UPDATED: 13.11.2009

daatdraqq · November 9, 2009

Nu merge , dupa "usage" te arunca pe ecran iar ( sau..unde ai salvat )

cmiN · November 9, 2009

Pai usage-ul iti arata cu ce argumente sa apelezi upf.py. Ai incercat exemplul si nu ti-a mers ? Daca scrie "Please wait..." si apoi "Ready!" inseamna ca a functionat cum trebuie, daca nu sigur iti arunca o eroare ceva. Link-urile extrase sunt salvate in acelasi folder in care se gaseste si upf.py sub numele links.txt. Da mai multe detalii ca nu inteleg la ce te referi .

pacealik · November 9, 2009

cmiN ti-am dat un reply la sectiunea ajutor

daatdraqq · November 9, 2009

Da ,acum imi da o eroare.Mai devreme nu mergea deloc ...probabil e de la mine

cmiN · November 10, 2009

Pai zi ce eroare iti da spune ce se intampla .

daatdraqq · November 10, 2009

cmiN · November 10, 2009

Ma voi ati citit cu atentie usage-ul ?

paths is a file with paths (like /admin /images /forum)

Eroarea apare pentru ca nu exista acel fisier C:\paths.txt, trebuie sa-l creati voi si puneti in el path-uri care vreti sa fie incercate pe link-urile cautate:


/admin
/images
/pub
/forum

Ceva de genul trebuie sa contina, iar output-ul este salvat in links.txt langa upf.py.

daatdraqq · November 10, 2009

Gata ,merge .. :)) .Imi dadea eroarea pentru ca denumisem fisierul "path.txt" si nu mai aveam extensiile la vedere ,in felul asta devenea "path.txt.txt" :))

cmiN · November 10, 2009

Exact! .

NullCode · November 12, 2009

Oare Google nu iti v-a bloca IP-ul, pentru ca faci prea multe cereri ? Sau o sa iti apara un "captcha" dupa primele 10-20 de pagini. Poate ar fi fost mai usor prin API-ul de Google, sau folosirea unor proxy?

cmiN · November 13, 2009

Nu mi s-a intamplat o sa-l mai testez ... pot implementa foarte usor o optiune cu o lista de proxy-uri si voi adauga si un switch in a alege intre un API si cautarea normala. Nu m-am dus direct pe API pentru ca auzisem ca are niste restrictii sau probleme.

pacealik · November 15, 2009

cmiN ca sa poti trece de captcha folosesti in loc de API linkul http://www.google.com/cse?cx=013269018370076798483%3Awdba3dlnxqm&q=$cuvant&num=100&hl=en&as_qdr=all&start=$start&sa=N

$cuvant = dupa ce cuvinte vrei sa caute

$start = de la cat sa porneasca cautarile

daca nu folosesti acel link fi sigur ca intervine captcha. iar API iti limiteaza cautarile max 8 pagini

1. ar putea sa arate la inceput cate linkuri a gasit dupa cautarea stabilita de utilizator ?

2. sa scoata din schema linkurile care se repeta

3. si sa nu salveze siteurile care folosesc mod_rewrite pentru ca salveaza prostii si te incurca

Sign In

[Python] URL Path Finder (console) [cmiN]

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation