Jump to content
cmiN

[Python] URL Path Finder (console) [cmiN]

Recommended Posts

Posted

Python pastebin - collaborative debugging tool

#! /usr/bin/env python3.1
# 08.11.2009 <> 09.11.2009 | cmiN
# URL Path Finder (console) 4 pacealik @ rstcenter.com


import sys, threading, urllib.request, urllib.parse


def main():
usage = """\t\t URL Path Finder 1.0

\t Usage: upf.py q paths start::end timeout threads

Where q is the word that are you searching for
paths is a file with paths (like /admin /images /forum)
start is an integer; from here begins the searching
end is an integer; here the searching stops
timeout is a float in seconds
threads is an integer representing how many threads are running asynchronously

\t Example: upf.py rstcenter C:\\paths.txt 0::20 1 50"""
args = sys.argv
if len(args) == 6:
try:
print("Please wait...")
q = args[1]
paths = list()
with open(args[2], "r") as fin:
for line in fin.readlines():
paths.append(line.strip("\n"))
start = int(args[3].split("::")[0])
end = int(args[3].split("::")[1])
timeout = float(args[4])
threads = int(args[5])
url = "http://www.google.com/search?q={q}&start={start}&hl=en"
headers = {"User-Agent": "Mozilla/5.0 (compatible; MSIE 8.0)"}
for i in range(start, end + 1):
while threading.active_count() > threads:
pass
Scan(url.format(q=q, start=str(i)), headers, timeout, paths).start()
while threading.active_count() > 1:
pass
with open("links.txt", "w") as fout:
for link in Scan.links:
while Scan.links.count(link) > 1:
Scan.links.remove(link)
Scan.links.sort()
for link in Scan.links:
fout.write(link + "\n")
except Exception as message:
print("An error occurred: {}".format(message))
except:
print("Unknown error.")
else:
print("Ready!")
else:
print(usage)
input()


class Scan(threading.Thread):

links = list()

def __init__(self, url, headers, timeout, paths):
threading.Thread.__init__(self)
self.url = url
self.headers = headers
self.timeout = timeout
self.paths = paths

def run(self):
request = urllib.request.Request(self.url, headers=self.headers)
with urllib.request.urlopen(request, timeout=self.timeout) as usock:
source = usock.read()
source = source[source.find(b"Search Results"):]
source = source[:source.find(b"</a>")]
source = source.decode()
url = source[source.find("http://"):]
url = url[:url.find('"')]
uparser = urllib.parse.urlparse(url)
url = "{scheme}://{netloc}".format(scheme=uparser.scheme, netloc=uparser.netloc)
for path in self.paths:
request = urllib.request.Request(url + path, headers=self.headers)
try:
with urllib.request.urlopen(request, timeout=self.timeout) as usock:
Scan.links.append(usock.geturl())
except:
pass


if __name__ == "__main__":
main()

1) Download Python3.x Releases

2) Start -> Run -> cmd

3) cd %locatie% unde %locatie% reprezinta path-ul ce contine fisierul upf.py ce contine codul de mai sus

4) upf.py [vezi usage]

UPDATED: 13.11.2009

Posted

Pai usage-ul iti arata cu ce argumente sa apelezi upf.py. Ai incercat exemplul si nu ti-a mers ? Daca scrie "Please wait..." si apoi "Ready!" inseamna ca a functionat cum trebuie, daca nu sigur iti arunca o eroare ceva. Link-urile extrase sunt salvate in acelasi folder in care se gaseste si upf.py sub numele links.txt. Da mai multe detalii ca nu inteleg la ce te referi :P.

Posted

Ma voi ati citit cu atentie usage-ul ?

paths is a file with paths (like /admin /images /forum)

Eroarea apare pentru ca nu exista acel fisier C:\paths.txt, trebuie sa-l creati voi si puneti in el path-uri care vreti sa fie incercate pe link-urile cautate:


/admin
/images
/pub
/forum

Ceva de genul trebuie sa contina, iar output-ul este salvat in links.txt langa upf.py.

Posted

Oare Google nu iti v-a bloca IP-ul, pentru ca faci prea multe cereri ? Sau o sa iti apara un "captcha" dupa primele 10-20 de pagini. Poate ar fi fost mai usor prin API-ul de Google, sau folosirea unor proxy?

Posted

Nu mi s-a intamplat o sa-l mai testez ... pot implementa foarte usor o optiune cu o lista de proxy-uri si voi adauga si un switch in a alege intre un API si cautarea normala. Nu m-am dus direct pe API pentru ca auzisem ca are niste restrictii sau probleme.

Posted

cmiN ca sa poti trece de captcha folosesti in loc de API linkul http://www.google.com/cse?cx=013269018370076798483%3Awdba3dlnxqm&q=$cuvant&num=100&hl=en&as_qdr=all&start=$start&sa=N

$cuvant = dupa ce cuvinte vrei sa caute

$start = de la cat sa porneasca cautarile

daca nu folosesti acel link fi sigur ca intervine captcha. iar API iti limiteaza cautarile max 8 pagini :)

1. ar putea sa arate la inceput cate linkuri a gasit dupa cautarea stabilita de utilizator ?

2. sa scoata din schema linkurile care se repeta

3. si sa nu salveze siteurile care folosesc mod_rewrite pentru ca salveaza prostii si te incurca

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.



×
×
  • Create New...