Jump to content
Kev

url-crawler

Recommended Posts

URL-CRAWLER V1.0.0

URL-CRAWLER is a Python script that extracts all third-party links from a given domain. It uses the BeautifulSoup library to parse the HTML and filter out links that belong to the same domain as the original domain.

 

Requirements

To run this script, you need to have Python 3 installed on your computer, as well as the following Python libraries:

  • argparse
  • requests
  • bs4 (BeautifulSoup)

 

You can install the required libraries using pip:

pip install argparse requests bs4

 

Usage

To use extract_links.py, run the script from the command line and provide the domain you want to extract links from as a command line argument. You can also specify additional options, such as whether to show link status codes or save the links to a file.

Here are some examples:

# Extract links from a single domain and display them in the console
python urlcrawler.py -d example.com

# Extract links from a single domain and save them to a file
python urlcrawler.py -d example.com -o links.txt

# Extract links from multiple domains and save them to a file
python urlcrawler.py -t -o links.txt

# Extract links from a single domain and display their status codes in the console
python urlcrawler.py -d example.com -s

#For more detailed usage instructions, run the script with the -h or --help option.

 

Author

This script was created by BLACK-SCORP10. You can contact me on Telegram at https://t.me/BLACK_SCORP10 or on Instagram at @hacke_1o1.

 

License

This project is licensed under the MIT License - see the LICENSE file for details.

 

Download: url-crawler-main.zip

 

or

 

git clone https://github.com/BLACK-SCORP10/url-crawler.git

 

Mirror: 

import argparse
import requests
from bs4 import BeautifulSoup
import colorama

def extract_links(domain):
  # Add "https://" to the beginning of the domain if it does not start with "http" or "https"
  if not domain.startswith('http://') and not domain.startswith('https://'):
    domain = 'https://' + domain
  
  # Make a GET request to the domain
  response = requests.get(domain)
  html = response.text
  
  # Use BeautifulSoup to parse the HTML and extract all links
  soup = BeautifulSoup(html, 'html.parser')
  links = soup.find_all('a')
  
  # Filter the links to only include third-party domains
  third_party_links = []
  for link in links:
    href = link.get('href')
    if href and domain not in href and (href.startswith('http://') or href.startswith('https://')):
      third_party_links.append(href)
  
  # Remove duplicate links
  third_party_links = list(set(third_party_links))
  
  return third_party_links

def main():
  # Initialize colorama
  colorama.init()
  
  # Set up the command line argument parser
  parser = argparse.ArgumentParser(description=colorama.Fore.CYAN + 'Extract third party links from a domain', epilog=colorama.Fore.YELLOW + "Examples: \n1. python url-crawler.py -d example.com, \n2. python url-crawler.py -d example.com -o links.txt, \n3. python url-crawler.py -t -o links.txt", formatter_class=argparse.RawTextHelpFormatter)
  print(colorama.Fore.BLUE + "Made by BLACK-SCORP10")
  print(colorama.Fore.BLUE + "t.me/BLACK_SCORP10")
  print(colorama.Fore.BLUE + "Instagram: @hacke_1o1")
  parser.add_argument('-o', '--output', help='Output file name')
  parser.add_argument('-d', '--domain', help='Domain name', required=True)
  parser.add_argument('-t', '--multiple', help='Extract links from multiple domains', action='store_true')
  parser.add_argument('-s', '--status', help='Show link status code', action='store_true')
  args = parser.parse_args()
  
  # Print the banner
  #print(colorama.Fore.BLUE + "Made by BLACK-SCORP10")
  #print(colorama.Fore.BLUE + "Instagram: @hacke_1o1")
  print("\n")
  
  # Extract the links from the domain
  if args.multiple:
    # TODO: Extract links from multiple domains
    pass
  else:
    links = extract_links(args.domain)
    
    # Print the links or save them to a file
    if args.output:
      with open(args.output, 'w') as f:
        for link in links:
          if args.status:
            # Make an HTTP HEAD request to the link and get the status code
            try:
              response = requests.head(link)
              status_code = response.status_code
              f.write(f'[{status_code}] : {link}\n')
            except:
              f.write(colorama.Fore.RED + f'[err] : {link}\n')
          else:
            f.write(link + '\n')
    else:
      for link in links:
        if args.status:
          # Make an HTTP HEAD request to the link and get the status code
          try:
            response = requests.head(link)
            status_code = response.status_code
            print(colorama.Fore.GREEN + f'[{status_code}] : {link}')
          except:
            print(colorama.Fore.RED + f'[err] : {link}')
        else:
          print(link)
          
  # Reset the colorama settings
  colorama.deinit()

if __name__ == '__main__':
  main()

# This code is made and owned by BLACK-SCORP10.
# Feel free to contact me at https://t.me/BLACK_SCORP10

 

Source

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.



×
×
  • Create New...