Jump to content
Che

Python: Accesare pagina protejata de Incapsula.

Recommended Posts

Vreau sa accesez sursa acestei pagini in Python:

https://www.whoscored.com/Matches/1485370/Live/England-Premier-League-2020-2021-Brighton-Leicester

 

Scriptul in Python este acesta:

import ssl

import requests

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

url = 'https://www.whoscored.com/Matches/1485370/Live/England-Premier-League-2020-2021-Brighton-Leicester'

headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-ch-ua': '\"Google Chrome\";v=\"89\", \"Chromium\";v=\"89\", \";Not A Brand\";v=\"99\"',
'sec-ch-ua-mobile': '?0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
}

response = requests.get(url)
print(response.content)

 

Problema este ca nu ma lasa. Ori primesc eroare ca nu e certificatul bun (si de asta am pus si try-block-ul cu certificatul la inceputul codului), ori pur si simplu nu imi da voie ca e nu stiu ce problema cu Incapsula. Nu inteleg de ce nu imi da voie ca doar practic emulez la perfectie un browser si tot isi da seama cumva ca nu e real.

Ma poate ajuta cineva, va rog?

Multumesc mult!

Link to comment
Share on other sites

Initial am crezut ca nu ai setat toate lucrurile la locul lor, am mai adaugat niste headere. dupa am vazut ca nu ai setat efectiv headerul in request (evident, imi luasem si cooldown intre timp). Codul asta imi merge bine:

import requests

# try:
#     _create_unverified_https_context = ssl._create_unverified_context
# except AttributeError:
#     pass
# else:
#     ssl._create_default_https_context = _create_unverified_https_context

url = 'https://www.whoscored.com/Matches/1485370/Live/England-Premier-League-2020-2021-Brighton-Leicester'

headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.5',
'Connection': 'keep-alive',
'DNT': '1',
'Host': 'www.whoscored.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'
}

s = requests.Session()
response = s.get(url, allow_redirects=True, headers=headers)
print(response.content)

Ai spune ca dupa ce executi asta nu primesti chiar ce vrei si ca ai fost blocat, asta a fost prima mea impresie, dupa am vazut ca nu am luat cooldown, primesti tot ce trebuie doar ca nu poti executa javascript-ul din pagina ca sa poti vedea tot content-ul folosind doar modulul asta requests

Javascript criptat.. evident...

 

YqLugUL.png

Cred ca modulul asta este util daca esti prea lenes sa vezi cum se decripteaza https://pypi.org/project/requests-html/

  • Upvote 1
Link to comment
Share on other sites

@FoxBlood Cum iti da tie acel cod js ca mie de fiecare data imi da doar asta?

b'<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><script type="text/javascript" src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3"></script></head><body style="margin:0px;height:100%"><iframe id="main-iframe" src="/_Incapsula_Resource?SWUDNSAI=30&xinfo=7-127053514-0%200NNN%20RT%281617365412939%2074%29%20q%280%20-1%20-1%201%29%20r%280%20-1%29%20B12%2811%2c8628%2c0%29%20U18&incident_id=875000100355227794-412533605997937287&edet=12&cinfo=0b000000&rpinfo=0&cts=nwc8yt0AuMjh0gpLuhos1IuJJSpDlyUzeNlHAg8f8pgX3fHR3fTHQ1klHRlgMFI6" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 875000100355227794-412533605997937287</iframe></body></html>'

@Zatarra Am incercat si tot nu merge. Inca mai da si eroare de SSL cum ca nu ar fi bine sa eviti verificarea. Asta ffind de fapt un warning de la urllib3. Dar eroarea de mai sus e aceeasi.

Link to comment
Share on other sites

20 minutes ago, gigiRoman said:

Phantomjs e headleass chrome only. Selenium e multi browser. Pt gui tests pe mai multe browsere e recomandat selenium:

https://stackoverflow.com/questions/14099770/casperjs-phantomjs-vs-selenium

Adica spui ca Phantomjs este doar Chrome si nu poate fi detectat? Dar si Selenium tot Chrome este ca descarci chrome driver si pe ala il iei si il folosesti care este chrome practic.

Link to comment
Share on other sites

3 minutes ago, Che said:

Adica spui ca Phantomjs este doar Chrome si nu poate fi detectat? Dar si Selenium tot Chrome este ca descarci chrome driver si pe ala il iei si il folosesti care este chrome practic.

Atat phantomjs cat si selenium sunt frameworkuri de testare automata. Phantomjs foloseste doar chrome driver, insa cu selenium poti automatiza browsere multiple. Ele nu sunt facute sa nu fii detectat. Se folosesc in medii de test. Vezi ca trimit chiar si user agenti in care trec ca vin din selenium, respectiv phantomjs. Aici trebuie sa ai grija sa folosesti un proxy sa rescrii useragents... samd.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.



×
×
  • Create New...