Scraping a specific website with a search box and javascripts in Python

On the website https://sray.arabesque.com/dashboard there is a search box "input" in html. I want to enter a company name in the search box, choose the first suggestion for that name in the dropout menu (e.g., "Anglo American plc"), go to the url with the info about that company, load javascripts to get full html version of the obtained page, and then scrape it for GC Score, ESG Score, Temperature Score in the bottom.

!apt install chromium-chromedriver !cp /usr/lib/chromium-browser/chromedriver /usr/bin !pip install selenium  from selenium import webdriver from selenium.webdriver.common.keys import Keys options = webdriver.ChromeOptions() options.add_argument('-headless') options.add_argument('-no-sandbox') options.add_argument('-disable-dev-shm-usage')  wd = webdriver.Chrome('chromedriver',options=options)  companies = ['Anglo American plc']  for company in companies:   # dryscrape.start_xvfb()   # session = dryscrape.Session()   # session.visit("https://srayapi.arabesque.com/api/sray/company/history/004BTP-E")   resp = wd.get('https://sray.arabesque.com/dashboard/') #print(driver.page_source)   e = wd.find_element_by_id(id_='mat-input-0')   e.send_keys(company)   e.send_keys(Keys.ENTER)   innerHTML = e.execute_script("return document.body.innerHTML")   print(innerHTML) 

I don’t quite understand how to visit an URL with info about Anglo American and scrape it if we don’t know the URL after entering the company name in the search box.

Add Comment
2 Answer(s)

You can do that using selenium.Couple of things you need to update.

While interacting headless you need to provide window size.

Induce WebDriverWait() to avoid synchronization issue.

Code:

from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By  options = webdriver.ChromeOptions() options.add_argument('-headless') options.add_argument('-no-sandbox') options.add_argument('-disable-dev-shm-usage') options.add_argument('window-size=1920,1080')  wd = webdriver.Chrome(options=options)  companies = ['Anglo American plc']  for company in companies:   wd.get('https://sray.arabesque.com/dashboard/')   WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[text()='list']"))).click()   WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH, "//input[@id='mat-input-0']"))).send_keys(company)   WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH, "//span[contains(.,' Anglo American plc ')]"))).click()   WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH, "(//span[normalize-space(.)='Open dashboard'])[1]"))).click()   WebDriverWait(wd,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"div.mat-tab-labels")))   print(wd.find_element_by_xpath("//div[@class='mat-tab-label-content'][contains(.,'GC Score')]/span").text)   print(wd.find_element_by_xpath("//div[@class='mat-tab-label-content'][contains(.,'ESG Score')]/span").text)   print(wd.find_element_by_xpath("//div[@class='mat-tab-label-content'][contains(.,'Temp')]/span").text) 

Output:

57.03 53.78 2.7°C     
Add Comment

Without exactly knowing why you want to use selenium, use the search and then getting another site, here is what I would do to get the data you are looking for:

import requests import json  session = requests.Session() url = 'https://srayapi.arabesque.com/api/sray/q' response = session.get(url).json()  rays = response['data']['rays'] [ray for ray in rays if ray['name'].startswith('Anglo American')] 

Then do whatever you want, so for esg, gc and temperature perhaps:

myObj = [{result['name']: {'gc': result['gc'], 'esg': result['esg'], 'temp': result['score_near']}} for result in results] 
Add Comment

Your Answer

By posting your answer, you agree to the privacy policy and terms of service.