Why Isn't For Loop Web Scraping for Href? I Appreciate Anyone Who Can Figure This Out

On the url in the "sitemap" below, I am trying to scrape the hyperlinked "Visit Website" for each office on this page.

However, I believe I am making an error with my "listings_div" variable as it does not seem to capture all the offices when I do the for loop. Thank you for your help!!

import requests from bs4 import BeautifulSoup  sitemap = 'https://www.bhhs.com/office-results-list?office_country=US' sitemap_content = requests.get(sitemap).content soup = BeautifulSoup(sitemap_content, 'html.parser')  listings_div = soup.find('section', attrs={'class':'cmp-office-search-results'})  for state in listings_div.find_all('div', attrs={'class':'cmp-office-results-list-view__content'}):     print(state.find('section', attrs={'class':'cmp-cta'}).get('href')) 
Add Comment
1 Answer(s)

Your job is much easier now. The website uses javascript to get this information.

The below scrapes all the 141 pages.

import requests, json  results = []  for i in range(1,142):     res = requests.get("https://www.bhhs.com/bin/bhhs/officeSearchServlet?PageSize=10&Sort=1&Page={}&office_country=US".format(i))     results.append(res.json())  with open("result.json", "w") as f:     json.dump(results, f) 

Trying all the requests at once can make some requests failed. Hence, I recommend crawling the pages in batches and save the data like pages from 1-10 save the data, next 10-20 save the data etc… Next you can consolidate all the scraped results

Answered on July 16, 2020.
Add Comment

Your Answer

By posting your answer, you agree to the privacy policy and terms of service.