Why Isn't For Loop Web Scraping for Href? I Appreciate Anyone Who Can Figure This Out
On the url in the "sitemap" below, I am trying to scrape the hyperlinked "Visit Website" for each office on this page.
However, I believe I am making an error with my "listings_div" variable as it does not seem to capture all the offices when I do the for loop. Thank you for your help!!
import requests from bs4 import BeautifulSoup sitemap = 'https://www.bhhs.com/office-results-list?office_country=US' sitemap_content = requests.get(sitemap).content soup = BeautifulSoup(sitemap_content, 'html.parser') listings_div = soup.find('section', attrs={'class':'cmp-office-search-results'}) for state in listings_div.find_all('div', attrs={'class':'cmp-office-results-list-view__content'}): print(state.find('section', attrs={'class':'cmp-cta'}).get('href'))
Your job is much easier now. The website uses javascript to get this information.
The below scrapes all the 141 pages.
import requests, json results = [] for i in range(1,142): res = requests.get("https://www.bhhs.com/bin/bhhs/officeSearchServlet?PageSize=10&Sort=1&Page={}&office_country=US".format(i)) results.append(res.json()) with open("result.json", "w") as f: json.dump(results, f)
Trying all the requests at once can make some requests failed. Hence, I recommend crawling the pages in batches and save the data like pages from 1-10 save the data, next 10-20 save the data etc… Next you can consolidate all the scraped results