How to turn list into data frame after list comprehension
I have the following code and I was wondering how do I properly turn it into a data frame with country as one column and population as the other after looping through my function with list comprehension?
from bs4 import BeautifulSoup import html from urllib.request import urlopen import pandas as pd countries = ['af', 'ax'] def get_data(countries): url = 'https://www.cia.gov/library/publications/the-world-factbook/geos/'+countries+'.html' page = urlopen(url) soup = BeautifulSoup(page,'html.parser') # geography country = soup.find('span', {'class' : 'region'}).text population = soup.find('div', {'id' : 'field-population'}).find_next('span').get_text(strip=True) dataframe = [country, population] dataframe = pd.DataFrame([dataframe]) return dataframe results = [get_data(p) for p in countries]
What I tried and it gives me the following data frame:
results = pd.DataFrame(results) 0 1 0 0 Afghanistan Name: 0, dtype: object 0 Afghanistan Name: 0, dtype: object 1 0 Akrotiri Name: 0, dtype: object 0 Akrotiri Name: 0, dtype: object
I’m not quite sure why you’re returning it as a DataFrame from get_data()
. If you return it as a dictionary, it will be much more logical for conversion to a dataframe later.
countries = ['af', 'ax'] def get_data(countries): url = 'https://www.cia.gov/library/publications/the-world-factbook/geos/'+countries+'.html' page = urlopen(url) soup = BeautifulSoup(page,'html.parser') # geography country = soup.find('span', {'class' : 'region'}).text population = soup.find('div', {'id' : 'field-population'}).find_next('span').get_text(strip=True) scraped = {'country':country, 'population':population} return scraped results = [get_data(p) for p in countries]
This returns a list of dictionaries such as:
[{'country': 'Afghanistan', 'population': '36,643,815'}, {'country': 'Akrotiri', 'population': 'approximately 15,500 on the Sovereign Base Areas of Akrotiri and Dhekelia including 9,700 Cypriots and 5,800 Service and UK-based contract personnel and dependents'}]
So when you convert with pd.DataFrame(results)
you get:
country population 0 Afghanistan 36,643,815 1 Akrotiri approximately 15,500 on the Sovereign Base Are...
In [136]: from bs4 import BeautifulSoup ...: import html ...: from urllib.request import urlopen ...: import pandas as pd ...: ...: countries = ['af', 'ax'] ...: ...: def get_data(countries): ...: url = 'https://www.cia.gov/library/publications/the-world-factbook/geos/'+countries+'.html' ...: page = urlopen(url) ...: soup = BeautifulSoup(page,'html.parser') ...: # geography ...: country = soup.find('span', {'class' : 'region'}).text ...: population = soup.find('div', {'id' : 'field-population'}).find_next('span').get_text(strip=True) ...: json_str = {"country":country, "population":population} ...: return json_str ...: results = [get_data(p) for p in countries] ...: df = pd.DataFrame(results) In [137]: df Out[137]: country population 0 Afghanistan 36,643,815 1 Akrotiri approximately 15,500 on the Sovereign Base Are...
If you rewrite your original function as:
def get_data(countries): url = 'https://www.cia.gov/library/publications/the-world-factbook/geos/'+countries+'.html' page = urlopen(url) soup = BeautifulSoup(page,'html.parser') # geography country = soup.find('span', {'class' : 'region'}).text population = soup.find('div', {'id' : 'field-population'}).find_next('span').get_text(strip=True) return country, population
and call
results = [get_data(p) for p in countries]
as you suggested, you can do something like this:
def listToFrame(res, column_labels=None): C = len(res[0]) # number of columns if column_labels is None: column_labels = list(range(C)) dct = {} for c in range(C): col = [] for r in range(len(res)): col.append(res[r][c]) dct[column_labels[c]] = col return pd.DataFrame(dct) df = listToFrame(results)
or, even nicer,
df = listToFrame(results, ['Country', 'Population'])