How to solve Key Error while XML File Parsing in Python

I have the following XML file which I want to convert as a Pandas DataFrame.

row {'Id': '-1', 'Reputation': '1', 'CreationDate': '2009-09-28T00:00:00.000', 'DisplayName': 'Community', 'LastAccessDate': '2010-11-10T17:25:34.627', 'WebsiteUrl': 'http://meta.stackexchange.com/', 'Location': 'on the server farm', 'AboutMe': '<p>Hi, I\'m not really a person.</p>\n\n<p>I\'m a background process that helps keep this site clean!</p>\n\n<p>I do things like</p>\n\n<ul>\n<li>Randomly poke old unanswered questions every hour so they get some attention</li>\n<li>Own community questions and answers so nobody gets unnecessary reputation from them</li>\n<li>Own downvotes on spam/evil posts that get permanently deleted</li>\n<li>Own suggested edits from anonymous users</li>\n<li><a href="http://meta.stackexchange.com/a/92006">Remove abandoned questions</a></li>\n</ul>\n', 'Views': '0', 'UpVotes': '21001', 'DownVotes': '27468', 'AccountId': '-1'} row {'Id': '1', 'Reputation': '21228', 'CreationDate': '2009-09-28T14:35:46.490', 'DisplayName': 'Anton Geraschenko', 'LastAccessDate': '2020-05-17T06:51:32.333', 'WebsiteUrl': 'http://stacky.net', 'Location': 'Palo Alto, CA, United States', 'AboutMe': '<p>You can get in touch with me at [email protected].</p>\n', 'Views': '25360', 'UpVotes': '1052', 'DownVotes': '90', 'AccountId': '36500'} 

The following code works for an almost identical XML file but when I use it for this file I get an error:

CODE

users_tree = ET.parse("/content/Users.xml") users_root = users_tree.getroot()  file_path_users = r"/content/Users.xml" dict_list_users = []  for _, elem in ET.iterparse(file_path_users, events=("end",)):     if elem.tag == "row":         dict_list_users.append({'UserId': elem.attrib['Id'],                           'Reputation': elem.attrib['Reputation'],                           'CreationDate': elem.attrib['CreationDate'],                           'DisplayName': elem.attrib['DisplayName'],                           'LastAccessDate': elem.attrib['LastAccessDate'],                           'WebsiteUrl': elem.attrib['WebsiteUrl'],                           'Location': elem.attrib['Location'],                           'AboutMe': elem.attrib['AboutMe'],                           'Views': elem.attrib['Views'],                           'UpVotes': elem.attrib['UpVotes'],                           'DownVotes': elem.attrib['DownVotes'],                           'AccountId': elem.attrib['AccountId']}) elem.clear()  df_users = pd.DataFrame(dict_list_users) 

ERROR

KeyError                                  Traceback (most recent call last) <ipython-input-18-7af87798bae8> in <module>()      24                           'DisplayName': elem.attrib['DisplayName'],      25                           'LastAccessDate': elem.attrib['LastAccessDate'], ---> 26                           'WebsiteUrl': elem.attrib['WebsiteUrl'],      27                           'Location': elem.attrib['Location'],      28                           'AboutMe': elem.attrib['AboutMe'],  KeyError: 'WebsiteUrl' 

NOTE: This error occurs for all attributes after LastAccessDate, i.e., even if I remove the WebsiteUrl key, I get error for the next attribute and so on.

Please provide me a way to fix this.

Add Comment
1 Answer(s)

Error appears to be due to missing attributes in one or more of the <row> tags. Instead of explicitly assigning dictionary keys/values by each attribute consider retrieving all attributes. Doing so, the final DataFrame constructor will input NAs to rows with missing attributes.

for _, elem in ET.iterparse(file_path_users, events=("end",)):     if elem.tag == "row":         dict_list_users.append(elem.attrib)    # RETRIEVE ALL ATTRIBUTES          elem.clear()                           # SHOULD BE AT NESTED LEVEL  df_users = pd.DataFrame(dict_list_users) 

If above pulls in more columns than needed, keep only the relevant columns with reindex:

df_users = df_users.reindexc(['UserId', 'Reputation', 'CreationDate', 'DisplayName',                               'LastAccessDate', 'WebsiteUrl', 'Location', 'AboutMe',                               'Views', 'UpVotes', 'DownVotes', 'AccountId'],                                axis='columns') 
Add Comment

Your Answer

By posting your answer, you agree to the privacy policy and terms of service.