Create CSV from XML/Json using Python Pandas
I am trying to parse to an xml into multiple different Files –
Sample XML
<integration-outbound:IntegrationEntity xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <integrationEntityHeader> <integrationTrackingNumber>281#963-4c1d-9d26-877ba40a4b4b#1583507840354</integrationTrackingNumber> <referenceCodeForEntity>25428</referenceCodeForEntity> <attachments> <attachment> <id>d6esd1d518b06019e01</id> <name>durance.pdf</name> <size>0</size> </attachment> <attachment> <id>182e60164ddd4236b5bd96109</id> <name>ssds</name> <size>0</size> </attachment> </attachments> <source>SIM</source> <entity>SUPPLIER</entity> <action>CREATE</action> <timestampUTC>20200306T151721</timestampUTC> <zDocBaseVersion>2.0</zDocBaseVersion> <zDocCustomVersion>0</zDocCustomVersion> </integrationEntityHeader> <integrationEntityDetails> <supplier> <requestId>2614352</requestId> <controlBlock> <dataProcessingInfo> <key>MODE</key> <value>Onboarding</value> </dataProcessingInfo> <dataProcessingInfo> <key>Supplier_Type</key> <value>Operational</value> </dataProcessingInfo> </controlBlock> <id>1647059</id> <facilityCode>0001</facilityCode> <systemCode>1</systemCode> <supplierType>Operational</supplierType> <systemFacilityDetails> <systemFacilityDetail> <facilityCode>0001</facilityCode> <systemCode>1</systemCode> <FacilityStatus>ACTIVE</FacilityStatus> </systemFacilityDetail> </systemFacilityDetails> <status>ACTIVE</status> <companyDetails> <displayGSID>254232128</displayGSID> <legalCompanyName>asdasdsads</legalCompanyName> <dunsNumber>03-175-2493</dunsNumber> <legalStructure>1</legalStructure> <website>www.aaadistributor.com</website> <noEmp>25</noEmp> <companyIndicator1099>No</companyIndicator1099> <taxidAndWxformRequired>NO</taxidAndWxformRequired> <taxidFormat>Fed. Tax</taxidFormat> <wxForm>182e601649ade4c38cd4236b5bd96109</wxForm> <taxid>27-2204474</taxid> <companyTypeFix>SUPPLIER</companyTypeFix> <fields> <field> <id>LOW_CUURENT_SERV</id> <value>1</value> </field> <field> <id>LOW_COI</id> <value>USA</value> </field> <field> <id>LOW_STATE_INCO</id> <value>US-PA</value> </field> <field> <id>CERT_INSURANCE</id> <value>d6e6e460fe8958564c1d518b06019e01</value> </field> <field> <id>COMP_DBA</id> <value>asdadas</value> </field> <field> <id>LOW_AREUDIVE</id> <value>N</value> </field> <field> <id>LOW_BU_SIZE1</id> <value>SMLBUS</value> </field> <field> <id>EDI_CAP</id> <value>Y</value> </field> <field> <id>EDI_WEB</id> <value>N</value> </field> <field> <id>EDI_TRAD</id> <value>N</value> </field> </fields> </companyDetails> <allLocations> <location> <addressInternalid>1704342</addressInternalid> <isDelete>false</isDelete> <internalSupplierid>1647059</internalSupplierid> <acctGrpid>HQ</acctGrpid> <address1>2501 GRANT AVE</address1> <country>USA</country> <state>US-PA</state> <city>PHILADELPHIA</city> <zip>19114</zip> <phone>(215) 745-7900</phone> </location> </allLocations> <contactDetails> <contactDetail> <contactInternalid>12232</contactInternalid> <isDelete>false</isDelete> <addressInternalid>1704312142</addressInternalid> <contactType>Main</contactType> <firstName>Raf</firstName> <lastName>jas</lastName> <title>Admin</title> <email>[email protected]</email> <phoneNo>123-42-23-23</phoneNo> <createPortalLogin>yes</createPortalLogin> <allowedPortalSideProducts>SIM,iSource,iContract</allowedPortalSideProducts> </contactDetail> <contactDetail> <contactInternalid>1944938</contactInternalid> <isDelete>false</isDelete> <addressInternalid>1704342</addressInternalid> <contactType>Rad</contactType> <firstName>AVs</firstName> <lastName>asd</lastName> <title>Founder</title> <email>[email protected]</email> <phoneNo>21521-2112-7900</phoneNo> <createPortalLogin>yes</createPortalLogin> <allowedPortalSideProducts>SIM,iContract,iSource</allowedPortalSideProducts> </contactDetail> </contactDetails> <myLocation> <addresses> <myLocationsInternalid>1704342</myLocationsInternalid> <isDelete>false</isDelete> <addressInternalid>1704342</addressInternalid> <usedAt>N</usedAt> </addresses> </myLocation> <bankDetails> <fields> <field> <id>LOW_BANK_KEY</id> <value>123213</value> </field> <field> <id>LOW_EFT</id> <value>123123</value> </field> </fields> </bankDetails> <forms> <form> <id>CATEGORY_PRODSER</id> <records> <record> <Internalid>24348</Internalid> <isDelete>false</isDelete> <fields> <field> <id>CATEGOR_LEVEL_1</id> <value>MR</value> </field> <field> <id>LOW_PRODSERV</id> <value>RES</value> </field> <field> <id>LOW_LEVEL_2</id> <value>keylevel221</value> </field> <field> <id>LOW_LEVEL_3</id> <value>keylevel3127</value> </field> <field> <id>LOW_LEVEL_4</id> <value>keylevel4434</value> </field> <field> <id>LOW_LEVEL_5</id> <value>keylevel5545</value> </field> </fields> </record> <record> <Internalid>24349</Internalid> <isDelete>false</isDelete> <fields> <field> <id>CATEGOR_LEVEL_1</id> <value>MR</value> </field> <field> <id>LOW_PRODSERV</id> <value>RES</value> </field> <field> <id>LOW_LEVEL_2</id> <value>keylevel221</value> </field> <field> <id>LOW_LEVEL_3</id> <value>keylevel3125</value> </field> <field> <id>LOW_LEVEL_4</id> <value>keylevel4268</value> </field> <field> <id>LOW_LEVEL_5</id> <value>keylevel5418</value> </field> </fields> </record> <record> <Internalid>24350</Internalid> <isDelete>false</isDelete> <fields> <field> <id>CATEGOR_LEVEL_1</id> <value>MR</value> </field> <field> <id>LOW_PRODSERV</id> <value>RES</value> </field> <field> <id>LOW_LEVEL_2</id> <value>keylevel221</value> </field> <field> <id>LOW_LEVEL_3</id> <value>keylevel3122</value> </field> <field> <id>LOW_LEVEL_4</id> <value>keylevel425</value> </field> <field> <id>LOW_LEVEL_5</id> <value>keylevel5221</value> </field> </fields> </record> </records> </form> <form> <id>OTHER_INFOR</id> <records> <record> <isDelete>false</isDelete> <fields> <field> <id>S_EAST</id> <value>N</value> </field> <field> <id>W_EST</id> <value>N</value> </field> <field> <id>M_WEST</id> <value>N</value> </field> <field> <id>N_EAST</id> <value>N</value> </field> <field> <id>LOW_AREYOU_ASSET</id> <value>-1</value> </field> <field> <id>LOW_SWART_PROG</id> <value>-1</value> </field> </fields> </record> </records> </form> <form> <id>ABDCEDF</id> <records> <record> <isDelete>false</isDelete> <fields> <field> <id>LOW_COD_CONDUCT</id> <value>-1</value> </field> </fields> </record> </records> </form> <form> <id>CODDUC</id> <records> <record> <isDelete>false</isDelete> <fields> <field> <id>LOW_SUPPLIER_TYPE</id> <value>2</value> </field> <field> <id>LOW_DO_INT_BOTH</id> <value>1</value> </field> </fields> </record> </records> </form> </forms> </supplier> </integrationEntityDetails> </integration-outbound:IntegrationEntity>
The goal is to have common xml to csv conversion to be put in place. Based on input file the xml should be flattend and exploded into multiple csv and stored.
The input is an xml which is above and config csv file below. Need to create 3 csv files with corresponding XPATH mentioned in the file
XPATH,ColumName,CSV_File_Name,ParentKey /integration-outbound:IntegrationEntity/integrationEntityHeader/integrationTrackingNumber,integrationTrackingNumber,integrationEntityHeader.csv, /integration-outbound:IntegrationEntity/integrationEntityHeader/referenceCodeForEntity,referenceCodeForEntity,integrationEntityHeader.csv, /integration-outbound:IntegrationEntity/integrationEntityHeader/attachments/attachment[]/id,id,integrationEntityHeader.csv, /integration-outbound:IntegrationEntity/integrationEntityHeader/attachments/attachment[]/name,name,integrationEntityHeader.csv, /integration-outbound:IntegrationEntity/integrationEntityHeader/attachments/attachment[]/size,size,integrationEntityHeader.csv, /integration-outbound:IntegrationEntity/integrationEntityHeader/source,source,integrationEntityHeader.csv, /integration-outbound:IntegrationEntity/integrationEntityHeader/entity,entity,integrationEntityHeader.csv, /integration-outbound:IntegrationEntity/integrationEntityHeader/action,action,integrationEntityHeader.csv, /integration-outbound:IntegrationEntity/integrationEntityHeader/timestampUTC,timestampUTC,integrationEntityHeader.csv, /integration-outbound:IntegrationEntity/integrationEntityHeader/zDocBaseVersion,zDocBaseVersion,integrationEntityHeader.csv, /integration-outbound:IntegrationEntity/integrationEntityHeader/zDocCustomVersion,zDocCustomVersion,integrationEntityHeader.csv, /integration-outbound:IntegrationEntity/integrationEntityHeader/integrationTrackingNumber,integrationTrackingNumber,integrationEntityDetailsControlBlock.csv,Y /integration-outbound:IntegrationEntity/integrationEntityHeader/referenceCodeForEntity,referenceCodeForEntity,integrationEntityDetailsControlBlock.csv,Y /integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/requestId,requestId,integrationEntityDetailsControlBlock.csv, /integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/controlBlock/dataProcessingInfo[]/key,key,integrationEntityDetailsControlBlock.csv, /integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/controlBlock/dataProcessingInfo[]/value,value,integrationEntityDetailsControlBlock.csv, /integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/id,supplier_id,integrationEntityDetailsControlBlock.csv, /integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/id,id,integrationEntityDetailsForms.csv, /integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/records/record[]/Internalid,Internalid,integrationEntityDetailsForms.csv, /integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/records/record[]/isDelete,FormId,integrationEntityDetailsForms.csv, /integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/records/record[]/fields/field[]/id,SupplierFormRecordFieldId,integrationEntityDetailsForms.csv, /integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/records/record[]/fields/field[]/value,SupplierFormRecordFieldValue,integrationEntityDetailsForms.csv, /integration-outbound:IntegrationEntity/integrationEntityHeader/integrationTrackingNumber,integrationTrackingNumber,integrationEntityDetailsForms.csv,Y /integration-outbound:IntegrationEntity/integrationEntityHeader/referenceCodeForEntity,referenceCodeForEntity,integrationEntityDetailsForms.csv,Y /integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/requestId,requestId,integrationEntityDetailsForms.csv,Y /integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/id,supplier_id,integrationEntityDetailsForms.csv,Y
I need to create 3 csv files output from it.
The design is to pick each csv file and get the xpath and pick the corresponding value from the xml and fetch it
Step 1 – Convert to xml to Json –
import json import xmltodict with open("/home/s0998hws/test.xml") as xml_file: data_dict = xmltodict.parse(xml_file.read()) xml_file.close() # generate the object using json.dumps() # corresponding to json data json_data = json.dumps(data_dict) # Write the json data to output # json file with open("data.json", "w") as json_file: json_file.write(json_data) json_file.close() with open('data.json') as f: d = json.load(f)
Step 2 – Normalize using the panda normalize function – using the xpath / converting to . and [] as other delimter and building the columns to be fecthed from the json i.e code will look for /integration-outbound:IntegrationEntity/integrationEntityHeader/integrationTrackingNumber and convert to .integrationEntityHeader.integrationTrackingNumber and with the first [] it will exlode , there on
df_1=pd.json_normalize(data=d['integration-outbound:IntegrationEntity']) df_2=df_1[['integrationEntityHeader.integrationTrackingNumber','integrationEntityDetails.supplier.requestId','integrationEntityHeader.referenceCodeForEntity','integrationEntityDetails.supplier.id','integrationEntityDetails.supplier.forms.form']] df_3=df_2.explode('integrationEntityDetails.supplier.forms.form') df_3['integrationEntityDetails.supplier.forms.form.id']=df_3['integrationEntityDetails.supplier.forms.form'].apply(lambda x: x.get('id')) df_3['integrationEntityDetails.supplier.forms.form.records']=df_3['integrationEntityDetails.supplier.forms.form'].apply(lambda x: x.get('records'))
I was trying to use the metadata from the csv file and fecth it but the challenge is
df_3['integrationEntityDetails.supplier.forms.form.records.record.Internalid']=df_3['integrationEntityDetails.supplier.forms.form.records.record'].apply(lambda x: x.get('Internalid'))
Failed with Error –
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib64/python3.6/site-packages/pandas/core/series.py", line 3848, in apply mapped = lib.map_infer(values, f, convert=convert_dtype) File "pandas/_libs/lib.pyx", line 2327, in pandas._libs.lib.map_infer File "<stdin>", line 1, in <lambda> AttributeError: 'list' object has no attribute 'get'
The reason is the data from the panda dataframe is having list when and array and it is unable be fecth using the above method.
Below is the output generated
integrationEntityHeader.integrationTrackingNumber integrationEntityDetails.supplier.requestId integrationEntityHeader.referenceCodeForEntity integrationEntityDetails.supplier.id integrationEntityDetails.supplier.forms.form integrationEntityDetails.supplier.forms.form.id integrationEntityDetails.supplier.forms.form.records 0 281#999eb16e-242c-4239-b33e-ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 {'id': 'CATEGORY_PRODSER', 'records': {'record': [{'Internalid': '24348', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3127'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel4434'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5545'}]}}, {'Internalid': '24349', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3125'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel4268'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5418'}]}}, {'Internalid': '24350', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3122'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel425'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5221'}]}}]}} CATEGORY_PRODSER {'record': [{'Internalid': '24348', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3127'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel4434'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5545'}]}}, {'Internalid': '24349', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3125'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel4268'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5418'}]}}, {'Internalid': '24350', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3122'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel425'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5221'}]}}]} 0 281#999eb16e-242c-4239-b33e-ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 {'id': 'OTHER_INFOR', 'records': {'record': {'isDelete': 'false', 'fields': {'field': [{'id': 'S_EAST', 'value': 'N'}, {'id': 'W_EST', 'value': 'N'}, {'id': 'M_WEST', 'value': 'N'}, {'id': 'N_EAST', 'value': 'N'}, {'id': 'LOW_AREYOU_ASSET', 'value': '-1'}, {'id': 'LOW_SWART_PROG', 'value': '-1'}]}}}} OTHER_INFOR {'record': {'isDelete': 'false', 'fields': {'field': [{'id': 'S_EAST', 'value': 'N'}, {'id': 'W_EST', 'value': 'N'}, {'id': 'M_WEST', 'value': 'N'}, {'id': 'N_EAST', 'value': 'N'}, {'id': 'LOW_AREYOU_ASSET', 'value': '-1'}, {'id': 'LOW_SWART_PROG', 'value': '-1'}]}}} 0 281#999eb16e-242c-4239-b33e-ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 {'id': 'CORPORATESUSTAINABILITY', 'records': {'record': {'isDelete': 'false', 'fields': {'field': {'id': 'LOW_COD_CONDUCT', 'value': '-1'}}}}} CORPORATESUSTAINABILITY {'record': {'isDelete': 'false', 'fields': {'field': {'id': 'LOW_COD_CONDUCT', 'value': '-1'}}}} 0 281#999eb16e-242c-4239-b33e-ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 {'id': 'PRODUCTSERVICES', 'records': {'record': {'isDelete': 'false', 'fields': {'field': [{'id': 'LOW_SUPPLIER_TYPE', 'value': '2'}, {'id': 'LOW_DO_INT_BOTH', 'value': '1'}]}}}} PRODUCTSERVICES {'record': {'isDelete': 'false', 'fields': {'field': [{'id': 'LOW_SUPPLIER_TYPE', 'value': '2'}, {'id': 'LOW_DO_INT_BOTH', 'value': '1'}]}}}
Expected Ouput integrationEntityDetailsForms.csv
integrationTrackingNumber requestId referenceCodeForEntity supplier.id integrationEntityDetails.supplier.forms.form.id InternalId isDelete SupplierFormRecordFieldId SupplierFormRecordFieldValue 281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE CATEGOR_LEVEL_1 MR 281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE LOW_PRODSERV RES 281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE LOW_LEVEL_2 keylevel221 281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE LOW_LEVEL_3 keylevel3127 281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE LOW_LEVEL_4 keylevel4434 281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24348 FALSE LOW_LEVEL_5 keylevel5545 281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24350 FALSE CATEGOR_LEVEL_1 MR 281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24350 FALSE LOW_PRODSERV RES 281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24350 FALSE LOW_LEVEL_2 keylevel221 281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24350 FALSE LOW_LEVEL_3 keylevel3122 281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24350 FALSE LOW_LEVEL_4 keylevel425 281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CATEGORY_PRODSER 24350 FALSE LOW_LEVEL_5 keylevel5221 281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 OTHER_INFOR FALSE S_EAST N 281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 OTHER_INFOR FALSE W_EST N 281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 OTHER_INFOR FALSE M_WEST N 281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 OTHER_INFOR FALSE N_EAST N 281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 OTHER_INFOR FALSE LOW_AREYOU_ASSET -1 281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CORPORATESUSTAINABILITY FALSE LOW_SWART_PROG -1 281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 CORPORATESUSTAINABILITY FALSE LOW_COD_CONDUCT -1 281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 PRODUCTSERVICES FALSE LOW_SUPPLIER_TYPE 2 281#963-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428 1647059 PRODUCTSERVICES FALSE LOW_DO_INT_BOTH 1
Consider XSLT, the special purpose language designed to transform XML files like flattening them at certain sections. Python’s third-party module, lxml, can run XSLT 1.0 scripts and XPath 1.0 expressions.
Specifically, XSLT can handle your XPath extractions. Then, from the single transformed result tree, build the needed three data frames. For well-formedness, below assumes the following root and data structure:
<integration-outbound:IntegrationEntity xmlns:integration-outbound="http://example.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> ...same content... </integration-outbound:IntegrationEntity>
XSLT (save as .xsl, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:integration-outbound="http://example.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/> <xsl:strip-space elements="*"/> <xsl:template match="integration-outbound:IntegrationEntity"> <data> <xsl:apply-templates select="integrationEntityHeader/descendant::attachment"/> <xsl:apply-templates select="integrationEntityDetails/descendant::dataProcessingInfo"/> <xsl:apply-templates select="integrationEntityDetails/descendant::forms/descendant::field"/> </data> </xsl:template> <xsl:template match="attachment"> <integrationEntityHeader> <xsl:copy-of select="ancestor::integrationEntityHeader/*[name()!='attachments']"/> <xsl:copy-of select="*"/> </integrationEntityHeader> </xsl:template> <xsl:template match="dataProcessingInfo"> <integrationEntityDetailsControlBlock> <xsl:copy-of select="ancestor::integration-outbound:IntegrationEntity/integrationEntityHeader/*[position() <= 2]"/> <requestId><xsl:value-of select="ancestor::supplier/requestId"/></requestId> <supplier_id><xsl:value-of select="ancestor::supplier/id"/></supplier_id> <xsl:copy-of select="*"/> </integrationEntityDetailsControlBlock> </xsl:template> <xsl:template match="field"> <integrationEntityDetailsForms> <form_id><xsl:value-of select="ancestor::form/id"/></form_id> <xsl:copy-of select="ancestor::record/*[name()!='fields']"/> <SupplierFormRecordFieldId><xsl:value-of select="id"/></SupplierFormRecordFieldId> <SupplierFormRecordFieldValue><xsl:value-of select="id"/></SupplierFormRecordFieldValue> <xsl:copy-of select="ancestor::integration-outbound:IntegrationEntity/integrationEntityHeader/*[position() <= 2]"/> <requestId><xsl:value-of select="ancestor::supplier/requestId"/></requestId> <supplier_id><xsl:value-of select="ancestor::supplier/id"/></supplier_id> </integrationEntityDetailsForms> </xsl:template> </xsl:stylesheet>
Python
import lxml.etree as et import pandas as pd # LOAD XML AND XSL doc = et.parse('Input.xml') style = et.parse('Script.xsl') # INITIALIZE AND RUN TRANSFORMATION transformer = et.XSLT(style) flat_doc = transformer(doc) # BUILD THREE DATA FRAMES df_header = pd.DataFrame([{i.tag:i.text for i in el} for el in flat_doc.xpath('integrationEntityHeader')]) df_detailsControlBlock = pd.DataFrame([{i.tag:i.text for i in el} for el in flat_doc.xpath('integrationEntityDetailsControlBlock')]) df_detailsForms = pd.DataFrame([{i.tag:i.text for i in el} for el in flat_doc.xpath('integrationEntityDetailsForms')])
The xml is converted to dict and then the parsing logic is written , the reason for this is because the same can be used for json . The stackoverflow is amazingly helpful and the solution is build based on the responses from all these links . For simplicity i have created a 3 level nest xml. This works on Python3
<?xml version="1.0"?><Company><Employee><FirstName>Hal</FirstName><LastName>Thanos</LastName><ContactNo>122131</ContactNo><Email>[email protected]</Email><Addresses><Address><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form></forms></Address></Addresses></Employee><Employee><FirstName>Iron</FirstName><LastName>Man</LastName><ContactNo>12324</ContactNo><Email>[email protected]</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID3</id><value>LIC</value></form></forms></Address><Address><type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company> <?xml version="1.0"?><Company><Employee><FirstName>Captain</FirstName><LastName>America</LastName><ContactNo>13322</ContactNo><Email>[email protected]</Email><Addresses><Address><City>Trivandrum</City><State>Kerala</State><Zip>28115</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form></forms></Address></Addresses></Employee><Employee><FirstName>Sword</FirstName><LastName>Man</LastName><ContactNo>12324</ContactNo><Email>[email protected]</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID3</id><value>LIC</value></form></forms></Address><Address><type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company> <?xml version="1.0"?><Company><Employee><FirstName>Thor</FirstName><LastName>Odison</LastName><ContactNo>156565</ContactNo><Email>[email protected]</Email><Addresses><Address><City>Tirunelveli</City><State>TamilNadu</State><Zip>36595</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form></forms></Address></Addresses></Employee><Employee><FirstName>Spider</FirstName><LastName>Man</LastName><ContactNo>12324</ContactNo><Email>[email protected]</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID3</id><value>LIC</value></form></forms></Address><Address><type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company> <?xml version="1.0"?><Company><Employee><FirstName>Black</FirstName><LastName>Widow</LastName><ContactNo>16767</ContactNo><Email>[email protected]</Email><Addresses><Address><City>Mysore</City><State>Karnataka</State><Zip>12478</Zip><forms><form><id>ID1</id><value>LIC</value></form></forms></Address></Addresses></Employee><Employee><FirstName>White</FirstName><LastName>Man</LastName><ContactNo>5634</ContactNo><Email>[email protected]</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID3</id><value>LIC</value></form></forms></Address><Address><type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company>
The config file for this xml is all possible array/multiple level/explode columns should be mentioned as []. The header is needed as referred in the code.
Chnage the variable as per u store process_config_csv = ‘config.csv’ xml_file_name = ‘test.xml’
XPATH,ColumName,CSV_File_Name /Company/Employee[]/FirstName,FirstName,Name.csv /Company/Employee[]/LastName,LastName,Name.csv /Company/Employee[]/ContactNo,ContactNo,Name.csv /Company/Employee[]/Email,Email,Name.csv /Company/Employee[]/FirstName,FirstName,Address.csv /Company/Employee[]/LastName,LastName,Address.csv /Company/Employee[]/ContactNo,ContactNo,Address.csv /Company/Employee[]/Email,Email,Address.csv /Company/Employee[]/Addresses/Address[]/City,City,Address.csv /Company/Employee[]/Addresses/Address[]/State,State,Address.csv /Company/Employee[]/Addresses/Address[]/Zip,Zip,Address.csv /Company/Employee[]/Addresses/Address[]/type,type,Address.csv /Company/Employee[]/FirstName,FirstName,Form.csv /Company/Employee[]/LastName,LastName,Form.csv /Company/Employee[]/ContactNo,ContactNo,Form.csv /Company/Employee[]/Email,Email,Form.csv /Company/Employee[]/Addresses/Address[]/type,type,Form.csv /Company/Employee[]/Addresses/Address[]/forms/form[]/id,id,Form.csv /Company/Employee[]/Addresses/Address[]/forms/form[]/value,value,Form.csv
The code to create multiple csv based on the config file is
import json import xmltodict import json import os import csv import numpy as np import pandas as pd import sys from collections import defaultdict import numpy as np def getMatches(L1, L2): R = set() for elm in L1: for pat in L2: if elm.find(pat) != -1: if elm.find('.', len(pat)+1) != -1: R.add(elm[:elm.find('.', len(pat)+1)]) else: R.add(elm) return list(R) def xml_parse(xml_file_name): try: process_xml_file = xml_file_name with open(process_xml_file) as xml_file: for xml_string in xml_file: """Converting the xml to Dict""" data_dict = xmltodict.parse(xml_string) """Converting the dict to Pandas DF""" df_processing = pd.json_normalize(data_dict) xml_parse_loop(df_processing) xml_file.close() except Exception as e: s = str(e) print(s) def xml_parse_loop(df_processing_input): CSV_File_Name = [] """Getting the list of csv Files to be created""" with open(process_config_csv, newline='') as csvfile: DataCaptured = csv.DictReader(csvfile) for row in DataCaptured: if row['CSV_File_Name'] not in CSV_File_Name: CSV_File_Name.append(row['CSV_File_Name']) """Iterating the list of CSV""" for items in CSV_File_Name: df_processing = df_processing_input df_subset_process = [] df_subset_list_all_cols = [] df_process_sub_explode_Level = [] df_final_column_name = [] print('Parsing the xml file for creating the file - ' + str(items)) """Fetching the field list for processs from the confic File""" with open(process_config_csv, newline='') as csvfile: DataCaptured = csv.DictReader(csvfile) for row in DataCaptured: if row['CSV_File_Name'] in items: df_final_column_name.append(row['ColumName']) """Getting the columns until the first [] """ df_subset_process.append(row['XPATH'].strip('/').replace("/",".").split('[]')[0]) """Getting the All the columnnames""" df_subset_list_all_cols.append(row['XPATH'].strip('/').replace("/",".").replace("[]","")) """Getting the All the Columns to explode""" df_process_sub_explode_Level.append(row['XPATH'].strip('/').replace('/', '.').split('[]')) explode_ld = defaultdict(set) """Putting Level of explode and column names""" for x in df_process_sub_explode_Level: if len(x) > 1: explode_ld[len(x) - 1].add(''.join(x[: -1])) explode_ld = {k: list(v) for k, v in explode_ld.items()} #print(' The All column list is for the file ' + items + " is " + str(df_subset_list_all_cols)) #print(' The first processing for the file ' + items + " is " + str(df_subset_process)) #print('The explode level of attributes for the file ' + items + " is " + str(explode_ld)) """Remove column duplciates""" df_subset_process = list(dict.fromkeys(df_subset_process)) for col in df_subset_process: if col not in df_processing.columns: df_processing[col] = np.nan df_processing = df_processing[df_subset_process] df_processing_col_list = df_processing.columns.tolist() print ('The total levels to be exploded : %d' % len(explode_ld)) i=0 level=len(explode_ld) for i in range(level): print (' Exploding the Level : %d' % i ) df_processing_col_list = df_processing.columns.tolist() list_of_explode=set(df_processing_col_list) & set(explode_ld[i + 1]) #print('List to expolde' + str(list_of_explode)) """If founc in explode list exlplode some xml doesnt need to have a list it could be column handling the same""" for c in list_of_explode: print (' There are column present which needs to be exploded - ' + str(c)) df_processing = pd.concat((df_processing.iloc[[type(item) == list for item in df_processing[c]]].explode(c),df_processing.iloc[[type(item) != list for item in df_processing[c]]])) print(' Finding the columns need to be fetched ') """From the overall column list fecthing the attributes needed to explode""" next_level_pro_lst = getMatches(df_subset_list_all_cols,explode_ld[ i + 1 ]) #print(next_level_pro_lst) df_processing_col_list = df_processing.columns.tolist() for nex in next_level_pro_lst: #print ("Fetching " + nex.rsplit('.', 1)[1] + ' from ' + nex.rsplit('.', 1)[0] + ' from ' + nex ) parent_col=nex.rsplit('.', 1)[0] child_col=nex.rsplit('.', 1)[1] #print(parent_col) #print(df_processing_col_list) if parent_col not in df_processing_col_list: df_processing[nex.rsplit('.', 1)[0]] = "" try: df_processing[nex] = df_processing[parent_col].apply(lambda x: x.get(child_col)) except AttributeError: df_processing[nex] = "" df_processing_col_list = df_processing.columns.tolist() if i == level-1: print('Last Level nothing to be done') else: """Extracting All columns until the next exlode column list is found""" while len(set(df_processing_col_list) & set(explode_ld[i + 2]))==0: next_level_pro_lst = getMatches(df_subset_list_all_cols, next_level_pro_lst) #print(next_level_pro_lst) for nextval in next_level_pro_lst: if nextval not in df_processing_col_list: #print("Fetching " + nextval.rsplit('.', 1)[1] + ' from ' + nextval.rsplit('.', 1)[0] + ' from ' + nextval) if nextval.rsplit('.', 1)[0] not in df_processing.columns: df_processing[nextval.rsplit('.', 1)[0]] = "" try: df_processing[nextval] = df_processing[nextval.rsplit('.', 1)[0]].apply(lambda x: x.get(nextval.rsplit('.', 1)[1])) except AttributeError: df_processing[nextval] = "" df_processing_col_list = df_processing.columns.tolist() df_processing = df_processing[df_subset_list_all_cols] df_processing.columns = df_final_column_name # if file does not exist write header if not os.path.isfile(items): print("The file does not exists Exists so writing new") df_processing.to_csv('{}'.format(items), header='column_names',index=None) else: # else it exists so append without writing the header print("The file does exists Exists so appending") df_processing.to_csv('{}'.format(items), mode='a', header=False,index=None) from datetime import datetime startTime = datetime.now().strftime("%Y%m%d_%H%M%S") startTime = str(os.getpid()) + "_" + startTime process_task_name = '' process_config_csv = 'config.csv' xml_file_name = 'test.xml' old_print = print def timestamped_print(*args, **kwargs): now = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f") printheader = now + " xml_parser " + " " + process_task_name + " - " old_print(printheader, *args, **kwargs) print = timestamped_print xml_parse(xml_file_name)
The output created are
[, ~]$ cat Name.csv FirstName,LastName,ContactNo,Email Hal,Thanos,122131,[email protected] Iron,Man,12324,[email protected] Captain,America,13322,[email protected] Sword,Man,12324,[email protected] Thor,Odison,156565,[email protected] Spider,Man,12324,[email protected] Black,Widow,16767,[email protected] White,Man,5634,[email protected] [, ~]$ cat Address.csv FirstName,LastName,ContactNo,Email,City,State,Zip,type Iron,Man,12324,[email protected],Bangalore,Karnataka,560212,Permanent Iron,Man,12324,[email protected],Concord,NC,28027,Temporary Hal,Thanos,122131,[email protected],Bangalore,Karnataka,560212, Sword,Man,12324,[email protected],Bangalore,Karnataka,560212,Permanent Sword,Man,12324,[email protected],Concord,NC,28027,Temporary Captain,America,13322,[email protected],Trivandrum,Kerala,28115, Spider,Man,12324,[email protected],Bangalore,Karnataka,560212,Permanent Spider,Man,12324,[email protected],Concord,NC,28027,Temporary Thor,Odison,156565,[email protected],Tirunelveli,TamilNadu,36595, White,Man,5634,[email protected],Bangalore,Karnataka,560212,Permanent White,Man,5634,[email protected],Concord,NC,28027,Temporary Black,Widow,16767,[email protected],Mysore,Karnataka,12478, [, ~]$ cat Form.csv FirstName,LastName,ContactNo,Email,type,id,value Iron,Man,12324,[email protected],Temporary,ID1,LIC Iron,Man,12324,[email protected],Temporary,ID2,PAS Iron,Man,12324,[email protected],Temporary,ID3,SSN Iron,Man,12324,[email protected],Temporary,ID2,CC Hal,Thanos,122131,[email protected],,ID1,LIC Hal,Thanos,122131,[email protected],,ID2,PAS Iron,Man,12324,[email protected],Permanent,ID3,LIC Sword,Man,12324,[email protected],Temporary,ID1,LIC Sword,Man,12324,[email protected],Temporary,ID2,PAS Sword,Man,12324,[email protected],Temporary,ID3,SSN Sword,Man,12324,[email protected],Temporary,ID2,CC Captain,America,13322,[email protected],,ID1,LIC Captain,America,13322,[email protected],,ID2,PAS Sword,Man,12324,[email protected],Permanent,ID3,LIC Spider,Man,12324,[email protected],Temporary,ID1,LIC Spider,Man,12324,[email protected],Temporary,ID2,PAS Spider,Man,12324,[email protected],Temporary,ID3,SSN Spider,Man,12324,[email protected],Temporary,ID2,CC Thor,Odison,156565,[email protected],,ID1,LIC Thor,Odison,156565,[email protected],,ID2,PAS Spider,Man,12324,[email protected],Permanent,ID3,LIC White,Man,5634,[email protected],Temporary,ID1,LIC White,Man,5634,[email protected],Temporary,ID2,PAS White,Man,5634,[email protected],Temporary,ID3,SSN White,Man,5634,[email protected],Temporary,ID2,CC White,Man,5634,[email protected],Permanent,ID3,LIC Black,Widow,16767,[email protected],,ID1,LIC
The pieces and answers are extracted from different threads and thanks to @Mark Tolonen @Mandy007 @deadshot
Create a dict of list using python from csv
https://stackoverflow.com/questions/62837949/extract-a-list-from-a-list
How to explode Panda column with data having different dict and list of dict
This can be definitely made shorter and more performing one and can be enhanced further
I think this line is missing in the question:
df_3['integrationEntityDetails.supplier.forms.form.records.record'] = ( df_3['integrationEntityDetails.supplier.forms.form.records'].apply( lambda x: x.get('record') ) )
Then, for the Internalid, you could do this:
df_3['integrationEntityDetails.supplier.forms.form.records.record.Internalid'] = ( df_3['integrationEntityDetails.supplier.forms.form.records.record'].apply( lambda x: x[0].get('Internalid') if type(x) == list else x.get('Internalid') ) )
Spark to the rescue!
The following code is in Scala, but it can easily be converted to Python if you wish.
Databrick’s XML library makes XML processing easy.
val headers = spark.read.format("xml").option("rowTag", "integrationEntityHeader").load("stackOverflowRafaXML.xml") headers.write.csv(<headerFilename>) // Create CSV from the header file val details = spark.read.format("xml").option("rowTag", "integrationEntityDetails").load("stackOverflowRafaXML.xml") // The details need further unnesting. To get suppliers, for instance, you can do val supplier = spark.read.format("xml").option("rowTag", "supplier").load("stackOverflowRafaXML.xml") supplier.show +--------------------+--------------------+--------------------+--------------------+--------------------+------------+--------------------+-------+--------------------+---------+------+------------+----------+---------------------+ | allLocations| bankDetails| companyDetails| contactDetails| controlBlock|facilityCode| forms| id| myLocation|requestId|status|supplierType|systemCode|systemFacilityDetails| +--------------------+--------------------+--------------------+--------------------+--------------------+------------+--------------------+-------+--------------------+---------+------+------------+----------+---------------------+ |[[HQ, 2501 GRANT ...|[[[[LOW_BANK_KEY,...|[No, SUPPLIER, 25...|[[[1704312142, SI...|[[[MODE, Onboardi...| 1|[[[CATEGORY_PRODS...|1647059|[[1704342, false,...| 2614352|ACTIVE| Operational| 1| [[ACTIVE, 1, 1]]| +--------------------+--------------------+--------------------+--------------------+--------------------+------------+--------------------+-------+--------------------+---------+------+------------+----------+---------------------+