Python lxml extract text when a tag exists in the middle of the text

I am trying to parse and extract all the text inside of the claim-text tag and prepare it for a csv. So each claim tag will have a column containing all the claim-text.

Basically the claims are represented in two kind of styles. The first one claim id="CLM-00001" num="00001"> being a nested claim-text tag inside another nested claim-text tag. The second style, if you look at <claim id="CLM-00002" num="00002"> it has a <claim-ref tag in the middle of the text(which seems to be my problem).

<claims id="claims">         <claim id="CLM-00001" num="00001">             <claim-text>1. A method of forming an amorphous metal foam formed of an amorphous metal powder comprising:                 <claim-text>mixing at least one amorphous metal powder and at least one gas-splitting propellant powder into a propellant filled amorphous metal powder mixture, such that upon decomposition of the gas-splitting propellant powder, gas-containing pores are created within the amorphous metal powder mixture;</claim-text>                 <claim-text>compacting the mixture such that the amorphous metal powder particles are bonded to one another to form a gas-tight seal around the gas-splitting propellant powder particles, the mixture being compacted at a compacting temperature and pressure sufficient to allow for bonding of the mixture, wherein the temperature is below any crystalline transition temperature of the amorphous metal powder, and for a duration not exceeding a time for any crystalline transformation of said amorphous metal powder at the compacting temperature and pressure;</claim-text>                 <claim-text>cooling the compacted mixture at a cooling rate sufficient that the amorphous metal powder mixture remains amorphous;</claim-text>                 <claim-text>expanding the compacted amorphous metal powder mixture to form a foam material, said expansion being conducted at an expansion temperature below any crystalline transition temperature of the amorphous metal powder, but sufficiently high to allow bubble expansion, at a surrounding pressure sufficient to promote expansion arising from a difference between a pressure in the gas-containing pores and the surrounding pressure, and for a duration not exceeding the time for any crystalline transformation to take place; and</claim-text>                 <claim-text>cooling the expanded foam material in order to allow the foam material to remain amorphous.</claim-text>             </claim-text>         </claim>         <claim id="CLM-00002" num="00002">             <claim-text>2. The method according to <claim-ref idref="CLM-00001">claim 1</claim-ref> wherein the gas-splitting propellant powder decomposes during expansion.</claim-text>         </claim>         <claim id="CLM-00003" num="00003">             <claim-text>3. The method according to <claim-ref idref="CLM-00001">claim 1</claim-ref> wherein the gas-splitting propellant powder decomposes during compaction.</claim-text>         </claim> ... ... ... </claims> 

I tried this: Python element tree – extract text from element, stripping tags
and this: python xml.etree.ElementTree remove empty tag in the middle of text

I tried the itertext() method which for the very first claim tag it gets me this(which gets me everything I need for the column):

['1. A method of forming an amorphous metal foam formed of an amorphous metal powder comprising:\n                ', 'mixing at least one amorphous metal powder and at least one gas-splitting propellant powder into a propellant filled amorphous metal powder mixture, such that upon decomposition of the gas-splitting propellant powder, gas-containing pores are created within the amorphous metal powder mixture;', '\n                ', 'compacting the mixture such that the amorphous metal powder particles are bonded to one another to form a gas-tight seal around the gas-splitting propellant powder particles, the mixture being compacted at a compacting temperature and pressure sufficient to allow for bonding of the mixture, wherein the temperature is below any crystalline transition temperature of the amorphous metal powder, and for a duration not exceeding a time for any crystalline transformation of said amorphous metal powder at the compacting temperature and pressure;', '\n                ', 'cooling the compacted mixture at a cooling rate sufficient that the amorphous metal powder mixture remains amorphous;', '\n                ', 'expanding the compacted amorphous metal powder mixture to form a foam material, said expansion being conducted at an expansion temperature below any crystalline transition temperature of the amorphous metal powder, but sufficiently high to allow bubble expansion, at a surrounding pressure sufficient to promote expansion arising from a difference between a pressure in the gas-containing pores and the surrounding pressure, and for a duration not exceeding the time for any crystalline transformation to take place; and', '\n                ', 'cooling the expanded foam material in order to allow the foam material to remain amorphous.', '\n            ', '\n        '] 

Now on to the next claim tag <claim id="CLM-00002" num="00002"> it should get me ideally:

The method according to wherein the gas-splitting propellant powder decomposes during expansion. 

but it gets me:

['2. The method according to ', '\n        '] 

The code I am using that gets me this result is:

result = []     for doc in root.xpath('//claims/claim/claim-text'):          textwork = ((doc.getparent()).itertext('claim-text'))         b=[]         for texts in textwork:             b.append(texts)           result.append([b])     write_all_to_csv(result, FILENAME_CLAIMS) 

Note: The code is a shortened version. I also extract other things from the claims which work fine. Just shortened it to focus on the problem.

Add Comment
1 Answer(s)

Just remove the tag name from the itertext method then it will extract all the relevant text within the tag. Hope this helps.

from lxml import etree root=etree.fromstring(xml) result = [] for doc in root.xpath('//claims/claim/claim-text'):      textwork = (''.join((doc.getparent()).itertext()))     #print(textwork)     #b=[]     #for texts in textwork:     #    b.append(texts)      result.append([textwork]) print(result) #write_all_to_csv(result, FILENAME_CLAIMS) 

enter image description here

Add Comment

Your Answer

By posting your answer, you agree to the privacy policy and terms of service.