How can I pull just the actual text out of this XML structure?

I have this block of XML that is in it’s own XML file that I need to pull only the letters from, but I haven’t been able to use the <text> tag as it returns only the information from before the letter on each line.

<text font="TimesNewRomanPSMT" bbox="72.024,707.275,78.769,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">T</text> <text font="TimesNewRomanPSMT" bbox="78.769,707.275,84.289,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">h</text> <text font="TimesNewRomanPSMT" bbox="84.289,707.275,87.359,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">i</text> <text font="TimesNewRomanPSMT" bbox="87.359,707.275,91.653,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">s</text> <text font="TimesNewRomanPSMT" bbox="91.697,707.275,94.457,718.315" colourspace="DeviceGray" ncolour="0" size="11.040"> </text> <text font="TimesNewRomanPSMT" bbox="94.336,707.275,97.405,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">i</text> <text font="TimesNewRomanPSMT" bbox="97.449,707.275,101.744,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">s</text> 

So in theory these 7 lines would give back the string "This is". There may be a stupidly simple answer to this, but I don’t understand XML even in the slightest, so I haven’t any idea what that answer would be.

Add Comment
1 Answer(s)

You can just use BeautifulSoup

from bs4 import BeautifulSoup as bs  xml = ''' <text font="TimesNewRomanPSMT" bbox="72.024,707.275,78.769,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">T</text> <text font="TimesNewRomanPSMT" bbox="78.769,707.275,84.289,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">h</text> <text font="TimesNewRomanPSMT" bbox="84.289,707.275,87.359,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">i</text> <text font="TimesNewRomanPSMT" bbox="87.359,707.275,91.653,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">s</text> <text font="TimesNewRomanPSMT" bbox="91.697,707.275,94.457,718.315" colourspace="DeviceGray" ncolour="0" size="11.040"> </text> <text font="TimesNewRomanPSMT" bbox="94.336,707.275,97.405,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">i</text> <text font="TimesNewRomanPSMT" bbox="97.449,707.275,101.744,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">s</text> '''  soup = bs(xml, 'lxml') soup = soup.find_all('text')  [print(i.text, end="") for i in soup] 
Answered on July 17, 2020.
Add Comment

Your Answer

By posting your answer, you agree to the privacy policy and terms of service.