How can I pull just the actual text out of this XML structure?
I have this block of XML that is in it’s own XML file that I need to pull only the letters from, but I haven’t been able to use the <text> tag as it returns only the information from before the letter on each line.
<text font="TimesNewRomanPSMT" bbox="72.024,707.275,78.769,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">T</text> <text font="TimesNewRomanPSMT" bbox="78.769,707.275,84.289,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">h</text> <text font="TimesNewRomanPSMT" bbox="84.289,707.275,87.359,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">i</text> <text font="TimesNewRomanPSMT" bbox="87.359,707.275,91.653,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">s</text> <text font="TimesNewRomanPSMT" bbox="91.697,707.275,94.457,718.315" colourspace="DeviceGray" ncolour="0" size="11.040"> </text> <text font="TimesNewRomanPSMT" bbox="94.336,707.275,97.405,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">i</text> <text font="TimesNewRomanPSMT" bbox="97.449,707.275,101.744,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">s</text>
So in theory these 7 lines would give back the string "This is". There may be a stupidly simple answer to this, but I don’t understand XML even in the slightest, so I haven’t any idea what that answer would be.
You can just use BeautifulSoup
from bs4 import BeautifulSoup as bs xml = ''' <text font="TimesNewRomanPSMT" bbox="72.024,707.275,78.769,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">T</text> <text font="TimesNewRomanPSMT" bbox="78.769,707.275,84.289,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">h</text> <text font="TimesNewRomanPSMT" bbox="84.289,707.275,87.359,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">i</text> <text font="TimesNewRomanPSMT" bbox="87.359,707.275,91.653,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">s</text> <text font="TimesNewRomanPSMT" bbox="91.697,707.275,94.457,718.315" colourspace="DeviceGray" ncolour="0" size="11.040"> </text> <text font="TimesNewRomanPSMT" bbox="94.336,707.275,97.405,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">i</text> <text font="TimesNewRomanPSMT" bbox="97.449,707.275,101.744,718.315" colourspace="DeviceGray" ncolour="0" size="11.040">s</text> ''' soup = bs(xml, 'lxml') soup = soup.find_all('text') [print(i.text, end="") for i in soup]