What kind of object is this and how do I work with it?
I ran a urllib.request.urlopen
on a URL and then used read()
. This is the content I got:
b'%PDF-1.5\r%\xc8\xc8\xc8\xc8\xc8\xc8\xc8\r1 0 obj\n<</Type/Page/Parent 41 0 R/Resources<</Font<</F1 32460 0 R/F2 32461 0 R/F5 32464 0 R/F4 32463 0 R/F6 32465 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>/MediaBox[0 0 595.2 841.6]/Contents 2 0 R/Group<</Type/Group/S/Transparency/CS/DeviceRGB>>/Tabs/S/StructParents 1>>\r\nendobj\n2 0 obj\n<</Filter/FlateDecode/Length 38817>>stream\r\n\x1e\x9a\x04z\xe6\xd7S\x9e2\xf7_V<oRy\x9f1\x84\x94x\xed\xf8g\xe1.x\nx\r`p\xc2S\x06\x8b{\xb0\xe2\xca\xedG\n\xdcJ\x82\x1e\xa6)\xd9*\xb3\xf9\x8c\x12P!&\xb0\xbc\xb7Z\x8d\xa7@\x11<\x9d\xbba\xee&\xa8v?\xd4\xc9\x83\xfc\xbb2H\x01\xcf\x08\xa2\xee\x90\x8a\x1b\xee\x0e\x19F\xa9\xd8\xbb_|;]\x8e\x8a\xfc\xd7\x11\xf7\xa2\xd2W\x84+\xa5\xe0\xc3\xcc=\x9b\xd4\xf0L\xdep\xf1\xbf40>\x13PY\x89(\xc5\xbd<\xeft\x93\xc5\xa8\xd0\xf6\x16'
(that’s just the first 500 characters)
My goal is to extract the panel data in the pdf.
Can someone please suggest where I can learn more about this and how to read it into the original structure?
The url alone directs to an online pdf
.
This looks like Byte code.
Check it out here (JournalDev) and here (Python Documentation)
Also you can find more info about how to deal with PDF’s in Python here