Python 2.7 Beautiful Soup Img Src Extract

Question

for imgsrc in Soup.findAll('img', {'class': 'sizedProdImage'}): if imgsrc: imgsrc = imgsrc else: imgsrc = "ERROR" patImgSrc = re.compile('src="(.*)".*/>') findPatImgSrc = re.findall(patImgSrc, imgsrc) print findPatImgSrc ''' <img height="72" name="proimg" id="image" class="sizedProdImage" src="http://imagelocation" />

これは私が抽出しようとしているものであり、私は得ています：

findimgsrcPat = re.findall(imgsrcPat, imgsrc) File "C:\Python27\lib\re.py", line 177, in findall return _compile(pattern, flags).findall(string) TypeError: expected string or buffer

'' '

soulcheck · Accepted Answer

Beautifulsoupノードをre.findallに渡します。文字列に変換する必要があります。試してください：

findPatImgSrc = re.findall(patImgSrc, str(imgsrc))

さらに、beautifulsoupが提供するツールを使用してください。

[x['src'] for x in soup.findAll('img', {'class': 'sizedProdImage'})]

クラス 'sizedProdImage'のimgタグのすべてのsrc属性のリストを提供します。

StanleyD · Answer

より簡単な解決策があります：

 soup.find('img')['src']

Abu Shoeb · Answer

私の例では、htmlTextにimgタグが含まれていますが、URLにも使用できます。私の答えを見てくださいここ

from BeautifulSoup import BeautifulSoup as BSHTML htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """ soup = BSHTML(htmlText) images = soup.findAll('img') for image in images: print image['src']

Kirk Strauser · Answer

reオブジェクトを作成し、それをre.findallに渡しています。これは、最初の引数として文字列を期待しています。

patImgSrc = re.compile('src="(.*)".*/>') findPatImgSrc = re.findall(patImgSrc, imgsrc)

代わりに、作成したpatImgSrcオブジェクトの.findallメソッドを使用します。

patImgSrc = re.compile('src="(.*)".*/>') findPatImgSrc = patImgSrc.findall(imgsrc)