pythonで相対パスの代わりに絶対URLをスクレイピング

Question

HTMLコードからすべてのhrefを取得し、次のような将来の処理のためにリストに格納しようとしています。

URLの例：www.example-page-xl.com

 <body> <section> <a href="/helloworld/index.php"> Hello World </a> </section> </body>

私は次のコードを使用してhrefをリストしています：

import bs4 as bs4 import urllib.request sauce = urllib.request.urlopen('https:www.example-page-xl.com').read() soup = bs.BeautifulSoup(sauce,'lxml') section = soup.section for url in section.find_all('a'): print(url.get('href'))

ただし、/ helloworld/index.phpである相対パスだけでなく、www.example-page-xl.com/helloworld/index.phpとしてURLを保存したい

URLと相対パスを結合すると動的リンクが異なる場合があるため、相対パスを使用してURLを追加/結合する必要はありません。

簡単に言えば、相対パスだけではなく（結合せずに）絶対URLをスクレイピングしたい

Somil · Accepted Answer

この場合、urlparse.urljoinが役立ちます。このようにコードを変更する必要があります

import bs4 as bs4 import urllib.request from urlparse import urljoin web_url = 'https:www.example-page-xl.com' sauce = urllib.request.urlopen(web_url).read() soup = bs.BeautifulSoup(sauce,'lxml') section = soup.section for url in section.find_all('a'): print urljoin(web_url,url.get('href'))

ここurljoinは絶対パスと相対パスを管理します。

Andrei Cioara · Answer

urllib.parse.urljoin（）が役立つかもしれません。これは結合を行いますが、それについては賢く、相対パスと絶対パスの両方を処理します。これはpython 3コードです。

>>> import urllib.parse >>> base = 'https://www.example-page-xl.com' >>> urllib.parse.urljoin(base, '/helloworld/index.php') 'https://www.example-page-xl.com/helloworld/index.php' >>> urllib.parse.urljoin(base, 'https://www.example-page-xl.com/helloworld/index.php') 'https://www.example-page-xl.com/helloworld/index.php'