Pythonの文字列内のURLを削除する方法

Question

文字列内のすべてのURLを削除したい（それらを ""で置き換えます）周りを検索しましたが、実際に必要なものが見つかりませんでした。

例：

text1 text2 http://url.com/bla1/blah1/ text3 text4 http://url.com/bla2/blah2/ text5 text6 http://url.com/bla3/blah3/

結果を次のようにしたい：

text1 text2 text3 text4 text5 text6

Ωmega · Accepted Answer

Pythonスクリプト：

import re text = re.sub(r'^https?://.*[
]*', '', text, flags=re.MULTILINE)

出力：

text1 text2 text3 text4 text5 text6

このコードをテスト here 。

tolgayilmaz · Answer

最短の方法

re.sub(r'http\S+', '', stringliteral)

Muhammad Taha · Answer

これは私のために働いた：

import re thestring = "text1
text2
http://url.com/bla1/blah1/
text3
text4
http://url.com/bla2/blah2/
text5
text6" URLless_string = re.sub(r'\w+:/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:/[^\s/]*))*', '', thestring) print URLless_string

結果：

text1 text2 text3 text4 text5 text6

Abhranil Das · Answer

正規表現 を使用してシンプルにする必要があります。 Pythonのreモジュールを介して使用できます。

どの正規表現が有効なURLを最も適切に検出できるかについては、これらのSOの質問を確認してください。

文字列が有効なURLであるかどうかを確認するのに最適な正規表現は何ですか？
Pythonを使用して文字列からURLを抽出する最もクリーンな方法は何ですか？
テキスト内のURIを一致させる方法？

これらには非常に多くの投票された回答がありますので、それはあなたに何らかの方向性を与えるはずです。

Pranzell · Answer

任意のテキストに混在するHTTPリンク/ URLの削除：

import re re.sub(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|$([^\s()<>]+|(\([^\s()<>]+$))*\))+(?:$([^\s()<>]+|(\([^\s()<>]+$))*\)|[^\s`!(){};:'".,<>?«»“”‘’]))''', " ", text)

Lee Martin · Answer

このソリューションは、http、https、およびその他の通常のurlタイプの特殊文字に対応しています。

import re def remove_urls (vTEXT): vTEXT = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', vTEXT, flags=re.MULTILINE) return(vTEXT) print( remove_urls("this is a test https://sdfs.sdfsdf.com/sdfsdf/sdfsdf/sd/sdfsdfs?bob=%20tree&jef=man lets see this too https://sdfsdf.fdf.com/sdf/f end"))

Jon Clements · Answer

他の方法から見ることもできます...

from urlparse import urlparse [el for el in ['text1', 'FTP://somewhere.com', 'text2', 'http://blah.com:8080/foo/bar#header'] if not urlparse(el).scheme]

Gabriel Giraldo-Wingler · Answer

特定の状況に対処できるものを見つけることができませんでした。これはrlsの中間の空白も含むツイートの中間のURLを削除していたので、自分の：

(https?:\/\/)(\s)*(www\.)?(\s)*((\w|\s)+\.)*([\w\-\s]+\/)*([\w\-]+)((\?)?[\w\s]*=\s*[\w\%&]*)*

ここに説明があります：
(https?:\/\/)は、http：//またはhttps：//に一致します
(\s)*オプションの空白
(www\.)?オプションでwwwと一致します。
(\s)*オプションで空白に一致します
((\w|\s)+\.)*は、ピリオドが後に続く1つ以上のWord文字の0個以上と一致します。
([\w\-\s]+\/)*は、 '\'が後に続く1つ以上の単語（またはダッシュまたはスペース）の0個以上に一致します。
([\w\-]+) urlの最後の残りのパスと、オプションの末尾
((\?)?[\w\s]*=\s*[\w\%&]*)*は、クエリの終了パラメータに一致します（空白なども含む）

ここでこれをテストしてください： https://regex101.com/r/NmVGOo/8

Shailesh Wadhwa · Answer

次のPythonの正規表現は、テキスト内のURL（s）の検出に適しています：

source_text = ''' text1 text2 http://url.com/bla1/blah1/ text3 text4 http://url.com/bla2/blah2/ text5 text6 ''' import re url_reg = r'[a-z]*[:.]+\S+' result = re.sub(url_reg, '', source_text) print(result)

出力：

text1 text2 text3 text4 text5 text6

Samuel Nde · Answer

本当にしたいのは、http://またはhttps://で始まる文字列と、空白以外の文字の組み合わせを削除することです。解決方法は次のとおりです。私の解決策は@tolgayilmazのそれに非常に似ています

#Define the text from which you want to replace the url with "". text ='''The link to this post is https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python''' import re #Either use: re.sub('http://\S+|https://\S+', '', text) #OR re.sub('http[s]?://\S+', '', text)

そして、上記のいずれかのコードを実行した結果は

>>> 'The link to this post is '

読みやすいので、2番目の方が好きです。

Nischit Pradhan · Answer

これはすでに回答済みであり、その愚かさは遅いことは知っていますが、これはここにあるべきだと思います。これは、あらゆる種類のURLに一致する正規表現です。

[^ ]+\.[^ ]+

次のように使用できます

re.sub('[^ ]+\.[^ ]+','',sentence)

[^ ]+\.[^ ]+

次のように使用できます

re.sub('[^ ]+\.[^ ]+','',sentence)

Rsh · Answer

まず、URLのテキストファイルでパターンを見つける必要があります。 itが見つかったら、正規表現を使用できます。
。