Pandasデータフレームへのログファイル

Question

私はログファイルを持っています、それは次の形で多くの行を持っています：

LogLevel [13/10/2015 00:30:00.650] [Message Text]

私の目標は、ログファイルの各行をNiceDataフレームに変換することです。 [文字の行を分割することでそれを行うのに疲れましたが、それでもきちんとしたデータフレームを取得できていません。

私のコード：

level = [] time = [] text = [] with open(filename) as inf: for line in inf: parts = line.split('[') if len(parts) > 1: level = parts[0] time = parts[1] text = parts[2] print (parts[0],parts[1],parts[2]) s1 = pd.Series({'Level':level, 'Time': time, 'Text':text}) df = pd.DataFrame(s1).reset_index()

印刷したデータフレームは次のとおりです。

Info 10/08/16 10:56:09.843] In Function CCatalinaPrinter::ItemDescription()] Info 10/08/16 10:56:09.843] Sending UPC Description Message ]

空白と他の ']'文字を取り除くためにこれをどのように改善できますか

ありがとうございました

jezrael · Accepted Answer

read_csv 区切り文字\s*\[で使用できます-空白は[で：

import pandas as pd from pandas.compat import StringIO temp=u"""LogLevel [13/10/2015 00:30:00.650] [Message Text] LogLevel [13/10/2015 00:30:00.650] [Message Text] LogLevel [13/10/2015 00:30:00.650] [Message Text] LogLevel [13/10/2015 00:30:00.650] [Message Text]""" #after testing replace StringIO(temp) to filename df = pd.read_csv(StringIO(temp), sep="\s*\[", names=['Level','Time','Text'], engine='python')

次に、]を strip で削除し、列Time to_datetime を変換します。

df.Time = pd.to_datetime(df.Time.str.strip(']'), format='%d/%m/%Y %H:%M:%S.%f') df.Text = df.Text.str.strip(']') print (df) Level Time Text 0 LogLevel 2015-10-13 00:30:00.650 Message Text 1 LogLevel 2015-10-13 00:30:00.650 Message Text 2 LogLevel 2015-10-13 00:30:00.650 Message Text 3 LogLevel 2015-10-13 00:30:00.650 Message Text print (df.dtypes) Level object Time datetime64[ns] Text object dtype: object

jxramos · Answer

セパレータがメッセージ本文に表示され、メッセージ本文も複数行にまたがるので、手動で解析する必要がありました。たとえば、Flaskアプリケーションから例外がスローされ、スタックトラックが記録された場合）。

これが私のログ作成フォーマットです...

logging.basicConfig( filename="%s/%s_MyApp.log" % ( Utilities.logFolder , datetime.datetime.today().strftime("%Y%m%d-%H%M%S")) , level=logging.DEBUG, format="%(asctime)s,%(name)s,%(process)s,%(levelno)u,%(message)s", datefmt="%Y-%m-%d %H:%M:%S" )

そして、ユーティリティモジュールの解析コード

Utilities.py import re import pandas logFolder = "./Logs" logLevelToString = { "50" : "CRITICAL", "40" : "ERROR" , "30" : "WARNING" , "20" : "INFO" , "10" : "DEBUG" , "0" : "NOTSET" } # https://docs.python.org/3.6/library/logging.html#logging-levels def logFile2DataFrame( filePath ) : dfLog = pandas.DataFrame( columns=[ 'Timestamp' , 'Module' , 'ProcessID' , 'Level' , 'Message' ] ) tsPattern = "^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}," with open( filePath , 'r' ) as logFile : numRows = -1 for line in logFile : if re.search( tsPattern , line ) : tokens = line.split(",") timestamp = tokens[0] module = tokens[1] processID = tokens[2] level = logLevelToString[ tokens[3] ] message = ",".join( tokens[4:] ) numRows += 1 dfLog.loc[ numRows ] = [ timestamp , module , processID , level , message ] else : # Multiline message, integrate it into last record dfLog.loc[ numRows , 'Message' ] += line return dfLog

DataFrameをレンダリングする便利なテンプレートがあるので、実際にこのヘルパーメッセージを作成して、Flaskアプリから直接ログを表示できるようにしました。flaskappをTornadoに格納してから、大量のデバッグを高速化する必要があります。 WSGIサーバーは、例外がスローされたときにFlaskから見える通常のデバッグページの表示を防ぎます。そのような使用法でその機能を復元する方法を知っている人がいる場合は、共有してください。