web-dev-qa-db-ja.com

1つのCSVファイルに複数の区切り文字

私はCSVを持っています。これには、「|」、「」、「;」の3つの異なる区切り文字があります。異なる列の間。

PythonこのCSVを解析するにはどうすればよいですか?

私のデータは以下のようなものです:

2017-01-24|05:19:30+0000|TRANSACTIONDelim_secondUSER_LOGINDelim_firstCONSUMERIDDelim_secondc4115f53-3798-4c9e-9bfd-506c842aff96Delim_firstTRANSACTIONDATEDelim_second17-01-24 05:19:30Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondnullDelim_firstAIRINGDATEDelim_second|**
2017-01-24|05:19:30+0000|TRANSACTIONDelim_secondUSER_LOGOUTDelim_firstCONSUMERIDDelim_second1583e83882b8e7Delim_firstTRANSACTIONDATEDelim_second17-01-24 05:19:26Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondbu002Delim_firstAIRINGDATEDelim_second24-Jan-2017|**
2017-01-24|05:21:59+0000|TRANSACTIONDelim_secondVIEW_PRIVACY_POLICYDelim_firstCONSUMERIDDelim_secondnullDelim_firstTRANSACTIONDATEDelim_second17-01-24 05:21:59Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondnullDelim_firstAIRINGDATEDelim_second|**
2017-01-24|05:59:25+0000|TRANSACTIONDelim_secondUSER_LOGOUTDelim_firstCONSUMERIDDelim_second1586a2aa4bc18fDelim_firstTRANSACTIONDATEDelim_second17-01-24 05:59:21Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondbu002Delim_firstAIRINGDATEDelim_second24-Jan-2017|**
2017-01-24|05:59:36+0000|TRANSACTIONDelim_secondUSER_LOGOUTDelim_firstCONSUMERIDDelim_second1583e83882b8e7Delim_firstTRANSACTIONDATEDelim_second17-01-24 05:59:31Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondbu002Delim_firstAIRINGDATEDelim_second24-Jan-2017|**
2017-01-24|06:04:25+0000|TRANSACTIONDelim_secondUSER_LOGOUTDelim_firstCONSUMERIDDelim_secondc4115f53-3798-4c9e-9bfd-506c842aff96Delim_firstTRANSACTIONDATEDelim_second17-01-24 06:04:24Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondbu002Delim_firstAIRINGDATEDelim_second|**
2017-01-24|06:05:07+0000|TRANSACTIONDelim_secondUSER_LOGINDelim_firstCONSUMERIDDelim_secondc4115f53-3798-4c9e-9bfd-506c842aff96Delim_firstTRANSACTIONDATEDelim_second17-01-24 06:05:07Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondnullDelim_firstAIRINGDATEDelim_second|**
2017-01-24|06:05:07+0000|TRANSACTIONDelim_secondUSER_LOGINDelim_firstCONSUMERIDDelim_secondc4115f53-3798-4c9e-9bfd-506c842aff96Delim_firstTRANSACTIONDATEDelim_second17-01-24 06:05:07Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondbu002Delim_firstAIRINGDATEDelim_second|**
5
Sarang

標準ライブラリに固執すると、re.split()は次の文字のいずれかで行を分割できます。

import re

with open(file_name) as fobj:
    for line in fobj:
        line_data = re.split('Delim_first|Delim_second|[|]', line)
        print(line_data)

これは区切り文字で分割されます|Delim_first、およびDelim_second

またはパンダと一緒に:

import pandas as pd
df = pd.read_csv('multi_delim.csv', sep='Delim_first|Delim_second|[|]', 
                  engine='python', header=None)

結果:

enter image description here

3
Mike Müller

目的を達成する簡単な方法の1つは、 pandas パッケージを使用することです。ここに小さな例を示します。

import pandas as pd
import StringIO

data = StringIO.StringIO("""a;b|c;
2016-09-05 10:47:00|1,foo;
2016-09-06 10:47:00;2;foo2;
2016-09-07 10:47:00;3;foo3;""")

df = pd.read_csv(data, sep='[;,|]', engine='python')
for c in ['a', 'b', 'c']:
    print('-' * 80)
    print(df[c])
1
BPL

私のサンプルデータは次のようなものでした:

2017-01-24|05:19:30+0000|TRANSACTIONDelim_secondUSER_LOGINDelim_firstCONSUMERIDDelim_secondc4115f53-3798-4c9e-9bfd-506c842aff96Delim_firstTRANSACTIONDATEDelim_second17-01-24 05:19:30Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondnullDelim_firstAIRINGDATEDelim_second|**
2017-01-24|05:19:30+0000|TRANSACTIONDelim_secondUSER_LOGOUTDelim_firstCONSUMERIDDelim_second1583e83882b8e7Delim_firstTRANSACTIONDATEDelim_second17-01-24 05:19:26Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondbu002Delim_firstAIRINGDATEDelim_second24-Jan-2017|**
2017-01-24|05:21:59+0000|TRANSACTIONDelim_secondVIEW_PRIVACY_POLICYDelim_firstCONSUMERIDDelim_secondnullDelim_firstTRANSACTIONDATEDelim_second17-01-24 05:21:59Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondnullDelim_firstAIRINGDATEDelim_second|**
2017-01-24|05:59:25+0000|TRANSACTIONDelim_secondUSER_LOGOUTDelim_firstCONSUMERIDDelim_second1586a2aa4bc18fDelim_firstTRANSACTIONDATEDelim_second17-01-24 05:59:21Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondbu002Delim_firstAIRINGDATEDelim_second24-Jan-2017|**
2017-01-24|05:59:36+0000|TRANSACTIONDelim_secondUSER_LOGOUTDelim_firstCONSUMERIDDelim_second1583e83882b8e7Delim_firstTRANSACTIONDATEDelim_second17-01-24 05:59:31Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondbu002Delim_firstAIRINGDATEDelim_second24-Jan-2017|**
2017-01-24|06:04:25+0000|TRANSACTIONDelim_secondUSER_LOGOUTDelim_firstCONSUMERIDDelim_secondc4115f53-3798-4c9e-9bfd-506c842aff96Delim_firstTRANSACTIONDATEDelim_second17-01-24 06:04:24Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondbu002Delim_firstAIRINGDATEDelim_second|**
2017-01-24|06:05:07+0000|TRANSACTIONDelim_secondUSER_LOGINDelim_firstCONSUMERIDDelim_secondc4115f53-3798-4c9e-9bfd-506c842aff96Delim_firstTRANSACTIONDATEDelim_second17-01-24 06:05:07Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondnullDelim_firstAIRINGDATEDelim_second|**
2017-01-24|06:05:07+0000|TRANSACTIONDelim_secondUSER_LOGINDelim_firstCONSUMERIDDelim_secondc4115f53-3798-4c9e-9bfd-506c842aff96Delim_firstTRANSACTIONDATEDelim_second17-01-24 06:05:07Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondbu002Delim_firstAIRINGDATEDelim_second|**

したがって、「|」が含まれていました区切り文字として、「Delim_first」および「Delim_second」。

3つの区切り文字すべてでデータを分離する必要がありました。

データからpandasデータフレームを作成し、使用しました;

i = 0 
while i < 8:
    df10[i+6]=(df10[2].str[:].str.split('First_delim').apply(pd.Series).astype(str))[i]
    i = i + 1

j = 0 
while j < 8:
    k = 0
    df10[2*j+14]=(df10[j+6+k].str[:].str.split('Second_delim').apply(pd.Series).astype(str))[0]
    df10[2*j+15]=(df10[j+6+k].str[:].str.split('Second_delim').apply(pd.Series).astype(str))[1]
    j = j + 1
    k = k + 1

j=0
for i in df10[1]:
    i = i[:-5]
    df10[1][j]=i
    j = j+1
0
Sarang