pythonで1つのcsvを複数のファイルに分割する

Question

pythonに約5000行のcsvファイルがあります。これを5つのファイルに分割します。

私はそれのためにコードを書きましたが、それは動作していません

import codecs import csv NO_OF_LINES_PER_FILE = 1000 def again(count_file_header,count): f3 = open('write_'+count_file_header+'.csv', 'at') with open('import_1458922827.csv', 'rb') as csvfile: candidate_info_reader = csv.reader(csvfile, delimiter=',', quoting=csv.QUOTE_ALL) co = 0 for row in candidate_info_reader: co = co + 1 count = count + 1 if count <= count: pass Elif count >= NO_OF_LINES_PER_FILE: count_file_header = count + NO_OF_LINES_PER_FILE again(count_file_header,count) else: writer = csv.writer(f3,delimiter = ',', lineterminator='
',quoting=csv.QUOTE_ALL) writer.writerow(row) def read_write(): f3 = open('write_'+NO_OF_LINES_PER_FILE+'.csv', 'at') with open('import_1458922827.csv', 'rb') as csvfile: candidate_info_reader = csv.reader(csvfile, delimiter=',', quoting=csv.QUOTE_ALL) count = 0 for row in candidate_info_reader: count = count + 1 if count >= NO_OF_LINES_PER_FILE: count_file_header = count + NO_OF_LINES_PER_FILE again(count_file_header,count) else: writer = csv.writer(f3,delimiter = ',', lineterminator='
',quoting=csv.QUOTE_ALL) writer.writerow(row) read_write()

上記のコードは、空のコンテンツを持つ多くのファイルを作成します。

1つのファイルを5つのcsvファイルに分割する方法は？

Rudziankoŭ · Accepted Answer

車輪を発明しないことをお勧めします。既存のソリューションがあります。ソースここ

import os def split(filehandler, delimiter=',', row_limit=1000, output_name_template='output_%s.csv', output_path='.', keep_headers=True): import csv reader = csv.reader(filehandler, delimiter=delimiter) current_piece = 1 current_out_path = os.path.join( output_path, output_name_template % current_piece ) current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter) current_limit = row_limit if keep_headers: headers = reader.next() current_out_writer.writerow(headers) for i, row in enumerate(reader): if i + 1 > current_limit: current_piece += 1 current_limit = row_limit * current_piece current_out_path = os.path.join( output_path, output_name_template % current_piece ) current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter) if keep_headers: current_out_writer.writerow(headers) current_out_writer.writerow(row)

次のように使用します。

split(open('/your/pat/input.csv', 'r'));

Aziz Alto · Answer

Pythonで

これを行うには、readlines()とwritelines()を使用します。以下に例を示します。

>>> csvfile = open('import_1458922827.csv', 'r').readlines() >>> filename = 1 >>> for i in range(len(csvfile)): ... if i % 1000 == 0: ... open(str(filename) + '.csv', 'w+').writelines(csvfile[i:i+1000]) ... filename += 1

出力ファイル名には番号が付けられます1.csv、2.csv、...など.

端末から

参考までに、次のようにsplitを使用してコマンドラインからこれを行うことができます。

$ split -l 1000 import_1458922827.csv

Ryan Tuck · Answer

Python3に優しいソリューション：

def split_csv(source_filepath, dest_folder, split_file_prefix, records_per_file): """ Split a source csv into multiple csvs of equal numbers of records, except the last file. Includes the initial header row in each split file. Split files follow a zero-index sequential naming convention like so: `{split_file_prefix}_0.csv` """ if records_per_file <= 0: raise Exception('records_per_file must be > 0') with open(source_filepath, 'r') as source: reader = csv.reader(source) headers = next(reader) file_idx = 0 records_exist = True while records_exist: i = 0 target_filename = f'{split_file_prefix}_{file_idx}.csv' target_filepath = os.path.join(dest_folder, target_filename) with open(target_filepath, 'w') as target: writer = csv.writer(target) while i < records_per_file: if i == 0: writer.writerow(headers) try: writer.writerow(next(reader)) i += 1 except: records_exist = False break if i == 0: # we only wrote the header, so delete that file os.remove(target_filepath) file_idx += 1

Whitefret · Answer

if count <= count: pass

この条件は常に真であるため、毎回合格します

それ以外の場合は、この投稿を見ることができます： CSVファイルを等しい部分に分割しますか？

Aetos · Answer

パンダが提供する可能性を活用することをお勧めします。これを行うために使用できる関数は次のとおりです。

def csv_count_rows(file): """ Counts the number of rows in a file. :param file: path to the file. :return: number of lines in the designated file. """ with open(file) as f: nb_lines = sum(1 for line in f) return nb_lines def split_csv(file, sep=",", output_path=".", nrows=None, chunksize=None, low_memory=True, usecols=None): """ Split a csv into several files. :param file: path to the original csv. :param sep: View pandas.read_csv doc. :param output_path: path in which to output the resulting parts of the splitting. :param nrows: Number of rows to split the original csv by, also view pandas.read_csv doc. :param chunksize: View pandas.read_csv doc. :param low_memory: View pandas.read_csv doc. :param usecols: View pandas.read_csv doc. """ nb_of_rows = csv_count_rows(file) # Parsing file elements : Path, name, extension, etc... # file_path = "/".join(file.split("/")[0:-1]) file_name = file.split("/")[-1] # file_ext = file_name.split(".")[-1] file_name_trunk = file_name.split(".")[0] split_files_name_trunk = file_name_trunk + "_part_" # Number of chunks to partition the original file into nb_of_chunks = math.ceil(nb_of_rows / nrows) if nrows: log_debug_process_start = f"The file '{file_name}' contains {nb_of_rows} ROWS. " \ f"
It will be split into {nb_of_chunks} chunks of a max number of rows : {nrows}." \ f"
The resulting files will be output in '{output_path}' as '{split_files_name_trunk}0 to {nb_of_chunks - 1}'" logging.debug(log_debug_process_start) for i in range(nb_of_chunks): # Number of rows to skip is determined by (the number of the chunk being processed) multiplied by (the nrows parameter). rows_to_skip = range(1, i * nrows) if i else None output_file = f"{output_path}/{split_files_name_trunk}{i}.csv" log_debug_chunk_processing = f"Processing chunk {i} of the file '{file_name}'" logging.debug(log_debug_chunk_processing) # Fetching the original csv file and handling it with skiprows and nrows to process its data df_chunk = pd.read_csv(filepath_or_buffer=file, sep=sep, nrows=nrows, skiprows=rows_to_skip, chunksize=chunksize, low_memory=low_memory, usecols=usecols) df_chunk.to_csv(path_or_buf=output_file, sep=sep) log_info_file_output = f"Chunk {i} of file '{file_name}' created in '{output_file}'" logging.info(log_info_file_output)

そして、メインまたはjupyterノートブックに以下を置きます。

# This is how you initiate logging in the most basic way. logging.basicConfig(level=logging.DEBUG) file = {#Path to your file} split_csv(file,sep=";" ,output_path={#Path where you'd like to output it},nrows = 4000000, low_memory = False)

P.S.1：nrows = 4000000を入れたのは、それが個人的な好みだからです。必要に応じてその番号を変更できます。

P.S.2：メッセージを表示するためにロギングライブラリを使用しました。リモートサーバー上に存在する大きなファイルにこのような関数を適用する場合、「単純な印刷」を避け、ロギング機能を組み込む必要があります。 logging.infoまたはlogging.debugをprintに置き換えることができます

P.S.3：もちろん、コードの{# Blablabla}部分を独自のパラメーターに置き換える必要があります。

Ramesh K · Answer

@ Ryan、Python3コードは私のために働いた。空白行の問題を回避するために、次のようにnewline=''を使用しました。

with open(target_filepath, 'w', newline='') as target: