エアフローを使用してファイルをkafka

Question

エアフローを使用してCSVファイルをkafkaトピックにストリーミングするための最良のアプローチは何ですか？

気流のカスタム演算子を作成しますか？

Mike · Answer

おそらく、PythonOperatorを使用してファイルを1行ずつ処理するのが最善です。ファイルをSFTPサーバーでポーリングし、ファイルが見つかったら、それらを1行ずつ処理して、結果をJSONとして書き出すというユースケースがあります。私は日付をYYYY-MM-DD形式に解析するようなことをします。このような何かがあなたのために働くかもしれません：

def csv_file_to_kafka(**context): f = '/path/to/downloaded/csv_file.csv' csvfile = open(f, 'r') reader = csv.DictReader(csvfile) for row in reader: """ Send the row to Kafka """ return csv_file_to_kafka = PythonOperator( task_id='csv_file_to_kafka', python_callable=csv_file_to_kafka, dag=dag )

ファイルをダウンロードする方法は、実際にはあなた次第です。私の場合、SSHHookとGoogleCloudStorageHookを使用してSFTPサーバーからファイルを取得し、csvファイルを解析してクリーンアップするタスクにファイルの名前を渡します。これを行うには、ファイルをSFTPからプルダウンし、Google CloudStorageに配置します。

""" HOOKS: Connections to external systems """ def sftp_connection(): """ Returns an SFTP connection created using the SSHHook """ ssh_hook = SSHHook(ssh_conn_id='sftp_connection') ssh_client = ssh_hook.get_conn() return ssh_client.open_sftp() def gcs_connection(): """ Returns an GCP connection created using the GoogleCloudStorageHook """ return GoogleCloudStorageHook(google_cloud_storage_conn_id='my_gcs_connection') """ PYTHON CALLABLES: Called by PythonOperators """ def get_files(**context): """ Looks at all files on the FTP server and returns a list files. """ sftp_client = sftp_connection() all_files = sftp_client.listdir('/path/to/files/') files = [] for f in all_files: files.append(f) return files def save_files(**context): """ Looks to see if a file already exists in GCS. If not, the file is downloaed from SFTP server and uploaded to GCS. A list of """ files = context['task_instance'].xcom_pull(task_ids='get_files') sftp_client = sftp_connection() gcs = gcs_connection() new_files = [] new_outcomes_files = [] new_si_files = [] new_files = process_sftp_files(files, gcs, sftp_client) return new_files def csv_file_to_kafka(**context): """ Untested sample parse csv files and send to kafka """ files = context['task_instance'].xcom_pull(task_ids='save_files') for f in new_files: csvfile = open(f, 'r') reader = csv.DictReader(csvfile) for row in reader: """ Send the row to Kafka """ return get_files = PythonOperator( task_id='get_files', python_callable=get_files, dag=dag ) save_files = PythonOperator( task_id='save_files', python_callable=save_files, dag=dag ) csv_file_to_kafka = PythonOperator( task_id='csv_file_to_kafka', python_callable=csv_file_to_kafka, dag=dag )

私はこれをすべて1つの大きなpython呼び出し可能で行うことができることを知っています。これが、呼び出し可能でコードをリファクタリングする方法です。したがって、SFTPサーバーをポーリングし、最新のファイルをプルし、私のルールに従ってそれらをすべて1つのpython関数で解析します。XComの使用は理想的ではないと聞いています。おそらく、Airflowタスクは互いにあまり通信しないはずです。

ユースケースによっては、 Apache Nifi のようなものを検討することもできますが、私は実際にそれについても調査しています。