Dask read_csv--`pd.read_csv` / `pd.read_table`で不一致のdtypeが見つかりました

Question

Daskを使用してcsvファイルを読み取ろうとしていますが、次のようなエラーが発生しました。しかし、問題は、_ARTICLE_ID_をobject(string)にしたいということです。誰かが私がデータを正常に読み取るのを手伝ってくれる？

トレースバックは次のようになります。

_ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`. +------------+--------+----------+ | Column | Found | Expected | +------------+--------+----------+ | ARTICLE_ID | object | int64 | +------------+--------+----------+ The following columns also raised exceptions on conversion: ARTICLE_ID: ValueError("invalid literal for int() with base 10: ' July 2007 and 31 March 2008. Diagnostic practices of the medical practitioners for establishing the diagnosis of different types of EPTB were studied. Results: For the diagnosi\\'",) Usually this is due to dask's dtype inference failing, and *may* be fixed by specifying dtypes manually by adding: dtype={'ARTICLE_ID': 'object'} to the call to `read_csv`/`read_table`. _

gench · Answer

read_csvメソッドでsampleパラメータを使用し、dtypeを決定するときに使用するバイト数を示す整数を割り当てることができます。たとえば、データの種類を（171907、161）の形で正しく推測するには、25000000を指定する必要がありました。

df = dd.read_csv("game_logs.csv", sample=25000000)

https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv

df = dd.read_csv("game_logs.csv", sample=25000000)

https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv