ちょうど10分後にmongodumpが失敗する

Question

Debian 9.11でシングルインスタンスMongoDB v4.2.0を使用し、2つのvCPU、13 GBのメモリサーバーで78GBおよび60Mのドキュメントを収集します。このコマンドは、データベースが実行されているのと同じサーバーで呼び出されます。

mongodump --username user --password pwd --authenticationDatabase admin --Host localhost --gzip --archive=out.gz --db database --collection collection

そして10分後、これは出力です：

2019-10-17T16:13:09.523+0200 Failed: error creating intents to dump: error counting database.collection: connection(localhost:27017[-2]) unable to decode message length: read tcp 127.0.0.1:57798->127.0.0.1:27017: i/o timeout

mongod.logを見ると、これがmongodumpの出力です。

2019-10-17T16:13:09.523+0200 I NETWORK [conn20] end connection 127.0.0.1:57794 (4 connections now open) 2019-10-17T16:13:09.748+0200 I - [conn21] operation was interrupted because a client disconnected 2019-10-17T16:13:10.371+0200 W COMMAND [conn21] Unable to gather storage statistics for a slow operation due to lock aquire timeout 2019-10-17T16:13:10.371+0200 I COMMAND [conn21] command database.collection appName: "mongodump" command: aggregate { aggregate: "collection", pipeline: [ { $match: {} }, { $group: { _id: 1, n: { $sum: 1 } } } ], cursor: {}, lsid: { id: UUID("4759f9ad-7d37-44c7-bd41-8610af565c47") }, $db: "database" } planSummary: COLLSCAN numYields:437454 ok:0 errMsg:"Error in $cursor stage :: caused by :: operation was interrupted because a client disconnected" errName:ClientDisconnect errCode:279 reslen:186 locks:{ ReplicationStateTransition: { acquireCount: { w: 437456 } }, Global: { acquireCount: { r: 437456 } }, Database: { acquireCount: { r: 437455 } }, Collection: { acquireCount: { r: 437455 } }, Mutex: { acquireCount: { r: 2 } } } protocol:op_msg 600852ms

Mongodumpが内部でこの集約クエリを実行しているようです。

[ { $match: {} }, { $group: { _id: 1, n: { $sum: 1 } } } ]

また、同じクエリがMongoDBシェルから実行された場合も機能します。

> db.collection.aggregate([ { $match: {} }, { $group: { _id: 1, n: { $sum: 1 } } } ]) { "_id" : 1, "n" : 60488853 }

そして、これはmongod.log出力です：

2019-10-17T15:01:37.130+0200 I COMMAND [conn2] command database.collection appName: "MongoDB Shell" command: aggregate { aggregate: "collection", pipeline: [ { $match: {} }, { $group: { _id: 1.0, n: { $sum: 1.0 } } } ], cursor: {}, lsid: { id: UUID("3b732623-e8b4-4365-bac7-efa710db035c") }, $db: "database" } planSummary: COLLSCAN keysExamined:0 docsExamined:60488853 cursorExhausted:1 numYields:472591 nreturned:1 reslen:136 locks:{ ReplicationStateTransition: { acquireCount: { w: 472593 } }, Global: { acquireCount: { r: 472593 } }, Database: { acquireCount: { r: 472593 } }, Collection: { acquireCount: { r: 472593 } }, Mutex: { acquireCount: { r: 2 } } } storage:{ data: { bytesRead: 79320933249, timeReadingMicros: 610249624 } } protocol:op_msg 653559ms

シェルからのこのカウントクエリは653秒かかりますが、mongodump内部クエリは600秒（正確には10分）後にタイムアウトします。

同じサーバー上の他の小さなコレクションとデータベースにはこの問題はありません。この大きなものだけです。

このタイムアウトまたは大きなクエリの問題を解決して、mongodumpを問題なく実行するにはどうすればよいですか？

ssasa · Answer

MongoDB v4.2.1にアップグレードすると問題が発生しなくなったため、バグがあったようです。