mongodbで重複文書を削除する最も速い方法

Question

Mongodbには約170万のドキュメントがあります（将来10m以上）。それらのいくつかは、私が望んでいない重複エントリを表しています。ドキュメントの構造は次のようなものです。

{ _id: 14124412, nodes: [ 12345, 54321 ], name: "Some beauty" }

少なくとも1つのノードが同じが同じ名前の別のドキュメントと同じ場合、ドキュメントは重複しています。重複を削除する最も速い方法は何ですか？

JohnnyHK · Accepted Answer

重複したname + nodesエントリを含むドキュメントをコレクションから完全に削除する場合、 dropDups: true でuniqueインデックスを追加できます=オプション：

db.test.ensureIndex({name: 1, nodes: 1}, {unique: true, dropDups: true})

ドキュメントが言うように、データベースからデータを削除するため、これには細心の注意を払ってください。期待どおりに動作しない場合は、まずデータベースをバックアップしてください。

[〜＃〜] update [〜＃〜]

dropDupsオプションは3.0では使用できなくなったため、このソリューションはMongoDB 2.xでのみ有効です（- docs ）。

Somnath Muluk · Answer

dropDups: trueオプションは3.0では使用できません。

重複を収集し、一度に削除するための集約フレームワークを備えたソリューションがあります。

システムレベルの「インデックス」の変更よりも多少遅い場合があります。ただし、重複するドキュメントを削除する方法を検討することにより、適切です。

a。すべてのドキュメントを一度に削除します

var duplicates = []; db.collectionName.aggregate([ { $match: { name: { "$ne": '' } // discard selection criteria }}, { $group: { _id: { name: "$name"}, // can be grouped on multiple properties dups: { "$addToSet": "$_id" }, count: { "$sum": 1 } }}, { $match: { count: { "$gt": 1 } // Duplicates considered as count greater than one }} ], {allowDiskUse: true} // For faster processing if set is larger ) // You can display result until this and check duplicates .forEach(function(doc) { doc.dups.shift(); // First element skipped for deleting doc.dups.forEach( function(dupId){ duplicates.Push(dupId); // Getting all duplicate ids } ) }) // If you want to Check all "_id" which you are deleting else print statement not needed printjson(duplicates); // Remove all duplicates in one go db.collectionName.remove({_id:{$in:duplicates}})

b。ドキュメントを1つずつ削除できます。

db.collectionName.aggregate([ // discard selection criteria, You can remove "$match" section if you want { $match: { source_references.key: { "$ne": '' } }}, { $group: { _id: { source_references.key: "$source_references.key"}, // can be grouped on multiple properties dups: { "$addToSet": "$_id" }, count: { "$sum": 1 } }}, { $match: { count: { "$gt": 1 } // Duplicates considered as count greater than one }} ], {allowDiskUse: true} // For faster processing if set is larger ) // You can display result until this and check duplicates .forEach(function(doc) { doc.dups.shift(); // First element skipped for deleting db.collectionName.remove({_id : {$in: doc.dups }}); // Delete remaining duplicates })

dhythhsba · Answer

Mongodumpを使用してコレクションダンプを作成する

コレクションをクリア

一意のインデックスを追加

Mongorestoreを使用してコレクションを復元する

Ali Abul Hawa · Answer

MongoDB 3.4で動作するこのソリューションを見つけました：重複するフィールドはfieldXと呼ばれると仮定します

db.collection.aggregate([ { // only match documents that have this field // you can omit this stage if you don't have missing fieldX $match: {"fieldX": {$nin:[null]}} }, { $group: { "_id": "$fieldX", "doc" : {"$first": "$$ROOT"}} }, { $replaceRoot: { "newRoot": "$doc"} } ], {allowDiskUse:true})

MongoDBが初めてなので、多くの時間を費やし、他の長いソリューションを使用して重複を見つけて削除しました。しかし、この解決策はきちんとしていて理解しやすいと思います。

最初にfieldXを含むドキュメントを照合することにより機能します（このフィールドのないドキュメントがいくつかあり、空の結果が1つ追加されました）。

次のステージでは、ドキュメントをfieldXでグループ化し、 $$ ROOT を使用して $ first ドキュメントのみを各グループに挿入します。最後に、集約されたグループ全体を、$ firstおよび$$ ROOTを使用して見つかったドキュメントに置き換えます。

コレクションが大きいため、allowDiskUseを追加する必要がありました。

これは任意の数のパイプラインの後に追加できます。$ firstのドキュメントでは、$ firstを使用する前のソート段階について言及していますが、それ。「ここにリンクを投稿できませんでした。私の評判は10未満です:(」

結果を新しいコレクションに保存するには、$ outステージを追加します...

代わりに、少数のフィールドのみに興味がある場合replaceRootを使用しないグループステージのドキュメント全体ではなく、field1、field2：

db.collection.aggregate([ { // only match documents that have this field $match: {"fieldX": {$nin:[null]}} }, { $group: { "_id": "$fieldX", "field1": {"$first": "$$ROOT.field1"}, "field2": { "$first": "$field2" }} } ], {allowDiskUse:true})

amateur · Answer

あなたがpymongoでそれをやろうとしているなら、あなたはこのようなことをすることができます。

def _run_query(): try: for record in (aggregate_based_on_field(collection)): if not record: continue _logger.info("Working on Record %s", record) try: retain = db.collection.find_one(find_one({'fie1d1': 'x', 'field2':'y'}, {'_id': 1})) _logger.info("_id to retain from duplicates %s", retain['_id']) db.collection.remove({'fie1d1': 'x', 'field2':'y', '_id': {'$ne': retain['_id']}}) except Exception as ex: _logger.error(" Error when retaining the record :%s Exception: %s", x, str(ex)) except Exception as e: _logger.error("Mongo error when deleting duplicates %s", str(e)) def aggregate_based_on_field(collection): return collection.aggregate([{'$group' : {'_id': "$fieldX"}}])

シェルから：

Find_oneをfindOneに置き換えます
同じremoveコマンドが機能するはずです。

Renny · Answer

pymongoを使用すると、これは機能するはずです。

Unique_fieldのコレクションに対して一意である必要があるフィールドを追加します

unique_field = {"field1":"$field1","field2":"$field2"} cursor = DB.COL.aggregate([{"$group":{"_id":unique_field, "dups":{"$Push":"$uuid"}, "count": {"$sum": 1}}},{"$match":{"count": {"$gt": 1}}},{"$group":"_id":None,"dups":{"$addToSet":{"$arrayElemAt":["$dups",1]}}}}],allowDiskUse=True)

重複数に応じてDUP配列をスライスします（ここではすべての重複が1つだけありました）

items = list(cursor) removeIds = items[0]['dups'] hold.remove({"uuid":{"$in":removeIds}})

sanair96 · Answer

次のメソッドは、重複しないで一意のノードのみを保持しながら、同じ名前のドキュメントをマージします。

$out演算子は簡単な方法です。配列を解き、setに追加してグループ化します。 $out演算子を使用すると、集計結果を保持できます [docs] 。コレクション自体の名前を入力すると、コレクションが新しいデータに置き換えられます。名前が存在しない場合、新しいコレクションが作成されます。

お役に立てれば。

allowDiskUseをパイプラインに追加する必要がある場合があります。

db.collectionName.aggregate([ { $unwind:{path:"$nodes"}, }, { $group:{ _id:"$name", nodes:{ $addToSet:"$nodes" } }, { $project:{ _id:0, name:"$_id.name", nodes:1 } }, { $out:"collectionNameWithoutDuplicates" } ])