中央値計算にSTLコンテナーを使用する場合の正しいアプローチは何ですか？

Question

1000000のランダムな数値のシーケンスから中央値を取得する必要があるとしましょう。

何かを使用する場合but std::listの場合、中央値計算のためにシーケンスをソートする（組み込み）方法がありません。

std::listを使用している場合、値をランダムにアクセスして、ソートされたシーケンスの中央（中央値）を取得することはできません。

自分で並べ替えを実装して、たとえば、 std::vector、またはstd::listを使用し、std::list::iteratorを使用してfor-loop-walkで中央値にした方がよいでしょうか？後者はオーバーヘッドが少ないように見えますが、より醜く感じます。

それとも私にとってもっと良い選択肢がありますか？

Mike Seymour · Accepted Answer

ランダムアクセスコンテナー（std::vectorなど）は、std::sortヘッダーで使用できる標準の<algorithm>アルゴリズムでソートできます。

中央値を見つけるには、std::nth_elementを使用する方が速いでしょう。これは、選択した1つの要素を正しい位置に配置するのに十分なソートを実行しますが、コンテナーを完全にソートするわけではありません。したがって、次のような中央値を見つけることができます。

int median(vector<int> &v) { size_t n = v.size() / 2; nth_element(v.begin(), v.begin()+n, v.end()); return v[n]; }

Eponymous · Answer

中央値は、マイクシーモアの回答よりも複雑です。中央値は、サンプル内のアイテムの数が偶数か奇数かによって異なります。アイテムの数が偶数の場合、中央値は中央の2つのアイテムの平均です。これは、整数のリストの中央値が分数になる可能性があることを意味します。最後に、空のリストの中央値は未定義です。これが私の基本的なテストケースをパスするコードです：

///Represents the exception for taking the median of an empty list class median_of_empty_list_exception:public std::exception{ virtual const char* what() const throw() { return "Attempt to take the median of an empty list of numbers. " "The median of an empty list is undefined."; } }; ///Return the median of a sequence of numbers defined by the random ///access iterators begin and end. The sequence must not be empty ///(median is undefined for an empty set). /// ///The numbers must be convertible to double. template<class RandAccessIter> double median(RandAccessIter begin, RandAccessIter end) throw(median_of_empty_list_exception){ if(begin == end){ throw median_of_empty_list_exception(); } std::size_t size = end - begin; std::size_t middleIdx = size/2; RandAccessIter target = begin + middleIdx; std::nth_element(begin, target, end); if(size % 2 != 0){ //Odd number of elements return *target; }else{ //Even number of elements double a = *target; RandAccessIter targetNeighbor= target-1; std::nth_element(begin, targetNeighbor, end); return (a+*targetNeighbor)/2.0; } }

Alec Jacobson · Answer

マイクシーモアの回答のより完全なバージョンを次に示します。

// Could use pass by copy to avoid changing vector double median(std::vector<int> &v) { size_t n = v.size() / 2; std::nth_element(v.begin(), v.begin()+n, v.end()); int vn = v[n]; if(v.size()%2 == 1) { return vn; }else { std::nth_element(v.begin(), v.begin()+n-1, v.end()); return 0.5*(vn+v[n-1]); } }

奇数長または偶数長の入力を処理します。

Matthew Fioravante · Answer

このアルゴリズムは、STL nth_element（amortized O(N))アルゴリズムとmax_elementアルゴリズム（O（n））を使用して、偶数サイズと奇数サイズの入力の両方を効率的に処理します。nth_elementには別の保証された副作用があることに注意してください。つまり、nの前のすべての要素はすべてv[n]未満であることが保証されており、必ずしもソートされているとは限りません。

//post-condition: After returning, the elements in v may be reordered and the resulting order is implementation defined. double median(vector<double> &v) { if(v.empty()) { return 0.0; } auto n = v.size() / 2; nth_element(v.begin(), v.begin()+n, v.end()); auto med = v[n]; if(!(v.size() & 1)) { //If the set size is even auto max_it = max_element(v.begin(), v.begin()+n); med = (*max_it + med) / 2.0; } return med; }

Charles Salvia · Answer

std::vectorライブラリ関数の使用std::sort。

std::vector<int> vec; // ... fill vector with stuff std::sort(vec.begin(), vec.end());

Croc Dialer · Answer

このスレッドからのすべての洞察をまとめると、私はこのルーチンを持つことになりました。入力イテレータを提供する任意のstlコンテナまたは任意のクラスで動作し、奇数サイズおよび偶数サイズのコンテナを処理します。また、元のコンテンツを変更しないように、コンテナのコピーに対して作業を行います。

template <typename T = double, typename C> inline const T median(const C &the_container) { std::vector<T> tmp_array(std::begin(the_container), std::end(the_container)); size_t n = tmp_array.size() / 2; std::nth_element(tmp_array.begin(), tmp_array.begin() + n, tmp_array.end()); if(tmp_array.size() % 2){ return tmp_array[n]; } else { // even sized vector -> average the two middle values auto max_it = std::max_element(tmp_array.begin(), tmp_array.begin() + n); return (*max_it + tmp_array[n]) / 2.0; } }

ephemient · Answer

線形時間選択アルゴリズムが存在します。以下のコードは、コンテナにランダムアクセスイテレータがある場合にのみ機能しますが、なしで機能するように変更できます。end - beginやiter + nなどのショートカットを避けるには、もう少し注意する必要があります。。

#include <algorithm> #include <cstdlib> #include <iostream> #include <sstream> #include <vector> template<class A, class C = std::less<typename A::value_type> > class LinearTimeSelect { public: LinearTimeSelect(const A &things) : things(things) {} typename A::value_type nth(int n) { return nth(n, things.begin(), things.end()); } private: static typename A::value_type nth(int n, typename A::iterator begin, typename A::iterator end) { int size = end - begin; if (size <= 5) { std::sort(begin, end, C()); return begin[n]; } typename A::iterator walk(begin), skip(begin); #ifdef RANDOM // randomized algorithm, average linear-time typename A::value_type pivot = begin[std::Rand() % size]; #else // guaranteed linear-time, but usually slower in practice while (end - skip >= 5) { std::sort(skip, skip + 5); std::iter_swap(walk++, skip + 2); skip += 5; } while (skip != end) std::iter_swap(walk++, skip++); typename A::value_type pivot = nth((walk - begin) / 2, begin, walk); #endif for (walk = skip = begin, size = 0; skip != end; ++skip) if (C()(*skip, pivot)) std::iter_swap(walk++, skip), ++size; if (size <= n) return nth(n - size, walk, end); else return nth(n, begin, walk); } A things; }; int main(int argc, char **argv) { std::vector<int> seq; { int i = 32; std::istringstream(argc > 1 ? argv[1] : "") >> i; while (i--) seq.Push_back(i); } std::random_shuffle(seq.begin(), seq.end()); std::cout << "unordered: "; for (std::vector<int>::iterator i = seq.begin(); i != seq.end(); ++i) std::cout << *i << " "; LinearTimeSelect<std::vector<int> > alg(seq); std::cout << std::endl << "linear-time medians: " << alg.nth((seq.size()-1) / 2) << ", " << alg.nth(seq.size() / 2); std::sort(seq.begin(), seq.end()); std::cout << std::endl << "medians by sorting: " << seq[(seq.size()-1) / 2] << ", " << seq[seq.size() / 2] << std::endl; return 0; }

Lorah Attkins · Answer

@MatthieuMの提案を考慮した回答を次に示します。すなわちは入力ベクトルを変更しません。これは、偶数と奇数の両方のカーディナリティの範囲に対して（インデックスのベクトルで）単一の部分ソートを使用しますが、空の範囲は、ベクトルのatメソッドによってスローされる例外を使用して処理されます。

double median(vector<int> const& v) { bool isEven = !(v.size() % 2); size_t n = v.size() / 2; vector<size_t> vi(v.size()); iota(vi.begin(), vi.end(), 0); partial_sort(begin(vi), vi.begin() + n + 1, end(vi), [&](size_t lhs, size_t rhs) { return v[lhs] < v[rhs]; }); return isEven ? 0.5 * (v[vi.at(n-1)] + v[vi.at(n)]) : v[vi.at(n)]; }

Demo