CFSでの高いCPU使用率？

Question

アプリケーションをRHEL 5からRHEL 6に移動するときに、CPU使用率の増加の原因を特定するために前の質問を尋ねました。そのために行った分析では、それが原因であることが示されているようですカーネルのCFSによって。これが当てはまるかどうかを確認するためにテストアプリケーションを作成しました（元のテストアプリケーションはサイズ制限に収まるように削除されましたが、 git repo で引き続き使用できます）。

RHEL 5で次のコマンドを使用してコンパイルしました。

cc test_select_work.c -O2 -DSLEEP_TYPE=0 -Wall -Wextra -lm -lpthread -o test_select_work

次に、Dell Precision m6500での反復ごとの実行時間が約1 msになるまで、パラメーターをいじりました。

RHEL 5で次の結果が得られました。

./test_select_work 1000 10000 300 4 time_per_iteration: min: 911.5 us avg: 913.7 us max: 917.1 us stddev: 2.4 us ./test_select_work 1000 10000 300 8 time_per_iteration: min: 1802.6 us avg: 1803.9 us max: 1809.1 us stddev: 2.1 us ./test_select_work 1000 10000 300 40 time_per_iteration: min: 7580.4 us avg: 8567.3 us max: 9022.0 us stddev: 299.6 us

そして、RHEL 6では次のようになります。

./test_select_work 1000 10000 300 4 time_per_iteration: min: 914.6 us avg: 975.7 us max: 1034.5 us stddev: 50.0 us ./test_select_work 1000 10000 300 8 time_per_iteration: min: 1683.9 us avg: 1771.8 us max: 1810.8 us stddev: 43.4 us ./test_select_work 1000 10000 300 40 time_per_iteration: min: 7997.1 us avg: 8709.1 us max: 9061.8 us stddev: 310.0 us

どちらのバージョンでも、これらの結果は、反復あたりの平均時間を比較的線形にスケーリングした場合の予想とほぼ同じでした。次に-DSLEEP_TYPE=1で再コンパイルし、RHEL 5で次の結果を得ました。

./test_select_work 1000 10000 300 4 time_per_iteration: min: 1803.3 us avg: 1902.8 us max: 2001.5 us stddev: 113.8 us ./test_select_work 1000 10000 300 8 time_per_iteration: min: 1997.1 us avg: 2002.0 us max: 2010.8 us stddev: 5.0 us ./test_select_work 1000 10000 300 40 time_per_iteration: min: 6958.4 us avg: 8397.9 us max: 9423.7 us stddev: 619.7 us

そして、RHEL 6での次の結果：

./test_select_work 1000 10000 300 4 time_per_iteration: min: 2107.1 us avg: 2143.1 us max: 2177.7 us stddev: 30.3 us ./test_select_work 1000 10000 300 8 time_per_iteration: min: 2903.3 us avg: 2903.8 us max: 2904.3 us stddev: 0.3 us ./test_select_work 1000 10000 300 40 time_per_iteration: min: 8877.7.1 us avg: 9016.3 us max: 9112.6 us stddev: 62.9 us

RHEL 5では、結果は私が期待したものでした（1ミリ秒のスリープのために4スレッドは2倍の時間がかかりますが、各スレッドが約半分の時間スリープしているため、8スレッドは同じ時間かかります。線形増加）。

ただし、RHEL 6では、4スレッドでの所要時間は予想される倍増よりも約15％増加し、8スレッドの場合は予想されるわずかな増加より約45％増加しました。 4スレッドのケースの増加は、RHEL 6が実際には1ミリ秒を超える数マイクロ秒スリープしているのに対し、RHEL 5は約900 usしかスリープしていないようですが、これは8および40の予期しない大幅な増加を説明していませんスレッドケース。

3つのDSLEEP_TYPE値すべてで同様のタイプの動作を見ました。また、sysctlでスケジューラパラメータを試してみましたが、結果に大きな影響を与えるものは何もないようです。この問題をさらに診断する方法に関するアイデアはありますか？

更新：2012-05-07

テストの出力として、/ proc/stat // tasks // statからユーザーとシステムのCPU使用率の測定値を追加して、別の観察ポイントを得ようとしました。また、外側の反復ループを追加したときに導入された平均と標準偏差の更新方法に問題があったので、平均と標準偏差の測定値を修正した新しいプロットを追加します。更新されたプログラムを含めました。コードを追跡するためのgitリポジトリも作成しました。こちらから入手できます。

#include <limits.h> #include <math.h> #include <poll.h> #include <pthread.h> #include <sched.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #include <unistd.h> #include <sys/select.h> #include <sys/syscall.h> #include <sys/time.h> // Apparently GLIBC doesn't provide a wrapper for this function so provide it here #ifndef HAS_GETTID pid_t gettid(void) { return syscall(SYS_gettid); } #endif // The different type of sleep that are supported enum sleep_type { SLEEP_TYPE_NONE, SLEEP_TYPE_SELECT, SLEEP_TYPE_POLL, SLEEP_TYPE_USLEEP, SLEEP_TYPE_YIELD, SLEEP_TYPE_PTHREAD_COND, SLEEP_TYPE_NANOSLEEP, }; // Information returned by the processing thread struct thread_res { long long clock; long long user; long long sys; }; // Function type for doing work with a sleep typedef struct thread_res *(*work_func)(const int pid, const int sleep_time, const int num_iterations, const int work_size); // Information passed to the thread struct thread_info { pid_t pid; int sleep_time; int num_iterations; int work_size; work_func func; }; inline void get_thread_times(pid_t pid, pid_t tid, unsigned long long *utime, unsigned long long *stime) { char filename[FILENAME_MAX]; FILE *f; sprintf(filename, "/proc/%d/task/%d/stat", pid, tid); f = fopen(filename, "r"); if (f == NULL) { *utime = 0; *stime = 0; return; } fscanf(f, "%*d %*s %*c %*d %*d %*d %*d %*d %*u %*u %*u %*u %*u %Lu %Lu", utime, stime); fclose(f); } // In order to make SLEEP_TYPE a run-time parameter function pointers are used. // The function pointer could have been to the sleep function being used, but // then that would mean an extra function call inside of the "work loop" and I // wanted to keep the measurements as tight as possible and the extra work being // done to be as small/controlled as possible so instead the work is declared as // a seriees of macros that are called in all of the sleep functions. The code // is a bit uglier this way, but I believe it results in a more accurate test. // Fill in a buffer with random numbers (taken from latt.c by Jens Axboe <jens.axboe@Oracle.com>) #define DECLARE_FUNC(NAME) struct thread_res *do_work_##NAME(const int pid, const int sleep_time, const int num_iterations, const int work_size) #define DECLARE_WORK() \ int *buf; \ int pseed; \ int inum, bnum; \ pid_t tid; \ struct timeval clock_before, clock_after; \ unsigned long long user_before, user_after; \ unsigned long long sys_before, sys_after; \ struct thread_res *diff; \ tid = gettid(); \ buf = malloc(work_size * sizeof(*buf)); \ diff = malloc(sizeof(*diff)); \ get_thread_times(pid, tid, &user_before, &sys_before); \ gettimeofday(&clock_before, NULL) #define DO_WORK(SLEEP_FUNC) \ for (inum=0; inum<num_iterations; ++inum) { \ SLEEP_FUNC \ \ pseed = 1; \ for (bnum=0; bnum<work_size; ++bnum) { \ pseed = pseed * 1103515245 + 12345; \ buf[bnum] = (pseed / 65536) % 32768; \ } \ } \ #define FINISH_WORK() \ gettimeofday(&clock_after, NULL); \ get_thread_times(pid, tid, &user_after, &sys_after); \ diff->clock = 1000000LL * (clock_after.tv_sec - clock_before.tv_sec); \ diff->clock += clock_after.tv_usec - clock_before.tv_usec; \ diff->user = user_after - user_before; \ diff->sys = sys_after - sys_before; \ free(buf); \ return diff DECLARE_FUNC(nosleep) { DECLARE_WORK(); // Let the compiler know that sleep_time isn't used in this function (void)sleep_time; DO_WORK(); FINISH_WORK(); } DECLARE_FUNC(select) { struct timeval ts; DECLARE_WORK(); DO_WORK( ts.tv_sec = 0; ts.tv_usec = sleep_time; select(0, 0, 0, 0, &ts); ); FINISH_WORK(); } DECLARE_FUNC(poll) { struct pollfd pfd; const int sleep_time_ms = sleep_time / 1000; DECLARE_WORK(); pfd.fd = 0; pfd.events = 0; DO_WORK( poll(&pfd, 1, sleep_time_ms); ); FINISH_WORK(); } DECLARE_FUNC(usleep) { DECLARE_WORK(); DO_WORK( usleep(sleep_time); ); FINISH_WORK(); } DECLARE_FUNC(yield) { DECLARE_WORK(); // Let the compiler know that sleep_time isn't used in this function (void)sleep_time; DO_WORK( sched_yield(); ); FINISH_WORK(); } DECLARE_FUNC(pthread_cond) { pthread_cond_t cond = PTHREAD_COND_INITIALIZER; pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; struct timespec ts; const int sleep_time_ns = sleep_time * 1000; DECLARE_WORK(); pthread_mutex_lock(&mutex); DO_WORK( clock_gettime(CLOCK_REALTIME, &ts); ts.tv_nsec += sleep_time_ns; if (ts.tv_nsec >= 1000000000) { ts.tv_sec += 1; ts.tv_nsec -= 1000000000; } pthread_cond_timedwait(&cond, &mutex, &ts); ); pthread_mutex_unlock(&mutex); pthread_cond_destroy(&cond); pthread_mutex_destroy(&mutex); FINISH_WORK(); } DECLARE_FUNC(nanosleep) { struct timespec req, rem; const int sleep_time_ns = sleep_time * 1000; DECLARE_WORK(); DO_WORK( req.tv_sec = 0; req.tv_nsec = sleep_time_ns; nanosleep(&req, &rem); ); FINISH_WORK(); } void *do_test(void *arg) { const struct thread_info *tinfo = (struct thread_info *)arg; // Call the function to do the work return (*tinfo->func)(tinfo->pid, tinfo->sleep_time, tinfo->num_iterations, tinfo->work_size); } struct thread_res_stats { double min; double max; double avg; double stddev; double prev_avg; }; #ifdef LLONG_MAX #define THREAD_RES_STATS_INITIALIZER {LLONG_MAX, LLONG_MIN, 0, 0, 0} #else #define THREAD_RES_STATS_INITIALIZER {LONG_MAX, LONG_MIN, 0, 0, 0} #endif void update_stats(struct thread_res_stats *stats, long long value, int num_samples, int num_iterations, double scale_to_usecs) { // Calculate the average time per iteration double value_per_iteration = value * scale_to_usecs / num_iterations; // Update the max and min if (value_per_iteration < stats->min) stats->min = value_per_iteration; if (value_per_iteration > stats->max) stats->max = value_per_iteration; // Update the average stats->avg += (value_per_iteration - stats->avg) / (double)(num_samples); // Update the standard deviation stats->stddev += (value_per_iteration - stats->prev_avg) * (value_per_iteration - stats->avg); // And record the current average for use in the next update stats->prev_avg= stats->avg; } void print_stats(const char *name, const struct thread_res_stats *stats) { printf("%s: min: %.1f us avg: %.1f us max: %.1f us stddev: %.1f us
", name, stats->min, stats->avg, stats->max, stats->stddev); } int main(int argc, char **argv) { if (argc <= 6) { printf("Usage: %s <sleep_time> <outer_iterations> <inner_iterations> <work_size> <num_threads> <sleep_type>
", argv[0]); printf(" outer_iterations: Number of iterations for each thread (used to calculate statistics)
"); printf(" inner_iterations: Number of work/sleep cycles performed in each thread (used to improve consistency/observability))
"); printf(" work_size: Number of array elements (in kb) that are filled with psuedo-random numbers
"); printf(" num_threads: Number of threads to spawn and perform work/sleep cycles in
"); printf(" sleep_type: 0=none 1=select 2=poll 3=usleep 4=yield 5=pthread_cond 6=nanosleep
"); return -1; } struct thread_info tinfo; int outer_iterations; int sleep_type; int s, inum, tnum, num_samples, num_threads; pthread_attr_t attr; pthread_t *threads; struct thread_res *res; struct thread_res **times; // Track the stats for each of the measurements struct thread_res_stats stats_clock = THREAD_RES_STATS_INITIALIZER; struct thread_res_stats stats_user = THREAD_RES_STATS_INITIALIZER; struct thread_res_stats stats_sys = THREAD_RES_STATS_INITIALIZER; // Calculate the conversion factor from clock_t to seconds const long clocks_per_sec = sysconf(_SC_CLK_TCK); const double clocks_to_usec = 1000000 / (double)clocks_per_sec; // Get the parameters tinfo.pid = getpid(); tinfo.sleep_time = atoi(argv[1]); outer_iterations = atoi(argv[2]); tinfo.num_iterations = atoi(argv[3]); tinfo.work_size = atoi(argv[4]) * 1024; num_threads = atoi(argv[5]); sleep_type = atoi(argv[6]); switch (sleep_type) { case SLEEP_TYPE_NONE: tinfo.func = &do_work_nosleep; break; case SLEEP_TYPE_SELECT: tinfo.func = &do_work_select; break; case SLEEP_TYPE_POLL: tinfo.func = &do_work_poll; break; case SLEEP_TYPE_USLEEP: tinfo.func = &do_work_usleep; break; case SLEEP_TYPE_YIELD: tinfo.func = &do_work_yield; break; case SLEEP_TYPE_PTHREAD_COND: tinfo.func = &do_work_pthread_cond; break; case SLEEP_TYPE_NANOSLEEP: tinfo.func = &do_work_nanosleep; break; default: printf("Invalid sleep type: %d
", sleep_type); return -7; } // Initialize the thread creation attributes s = pthread_attr_init(&attr); if (s != 0) { printf("Error initializing thread attributes
"); return -2; } // Allocate the memory to track the threads threads = calloc(num_threads, sizeof(*threads)); times = calloc(num_threads, sizeof(*times)); if (threads == NULL) { printf("Error allocating memory to track threads
"); return -3; } // Initialize the number of samples num_samples = 0; // Perform the requested number of outer iterations for (inum=0; inum<outer_iterations; ++inum) { // Start all of the threads for (tnum=0; tnum<num_threads; ++tnum) { s = pthread_create(&threads[tnum], &attr, &do_test, &tinfo); if (s != 0) { printf("Error starting thread
"); return -4; } } // Wait for all the threads to finish for (tnum=0; tnum<num_threads; ++tnum) { s = pthread_join(threads[tnum], (void **)(&res)); if (s != 0) { printf("Error waiting for thread
"); return -6; } // Save the result for processing when they're all done times[tnum] = res; } // For each of the threads for (tnum=0; tnum<num_threads; ++tnum) { // Increment the number of samples in the statistics ++num_samples; // Update the statistics with this measurement update_stats(&stats_clock, times[tnum]->clock, num_samples, tinfo.num_iterations, 1); update_stats(&stats_user, times[tnum]->user, num_samples, tinfo.num_iterations, clocks_to_usec); update_stats(&stats_sys, times[tnum]->sys, num_samples, tinfo.num_iterations, clocks_to_usec); // And clean it up free(times[tnum]); } } // Clean up the thread creation attributes s = pthread_attr_destroy(&attr); if (s != 0) { printf("Error cleaning up thread attributes
"); return -5; } // Finish the calculation of the standard deviation stats_clock.stddev = sqrtf(stats_clock.stddev / (num_samples - 1)); stats_user.stddev = sqrtf(stats_user.stddev / (num_samples - 1)); stats_sys.stddev = sqrtf(stats_sys.stddev / (num_samples - 1)); // Print out the statistics of the times print_stats("gettimeofday_per_iteration", &stats_clock); print_stats("utime_per_iteration", &stats_user); print_stats("stime_per_iteration", &stats_sys); // Clean up the allocated threads and times free(threads); free(times); return 0; }

複数の異なるOSバージョンを搭載したDell Vostro 200（デュアルコアCPU）でテストを再実行しました。これらのいくつかには異なるパッチが適用され、「純粋なカーネルコード」にはならないことを理解していますが、これは、カーネルの異なるバージョンでテストを実行して比較を行う最も簡単な方法でした。 gnuplotでプロットを生成し、この問題に関するバグジラからのバージョンを含めました。

これらのテストはすべて、次のスクリプトとこのコマンド./run_test 1000 10 1000 250 8 6 <os_name>を使用した次のコマンドで実行されました。

#!/bin/bash if [ $# -ne 7 ]; then echo "Usage: `basename $0` <sleep_time> <outer_iterations> <inner_iterations> <work_size> <max_num_threads> <max_sleep_type> <test_name>" echo " max_num_threads: The highest value used for num_threads in the results" echo " max_sleep_type: The highest value used for sleep_type in the results" echo " test_name: The name of the directory where the results will be stored" exit -1 fi sleep_time=$1 outer_iterations=$2 inner_iterations=$3 work_size=$4 max_num_threads=$5 max_sleep_type=$6 test_name=$7 # Make sure this results directory doesn't already exist if [ -e $test_name ]; then echo "$test_name already exists"; exit -1; fi # Create the directory to put the results in mkdir $test_name # Run through the requested number of SLEEP_TYPE values for i in $(seq 0 $max_sleep_type) do # Run through the requested number of threads for j in $(seq 1 $max_num_threads) do # Print which settings are about to be run echo "sleep_type: $i num_threads: $j" # Run the test and save it to the results file ./test_sleep $sleep_time $outer_iterations $inner_iterations $work_size $j $i >> "$test_name/results_$i.txt" done done

これが私が観察したことの要約です。今回はもう少し参考になると思いますので、ペアで比較します。

CentOS 5.6とCentOS 6.2

CentOS 5.6での反復ごとの実時間（gettimeofday）は6.2よりも変化しますが、CFSはプロセスに同等のCPU時間を与え、より一貫した結果をもたらすという優れた仕事をする必要があるため、これは理にかなっています。また、CentOS 6.2の方が、さまざまなスリープメカニズムでスリープする時間の量がより正確で一貫していることも明らかです。 gettimeofday CentOS 5.6 gettimeofday CentOS 6.2

「ペナルティ」は、スレッド数が少ない（gettimeofdayとユーザー時間のプロットで表示される）6.2では明らかですが、スレッド数が多いと減少するようです（ユーザー時間の違いは、ユーザー時間の測定は、まあまあです）。

utime CentOS 5.6 utime CentOS 6.2

システム時間のプロットは、6.2のスリープメカニズムが5.6のシステムよりも多くのシステムを消費していることを示しています。これは、selectを呼び出すだけの単純なテストの以前の結果と一致します。。

stime CentOS 5.6 stime CentOS 6.2

注目に値するのは、sched_yield（）の使用は、sleepメソッドで見られるのと同じペナルティを引き起こさないということです。これからの私の結論は、問題の原因であるのはスケジューラ自体ではなく、問題であるスケジューラとのスリープメソッドの相互作用です。

Ubuntu 7.10とUbuntu 8.04-4

これら2つのカーネルバージョンの違いは、CentOS 5.6と6.2の違いよりも小さいですが、CFSが導入された期間にまたがっています。最初の興味深い結果は、selectとpollが8.04で「ペナルティ」を持つ唯一のスリープメカニズムのようであり、そのペナルティはCentOS 6.2で見られたものよりも多くのスレッドに継続することです。

gettimeofday Ubuntu 7.10 gettimeofday Ubuntu 8.04-4

Select and pollとUbuntu 7.10のユーザー時間は不当に短いため、これは当時存在していたある種の会計上の問題であるように見えますが、現在の問題/ディスカッションには関係ないと思います。

utime Ubuntu 7.10 utime Ubuntu 8.04-4

システム時間はUbuntu 8.04の方がUbuntu 7.10よりも長いようですが、この違いは、CentOS 5.6と6.2で見られたものよりもはるかに明確です。

stime Ubuntu 7.10 stime Ubuntu 8.04-4

Ubuntu 11.10およびUbuntu 12.04に関する注意事項

ここで最初に注意する点は、Ubuntu 12.04のプロットが11.10のプロットに匹敵するため、不必要な冗長性を防ぐことが示されていないことです。

全体的に、Ubuntu 11.10のプロットは、CentOS 6.2で観察されたのと同じ種類の傾向を示しています（これは、これが一般的なカーネルの問題であり、単なるRHELの問題ではないことを示しています）。 1つの例外は、Ubuntu 11.10の場合、CentOS 6.2の場合よりもシステム時間が少し長いように見えることですが、この測定の解像度は非常に粗いため、「それより少し高いように見えます」以外の結論は「薄い氷に踏み込むだろう。

Ubuntu 11.10とUbuntu 11.10とBFS

UbuntuカーネルでBFSを使用するPPAは https://launchpad.net/~chogydan/+archive/ppa にあり、この比較を生成するためにインストールされています。 CentOS 6.2をBFSで実行する簡単な方法が見つからなかったため、この比較を実行しました。Ubuntu11.10の結果はCentOS 6.2と非常によく比較されているため、これは公平で意味のある比較だと思います。

gettimeofday Ubuntu 11.10 gettimeofday Ubuntu 11.10 with BFS

注目すべき主な点は、BFSではselectとnanosleepのみが少ないスレッド数で「ペナルティ」を引き起こすが、より高いCFSで見られるのと同様の「ペナルティ」（より大きくない場合）を引き起こすようだということです。スレッドの数。

utime Ubuntu 11.10 utime Ubuntu 11.10 with BFS

他の興味深い点は、システム時間がCFSよりもBFSの方が短いように見えることです。もう一度、これはデータの粗さのために薄い氷の上を歩き始めていますが、いくつかの違いが存在しているように見え、この結果は単純な50プロセスの選択ループテストと一致し、CFSよりもBFSの方がCPU使用量が少ない。

stime Ubuntu 11.10 stime Ubuntu 11.10 with BFS

私がこれらの2つの点から引き出す結論は、BFSは問題を解決しないが、少なくとも一部の領域でその影響を軽減するように思われるということです。

結論

前述のように、これはスケジューラ自体の問題ではなく、スリープメカニズムとスケジューラの間の相互作用に問題があるとは思いません。スリープ状態で、CPUをほとんど使用しないプロセスでのCPU使用率の増加は、CentOS 5.6からの回帰であり、イベントループまたはポーリングスタイルのメカニズムを使用したいプログラムにとって大きなハードルであると考えています。

問題をさらに診断するために取得できる他のデータや実行できるテストはありますか？

2012年6月29日更新

私はテストプログラムを少し簡略化して、見つけられる here （投稿は長さ制限を超え始めていたため、移動する必要がありました）。

Nils · Answer

SLES 11 SP2 release-notes によると、これはCFSの実装方法に導入された変更である可能性があります。

SLES 11 SP2は現在のSLESバージョンであるため、この動作はまだ有効です（すべての3.xカーネルで発生する可能性があるため）。

この変更は意図的なものでしたが、「悪い」副作用が生じる可能性があります。おそらく、説明されている回避策の1つが役立つでしょう...