実験メモ
をテンプレートにして作成
[
トップ
] [
新規
|
一覧
|
単語検索
|
最終更新
|
ヘルプ
|
ログイン
]
開始行:
[[MESI-CUDA]]
**caviar2の環境 [#k4b91a3b]
deviceQueryの実行結果コピペ
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static l...
Detected 2 CUDA Capable device(s)
Device 0: "GeForce GTX 680"
CUDA Driver Version / Runtime Version 5.0 / 5.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 4095 MBy...
( 8) Multiprocessors x (192) CUDA Cores/MP: 1536 CUD...
GPU Clock rate: 1058 MHz...
Memory Clock rate: 3004 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 b...
Max Texture Dimension Size (x,y,z) 1D=(6553...
Max Layered Texture Size (dim) x layers 1D=(1638...
Total amount of constant memory: 65536 by...
Total amount of shared memory per block: 49152 by...
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1...
Maximum sizes of each dimension of a grid: 21474836...
Maximum memory pitch: 21474836...
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with...
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDe...
Device 1: "Tesla C2075"
CUDA Driver Version / Runtime Version 5.0 / 5.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 5375 MBy...
(14) Multiprocessors x ( 32) CUDA Cores/MP: 448 CUDA...
GPU Clock rate: 1147 MHz...
Memory Clock rate: 1566 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 b...
Max Texture Dimension Size (x,y,z) 1D=(6553...
Max Layered Texture Size (dim) x layers 1D=(1638...
Total amount of constant memory: 65536 by...
Total amount of shared memory per block: 49152 by...
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1...
Maximum sizes of each dimension of a grid: 65535 x ...
Maximum memory pitch: 21474836...
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with...
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDe...
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = ...
**行列積の実行時間 [#w6543136]
GTX680,C2075,両方を使った場合の実行時間をまとめるはず
sampleのimul-cuda-opt.cuを利用したモノを使用 それぞれLOO...
GTX680
|1024|2048|4096|
|0.372563 sec|2.697491 sec|23.334252 sec|
Tesla C2075
|1024|2048|4096|
|0.551547 sec|3.598030 sec|37.572638 sec|
行列積 C=A×Bにおいて、AはそれぞれのGPUに半分づつ分けて転...
Bは全て転送、Cにも半分づつ転送
GTX680+C2075
|1024|2048|4096|
|0.398800 sec|2.465314 sec|26.126019 sec|
GPUのコード実行順を変えると、結構実行時間が変わったのでそ...
C2075+GTX680
|1024|2048|4096|
|0.299405 sec|1.866932 sec|21.175290 sec|
試しに、GPU毎にデータ割り振り量変えたヤツ
C2075:GTX680=3:5
|1024|2048|4096|
|0.262415 sec|1.887538 sec|19.437995 sec|
**転送速度の計測 [#kba1298c]
cudaMemcpy,cudaMemcpyAsync,cudaMemcpyPeer,cudaMemcpyPeerA...
cudaMemcpy,cudaMemcpyAsyncはそれぞれのデバイスメモリから...
cudaMemcpy
|1024|2048|4096|
|0.018957 sec|0.075174 sec|0.299940 sec|
cudaMemcpyAsync(1 stream)
|1024|2048|4096|
|0.012623 sec|0.050099 sec|0.199967 sec|
cudaMemcpyAsync(n stream)
|1024|2048|4096|
|0.048603 sec|0.111680 sec|0.402752 sec|
cudaMemcpyPeer
|1024|2048|4096|
|0.000554 sec|0.001848 sec|0.109166 sec|
cudaMemcpyPeerAsync(1 stream)
|1024|2048|4096|
|0.058816 sec|0.214505 sec|0.735079 sec|
cudaMemcpyPeerAsync(n stream)
|1024|2048|4096|
|0.419530 sec|0.996495 sec|2.220573 sec|
気になったこと
cudaMemcpyとcudaMemcpyAsyncの転送時間の違いはデバイス0と...
**カーネル実行のオーバーラップ [#i8f8b98c]
1つのデバイスが並行実行できるカーネル起動の最大数は16
ストリームを用いることでカーネル実行のオーバーラップさせ...
N*Nの正方行列の行列積の実行時間
device0:GTX680,device1:C2075 デバイス毎に演算する領域を0...
N=2048,分割数=16,LOOP=10,スレッド=512として、ストリームの...
|分割数\ストリーム数|1|2|4|8|16|
| 0/16|13.765975 sec| 7.284961 sec|4.166925 sec|2.862901 ...
| 1/16|12.894443 sec| 6.821152 sec|3.878094 sec|2.674344 ...
| 2/16|12.045507 sec| 6.372051 sec|3.636947 sec|2.490905 ...
| 3/16|11.178205 sec| 5.922370 sec|3.392846 sec|2.340465 ...
| 4/16|10.322463 sec| 5.473254 sec|3.129212 sec|2.145972 ...
| 5/16| 9.471967 sec| 5.023666 sec|2.859293 sec|1.985736 ...
| 6/16| 8.670135 sec| 4.585810 sec|2.708098 sec|1.806416 ...
| 7/16|10.097119 sec| 5.333367 sec|3.165185 sec|1.956888 ...
| 8/16|11.525391 sec| 6.081710 sec|3.601718 sec|2.236722 ...
| 9/16|12.950572 sec| 6.825689 sec|4.036854 sec|2.504258 ...
|10/16|14.376288 sec| 7.571741 sec|4.447949 sec|2.766733 ...
|11/16|15.802662 sec| 8.316296 sec|4.875347 sec|3.027774 ...
|12/16|17.229950 sec| 9.062237 sec|5.314387 sec|3.289930 ...
|13/16|18.655763 sec| 9.807283 sec|5.732227 sec|3.550273 ...
|14/16|20.083197 sec|10.553235 sec|6.161599 sec|3.813419 ...
|15/16|21.509035 sec|11.297994 sec|6.615043 sec|4.075671 ...
|16/16|22.911528 sec|12.018976 sec|6.977202 sec|4.313314 ...
どのストリーム本数の場合でも、6/16 GTX680:C2075=10:6の割...
ストリーム本数が増えるにつれ、実行時間は短くなる。これは...
**行列積の演算比率(12/5版) [#b83eecc1]
device0:GTX680,device1:C2075 デバイス毎に演算する領域を0...
分割数=16,LOOP=10,スレッド=512,ストリーム数=8として、行列...
|分割数\N|1024|2048|4096|
| 0/16|0.553955 sec|2.845587 sec|22.922292 sec|
| 1/16|0.500680 sec|2.673095 sec|21.504184 sec|
| 2/16|0.478309 sec|2.507882 sec|20.093751 sec|
| 3/16|0.438574 sec|2.326454 sec|18.736978 sec|
| 4/16|0.408120 sec|2.131195 sec|17.348398 sec|
| 5/16|0.373951 sec|1.986511 sec|15.924195 sec|
| 6/16|0.340626 sec|1.792023 sec|14.513475 sec|
| 7/16|0.387999 sec|1.954448 sec|16.880663 sec|
| 8/16|0.439925 sec|2.234612 sec|19.230996 sec|
| 9/16|0.490862 sec|2.501240 sec|21.590972 sec|
|10/16|0.543095 sec|2.763124 sec|23.940428 sec|
|11/16|0.594523 sec|3.022979 sec|26.304947 sec|
|12/16|0.646186 sec|3.286041 sec|28.661486 sec|
|13/16|0.697806 sec|3.547687 sec|31.024724 sec|
|14/16|0.749417 sec|3.809034 sec|33.382685 sec|
|15/16|0.800985 sec|4.070498 sec|35.735338 sec|
|16/16|0.844743 sec|4.308685 sec|37.993399 sec|
どの場合でも、6/16 GTX680:C2075=10:6の割合の時、最速。
デバイスの実行順を逆にした場合も同様にGTX680:C2075=10:6の...
上記のプログラムとはデバイスの実行順を逆にしたものから、...
|分割数\N|1024|2048|4096|
|10/16|0.341283 sec|1.820906 sec|15.218417 sec|
**動的振り分け(12/5版) [#c7eb0e1d]
(24o,gpu0=0,gpu1=1)
ストリーム数
size=4096, LOOP=10, DIV=16,ST=1~16
|ST|1|2|4|8|16|
|TIME|40.931367 sec|23.270021 sec|16.452349 sec|15.843703...
||5/16|5/16|5/16|5/16|5/16|
カーネル実行のオーバラップはデバイス依存。少なくともこの...
分割数
size=4096, LOOP=10, ST=8, DIV=16-2048
|DIV|16|128|512|1024|2048|
|TIME|15.853080 sec|14.605018 sec|14.230065 sec|13.980051...
||5/16|47-48/128|196-197/512|403-407/1024|729-754/2048|
細かく分割することで、デバイス間のカーネル実行のオーバラ...
サイズ
LOOP=10, DIV=1024, ST=8, N=1024-8192
|N|1024|2048|4096|8192|
|TIME|0.449891 sec|1.976305 sec|13.985616 sec|136.465634 ...
||294-340/1024|360-381/1024|404-405/1024|385-388/1024|
サイズ毎に分割されたデータ毎のカーネル実行の重みが変わる...
**memo [#id418430]
[[HowTo]]
[[覚え書き]]
[[古いデータ]]
終了行:
[[MESI-CUDA]]
**caviar2の環境 [#k4b91a3b]
deviceQueryの実行結果コピペ
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static l...
Detected 2 CUDA Capable device(s)
Device 0: "GeForce GTX 680"
CUDA Driver Version / Runtime Version 5.0 / 5.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 4095 MBy...
( 8) Multiprocessors x (192) CUDA Cores/MP: 1536 CUD...
GPU Clock rate: 1058 MHz...
Memory Clock rate: 3004 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 b...
Max Texture Dimension Size (x,y,z) 1D=(6553...
Max Layered Texture Size (dim) x layers 1D=(1638...
Total amount of constant memory: 65536 by...
Total amount of shared memory per block: 49152 by...
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1...
Maximum sizes of each dimension of a grid: 21474836...
Maximum memory pitch: 21474836...
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with...
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDe...
Device 1: "Tesla C2075"
CUDA Driver Version / Runtime Version 5.0 / 5.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 5375 MBy...
(14) Multiprocessors x ( 32) CUDA Cores/MP: 448 CUDA...
GPU Clock rate: 1147 MHz...
Memory Clock rate: 1566 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 b...
Max Texture Dimension Size (x,y,z) 1D=(6553...
Max Layered Texture Size (dim) x layers 1D=(1638...
Total amount of constant memory: 65536 by...
Total amount of shared memory per block: 49152 by...
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1...
Maximum sizes of each dimension of a grid: 65535 x ...
Maximum memory pitch: 21474836...
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with...
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDe...
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = ...
**行列積の実行時間 [#w6543136]
GTX680,C2075,両方を使った場合の実行時間をまとめるはず
sampleのimul-cuda-opt.cuを利用したモノを使用 それぞれLOO...
GTX680
|1024|2048|4096|
|0.372563 sec|2.697491 sec|23.334252 sec|
Tesla C2075
|1024|2048|4096|
|0.551547 sec|3.598030 sec|37.572638 sec|
行列積 C=A×Bにおいて、AはそれぞれのGPUに半分づつ分けて転...
Bは全て転送、Cにも半分づつ転送
GTX680+C2075
|1024|2048|4096|
|0.398800 sec|2.465314 sec|26.126019 sec|
GPUのコード実行順を変えると、結構実行時間が変わったのでそ...
C2075+GTX680
|1024|2048|4096|
|0.299405 sec|1.866932 sec|21.175290 sec|
試しに、GPU毎にデータ割り振り量変えたヤツ
C2075:GTX680=3:5
|1024|2048|4096|
|0.262415 sec|1.887538 sec|19.437995 sec|
**転送速度の計測 [#kba1298c]
cudaMemcpy,cudaMemcpyAsync,cudaMemcpyPeer,cudaMemcpyPeerA...
cudaMemcpy,cudaMemcpyAsyncはそれぞれのデバイスメモリから...
cudaMemcpy
|1024|2048|4096|
|0.018957 sec|0.075174 sec|0.299940 sec|
cudaMemcpyAsync(1 stream)
|1024|2048|4096|
|0.012623 sec|0.050099 sec|0.199967 sec|
cudaMemcpyAsync(n stream)
|1024|2048|4096|
|0.048603 sec|0.111680 sec|0.402752 sec|
cudaMemcpyPeer
|1024|2048|4096|
|0.000554 sec|0.001848 sec|0.109166 sec|
cudaMemcpyPeerAsync(1 stream)
|1024|2048|4096|
|0.058816 sec|0.214505 sec|0.735079 sec|
cudaMemcpyPeerAsync(n stream)
|1024|2048|4096|
|0.419530 sec|0.996495 sec|2.220573 sec|
気になったこと
cudaMemcpyとcudaMemcpyAsyncの転送時間の違いはデバイス0と...
**カーネル実行のオーバーラップ [#i8f8b98c]
1つのデバイスが並行実行できるカーネル起動の最大数は16
ストリームを用いることでカーネル実行のオーバーラップさせ...
N*Nの正方行列の行列積の実行時間
device0:GTX680,device1:C2075 デバイス毎に演算する領域を0...
N=2048,分割数=16,LOOP=10,スレッド=512として、ストリームの...
|分割数\ストリーム数|1|2|4|8|16|
| 0/16|13.765975 sec| 7.284961 sec|4.166925 sec|2.862901 ...
| 1/16|12.894443 sec| 6.821152 sec|3.878094 sec|2.674344 ...
| 2/16|12.045507 sec| 6.372051 sec|3.636947 sec|2.490905 ...
| 3/16|11.178205 sec| 5.922370 sec|3.392846 sec|2.340465 ...
| 4/16|10.322463 sec| 5.473254 sec|3.129212 sec|2.145972 ...
| 5/16| 9.471967 sec| 5.023666 sec|2.859293 sec|1.985736 ...
| 6/16| 8.670135 sec| 4.585810 sec|2.708098 sec|1.806416 ...
| 7/16|10.097119 sec| 5.333367 sec|3.165185 sec|1.956888 ...
| 8/16|11.525391 sec| 6.081710 sec|3.601718 sec|2.236722 ...
| 9/16|12.950572 sec| 6.825689 sec|4.036854 sec|2.504258 ...
|10/16|14.376288 sec| 7.571741 sec|4.447949 sec|2.766733 ...
|11/16|15.802662 sec| 8.316296 sec|4.875347 sec|3.027774 ...
|12/16|17.229950 sec| 9.062237 sec|5.314387 sec|3.289930 ...
|13/16|18.655763 sec| 9.807283 sec|5.732227 sec|3.550273 ...
|14/16|20.083197 sec|10.553235 sec|6.161599 sec|3.813419 ...
|15/16|21.509035 sec|11.297994 sec|6.615043 sec|4.075671 ...
|16/16|22.911528 sec|12.018976 sec|6.977202 sec|4.313314 ...
どのストリーム本数の場合でも、6/16 GTX680:C2075=10:6の割...
ストリーム本数が増えるにつれ、実行時間は短くなる。これは...
**行列積の演算比率(12/5版) [#b83eecc1]
device0:GTX680,device1:C2075 デバイス毎に演算する領域を0...
分割数=16,LOOP=10,スレッド=512,ストリーム数=8として、行列...
|分割数\N|1024|2048|4096|
| 0/16|0.553955 sec|2.845587 sec|22.922292 sec|
| 1/16|0.500680 sec|2.673095 sec|21.504184 sec|
| 2/16|0.478309 sec|2.507882 sec|20.093751 sec|
| 3/16|0.438574 sec|2.326454 sec|18.736978 sec|
| 4/16|0.408120 sec|2.131195 sec|17.348398 sec|
| 5/16|0.373951 sec|1.986511 sec|15.924195 sec|
| 6/16|0.340626 sec|1.792023 sec|14.513475 sec|
| 7/16|0.387999 sec|1.954448 sec|16.880663 sec|
| 8/16|0.439925 sec|2.234612 sec|19.230996 sec|
| 9/16|0.490862 sec|2.501240 sec|21.590972 sec|
|10/16|0.543095 sec|2.763124 sec|23.940428 sec|
|11/16|0.594523 sec|3.022979 sec|26.304947 sec|
|12/16|0.646186 sec|3.286041 sec|28.661486 sec|
|13/16|0.697806 sec|3.547687 sec|31.024724 sec|
|14/16|0.749417 sec|3.809034 sec|33.382685 sec|
|15/16|0.800985 sec|4.070498 sec|35.735338 sec|
|16/16|0.844743 sec|4.308685 sec|37.993399 sec|
どの場合でも、6/16 GTX680:C2075=10:6の割合の時、最速。
デバイスの実行順を逆にした場合も同様にGTX680:C2075=10:6の...
上記のプログラムとはデバイスの実行順を逆にしたものから、...
|分割数\N|1024|2048|4096|
|10/16|0.341283 sec|1.820906 sec|15.218417 sec|
**動的振り分け(12/5版) [#c7eb0e1d]
(24o,gpu0=0,gpu1=1)
ストリーム数
size=4096, LOOP=10, DIV=16,ST=1~16
|ST|1|2|4|8|16|
|TIME|40.931367 sec|23.270021 sec|16.452349 sec|15.843703...
||5/16|5/16|5/16|5/16|5/16|
カーネル実行のオーバラップはデバイス依存。少なくともこの...
分割数
size=4096, LOOP=10, ST=8, DIV=16-2048
|DIV|16|128|512|1024|2048|
|TIME|15.853080 sec|14.605018 sec|14.230065 sec|13.980051...
||5/16|47-48/128|196-197/512|403-407/1024|729-754/2048|
細かく分割することで、デバイス間のカーネル実行のオーバラ...
サイズ
LOOP=10, DIV=1024, ST=8, N=1024-8192
|N|1024|2048|4096|8192|
|TIME|0.449891 sec|1.976305 sec|13.985616 sec|136.465634 ...
||294-340/1024|360-381/1024|404-405/1024|385-388/1024|
サイズ毎に分割されたデータ毎のカーネル実行の重みが変わる...
**memo [#id418430]
[[HowTo]]
[[覚え書き]]
[[古いデータ]]
ページ名: