Multithreading – use omnithreadlibrary to write arrays that are slower in parallel than in serial
I am studying the implementation of differential evolution optimization algorithm and hope to speed up the computing time by parallel computing group members
I have simplified the code to its essence to test parallelization, and the reduced version shows the same problem: the parallel version is slower than the serial version
The key is that I passed multiple dynamic arrays and should write an output for each member Each array has a dimension dedicated to filling members, so for each filling member, access a different set of array indexes This also means that in a parallel implementation, no two threads will write to the same array element
Under the code I used to test (the actual code in differential evolution has a dowork procedure with more const parameters and VaR arrays)
unit Unit1;
interface
type
TGoalFunction = reference to function(const X,B: array of extended): extended;
TArrayExtended1D = array of extended;
TArrayExtended2D = array of TArrayExtended1D;
TClassToTest = class abstract
private
class procedure DoWork(const AGoalFunction: TGoalFunction; const AInputArray: TArrayExtended2D; var AOutputArray1: TArrayExtended1D; var AOutputArray2: TArrayExtended2D; const AIndex,AIndex2: integer);
public
class procedure RunSerial;
class procedure RunParallel;
end;
function HyperSphere(const X,B: array of extended): extended;
const
DIMENSION1 = 5000;
DIMENSION2 = 5000;
LOOPS = 10;
implementation
uses
OtlParallel;
function HyperSphere(const X,B: array of extended): extended;
var
I: Integer;
begin
Result := 0;
for I := 0 to Length(X) - 1 do
Result := Result + X[I]*X[I];
end;
{ TClassToTest }
class procedure TClassToTest.DoWork(const AGoalFunction: TGoalFunction; const AInputArray: TArrayExtended2D; var AOutputArray1: TArrayExtended1D; var AOutputArray2: TArrayExtended2D; const AIndex,AIndex2: integer);
var
I: Integer;
begin
AOutputArray1[AIndex] := AGoalFunction(AInputArray[AIndex],[]);
for I := 0 to Length(AOutputArray2[AIndex]) - 1 do
AOutputArray2[AIndex,I] := Random*AIndex2;
end;
class procedure TClassToTest.RunParallel;
var
LGoalFunction: TGoalFunction;
LInputArray: TArrayExtended2D;
LOutputArray1: TArrayExtended1D;
LOutputArray2: TArrayExtended2D;
I,J,K: Integer;
begin
SetLength(LInputArray,DIMENSION1,DIMENSION2);
for I := 0 to DIMENSION1 - 1 do
begin
for J := 0 to DIMENSION2 - 1 do
LInputArray[I,J] := Random;
end;
SetLength(LOutputArray1,DIMENSION1);
SetLength(LOutputArray2,DIMENSION2);
LGoalFunction := HyperSphere;
for I := 0 to LOOPS - 1 do
begin
Parallel.ForEach(0,DIMENSION1 - 1).Execute(
procedure (const value: integer)
begin
DoWork(LGoalFunction,LInputArray,LOutputArray1,LOutputArray2,value,I);
end
);
for J := 0 to DIMENSION1 - 1 do
begin
for K := 0 to DIMENSION2 - 1 do
LInputArray[J,K] := LOutputArray2[J,K];
end;
end;
end;
class procedure TClassToTest.RunSerial;
var
LGoalFunction: TGoalFunction;
LInputArray: TArrayExtended2D;
LOutputArray1: TArrayExtended1D;
LOutputArray2: TArrayExtended2D;
I,DIMENSION2);
LGoalFunction := HyperSphere;
for I := 0 to LOOPS - 1 do
begin
for J := 0 to DIMENSION1 - 1 do
begin
DoWork(LGoalFunction,I);
end;
for J := 0 to DIMENSION1 - 1 do
begin
for K := 0 to DIMENSION2 - 1 do
LInputArray[J,K];
end;
end;
end;
end.@H_301_19@
我期待在我的8核处理器上加速大约x6,但是面临轻微的减速.我应该更改什么来提高并行运行DoWork过程的速度?
请注意,我宁愿保留DoWork过程中的实际工作,因为我必须能够在有和没有并行化(布尔标志)的情况下调用相同的算法,同时保持代码的主体共享以便于维护
Solution
This is due to random's lack of thread safety Its implementation is:
// global var
var
RandSeed: Longint = 0; { Base for random number generator }
function Random: Extended;
const
two2neg32: double = ((1.0/$10000) / $10000); // 2^-32
var
Temp: Longint;
F: Extended;
begin
Temp := RandSeed * $08088405 + 1;
RandSeed := Temp;
F := Int64(Cardinal(Temp));
Result := F * two2neg32;
end;@H_301_19@
因为RandSeed是一个全局变量,通过调用Random来修改,所以线程最终会对RandSeed进行争用写入.那些争用的写入会导致您的性能问题.它们有效地序列化您的并行代码.严重到足以让它比真正的串行代码慢.
将以下代码添加到设备实施部分的顶部,您将看到不同之处:
threadvar
RandSeed: Longint;
function Random: Double;
const
two2neg32: double = ((1.0/$10000) / $10000); // 2^-32
var
Temp: Longint;
F: Double;
begin
Temp := RandSeed * $08088405 + 1;
RandSeed := Temp;
F := Int64(Cardinal(Temp));
Result := F * two2neg32;
end;@H_301_19@
通过这种更改来避免共享,争用写入,您会发现并行版本更快,正如预期的那样.您不会使用处理器计数进行线性缩放.我的猜测是因为你的内存访问模式在代码的并行版本中是次优的.
我猜你只是用Random作为生成一些数据的手段.但是如果你确实需要一个RNG,你需要安排每个任务使用他们自己的RNG私有实例.
您还可以使用Sqr(X)而不是X * X加速代码,也可以切换到Double而不是Extended.
