Skip to content

Making GAP code thread safe

Max Horn edited this page Jul 3, 2017 · 1 revision

Making the GAP code thread-safe

The code is thread-safe if can be safely executed by multiple threads at the same time. At this page we collect various guidelines, hints and examples accumulated from working on the thread-safety in the GAP library. They should be useful for preparing other GAP code (ranging from GAP packages to simple scripts) to run under HPC-GAP as well. We envisage that after making the GAP library thread-safe, it will be reasonable to expect that not too intricate existing GAP 4 code will work in a main execution thread (in this case, even if GAP will start several user interface threads, the actual computation will be performed in a single thread).

What has to be fixed?

The design of HPC-GAP already addresses many issues and takes care about the backwards compatibility. As a consequence, most of the problems will actually fall into one of the following two categories:

  • Failed attempt of getting read or write access to the object
  • Reading or modifying the the object when it is not safe (so one or more threads may see partially modified object).

The good news for you are that for "most of objects" (their range is deliberately vaguely outlined here, in case you're not yet familiar with the concept of regions) constructed by GAP these problems should be already sorted out in HPC-GAP. So if you will be interested to a question like "if I will construct the symmetric group S_3 in a thread, where it will appear? Will it be thread-local or not?", the answer will be that S_3 is an atomic component object, therefore it will be contained in the public region. In the following example, we may see that both S_3 and S_4 belong to the public region, though one of them was created in the main thread and another in the background thread using a task.

gap> G:=SymmetricGroup(3);
Sym( [ 1 .. 3 ] )
gap> RegionOf(G);
<region: public region>
gap> t:=RunTask(function() G:=SymmetricGroup(4);end);;
gap> G;
Sym( [ 1 .. 4 ] )
gap> RegionOf(G);
<region: public region>

However, when one creates e.g. lists or records, they will belong to the thread-local region:

gap> RegionOf([1,2,3]);
<region: thread region #0>
gap> RegionOf(rec(a:=1));
<region: thread region #0>

To be accessible by other threads, they have to be explicitly migrated to a shared region, public region or read-only region.

Global variables which hold objects in the main thread's thread-local region can be listed by reading the following code after loading GAP:

Filtered( NamesSystemGVars(), x -> 
  IsBoundGlobal(x) and 
  IsThreadLocal(ValueGlobal(x)) and 
  not IsThreadLocalGVar(x) );
Length(last);

These variables may cause problems in a multithreaded environment since it is not possible to access them from other threads, so one of the tasks is to fix this by making them read-only, atomic or shared, dependently on their usage.

Note, however, that the search over NamesSystemGVars() is not exhaustive: for example, the following piece of code from ffeconway.gi used fam!.ZCache component to store already known primitive elements of finite fields of characteristic p:

FFECONWAY.ZNC := function(p,d)
    local   fam,  zc,  v;
    fam := FFEFamily(p);
    if not IsBound(fam!.ZCache) then
        fam!.ZCache := [];
    fi;
    zc := fam!.ZCache;
    ...
    return zc[d];
end;

The fact that fam!.ZCache was not properly shared lead to a runtime error, occurring only when another thread tries to access fam!.ZCache, remaining a hidden bug in a single-threaded mode. Nevertheless, the actual fix required just one line with the assignment fam!.ZCache := [] to be changed to

        fam!.ZCache := MakeWriteOnceAtomic([]);

Such situations as above are harder to detect, but quite often they are accompanied by !. syntax, which is used in several thousand lines in the GAP library and in many more lines in GAP packages. Inspecting these lines systematically would point to more cases to be fixed.

Finally, beware that a package may use some global parameters which may be immutable objects e.g. booleans, integers etc. so they will belong to the public region and be accessed from any thread:

gap> PARAM:=1;;
gap> t:=RunTask(function() PARAM:=2;end);;
gap> PARAM;
2

They will not be reported by the script above, and one should check if it is safe to allow their modification from any thread or they should be made thread-local.

What may happen?

If you run computation in the main execution thread (i.e. the one with which you interact from the GAP prompt after HPC-GAP is loaded), and you have an access error, one of the reasons may be that some variable is already shared and lies in a shared region, but the code accessing it misses an atomic statement. For example,

gap> l:=[1,2,3];          
[ 1, 2, 3 ]
gap> ShareObj(l);
<obj 4406326112 inaccessible in region: 0x106aaa540>
gap> l[1];
Error, Attempt to read object 4406326112 of type list (plain,cyc) without havin\
g read access
not in any function at line 4 of stream
brk> quit;
gap> atomic l do Print(l[1],"\n");od;
1

Below you may see some more examples of error messages. The next error is already fixed by making LETTER_WORD_EREP_CACHE and LETTER_WORD_EREP_CACHEVAL thread-local variables.

gap> TaskResult(RunTask(SmallGroup,[8,3])); 
Error, No read access to object 4398002176 of type list (plain,cyc)
in gen/vars.c, line 1419, function EvalElmList(), accessing list in
  if IsIdenticalObj( LETTER_WORD_EREP_CACHE[i], w )  then
    return LETTER_WORD_EREP_CACHEVAL[i];
fi; called from 
ERepLettWord( w ) called from
NumberSyllables( gens[i] ) called from
SingleCollectorByRelators( efam, gens, rels, conflicts ) called from
PolycyclicFactorGroupByRelators( ElementsFamily( FamilyObj( fgrp ) ), GeneratorsOfGroup( fgrp ), 
 rels ) called from
PolycyclicFactorGroup( FreeGroupOfFpGroup( F ), RelatorsOfFpGroup( F ) ) called from
...  at line 0 of *defin*
brk> OBJ_HANDLE(4398002176);
<obj 4398002176 inaccessible in region: thread region #0>
brk> UNSAFE_VIEW(OBJ_HANDLE(4398002176));
[ 1, 1, 1 ]

Another problem showed up since type is used to store additional information. It was fixed at a more generic level, since extended types (i.e. types with some extra information stored and accessed via !. notation are used in many places:

gap> r:= GF(3);;
m:= Group( (1,2,3), (1,2) );
rm:= FreeMagmaRing( r, m );
membrm:= Embedding( m, rm );
gap> Group([ (1,2,3), (1,2) ])
gap> <algebra-with-one over GF(3), with 2 generators>
gap> Error, No write access to object 4384884976 of type object (positional)
in gen/vars.c, line 2108, function ExecAssPosObj(), accessing list in
  called from 
K![POS_DATA_TYPE] := data;SetDataType( Type, [ source, range ] ); called from
TypeOfDefaultGeneralMapping( M, RM, IsEmbeddingMagmaMagmaRing ) called from
<function "unknown">( <arguments> )
called from read-eval loop at line 4 of stream
brk> K;
NewType( NewFamily( "GeneralMappingsFamily", [ 416 ], [ 117, 120, 123, 127, 131, 150, 416 ] ), [ 36, 41, 117, 120, 123, 127, 131, 150, 416, 417, 428, 429, 430, 431, 432, 433, 436, 437, 447, 448, 1870, 1900 ] )
brk>

Sometimes the reason for the problem is an attempt to modify read-only object in place, like e.g. ConvertToVectorRepNC does for example here:

gap> g:= DerivedSubgroup( SO( 1, 8, 4 ) );;
Error, Attempt to write object 4574785072 of type object (data) without having write access in
 CLONE_OBJ( v, vc ); called from 
ConvertToVectorRepNC( v, GF( Q_VEC8BIT( w ) ) ); called from
fc{k} - c * gc called from
QUOTREM_LAURPOLS_LISTS( fc, gc ) called from
QuotRemLaurpols( f, g, 4 ) called from
Quotient( k, g ) called from
...  at line 1 of *stdin*
brk>

or here:

gap> R := PolynomialRing(GF(4),1);
GF(2^2)[x_1]
gap> One(R);
Z(2)^0
gap> Z(4)*One(R);
Error, Attempt to write object 4437429984 of type object (data) without having \
write access in
  called from 
CLONE_OBJ( v, vc );ConvertToVectorRepNC( a, Field( b ) ); called from
coef * tmp[1] called from
<function "ProdCoeffUnivfunc">( <arguments> )
called from read-eval loop at line 3 of stream
brk>

These particular problems were resolved locally by using a shallow copy of an argument, but in the future we would like to get rid of in-place modifications (e.g. changing the representation of an object) completely.

Further problems may occur when one tries to call a function in another thread (if GAP is started with the multi-threaded UI, you need to switch to the other thread to see the error message). For example:

gap> TaskResult(RunTask(Indeterminate,GF(13)));
Error, No read access to object 4649038464 of type object (data)
in gen/lists.c, line 200, function FuncLENGTH(), accessing list in
  called from 
if Length( cofs ) = 0  then
return String( zero );
fi;StringUnivariateLaurent( fam, cofs, val, name ) called from
DoPrintUnivariateLaurent( FamilyObj( f ), c[1], c[2], IndeterminateNumberOfLaurentPolynomial( f ) ); called from
SHELL( GetBottomLVars(  ), false, false, 3, true, prompt, function (  )
...

or

gap> task:=RunTask(ZmodnZ, 65537); TaskResult(task);
<obj 4394688816 inaccessible in region: 0x105fad540>
Error, No read access to object 4348079584 of type list (plain)
in gen/plist.c, line 663, function TypePlistWithKTnum(), accessing kinds in
 SortParallel( FAMS_FFE_LARGE[1], FAMS_FFE_LARGE[2] ); called from 
FFEFamily( p ) called from
ZmodpZNC( n ) called from
CALL_WITH_CATCH( taskdata.func, taskdata.args ) called from
<function "unknown">( <arguments> )
called from read-eval loop at line 0 of *defin*
brk> OBJ_HANDLE(4348079584);
<obj 4348079584 inaccessible in region: thread region #0>
brk> UNSAFE_VIEW(last);
[ NewType( NewFamily( "CollectionsFamily(...)", [ 55 ], [ 54, 55, 136 ] ), 
   [ 1, 2, 14, 15, 16, 25, 34, 54, 55, 66, 67, 87, 88, 89, 98, 101, 105, 109, 113, 136, 137, 138, 
     227 ] ),,,,,,,,,,,,,,,,,,,,,,,,,, 0 ]

(both already fixed).

See HPC-GAP-debugging-hints to find out how to investigate break loops similar to those shown above.

Clone this wiki locally