Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge sifive_x280 with RISC-V base architecture support changes #3

Closed
wants to merge 25 commits into from

Conversation

leekillough
Copy link

This merges flame#693, flame#735, flame#736, flame#738 and flame#740 into flame#737, and resolves merge conflicts, which are mostly caused by #if/#endif scopes for sifive_x280 and rv32/rv32i/rv64/rv64i overlapping.

Aaron-Hutchinson and others added 14 commits March 29, 2023 11:20
* Move -fPIC insertion to subconfigs' make_defs.mk.

Details:
- Previously, common.mk was appending -fPIC to the CPICFLAGS variables
  set within the various subconfigurations' make_defs.mk files. This
  seemed somewhat unintuitive, and so now the -fPIC flag is assigned to
  the various subconfigs' CPICFLAGS variables in the respective
  make_defs.mk files.
- This also commit changes the logic in common.mk so that instead of
  appending, the variable is overwritten, but now *only* in the case
  of Windows (since apparently -fPIC needs to be omitted there). Thanks
  to Nick Knight for catching and reporting this weirdness.
Details:
- Added two PowerPoint files that contain slides depicting the classic
  Goto algorithm for matrix multiplication as well as its sister
  "panel-block" algorithm. These files reside in docs/diagrams.
Details:
- Added PDF versions of the PowerPoint files added in 17cd260.
Details:
- Added `output.testsuite` to .gitignore since it was previously not 
  being matched by `output.testsuite.*`.
Details:
- Implemented a new configure option, --disable-tls, which allows the
  user to optionally disable the use of thread-local storage qualifiers
  on static variables in BLIS. This option will rarely be needed, but
  in some situations may allow BLIS to compile when TLS is unavailable.
  Thanks to Nick Knight for suggesting this option.
- Unlike the --disable-system option, --disable-tls does not forcibly
  disable threading. Instead, warnings of the possible consequences of
  using threading with TLS disabled are added to:
  - the output of './configure --help';
  - the output of 'configure' the --disable-tls option is parsed;
  - the informational header output by the testsuite.
  Thanks to Minh Quan Ho for suggesting these warnings.
- Modified frame/include/bli_lang_defs.h so that BLIS_THREAD_LOCAL is
  defined to nothing when BLIS_ENABLE_TLS is not defined.
- Defined bli_info_get_enable_tls(), which returns whether the cpp macro
  BLIS_ENABLE_TLS was defined.
- Edited --disable-system configure status output for clarity.
- Whitespace updates.
Added whitespace and other formatting fixes
Restore changes from sifive-blis-private#28
Details:
- Added attributions associated with commits:
  - 98d4678 9b1beec: @bartoldeman
  - 2b05948 059f151: @ct-clmsn
- Reordered attirubtion for @decandia50.
Details:
- There are four RISC-V base configurations: 'rv32i', 'rv32iv', 'rv64i', 
  and 'rv64iv', namely the 32-bit and 64-bit implementations with and 
  without the 'V' vector extension. Additional extensions such as 'M' 
  (multiplication), 'A' (atomics), 'F' ('float' hardware support), 'D' 
  ('double' hardware support), and 'C' (compressed-length instructions), 
  are automatically used when available. If they are not available, then 
  software equivalents (e.g., softfloat and -latomic) are used.
- './configure auto' can be invoked on a RISC-V build platform, and will
  automatically detect RISC-V CPU extensions through the RISC-V C API:
  https://github.com/riscv-non-isa/riscv-c-api-doc/blob/master/riscv-c-api.md
- The assembly kernels assume the presence of the vector extension
  RVV 1.0.
- It is possible to build 'rv[32,64]iv' for any value of VLEN.
  However, if VLEN < 128, the targets will fall back to the generic
  kernels and blocksizes.
- The vector microkernels are vector-length agnostic and work with
  every VLEN >=128, but are expected to work best with smaller vector
  lengths, i.e., VLEN <= 512.
- The assembly kernels cover column major storage (rs_c == 1).
- The blocksizes aim at being a good generic choice for out-of-order
  cores. They are not tuned to a specific RISC-V HPC core.
- The vector kernels have been tested using vlen={128,256,512}.
- The single- and double-precision assembly code routines for 'sgemm' 
  and 'dgemm', or for 'cgemm' and 'zgemm', are combined in their RISC-V 
  vector assembly source code, and are differentiated only with macros.
- The XLEN=32 and XLEN=64 versions of the RISC-V assembly code are
  identical, except that callee-saved registers are saved and restored 
  differently. There are RISC-V assembly code #include files for
  handling the saving and restoring of callee-saved registers, and they 
  are future-proof if ever XLEN=128.
- Multiplications, such as computing array strides and offsets, are 
  performed in C, and later passed to the RISC-V assembly kernels. This 
  is so that the compiler can determine whether the 'M' (multiply) 
  extension is available and use multiplication instructions, or call 
  library helper functions instead.
- A new macro called bli_static_assert() has been added to perform
  static assertions at compile-time, regardless of the C/C++ dialect of 
  the compiler. The original motivation of this was to ensure that 
  calling RISC-V assembly kernels would not silently truncate arguments 
  of type 'dim_t' or 'inc_t' (so-called "narrowing conversions").
- RISC-V CI tests have been added to Travis CI, using the
  riscv-gnu-toolchain cross-compiler, and qemu simulator.
- Thanks to Lee Killough for collaborating on this commit.
Details:
- PR flame#738 -- which moved -fPIC flag insertion responsibilities from
  common.mk to the subconfigs' individual make_defs.mk files -- was 
  merged shortly before the introduction of new RISC-V subconfigs in 
  flame#693. This commit brings those RISC-V subconfigs up to date with the 
  new -fPIC conventions.
devinamatthews and others added 5 commits May 5, 2023 14:22
`FC` was used instead of `found_fc`.
Details:
- This commit fixes issue flame#746, in which the _access() function (called
  from within blastest/f2c/open.c) is undeclared when compiling on 
  Windows with clang 16.
Details:
- Consolidated INSERT_GENTFUNC_* (and corresponding GENTPROT) macro sets
  using variadic macros (__VA_ARGS__), which means we no longer need a
  different INSERT_ macro for each possible number of arguments the 
  macro might take. This change seems reasonable given that variadic
  macros are a standard C99 feature and widely supported. I took care 
  not to use variadic macros where 0 variadic arguments are expected 
  since that is a non-standard extension.
- Added pre-typecast parentheses to arithmetic expressions in printf() 
  statements in bli_thread_range_tlb.c.
Add detection of the NVIDIA nvhpc compiler (`nvc`) in `configure`, and adjust some warning options in `config.mk`. Currently, no specific options for `nvc` have been added in the relevant configurations so it may not be usable without further tweaks.
Details:
- Ever since 28b0982, herk, her2k, syrk, and syr2k have been implemented
  in terms of the gemmt expert API. And since the decision of which
  induced method to use (1m or native) is made *below* the level of the
  expert API, executing any of {herk,her2k,syrk,syr2k} results in BLIS 
  checking the enablement status for gemmt.
- This commit applies a band-aid of sorts to this issue by modifying
  bli_l3_ind_oper_get_enable() and bli_l3_ind_oper_set_enable() so that
  any attempts to query or modify the internal enablement status for 
  herk, her2k, syrk, or syr2k instead does so for gemmt.
- This solution isn't perfect since, in theory, the user could enable 1m
  for, say, herk but then disable it for syrk, and then be confused when 
  herk runs via native execution. But we don't anticipate that users
  modify 1m enablement at the operation level, and so in practice this
  solution is likely fine for now.
fgvanzee and others added 4 commits June 7, 2023 16:11
Details:
- Inserted a cache line of padding between various fields of the
  thrcomm_t and, in the case of the (presently defunct) tree barrier,
  fields of the barrier_t. This additional padding ensures that these
  fields, which both serve different purposes when performing a thread
  barrier, are only accessed when needed (and not just due to their
  spatial locality with their cache line neighbors).
- Added a new cpp macro constant, BLIS_CACHE_LINE_SIZE, to
  bli_config_macro_defs. This new constant defines the size of a cache
  line (in bytes) and defaults to 64.
- Special thanks to Leick Robinson for discovering this false sharing
  issue and developing/submitting the patch.
Remove C++ options; fix whitespace; fix spurious merge conflicts
@leekillough
Copy link
Author

This PR, which merges sifive_x280 with the latest master (which includes my and @angsch's generic RISC-V code), passes all fast tests after using the attached script to build the toolchain. I'm running the full testsuite now.

build-riscv.sh.txt

cd
./build-riscv.sh
source  ~/riscv/rv64imafdv_lp64d_vlen512/riscv.sh

cd blis
./configure sifive_x280
make -j
make checkblis-fast
...
check-blistest.sh: All BLIS tests passed!

The CI is the main thing holding things up. @angsch and @fgvanzee have some experience there.

@leekillough
Copy link
Author

$ make checkblis
Running test_libblis.x with output redirected to 'output.testsuite'
check-blistest.sh: All BLIS tests passed!

fgvanzee and others added 2 commits June 12, 2023 17:22
Details:
- Rewrote the defintion of bli_thrcomm_tree_barrier() so that it (a)
  actually worked again, and (b) used atomics instead of a basic C99
  spin loop. (Note that the conventional barrier implementation is
  still enabled by default; the tree barrier must be toggled on
  manually within the configuration.)
- Added an early return to the definition of bli_thrcomm_barrier() in
  the cases where comm == NULL or comm->n_threads == 1.
- Reordered thread-related and thread-dependent header #include
  directives in blis.h so that the BLIS_TREE_BARRIER and
  BLIS_TREE_BARRIER_ARITY macros, which would be defined in the target
  configuration's in the bli_family_*.h file, would be #included prior
  to the inclusion of the thrcomm_t header that uses them.
- Changed the type of barrier_t.count from 'int' to 'dim_t'.
- Changed the type of barrier_t.signal from 'volatile int' to 'gint_t'.
- Special thanks to Leick Robinson for contributing these changes.
- Whitespace changes.
@leekillough
Copy link
Author

Superceded by #4

@leekillough leekillough closed this Jul 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants