Skip to content

Commit

Permalink
Cython perf simplification (#861)
Browse files Browse the repository at this point in the history
* complete rewrite of cython files

This was not really intended. But the amount
of different codes that were not using fused tyes
was annoying and prohibited further development.

I initially tried to implement pure-python mode
codes. However, it seems to be impossible to do
pure-python mode and cimport it into another
pure-python mode code.

Before this will be merged, I need to benchmark
the changes made.

The main change is that the fused datatypes
are omnipresent and that the ndarray data
type has been removed almost completely.
So now, the memoryviews are prevalent.

This means that it is much simpler to track problems.
There is no code duplication in the spin-box calculations.

I am not sure whether the python-yellow annotations are
only for the int/long variables, and when down-casting.
In any case, I think this is more easy to manage.

Signed-off-by: Nick Papior <[email protected]>

* redid fold_csr_matrix for much better perf

Simple benchmarks showed that siesta Hamiltonians
to Hk can be much faster by changing how the
folding of the matrices are done.

Instead of incrementally adding elements, and searching
for duplicates before each addition of elements, we know
built the entire array, and use numpy.unique to reduce
the actual array. This leverages the numpy unique
function which already returns a sorted array.

It marginally slows down csr creation of matrices with
few edges per orbital (TB models). But will be
much faster for larger models stemming from DFT
or the likes.

Tests for this commit:

    %timeit H.Hk()
    %timeit H.Hk([0.1] * 3)
    %timeit H.Hk(format="array")
    %timeit H.Hk([0.1] * 3, format="array")

For a *many* edge system, we get:

    67.2 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    85.4 ms ± 8.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    5.59 ms ± 426 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    11.3 ms ± 39.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

While for a *few* edge system, we get:

    9.1 ms ± 52.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    9.25 ms ± 65.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    5.75 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    6.17 ms ± 394 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

For commit v0.15.1-57-g6bbbde39 we get:

    196 ms ± 3.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    214 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    6.58 ms ± 139 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    12.8 ms ± 58.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

and

    7.41 ms ± 77.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    7.37 ms ± 73.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    6.04 ms ± 383 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    5.81 ms ± 37 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Enabling fortran is necessary for it to populate details of the fortran
world.

This should enable simpler access to data

Signed-off-by: Nick Papior <[email protected]>

* fixed handling complex matrices in sisl

Lots of the code were built for floats.
This now fixes the issue of reading/writing
sparse matrices in specific data-formats.

It allows a more natural way of handling SOC
matrices, with complex interplay, say:

 H[0, 0, 2] = Hud

as a complex variable, when dealing with floats
one needs to do this:

 H[0, 0, 2] = Hud.real
 H[0, 0, 3] = Hud.imag

which is not super-intuitive.

Currently there are still *many* hardcodings of the indices.

And we should strive to move these into a common framework
to limit the problems it creates.

Tests has been added that checks Hamiltonian eigenvalues
and Density matrices mulliken charges. So it seems it works
as intended, but not everything has been fully tested.

* added explanation of how transform works

* added more tests for dtype conversion

fixed errors in col creation of dtypes when calling
csr.diags()

Ensured that sparsegeometry.finalize accepts any arguments the
csr.finalize accepts.

Removed casting errors in mat2dtype. This may *hide* potential
problems when there are non-zero imaginary parts.
We should perhaps later revisit the problem considering
that TB + Peierls has complex overlap matrices.

Fixed issue when a construct was called with Python
intrinsic complex values.

---------

Signed-off-by: Nick Papior <[email protected]>
  • Loading branch information
zerothi authored Nov 26, 2024
1 parent 8559eda commit c6253c8
Show file tree
Hide file tree
Showing 51 changed files with 3,513 additions and 4,533 deletions.
7 changes: 6 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,14 @@ we hit release version 1.0.0.
sisl.geom.graphene

### Fixed

- `projection` arguments of several functions has been streamlined

### Changed
- internal Cython code for performance improvements.
This yield significant perf. improvements for DFT sparse matrices
with *many* edges in the sparse matrix, but a perf. hit for very
small TB matrices.


## [0.15.2] - 2024-11-06

Expand Down
27 changes: 14 additions & 13 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,6 @@ add_compile_definitions(CYTHON_NO_PYINIT_EXPORT=1)
#: lib, perhaps we should change this
set(CMAKE_SHARED_MODULE_PREFIX "")


# Determine whether we are in CIBUILDWHEEL
# and whether we are building for the universal target
set(_def_fortran TRUE)
Expand All @@ -81,6 +80,8 @@ option(WITH_FORTRAN

# Define all options for the user
if( WITH_FORTRAN )
enable_language(Fortran)

set(F2PY_REPORT_ON_ARRAY_COPY 10
CACHE STRING
"The minimum (element) size of arrays before warning about copies")
Expand Down Expand Up @@ -209,6 +210,18 @@ if(WITH_FORTRAN)
endif(WITH_FORTRAN)


message(STATUS "Python variables:")
list(APPEND CMAKE_MESSAGE_INDENT " ")

cmake_print_variables(Python_INCLUDE_DIRS)
cmake_print_variables(Python_NumPy_INCLUDE_DIRS)
if(WITH_FORTRAN)
cmake_print_variables(Python_NumPy_F2Py_INCLUDE_DIR)
endif()

list(POP_BACK CMAKE_MESSAGE_INDENT)


message(STATUS "sisl options")
list(APPEND CMAKE_MESSAGE_INDENT " ")

Expand All @@ -230,18 +243,6 @@ endif()
list(POP_BACK CMAKE_MESSAGE_INDENT)


message(STATUS "Python variables:")
list(APPEND CMAKE_MESSAGE_INDENT " ")

cmake_print_variables(Python_INCLUDE_DIRS)
cmake_print_variables(Python_NumPy_INCLUDE_DIRS)
if(WITH_FORTRAN)
cmake_print_variables(Python_NumPy_F2Py_INCLUDE_DIR)
endif()

list(POP_BACK CMAKE_MESSAGE_INDENT)



# Return in _result whether the _file should be built, or not
# It checks whether the file is present in the NO_COMPILATION
Expand Down
77 changes: 77 additions & 0 deletions benchmarks/optimizations/hamiltonian.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we test and check the performance of the `Hk` implementation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from pathlib import Path\n",
"import numpy as np\n",
"import sisl as si\n",
"\n",
"files = Path(os.environ[\"SISL_FILES_TESTS\"])\n",
"siesta = files / \"siesta\"\n",
"\n",
"N = 10"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"H = si.Hamiltonian.read(siesta / \"Si_pdos_k\" / \"Si_pdos.TSHS\").tile(N, 0).tile(N, 1)\n",
"\n",
"%timeit H.Hk()\n",
"%timeit H.Hk([0.1] * 3)\n",
"%timeit H.Hk(format=\"array\")\n",
"%timeit H.Hk([0.1] * 3, format=\"array\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"H = si.Hamiltonian.read(siesta / \"Pt2_soc\" / \"Pt2_xx.TSHS\").tile(N, 0).tile(N // 2, 1)\n",
"\n",
"%timeit H.Hk()\n",
"%timeit H.Hk([0.1] * 3)\n",
"%timeit H.Hk(format=\"array\")\n",
"%timeit H.Hk([0.1] * 3, format=\"array\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
5 changes: 2 additions & 3 deletions benchmarks/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,5 @@ profile=$base.profile
# Stats
stats=$base.stats

python -m cProfile -o $profile $script $@
python stats.py $profile > $stats

python3 -m cProfile -o $profile $script $@
python3 stats.py $profile > $stats
20 changes: 0 additions & 20 deletions benchmarks/run3.sh

This file was deleted.

7 changes: 7 additions & 0 deletions src/sisl/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
set_property(DIRECTORY
APPEND
PROPERTY INCLUDE_DIRECTORIES
${CMAKE_CURRENT_SOURCE_DIR}/_core
)

foreach(source _indices _math_small)
add_cython_library(
SOURCE ${source}.pyx
Expand Down Expand Up @@ -29,6 +35,7 @@ endforeach()
get_directory_property( SISL_DEFINITIONS DIRECTORY
${CMAKE_CURRENT_SOURCE_DIR}
COMPILE_DEFINITIONS )

# Join to stringify list
list(JOIN SISL_DEFINITIONS " " SISL_DEFINITIONS)

Expand Down
4 changes: 2 additions & 2 deletions src/sisl/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,8 @@
# import the common options used
from ._common import *

from ._core import *

# Import warning classes
# We currently do not import warn and info
# as they are too generic names in case one does from sisl import *
Expand All @@ -106,8 +108,6 @@
# Below are sisl-specific imports
from .shape import *

from ._core import *

# Physical quantities and required classes
from .physics import *

Expand Down
2 changes: 1 addition & 1 deletion src/sisl/_core/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
foreach(source _lattice _sparse)
foreach(source _lattice _dtypes _sparse)
add_cython_library(
SOURCE ${source}.pyx
LIBRARY ${source}
Expand Down
102 changes: 102 additions & 0 deletions src/sisl/_core/_dtypes.pxd
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
"""
Shared header for fused dtypes
"""
cimport cython

import numpy as np

cimport numpy as cnp
from numpy cimport (
complex64_t,
complex128_t,
float32_t,
float64_t,
int8_t,
int16_t,
int32_t,
int64_t,
uint8_t,
uint16_t,
uint32_t,
uint64_t,
)

# Generic typedefs for sisl internal naming convention
ctypedef size_t size_st
ctypedef Py_ssize_t ssize_st


ctypedef fused ints_st:
int
long


ctypedef fused floats_st:
float
double


ctypedef fused complexs_st:
float complex
double complex


ctypedef fused floatcomplexs_st:
float
double
float complex
double complex


# We need this fused data-type to omit complex data-types
ctypedef fused reals_st:
int
long
float
double

ctypedef fused numerics_st:
int
long
float
double
float complex
double complex

ctypedef fused _type2dtype_types_st:
short
int
long
float
double
float complex
double complex
float32_t
float64_t
#complex64_t # not usable...
#complex128_t
int8_t
int16_t
int32_t
int64_t
uint8_t
uint16_t
uint32_t
uint64_t


cdef object type2dtype(const _type2dtype_types_st v)


ctypedef fused _inline_sum_st:
short
int
long
int16_t
int32_t
int64_t
uint16_t
uint32_t
uint64_t

cdef ssize_st inline_sum(const _inline_sum_st[::1] array) noexcept nogil
Loading

0 comments on commit c6253c8

Please sign in to comment.