Numba usage stats

Jim Pivarski

Note: The URLs to will no longer work because those files have been removed. All but one have been moved to the files-from-AWS directory in this repository. The one that has not been saved, GitHub-numba-user-nonfork-raw-data-1Mcut-imports.tar, was 179.4 GB of repository data. If you're following the instructions below, you'll be able to produce an updated version of that dataset, but the original was too large to keep around. The other files are much smaller and serve as good checkpoints for testing that you're following the procedure correctly.

How the data were collected

Step 1: Scrape the dependents graph for numba/numba on GitHub (the repositories, not the packages).

Web-scraping script in

When I did it, there were 62903 of these.

Step 2: For each of those repos, get the repo metadata using the GitHub API, taking care to not exceed the rate limit.

My list has 62900 of these. (I guess 3 were lost.)

My copy of the repo info can be found in (32.1 MB).

My copy of the user info, also from GitHub API (for bios) can be found in (2.0 MB).

Step 3: For the repos in which "fork": false (users created the repo themselves), download all of the repos.

59233 repos from my previous list are non-fork.

I have the final results on a 380 GB disk, but I think I used a 1 TB disk during the process (all on AWS).

The script performs the giant git clone of all these repos. It's a parallized pipeline (ProcessPoolExecutor with max_workers=24 on a computer with 4 CPU cores... it's I/O limited) with the following steps:

  1. git clone with --depth 1 to get the latest snapshot, but not the history.
  2. Do a grep -i for \bnumba\b to cross-check GitHub's identification of these as depending on Numba and keep the result in a *.grep file beside the final tarball.
  3. Drop any files that are greater than 1 MB (some GitHub repos contain large data files) if they do not have an interesting file suffix: py, PY, ipynb, IPYNB, c, cc, cpp, cp, cxx, c++, C, CC, CPP, CP, CXX, C++, h, hpp, hp, hh, H, HPP, HP, HH, cu, cuh, CU, CUH.
  4. Tarball-and-compress what remains.

Occasionally, one of the 24 workers would get stuck with a large download, but the others moved past it. In the end, I think there were only a couple that couldn't be downloaded after a few attempts. (The script does not re-download, so it can be used to clean up after failed attempts.)

Step 4: Further select only the repos that actually contain


in some file. After this selection, only 13512 repos were kept (22.8%). Some of the repos that GitHub identified mentioned Numba in text or used it in markdown examples, but didn't import it: GitHub's interpretation of a "dependent repo" is very broad.

Finally, tarball (without compression!) the directory full of gzipped tarballs. My copy is at (179.4 GB).

Step 5: Do a static code analysis on all of the Python and Jupyter notebook files. This is another ProcessPoolExecutor pipeline, which results in a JSON file that will be used in interactive analysis. The steps of the pipeline are:

  1. Identify programming language by file extension, to learn which programming languages are used alongside Numba.
  2. For all C/C++/CUDA files,
  • try to parse it as a pure C file using pycparser (mostly to distinguish between C and C++),
  • look for CUDA's triple angle brackets, and
  • regex-search it for \s*#include [<\"](.*)[>\"] to get a list of includes (and identify if the include-file name matches the name of a file in the repo, so that locally defined files can be excluded).
  1. For all Python and Jupyter notebook files, parse the file with Python 3 (3.10.12) and indicate if parsing failed. For Jupyter, use jupytext to transform the Jupyter JSON into an in-memory pure Python, with IPython magics removed. Then, walk the Python AST to
  • collect all information on top-level imports and nested imports, keeping track of how imported modules or symbols are renamed,
  • if any of these are under the numba module, collect all symbol references and argument lists of function calls, including whether or not a function was used as a decorator, and
  • pay close attention to JIT-compilation functions/decorators: numba.jit, numba.njit, numba.generated_jit, numba.vectorize, numba.guvectorize, numba.cfunc.

My copy of the static analysis results is at (77.0 MB).


import json
from collections import Counter

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
results = []
with open("static-analysis-results.jsons") as file:
    for line in file:
df = pd.DataFrame([{"suffix": cfile["suffix"], "is_c": cfile["data"]["is_c"], "is_cuda": cfile["data"]["num_cuda"] > 0} for result in results for cfile in result["c"]])
suffix is_c is_cuda
0 cpp True False
1 c False False
2 c False False
3 cpp False False
4 cpp False False
... ... ... ...
897086 h False False
897087 h False False
897088 h False False
897089 h False False
897090 c False False

897091 rows × 3 columns

The file extension is useless for determining if something is pure C versus C++.

h      9157
cpp    3236
c      3024
hpp    2311
cxx     160
cc      106
cuh      49
cu       20
hh        6
hxx       5
Name: count, dtype: int64
h      401378
cpp    194933
c       85414
cc      73949
hpp     45596
cu      40778
cxx     24752
cuh      6170
hh       3046
hxx      2982
c++        13
cp          3
hp          3
Name: count, dtype: int64

But it's a pretty good indicator that a CUDA file is a CUDA file (unless it's a header file, but then my method of checking for <<< >>> doesn't work, either).

cu     18197
h       1274
cuh      962
c        101
cpp       85
hpp       82
cc        73
hxx        2
hh         1
Name: count, dtype: int64
h      409261
cpp    198084
c       88337
cc      73982
hpp     47825
cxx     24912
cu      22601
cuh      5257
hh       3051
hxx      2985
c++        13
cp          3
hp          3
Name: count, dtype: int64
languages = []
for result in results:
    for pyfile in result["python"]:
        if pyfile["data"] is not None and any(x == "numba" or x.startswith("numba.") for x in list(pyfile["data"]["top"]) + list(pyfile["data"]["nested"])):
    username, reponame = result["name"].split("/", 1)
    languages.append({"user": username, "repo": reponame})
    for cfile in result["c"]:
        if cfile["data"]["num_cuda"] > 0 or cfile["suffix"] in ("cu", "cuh"):
            languages[-1]["CUDA"] = True
        elif cfile["data"]["is_c"]:
            languages[-1]["C"] = True
            languages[-1]["C++"] = True
    for k, v in result["other_language"].items():
        if v > 0:
            languages[-1][k] = True

df = pd.DataFrame(languages).fillna(False)
user repo C C++ Cython Julia Swift Go CUDA Java ... R Rust MATLAB Fortran Groovy Scala Kotlin F# Haskell Ada
0 JeffreyMinucci ht_occupational False False False False False False False False ... False False False False False False False False False False
1 dreamento dreamento False False False False False False False False ... False False False False False False False False False False
2 nitin7478 Backorder_Prediction False False False False False False False False ... False False False False False False False False False False
3 exafmm pyexafmm False False False False False False False False ... False False False False False False False False False False
4 astro-informatics sleplet False False False False False False False False ... False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
13087 haowen-xu tensorkit False False False False False False False False ... False False False False False False False False False False
13088 WONDER-project OASYS1-WONDER False False False False False False False False ... False False False False False False False False False False
13089 WONDER-project Orange3-WONDER False False False False False False False False ... False False False False False False False False False False
13090 maciej-sypetkowski autoascend False False False False False False False False ... False False False False False False False False False False
13091 FS-CSCI150-F21 FS-CSCI150-F21-Team4 True True True False False False False False ... False False True True False False False False False False

13092 rows × 24 columns

len(df) / len(results)

In the following, "C++" and "C" are mutually exclusive categories of file ("does it compile in pycparser or not?"), but the bars are not mutually exclusive because a repo can contain a C++ file and also a pure C file.

"CUDA" is not exclusive with respect to "C++" and "C"; it corresponds to any C-like file with <<< >>> in it.

fig, ax = plt.subplots(figsize=(6, 4.5))

(df.drop(columns=["user", "repo"]).sum(axis=0).sort_values() / len(df)).plot.barh(ax=ax)
ax.set_xlabel("fraction of non-fork repos that contain 'import numba'")
ax.set_title("non-Python languages (represented by at least one file)")



fig, ax = plt.subplots(figsize=(9, 4))

(df.drop(columns=["repo"]).groupby("user").any().sum(axis=0).sort_values()[4:] * 100 / len(df)).plot.barh(ax=ax)
ax.set_xlabel("Percentage of GitHub users who 'import numba' or 'from numba import' in Python")
ax.set_ylabel("Other language used (at least one file)")

# fig.savefig("numba-users-other-language.svg")
# fig.savefig("numba-users-other-language.pdf")


Ada               1
Haskell          13
F#               14
Kotlin           24
Swift            29
Groovy           32
Scala            43
Ruby             90
Rust             91
Julia            92
Go               93
C#              115
Perl            177
Java            300
R               317
Fortran         465
MATLAB          533
Mathematica     692
CUDA           1270
Cython         1406
C              1511
C++            2850
dtype: int64

num_with_numba = 0
python_imports = Counter()
c_imports = Counter()
for result in results:
    for pyfile in result["python"]:
        if pyfile["data"] is not None and any(x == "numba" or x.startswith("numba.") for x in list(pyfile["data"]["top"]) + list(pyfile["data"]["nested"])):
    num_with_numba += 1

    counter = Counter()
    for pyfile in result["python"]:
        if pyfile["data"] is not None:
            for x in list(pyfile["data"]["top"]) + list(pyfile["data"]["nested"]):
                if x not in STDLIB_MODULES and x != "numba":
                    counter[x] += 1
    for x in counter:
        python_imports[x] += 1

    counter = Counter()
    for cfile in result["c"]:
        if cfile["data"] is not None:
            for x in list(cfile["data"]["global"]) + list(cfile["data"]["local"]):
                if x not in C_STDLIB_MODULES and x != "numba":
                    counter[x] += 1
    for x in counter:
        c_imports[x] += 1

python_imports = sorted(python_imports.items(), key=lambda x: -x[1])
c_imports = sorted(c_imports.items(), key=lambda x: -x[1])
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 9))

(pd.Series(dict(python_imports[:50])).sort_values() / num_with_numba).plot.barh(ax=ax1)
ax1.set_xlabel("fraction of non-fork repos that contain 'import numba'")
ax1.set_title("top Python imports (not standard library)")
ax1.set_xlim(0, 1)

(pd.Series(dict(c_imports[:50])).sort_values() / num_with_numba).plot.barh(ax=ax2)
ax2.set_xlabel("fraction of non-fork repos that contain 'import numba'")
ax2.set_title("top C and C++ includes (not standard library)")
# ax2.set_xlim(0, 1)




num_with_numba = 0
numba_references = Counter()
for result in results:
    for pyfile in result["python"]:
        if pyfile["data"] is not None and any(x == "numba" or x.startswith("numba.") for x in list(pyfile["data"]["top"]) + list(pyfile["data"]["nested"])):
    num_with_numba += 1

    counter = Counter()
    for pyfile in result["python"]:
        if pyfile["data"] is not None:
            for x in pyfile["data"]["numba"]:
                y = x.lstrip("@").split("(")[0]
                if x.startswith("numba.jit") and "nopython=True" in x:
                    y = "numba.njit"
                counter[y] += 1

    for x in counter:
        numba_references[x] += 1

numba_references = sorted(numba_references.items(), key=lambda x: -x[1])
fig, ax = plt.subplots(figsize=(6, 9))

(pd.Series(dict(numba_references[:50])).sort_values() / num_with_numba).plot.barh(ax=ax)
ax.set_xlabel("fraction of non-fork repos that contain 'import numba'")
ax.set_title("top Numba API calls")
# ax1.set_xlim(0, 1)



JIT_FUNCTIONS = {"numba.jit", "numba.njit", "numba.generated_jit", "numba.vectorize", "numba.guvectorize", "numba.cfunc", "numba.cuda.jit"}
fig, ax = plt.subplots(figsize=(6, 2))

(pd.Series({k: v for k, v in numba_references if k in JIT_FUNCTIONS}).sort_values() / num_with_numba).plot.barh(ax=ax)
ax.set_xlabel("fraction of non-fork repos that contain 'import numba'")
ax.set_title("Numba JIT API calls")
# ax1.set_xlim(0, 1)



num_with_numba = 0
jit_arguments = Counter()
for result in results:
    for pyfile in result["python"]:
        if pyfile["data"] is not None and any(x == "numba" or x.startswith("numba.") for x in list(pyfile["data"]["top"]) + list(pyfile["data"]["nested"])):
    num_with_numba += 1

    counter = Counter()
    for pyfile in result["python"]:
        if pyfile["data"] is not None:
            for x in pyfile["data"]["numba"]:
                if "(" in x and (x.lstrip("@").startswith("numba.jit") or x.lstrip("@").startswith("numba.njit")):
                    for arg in x.split("(", 1)[1].rstrip(")").split(","):
                        if "=" in arg:
                            counter[arg.strip()] += 1
                if x.lstrip("@").startswith("numba.njit"):
                    counter["nopython=True"] += 1

    for x in counter:
        jit_arguments[x] += 1

jit_arguments = sorted(jit_arguments.items(), key=lambda x: -x[1])
fig, ax = plt.subplots(figsize=(6, 4.5))

(pd.Series(dict(jit_arguments[:17])).sort_values() / num_with_numba).plot.barh(ax=ax)
ax.set_xlabel("fraction of non-fork repos that contain 'import numba'")
ax.set_title("top numba.jit arguments")
# ax1.set_xlim(0, 1)



num_with_numba = 0
num_with_numba_cuda = 0
for result in results:
    for pyfile in result["python"]:
        if pyfile["data"] is not None and any(x == "numba" or x.startswith("numba.") for x in list(pyfile["data"]["top"]) + list(pyfile["data"]["nested"])):
    num_with_numba += 1

    any_cuda = False
    for pyfile in result["python"]:
        if pyfile["data"] is not None:
            for x in pyfile["data"]["numba"]:
                if x.startswith("numba.cuda"):
                    any_cuda = True

    if any_cuda:
        num_with_numba_cuda += 1

num_with_numba_cuda / num_with_numba