Skip to content

Commit

Permalink
(blog): next draft
Browse files Browse the repository at this point in the history
  • Loading branch information
ArhanChaudhary committed Jun 30, 2024
1 parent 66a7839 commit 702481b
Show file tree
Hide file tree
Showing 3 changed files with 21 additions and 19 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
40 changes: 21 additions & 19 deletions src/content/blog/My GitHub repository has 100,000 contributors.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,13 @@ import github500 from "../../assets/blog/my-github-repository-has-100000-contrib

Your eyes don’t deceive you. You can check right now: my [GitHub repository](https://github.com/ArhanChaudhary/everyone) has 100,000 contributors.

Before I explain the how and the why, we first need to go back to a few weeks ago. I was routinely moving around some configuration files, and I accidentally swapped my `~/.gitconfig` file with that of my alternate account. When I pushed to one of my repositories, I was surprised to notice that my alt account was added as a contributor! Its rendered profile picture, username, and hyperlink — all neatly displayed under "Contributors".
Before I explain the why and the how, we first need to go back to a few weeks ago. I was routinely moving around some configuration files, and I accidentally swapped my `~/.gitconfig` file with that of my alternate account. When I pushed to one of my repositories, I was surprised to notice that my alt account was added as a contributor! Its rendered profile picture, username, and hyperlink — all neatly displayed under "Contributors".

There's a good chance you've already seen something [similar](https://github.com/Amog-OS/AmogOS/commit/0bb33e31e2a529bfd13c6013d1ad2dffa2485b61) to this; it isn't hard or particularly new to fake a commit from another user. But my curiosity was sparked. In my mind, the next logical question to ask was: how many contributors could I fake? And thereafter: could I get to exactly 100,000 contributors? It would be a fireplace of ghosts, a bizarre pit stop of GitHub users.

*(Before continuing, I would like to point out that users can sign their commits to cryptographically verify their identity, ineffectuating impersonation. Even then, impersonation is against GitHub's [Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-impersonation). Please be wary.)*

I wasn't trying to make 100,000 commits faking a different GitHub user each time, so I did some research and found a more efficient way. If you add two new lines under a commit message, you can co-author GitHub users with their username and email address as shown.
I wasn't trying to make 100,000 commits faking a different user each time, so I did some research and found a more efficient way. If you add two new lines under a commit message, you can co-author users with their username and email address as shown.

```bash
~ % git commit -m "Refactor usability tests.
Expand All @@ -32,15 +32,15 @@ And so on, with seemingly no co-author limit. Luckily for my case, each one is a

# The plan

Manually adding this many GitHub users to a commit message obviously isn’t practical. I needed to write an API scraper, and a quick check of GitHub's [policies](https://bounty.github.com/#legal_safe_harbor) affirms I'm allowed to do this provided that I'm not excessive.
Manually adding this many users to a commit message obviously isn’t practical. I needed to write an API scraper, and a quick check of GitHub's [policies](https://bounty.github.com/#legal_safe_harbor) affirms I'm allowed to do this provided that I'm not excessive.

> We do allow the use of automated tools so long as they do not produce excessive amounts of traffic. For example, running one nmap scan against one host is allowed, but sending 65,000 requests in two minutes using Burp Suite Intruder is excessive.
So, I got to brainstorming. The most important conceptual hurdle was figuring out how to get a user's email address, as you can't co-author a user without it. The problem was, even if a user makes their email address address publicly visible on their profile, the GitHub API doesn't reliably expose it *. Surprisingly, there is a very well-known way to circumvent this and find almost anyone's email address on GitHub.

Simply navigate to any commit authored by a user on GitHub and append `.patch` to the commit's url. Et voilà! The second line from the top enclosed within angle brackets lies their email address, in full form and glory. It sounds concerning that this type of information is so easy to access, but if you think about it, emails are made public for a reason. Attribution and contact are pretty important from an open source perspective.
Simply navigate to any commit authored by a user on GitHub and append `.patch` to the commit's url. Et voilà! The second line from the top enclosed within angled brackets lies their email address, in full form and glory. It sounds concerning that this type of information is so easy to access, but if you think about it, emails are made public for a reason. Attribution and contact are pretty important from an open source perspective.

With these types of technical projects that work with big data, social conduct should be taken seriously. I don't want to end up in a similar situation to [this guy](https://github.com/EpicGames/Signup/pull/24) and piss off 100,000 people by publicly leaking their email addresses. So, I decided to only co-author private email addresses that are only valid within GitHub.
With these types of technical projects that work with big data, social conduct should be taken seriously. I don't want to end up in a similar situation as [this guy](https://github.com/EpicGames/Signup/pull/24) and piss off 100,000 people by publicly leaking their email addresses. So, I decided to only co-author private email addresses that are only valid within GitHub.

Admittedly, a small amount of real email addresses were committed at the beginning, but only for testing purposes. I believe that this amount is small enough to be insignificant and inconsequential.

Expand All @@ -49,15 +49,17 @@ How would I find 100,000 users? GitHub provides a Search API for querying for us
Putting everything together, my API scraper will:

1. First use the Search API to find the most followed users on GitHub
2. Use the followers API endpoint to loop through each user's followers *(In retrospect, another approach could be a web crawling algorithm)*
2. Use the followers API endpoint to loop through each user's followers *(In retrospect, another approach could be utilizing a web crawling algorithm)*
3. Use the "hack" as described earlier to find each follower's email address
4. Filter each email address for private email addresses only valid within GitHub and format the co-author message
4. Filter each email address for private email addresses and format the co-author message

*The situation here is strange. For reasons unknown, the GitHub GraphQL API straight up doesn't expose the majority of users' public emails... UNLESS you use the GitHub REST API; in which case only this works... UNLESS you use the [GitHub GraphQL explorer](https://docs.github.com/en/graphql/overview/explorer) which seems to work with perfect accuracy. It's really weird.
Hopefully you can see that, through some intuitional ingenuity, the idea no longer sounds as Herculean a task as it seems.

*The situation here is strange. For reasons unknown, the GitHub GraphQL API straight up doesn't expose the majority of users' public emails... UNLESS you use the GitHub REST API; in which case only this method works... UNLESS you use the [GitHub GraphQL explorer](https://docs.github.com/en/graphql/overview/explorer) which seems to work with the same level of accuracy. I know I'm not going crazy. It's really weird.

# The Script

I decided to write the scraper in JavaScript so I could utilize [octokit.js](https://github.com/octokit/octokit.js), an API wrapper that provides useful built-in functionality such as throttling and retrying. I also decided to use GitHub's GraphQL API instead of their REST API because multiple GraphQL queries could be batched in a single request.
I decided to write the scraper in JavaScript so I could utilize [octokit.js](https://github.com/octokit/octokit.js), an API wrapper that provides useful built-in functionality such as throttling and retrying. I also decided to use GitHub's GraphQL API instead of their REST API because multiple GraphQL queries could be batched in a single request and other smaller efficiencies.

After hacking up a prototype, I tested my first commit.

Expand All @@ -76,15 +78,15 @@ I didn't feel like running it locally because I knew from testing it would take
[1] 104705
```

I think it's worth mentioning how surprisingly lenient the GraphQL API rate limits are. Even though my scraper was continuously and asynchronously parallelizing multiple requests at once, the highest I ever saw the GraphQL hourly rate limit quota reach was 1,500 out of 5,000 points.
I think it's worth mentioning how surprisingly lenient the GraphQL API rate limits are. Even though my scraper was continuously and asynchronously parallelizing multiple batched queries at once, the highest I ever saw the GraphQL hourly rate limit quota reach was 1,500 out of 5,000 points.

Nine hours later, combing through approximately half a million GitHub users, my scraper yielded 100,000 co-authors, each ready to become a contributor. Hooray!

I'm so sorry if it feels like I'm cutting you off, but I first wanted to discuss a few API anomalies on the error log before we get to the committing. Trust me, what I've found is equally, if not more interesting.

# The mystery of U_kgDOAMbr8w

Out of those half a million GitHub users, exactly two of them always crash the GraphQL API. That's a *0.0004%* fallthrough rate. Following some investigation, I was able to narrow down the crashing to any valid query containing the substring `history(author: {id: "U_kgDOAMbr8w"})`, where `id` is the user's GraphQL node id. If you want to try yourself, execute this on the [GitHub GraphQL explorer](https://docs.github.com/en/graphql/overview/explorer):
Out of those half a million users, exactly two of them always crash the GraphQL API. That's a *0.0004%* fallthrough rate. Following some investigation, I was able to narrow down the crashing to any valid query containing the substring `history(author: {id: "U_kgDOAMbr8w"})`, where `id` is the user's GraphQL node id. If you want to try yourself, execute this on the [GitHub GraphQL explorer](https://docs.github.com/en/graphql/overview/explorer):

```graphql
query {
Expand Down Expand Up @@ -112,11 +114,11 @@ I will provide an update to this section when my ticket receives a response.

Have you ever seen a corrupted repository?

<ContentImage src={corruptedRepository} desc="I won't be revealing the author nor the link to respect their privacy" alt="A repository that is corrupted"></ContentImage>
<ContentImage src={corruptedRepository} desc="I won't be revealing the author nor the link to respect their privacy" alt="A repository that is corrupted" width="700"></ContentImage>

Well, it ended up crashing my script half way through. Having to re-run it was more annoying than it should have been.

It turns out you can still clone this repository. Checking the commit history presents something interesting.
It turns out you can still clone this repository. Inspecting the commit history presents something interesting.

```bash
Desktop % git clone https://github.com/.../MyWebsite.git
Expand All @@ -127,7 +129,7 @@ Date: Fri Dec 2 10:14:22 3194 +25627400
...
```

I knew from a prior CTF competition that GitHub is [perfectly fine](https://github.com/l3rnds/Ft_IRC/commits/main/) with future commit dates. I didn't want to mess around with the chance to corrupt my own account, so I created a test account and replicated the date of this commit.
From a prior CTF competition, I already knew that GitHub was [perfectly fine](https://github.com/l3rnds/Ft_IRC/commits/main/) with future commit dates. I didn't want to accidentally corrupt my own GitHub account, so I created a test account and tried to replicate this commit.

```bash
Test % git commit --allow-empty -m "Testing"
Expand All @@ -139,15 +141,15 @@ Test % git update-ref refs/heads/main $new_commit
Test % git push
```

This only works on Git v2.38.2 or earlier, don't ask me how I know that. Sure enough, after pushing, my test repository nuked itself with the same message. I was able to pinpoint this behavior to the invalid UTC offset on the commit. I took it a step further and wanted to see what would happen I if opened a pull request that referenced this commit. Feast your eyes on...
This only works on Git v2.38.2 or earlier, don't ask me how I know that. Sure enough, after pushing, my test repository similarly nuked itself with the same message. I was able to pinpoint this behavior to the invalid UTC offset on the commit date, it seems like GitHub isn't able to properly parse it. I took this a step further and wanted to see what would happen I if opened a pull request that referenced this commit. Feast your eyes on...

<ContentImage src={github500} width={null} alt="GitHub throwing an error 500"></ContentImage>
<ContentImage src={github500} width={null} alt="GitHub throwing an internal server error"></ContentImage>

I really wish I could say I found a denial-of-service PoC and [became $1,000 (or more) richer](https://hackerone.com/github#user-content-performing-your-research). Imagine how cool of a resolution that would sound! Like before, following triage with some of my other programming friends, I surrendered empty-handed and filed a bug report to GitHub support.
I really wish I could say I found a denial-of-service PoC and [became $1,000 (or more) richer](https://hackerone.com/github#user-content-performing-your-research). Imagine how cool of a resolution that would sound! Like before, following triage with some of my friends, I surrendered empty-handed and filed a bug report to GitHub support.

# 100,000 Contributors

Where were we? Oh right, apologies for the side-tangents, I sometimes tend to get distracted. Now that we have all of our co-authors, let's get the party started!
Where were we? Oh right, apologies for the side-tangents, I tend to get distracted. Now that we have all of our co-authors, let's get the party started!

```bash
everyone % scp [email protected]:everyone/results.txt .
Expand Down Expand Up @@ -190,7 +192,7 @@ echo "Co-authors successfully processed!"

After running the script, around half an hour later:

<ContentImage src={_100000Contributors} desc="The last 14 contributors are represented by the icons up top. Though, this number seems to gradually decrease over time, probably because changing email addresses can cause it to fluctuate" width="500" alt="A screenshot showing 100,000 contributors"></ContentImage>
<ContentImage src={_100000Contributors} desc="The last 14 contributors are on the icons up top" width="500" alt="A screenshot showing 100,000 contributors"></ContentImage>

Booyah! \*drops mic\*

Expand Down

0 comments on commit 702481b

Please sign in to comment.