|
| 1 | +# SemHub |
| 2 | + |
| 3 | +## About |
| 4 | + |
| 5 | +We were not satisfied with the default search experience for GitHub issues and wanted to see if a semantic search powered by embeddings would perform better. We are open-sourcing this project to share our attempt with the community. For further details: |
| 6 | + |
| 7 | +- [essay](https://tzx.notion.site/What-I-Learned-Building-a-Free-Semantic-Search-Tool-for-GitHub-and-Why-I-Failed-1a09b742c7918033b318f3a5d7dc9751) |
| 8 | +- [HN discussion](https://news.ycombinator.com/item?id=43299659) |
| 9 | +- [X thread](https://x.com/zxt_tzx/status/1896731663801131180) |
| 10 | + |
| 11 | +This is an experimental project by [Coder.com](https://coder.com). |
| 12 | + |
| 13 | +## Development |
| 14 | + |
| 15 | +To develop using this repo, make sure you have installed the following: |
| 16 | + |
| 17 | +- [Bun](https://bun.sh/docs/installation) |
| 18 | +- [SST](https://sst.dev) |
| 19 | + |
| 20 | +### Monorepo |
| 21 | + |
| 22 | +1. `core/` |
| 23 | + |
| 24 | + This is for any shared code. |
| 25 | + |
| 26 | +2. `workers/` |
| 27 | + |
| 28 | + This is for your Cloudflare Workers and it uses the `core` package as a local |
| 29 | + dependency. |
| 30 | + |
| 31 | +3. `scripts/` |
| 32 | + |
| 33 | + This is for any scripts that you can run on your SST app using the |
| 34 | + `sst shell` CLI. |
| 35 | + |
| 36 | +4. `wrangler/` |
| 37 | + |
| 38 | + This is for Cloudflare resources that are deployed via `wrangler`. We use this for Cloudflare resources that cannot be deployed via Pulumi/SST. `wrangler` also provides more configurability. |
| 39 | + |
| 40 | + - We use Cloudflare Workflows to orchestrate the sync process. See [the README](./packages/wrangler/README.md) for more details. |
| 41 | + |
| 42 | +The `infra/` directory allows you to logically split the infrastructure of your app into separate files. This can be helpful as your app grows. |
| 43 | + |
| 44 | +### Environment variables |
| 45 | + |
| 46 | +You need the following environment variables (see `.env.example`) and secrets (see `.secrets.example`): |
| 47 | + |
| 48 | +- `CLOUDFLARE_ACCOUNT_ID`: Your Cloudflare [account ID](https://developers.cloudflare.com/fundamentals/setup/find-account-and-zone-ids/). (may not be 100% necessary) |
| 49 | +- `CLOUDFLARE_API_TOKEN`: Cloudflare API token to deploy Cloudflare workers and manage DNS. |
| 50 | + |
| 51 | +We currently also use AWS to deploy the frontend, but this is temporary and will be replaced by Cloudflare in the future. |
| 52 | + |
| 53 | +### Secrets |
| 54 | + |
| 55 | +Make a copy of `.secrets.example` and name it `.secrets` and a copy of `.env.example` and name it `.env` and fill in the values above. To load the secrets into SST, run `bun secret:load`. |
| 56 | + |
| 57 | +### Mobile |
| 58 | + |
| 59 | +To test on mobile, use Ngrok to create a tunnel to your local frontend: |
| 60 | + |
| 61 | +```zsh |
| 62 | +ngrok http 3001 |
| 63 | +``` |
| 64 | + |
| 65 | +### Auth and cookies on local development |
| 66 | + |
| 67 | +For auth to work on local development, there is a bit of rigmarole because we are running the frontend locally but the API server is on a `.semhub.dev` domain. So in order to set cookies, you need to: |
| 68 | + |
| 69 | +1. Edit your `/etc/hosts` file to add a new entry for `local.semhub.dev` that points to `127.0.0.1` |
| 70 | +2. Install and set up mkcert: |
| 71 | + |
| 72 | + ```bash |
| 73 | + brew install mkcert |
| 74 | + mkcert -install |
| 75 | + ``` |
| 76 | + |
| 77 | +3. Generate the local certificates: |
| 78 | + |
| 79 | + ```bash |
| 80 | + mkcert local.semhub.dev |
| 81 | + ``` |
| 82 | + |
| 83 | + This will create two files: `local.semhub.dev-key.pem` and `local.semhub.dev.pem` |
| 84 | + |
| 85 | +If you look at `vite.config.ts`, you will see that we reference these certificates to provide HTTPS for local development. |
| 86 | + |
| 87 | +### OAuth |
| 88 | + |
| 89 | +We choose to use GitHub App (instead of OAuth App) because of [these reasons](https://docs.github.com/en/apps/oauth-apps/building-oauth-apps/differences-between-github-apps-and-oauth-apps) (more granular control, scale with number of users, etc.). For dev vs prod, we use separate GitHub Apps (the production one is sited within the `coder` organization). |
| 90 | + |
| 91 | +To set up a GitHub App: |
| 92 | + |
| 93 | +- [Register a GitHub App](https://docs.github.com/en/apps/creating-github-apps/registering-a-github-app/registering-a-github-app) (dev one can be within your personal account, the [prod one](https://github.com/organizations/coder/settings/apps/coder-semhub) is within the `coder` organization) |
| 94 | + - In terms of permissions: |
| 95 | + - Select the following read-only Repository permissions: Metadata (mandatory), Discussions, Issues, Pull Requests, Contents. (These should be tracked in code via `github-app.ts`.) |
| 96 | + - Select the following read-only User permissions: Emails (actually would've gotten the user's email from the login process) |
| 97 | + - Select the following read-only Organization permissions: Members (to enable SemHub to work for users in the same organization after it has been installed by an admin) |
| 98 | + - Leave unchecked the box that says "Request user authorization (OAuth) during installation". Our app handles user login + creation. |
| 99 | + - Select redirect on update and use the frontend `/repos` page as the Setup URL |
| 100 | + - Local dev: `https://local.semhub.dev:3001/repos` |
| 101 | + - Prod: `https://semhub.dev/repos` |
| 102 | + - Callback URL is: `https://auth.[stage].stg.semhub.dev/github-login/callback` (see `packages/workers/src/auth/auth.constant.ts`) |
| 103 | + - Webhook URL is: `https://api.[stage].stg.semhub.dev/api/webhook/github`. The webhook secret is automatically generated by SST and can be revealed by modifying `outputs` in`infra/Secret.ts`. Installation events are automatically sent to this webhook, no need to subscribe manually. See [here](https://docs.github.com/en/webhooks/webhook-events-and-payloads#installation). Unlike callback URL, there can only be one webhook URL per app. |
| 104 | +- Generate and save the private key. NB the default format downloaded from GitHub is PKCS#1, but Octokit uses PKCS#8. You can convert the key using OpenSSL: `openssl pkcs8 -topk8 -inform PEM -in private-key.pem -outform PEM -out private-key-pkcs8.pem -nocrypt`. |
| 105 | +- Create a GitHub Client ID and Secret and load it into the `.secrets.dev` file |
| 106 | +- Go to Optional features and uncheck "User-to-server token expiration" |
| 107 | + |
| 108 | +Note that when you use a GitHub App on a personal account, the warning message on the authorization page is misleading. See [this thread](https://github.com/orgs/community/discussions/37117). |
| 109 | + |
| 110 | +## Deployment |
| 111 | + |
| 112 | +Right now, deployment is manual. Eventually, will set up GitHub Actions to automate this. |
| 113 | + |
| 114 | +### Deploying to new environment |
| 115 | + |
| 116 | +For a deploying a given change to a new environment: |
| 117 | + |
| 118 | +1. Load secrets. From root folder, run `bun secret:load:<env>`. |
| 119 | +1. Run `sst deploy --stage <env>` first to create state in SST. This will fail. |
| 120 | +1. Deploy Cloudflare resources. From `/packages/wrangler`, run `bun run deploy:all:<env>`. |
| 121 | +1. Run database migrations on prod. From `core` folder, run: `bun db:migrate:<env>`. |
| 122 | +1. Deploy SST resources again. This time it should succeed. |
| 123 | + |
| 124 | +Should probably set up a script to do this automatically as part of CI/CD. |
| 125 | + |
| 126 | +## Todos |
| 127 | + |
| 128 | +1. Deal with users who install our GitHub App without creating an account first. |
| 129 | +1. Current codebase assumes private/public property of repo is static and membership in org is static. Need to account for change. (Currently, we query membership for when subscription is made. But we should either receive webhook or regularly query to ensure that users that have left org should not have access to private repos.) |
| 130 | +1. Need to account for whether `no_issues` repo ever get issues. E.g. just run a daily cron to check? |
| 131 | + |
| 132 | +## Known issues |
| 133 | + |
| 134 | +1. When bulk inserting using Drizzle, make sure that the array in `values()` is not empty. Hence the various checks to either early return if the array is empty or making such insertions conditional. If we accidentally pass an empty array, an error will be thrown, disrupting the control flow. TODO: enforce this by using ESLint? |
| 135 | +1. Need some way to deal with error logging. Logging for SST-deployed workers is off by default (can turn it on via console, but it'll be overridden at the next update). At scale, will need to set something up so we will be informed of unknown errors. |
0 commit comments