Skip to content

Update llm-inference-on-edge.md #2732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 16, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions llm-inference-on-edge.md
Original file line number Diff line number Diff line change
Expand Up @@ -264,7 +264,7 @@ Let's run the app on our emulator/simulator as we showed before so we can start

### State Management

We will start by deleting everyting from the `App.tsx` file, and creating an empty code structure like the following :
We will start by deleting everything from the `App.tsx` file, and creating an empty code structure like the following :

<details>
<summary><b>App.tsx</b></summary>
Expand Down Expand Up @@ -995,15 +995,15 @@ The conversation page should now look like this:

Congratulations you've just built the core functionality of your first AI chatbot, the code is available [here](https://github.com/MekkCyber/EdgeLLM/blob/main/EdgeLLMBasic/App.tsx) ! You can now start adding more features to the app to make it more user friendly and efficient.

### The other Functionnalities
### The other Functionalities

The app is now fully functional, you can download a model, select a GGUF file, and chat with the model, but the user experience is not the best. In the [`EdgeLLMPlus`](https://github.com/MekkCyber/EdgeLLM/tree/main/EdgeLLMPlus) repo, I've added some other features, like on the fly generation, automatic scrolling, the inference speed tracking, the thought process of the model like deepseek-qwen-1.5B,... we will not go into details here as it will make the blog too long, we will go through some of the ideas and how to implement them but the whole code is available in the repo

#### Generation on the fly
The app generates responses incrementally, producing one token at a time rather than delivering the entire response in a single batch. This approach enhances the user experience, allowing users to begin reading the response as it is being formed. We achieve this by utilizing a `callback` function within `context.completion`, which is triggered after each token is generated, enabling us to update the `conversation` state accordingly.

#### Auto Scrolling
Auto Scrolling ensures that the latest messages or tokens are always visible to the user by automatically scrolling the chat view to the bottom as new content is added. To implement that we need we use a reference to the `ScrollView` to allow programatic control over the scroll position, and we use the `scrollToEnd` method to scroll to the bottom of the `ScrollView` when a new message is added to the `conversation` state. We also define a `autoScrollEnabled` state variable that will be set to false when the user scrolls up more than 100px from the bottom of the `ScrollView`.
Auto Scrolling ensures that the latest messages or tokens are always visible to the user by automatically scrolling the chat view to the bottom as new content is added. To implement that we need we use a reference to the `ScrollView` to allow programatic control over the scroll position, and we use the `scrollToEnd` method to scroll to the bottom of the `ScrollView` when a new message is added to the `conversation` state. We also define an `autoScrollEnabled` state variable that will be set to false when the user scrolls up more than 100px from the bottom of the `ScrollView`.

#### Inference Speed Tracking
Inference Speed Tracking is a feature that tracks the time taken to generate each token and displays under each message generated by the model. This feature is easy to implement because the `CompletionResult` object returned by the `context.completion` function contains a `timings` property which is a dictionary containing many metrics about the inference process. We can use the `predicted_per_second` metric to track the speed of the model.
Expand Down Expand Up @@ -1107,4 +1107,4 @@ You now have a working React Native app that can:

This implementation serves as a foundation for building more sophisticated AI-powered mobile applications. Remember to consider device capabilities when selecting models and tuning parameters.

Happy coding! 🚀
Happy coding! 🚀