AI code editing benchmarks for interactive visuals
Test how different AI models perform on code tasks:
# Run all benchmarks with default models
npm run benchmark
# Run a specific challenge
npm run benchmark -- --challenge stockPriceChart
# Specify which models to test
npm run benchmark -- --models gpt-4,claude-3
# Enable caching for faster development
npm run benchmark -- --cache
The benchmark system includes a grader for evaluating AI-generated visualizations:
# Launch the grader UI
npm run grade
# Focus on a specific challenge
npm run grade -- --challenge stockPriceChart
- Select Challenge: Choose from available challenges in the dropdown
- Browse Models: Navigate between different AI models' solutions
- Review Visualization: See the rendered visualization and screenshot
- Inspect Code: Review the generated code
- Assign Scores:
- Functionality (0-5): How well it meets requirements
- Aesthetics (0-5): Visual appeal and usability
- Add Notes: Provide specific feedback
- Submit Grade: Save evaluation to the results database
Functionality (0-5):
- 0: Does not work
- 1: Major bugs
- 2: Works but missing requirements
- 3: Meets basic requirements
- 4: Implements all requirements well
- 5: Perfect implementation with extras
Aesthetics (0-5):
- 0: Unusable layout
- 1: Poor design
- 2: Basic appearance
- 3: Clean design
- 4: Well-designed with good UX
- 5: Exceptional design