89 lines
2.7 KiB
Markdown
89 lines
2.7 KiB
Markdown
# Veylant IA — Load Tests (k6)
|
|
|
|
Performance tests for the Veylant proxy using [k6](https://k6.io).
|
|
|
|
## Prerequisites
|
|
|
|
```bash
|
|
brew install k6 # macOS
|
|
# or: https://k6.io/docs/getting-started/installation/
|
|
```
|
|
|
|
The proxy must be running: `make dev` (or point `VEYLANT_URL` at staging).
|
|
|
|
## Scripts
|
|
|
|
| Script | Description |
|
|
|--------|-------------|
|
|
| `k6-load-test.js` | Multi-scenario script (smoke / load / stress / soak) — **use this** |
|
|
| `load_test.js` | Sprint 10 single-scenario script (1 000 VUs, 8 min) — kept for reference |
|
|
|
|
## Running tests
|
|
|
|
### Via Makefile (recommended)
|
|
|
|
```bash
|
|
make load-test # smoke scenario (CI default)
|
|
make load-test SCENARIO=load # 50 VUs, 5 min steady state
|
|
make load-test SCENARIO=stress # 0→200 VUs, find breaking point
|
|
make load-test SCENARIO=soak # 20 VUs, 30 min (detect memory leaks)
|
|
```
|
|
|
|
### Via k6 directly
|
|
|
|
```bash
|
|
# Basic (load scenario, local proxy)
|
|
k6 run \
|
|
--env VEYLANT_URL=http://localhost:8090 \
|
|
--env VEYLANT_TOKEN=dev-token \
|
|
--env SCENARIO=load \
|
|
test/k6/k6-load-test.js
|
|
|
|
# Against staging
|
|
k6 run \
|
|
--env VEYLANT_URL=https://api-staging.veylant.ai \
|
|
--env VEYLANT_TOKEN=$STAGING_JWT \
|
|
--env SCENARIO=stress \
|
|
test/k6/k6-load-test.js
|
|
|
|
# With k6 Cloud output (requires K6_CLOUD_TOKEN)
|
|
k6 run --out cloud test/k6/k6-load-test.js
|
|
```
|
|
|
|
## Scenarios
|
|
|
|
| Scenario | VUs | Duration | Purpose |
|
|
|----------|-----|----------|---------|
|
|
| `smoke` | 1 | 1 min | Sanity check — runs in CI on every push |
|
|
| `load` | 0→50→0 | 7 min | Steady-state: validates SLAs under normal load |
|
|
| `stress` | 0→200 | 7 min | Find the breaking point (target: > 200 VUs) |
|
|
| `soak` | 20 | 30 min | Detect memory leaks / slow GC under sustained load |
|
|
|
|
## Thresholds (SLAs)
|
|
|
|
| Metric | Target |
|
|
|--------|--------|
|
|
| `http_req_duration p(99)` | < 500ms |
|
|
| `http_req_duration p(95)` | < 200ms |
|
|
| `http_req_failed` | < 1% |
|
|
| `veylant_chat_latency_ms p(99)` | < 500ms |
|
|
| `veylant_error_rate` | < 1% |
|
|
|
|
## Interpreting results
|
|
|
|
- **`http_req_duration`** — end-to-end latency including upstream LLM. For Ollama models on local hardware this includes model inference time.
|
|
- **`veylant_error_rate`** — tracks application-level errors (non-200 or missing `choices` array).
|
|
- **`veylant_chat_errors`** / **`veylant_health_errors`** — absolute error counts per endpoint type.
|
|
|
|
A passing run looks like:
|
|
|
|
```
|
|
✓ http_req_duration.............: avg=42ms p(95)=112ms p(99)=287ms
|
|
✓ http_req_failed...............: 0.00%
|
|
✓ veylant_error_rate............: 0.00%
|
|
```
|
|
|
|
## CI integration
|
|
|
|
The `smoke` scenario runs automatically in the GitHub Actions `load-test` job (see `.github/workflows/ci.yml`). The job uses a mock Ollama that returns static responses to ensure deterministic latency.
|