Consumption APIs Benchmarking

Jmeter Cluster

For benchmarking the APIs, three Jmeter clusters (1 master + 8 slaves in each cluster) were setup to perform API testing and verifying improvements in parallel.

1. Individual API benchmarks

These were captured after optimizations were applied to the individual APIs.
- Introduced redis caching for read APIs. The API would fetch from the system of record for a cache miss
- Configured different AKKA dispatchers for key APIs
- Configured the dispatcher configuration to optimize for throughput
The tests were done with different server counts to establish that the API can scale horizontally.
Empty values for various server counts imply that the data capture was abandoned if the objective of the test was achieved from a single test.
Each API was invoked directly without going through the proxy & API manager.
Each API test was run for atleast 15 mins

API	1 Server (Throughput/sec @ avg resp time )	2 Servers (Throughput/sec @ avg resp time)
content/v1/read	2459.7 @ 92	4560.4 @ 45
course/v1/hierarchy	959.5 @ 241	1642.2 @ 119
framework/v1/read	2153 @ 38	3604 @ 20
channel/v1/read	1247.3 @ 182	1681.7 @ 123
content/v1/search	899.6 @ 261	1592.1 @ 125
data/v1/telemetry	285.3 @ 389	497 @ 211
device/v3/register	536.1 @ 426	923.9 @ 228
org/v1/search	553.6 @ 398	874.4 @ 238
data/v1/form/read	386.9 @ 582	764.1 @ 281
/v1/tenant/info	779.2 @ 275	1460.1 @ 133
/data/v1/page/assemble	443.8 @ 506	449.1 @ 500

2. APIs being invoked in sequence via Proxy & API Manager

All times are in millisecond
Test duration was 36 minutes
The proxy was invoked via the intranet with Jmeter servers running in the same network
Number of threads - 600
Number of replicas for the test - 8 Proxy, 4 Kong, 6 Content Service, 6 Learner Service, 6 Telemetry Service, 2 Knowledge Platform servers, 2 Search servers

API	Samples	Error Count	Avg Response Time	95 percentile response time	99 percentile response time	Throughput (req/sec)
ContentRead	180000	0	981.77	2171.95	3161.00	82.59
PageAssemble	180000	0	963.76	2158.00	3156.00	82.59
DialSearch	180000	0	912.69	2136.00	3157.00	82.56
FormRead	180000	0	622.08	2048.00	3063.99	82.55
OrgSearch	180000	2	869.24	2111.00	3131.99	82.48
SendTelemetry	180000	0	607.84	2064.00	3075.00	82.47
TenantInfo	180000	0	55.21	95.00	1048.00	82.48
ContentHierarchy	180000	0	1082.27	2242.00	3219.98	82.59

Result Analysis & findings

The above results were not very encouraging and we analyzed further to discover the following issues

Proxy was using http 1.0 instead of http 1.1 to connect to upstream systems. Http 1.0 does not support keep alive connections

All the sub systems (content service, api manager, learner service, proxy) were creating a new> connection for every service call.

Options calls were not handled by the proxy and would go all the way to the actual service

Some of the services were logging too much information which was useless and this was causing performance issues

3 APIs being invoked in sequence via Proxy & API Manager (after optimization)

All times are in millisecond
Test duration was 22 minutes
The proxy was invoked via the intranet with Jmeter servers running in the same network
Number of threads - 600
Number of replicas for the test - 8 Proxy, 4 Kong, 6 Content Service, 6 Learner Service, 6 Telemetry Service, 2 Knowledge Platform servers, 2 Search servers

API	Samples	Error Count	Avg Response Time	95 percentile response time	99 percentile response time	Throughput (req/sec)
ContentRead	360000	0	421.29	1263.00	1810.97	279.30
PageAssemble	360000	0	283.20	794.00	1270.97	279.30
DialSearch	360000	0	301.07	956.00	1364.99	279.31
FormRead	360000	0	92.02	212.00	444.99	279.32
OrgSearch	360000	1	86.14	181.00	337.99	279.32
SendTelemetry	360000	0	142.74	310.00	639.00	279.32
TenantInfo	360000	0	35.24	113.00	205.99	279.32
ContentHierarchy	360000	0	632.79	1789.95	2420.99	279.28

Optimizations done

Content service was closing client connections explicitly and not using keep alive to invoke knowledge platform. These issues were fixed with 3-4 lines of code changes

The proxy was not using keep alive connections to upstreams - portal & api manager. This was fixed in the portal through a configuration change to also move from http 1.0 to http 1.1

We were using Kong 0.9 as our API manager and this version did not support keep alive connections. We had to upgrade to Kong 0.10 to enable keep alive connections from Kong to content-service, portal, telemetry and learner service.

Proxy rules were updated to respond to OPTIONS call without forwarding the request to the upstream service

We increased the configured memory from 750 MB to 1 GB for content service as this improved the throughput

On upgrading to Kong 0.10, we started getting occasional timeouts for upstream services and this was resolved by doing the following

We figured out that Play framework (Learner service, Play 2.4.x) does not close connections for close to an hour and Kong connections to play framework timeout after 15 mins because of IPVS default timeout of 15 mins. There was no way in docker swarm to change this timeout and so we ended up using the tasks.service endpoint for play applications. While using this endpoint, it skips the IPVS layer and thus the timeout is avoided.

The NodeJS applications kill the idle socket by default every 5 seconds. This was leading to race conditions where Kong would send a new request while the socket was being closed by NodeJS. This was resolved by increasing the idle timeout to 5 mins for all inter swarm communication.

4 Benchmarking the APIs after increasing the replicas of proxy, content service & api manager (Long running test)

All times are in millisecond
Test duration was 9 hours & 20 mins
The proxy was invoked via the intranet with Jmeter servers running in the same network
Number of threads - 600
Number of replicas for the test - 10 Proxy, 6 Kong, 8 Content Service, 6 Learner Service, 6 Telemetry Service, 2 Knowledge Platform servers, 2 Search servers

API	Samples	Error Count	Avg Response Time	95 percentile response time	99 percentile response time	Throughput (req/sec)
ContentHierarchy	12000000	2	370.13	251.00	407.00	353.94
ContentRead	12000000	0	182.96	116.00	198.00	353.94
DialAssemble	12000000	10	279.11	169.00	251.00	353.94
DialSearch	12000000	1	207.79	124.95	202.00	353.94
FormRead	12000000	269	272.33	663.95	1185.99	353.94
OrgSearch	12000000	9	84.38	75.00	133.99	353.94
SendTelemetry	12000000	72	200.59	273.00	415.99	353.94
TenantInfo	12000000	12	89.97	408.00	639.99	353.94

5 Benchmarking the APIs in the ratio that is seen in a Sunbird production environment

All times are in millisecond
Test duration was 58 minutes
The proxy was invoked via the intranet with Jmeter servers running in the same network
Number of threads - 600
Number of replicas for the test - 10 Proxy, 6 Kong, 8 Content Service, 6 Learner Service, 6 Telemetry Service, 2 Knowledge Platform, 2 Search

API	Samples	Error Count	Avg Response Time	95 percentile response time	99 percentile response time	Throughput (req/sec)
CompositeSearch	840000	0	260.81	1277.90	2940.99	244.13
ContentHierarchy	420000	0	370.66	1822.80	3285.99	122.07
ContentRead	1680000	0	545.15	1668.00	2956.97	488.23
DialAssemble	420000	1	253.57	639.00	946.99	122.09
DialSearch	420000	0	186.57	414.00	643.99	122.09
FormRead	420000	0	92.92	346.95	1596.98	122.09
OrgSearch	420000	1	292.84	1096.95	2590.97	122.09
SendTelemetry	1260000	0	278.62	346.00	605.99	366.25
TenantInfo	420000	0	34.06	83.00	212.00	122.09

6 Benchmarking the APIs via Internet

All times are in millisecond
Test duration was 29 mins on AWS and 33 mins on Azure
Number of threads - 600
The APIs were invoked using 15 jmeter slaves (7 from AWS + 8 from Azure) via the domain name - loadtest.ntp.net.in
Number of replicas for the test - 10 Proxy, 6 Kong, 8 Content Service, 6 Learner Service, 6 Telemetry Service, 2 Knowledge Platform, 2 Search

AWS Load generator summary (8 VMs)

API	Samples	Error Count	Avg Response Time	95 percentile response time	99 percentile response time	Throughput (req/sec)
ContentHierarchy	320000	0	280.80	514.00	906.95	191.24
ContentRead	320000	0	197.82	347.00	565.99	191.33
DialAssemble	320000	0	254.36	361.00	489.00	191.32
DialSearch	320000	0	208.91	294.00	426.98	191.33
FormRead	320000	21	253.41	2663.95	3123.00	191.33
OrgSearch	320000	0	89.36	244.00	332.00	191.33
SendTelemetry	320000	0	282.24	475.95	912.94	191.33
TenantInfo	320000	2	69.31	166.00	229.00	191.33

Azure Load Generator summary (8 VM)

API	Samples	Error Count	Avg Response Time	95 percentile response time	99 percentile response time	Throughput (req/sec)
ContentHierarchy	320000	0	236.78	200.00	1071.99	161.80
ContentRead	320000	0	232.05	1069.00	1122.00	161.80
DialAssemble	320000	2	254.78	178.00	1131.00	161.80
DialSearch	320000	0	231.48	1074.00	1134.00	161.80
FormRead	320000	7	250.82	760.00	1113.99	161.81
OrgSearch	320000	1	113.96	1074.95	1130.00	161.81
SendTelemetry	320000	0	241.08	245.00	1116.99	161.81
TenantInfo	320000	0	117.49	1075.00	1099.00	161.81

Overall we got a throughput close to the test run via intranet

7 Benchmarking the difference with invoking the knowledge platform APIs directly vs via proxy, api manager and content service

Number of threads - 600
All times are in millisecond
The proxy was invoked via the intranet with Jmeter servers running in the same network
Number of replicas for the test - 10 Proxy, 6 Kong, 8 Content Service, 6 Learner Service, 6 Telemetry Service, 2 Knowledge Platform servers, 2 Search servers

API	Samples	Error Count	95 percentile response time	99 percentile response time	Throughput (req/sec)
ContentRead (Direct KP)	3000000	0	73	135	4600
ContentRead (2 swarm, via Proxy)	540000	46	261	1098	2100
ContentHierarchy (Direct KP)	900000	0	565	1184	1698
ContentHierarchy (2 swarm, via Proxy)	900000	0	1195	1869	1043

8 Benchmarking the proxy calls to blob storage for plugins & assets

(NOT Yet Production Ready)

Static Content Calls (Proxied via our Nginx)	Samples	Error Count	Avg Response Time	95 percentile response time	99 percentile response time	Throughput (req/sec)
Without Upstream + HTTPS (Current Production)	600000	212	494	190	1030	1046
With Upstream + HTTP (Proposed)	1200000	62	138	66	95	2300

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consumption-api.md

consumption-api.md

Consumption APIs Benchmarking

Jmeter Cluster

Files

consumption-api.md

Latest commit

History

consumption-api.md

File metadata and controls

Consumption APIs Benchmarking

Jmeter Cluster