Using LLMs for test data generation
Recently I had the idea to use generative AI for test data generation.
To be precise: To generate JSON payloads out of an OpenAPI specification.
The rough idea is:
- Read the OpenAPI spec.
- Pass the spec to
gpt-3.5-turbo
or a local runningLlama 2
. Give it instructions which type of request is expected (valid/non-valid request, XSS attack, …) - Execute the requests
I’ve used the spring-petclinic-rest
project to test the idea and implementation.
I’ve created a lot of bad python code during this, but the idea worked. I describe in the following sections the implementation.
1. Read the OpenAPI spec
When the Spring application is configured correctly, the OpenAPI spec can be found behind a nice UI or an URL which outputs a YAML/JSON format.
Which is good, as JSON and YAML easily can be parsed.
But it is really important to have a well-documented API. The better documented, the better the output will be as the generated result can take constraints into accordance. Otherwise you may get a JSON payload that shall act as a valid payload although it isn’t.
I used the openapi_spec_validator
package to parse the specification.
2. Let LLMs generate the payloads
When the spec is parsed it’s time to pass the data to the LLM.
First, I tried to use a local Llama instance, but my computer was way too slow. Generating one request took 10 seconds and cost probably a lot more lifetime of my computer. So I switched over OpenAIs API and started using gpt-3.5-turbo
.
I’ve stored the pure answers directly into a MongoDB to fetch them later when executing the requests.
The important part was to get the correct prompt. I’ve experimented with the playground and got the following:
You're a tester which sends random requests to an API. The API is defined via the OpenAPI specification.
- You need a valid JSON payload that is handled correctly by the API.
- You get a route, the method and the schemas for the requests.
- You act as ownerId 11
COMMANDS:
- Don't comment anything, not even inline comments in json.
- Fill in the expected output below where fields have the following meanings:
description: Description of the route. Describe the request.
path: the path to send the request to. Fill in the route. Replace any needed route variables in curly braces. In case of invalid requests fill an arbitrary value. In case of valid requests fill a value which makes sense.
method: the REST request method. Fill in the method.
expected_status_code: the http status code which is expected to be returned. Fill in the expected status code.
payload: the payload to send. Generate a <<VALID|INVALID>> payload. None when there is no request payload.
- Answer in the following JSON schema
EXPECTED OUTPUT FORMAT (FILL IN THE ""):
- {"description": "", "path": "", "method": "", "expected_status_code": "", "payload": ""}
SPECIFICATION: <<OpenAPI specification here>>
With that I’ve generated 287 payloads for $0.52!
Furthermore you see that I handed over some information about existing data sets or as which agent the API requester acts.
2.1. Evaluating the results
The prompt from above gave some good results.
Here are some examples.
For XSS
attacks:
{
"description": "Attempt an XSS attack by injecting a script tag in the firstName field",
"path": "/api/owners",
"method": "POST",
"expected_status_code": 400,
"payload": {
"firstName": "<script>alert('XSS Attack')</script>",
"lastName": "Franklin",
"address": "110 W. Liberty St.",
"city": "Madison",
"telephone": "6085551023"
}
}
Try to access non-existing routes:
{"description": "Send a POST request to the root path", "path": "/", "method": "POST", "expected_status_code": 405, "payload": null}
However there were also some false-positives like the following one which tries to create a pet by sending null
as a payload:
{"description": "Create a new pet for the owner.", "path": "/api/owners/8/pets", "method": "post", "expected_status_code": 200, "payload": null}
Rarely the answer was in the following format:
PATH: /api/visits/{visitId}
METHOD: get
REQUEST: None
RESPONSE: {'required': ['error', 'message', 'path', 'schemaValidationErrors', 'status', 'timestamp'], 'type': 'object', 'properties': {'status': {'type': 'integer', 'description': 'The HTTP status code.', 'format': 'int32', 'readOnly': True, 'example': 400}, 'error': {'type': 'string', 'description': 'The short error message.', 'readOnly': True, 'example': 'Bad Request'}, 'path': {'type': 'string', 'description': 'The path of the URL for this request.', 'format': 'uri', 'readOnly': True, 'example': '/api/owners'}, 'timestamp': {'type': 'string', 'description': 'The time the error occured.', 'format': 'date-time', 'readOnly': True}, 'message': {'type': 'string', 'description': 'The long error message.', 'readOnly': True, 'example': 'Request failed schema validation'}, 'schemaValidationErrors': {'type': 'array', 'description': 'Validation errors against the OpenAPI schema.', 'items': {'$ref': '#/
3. Executing requests
After the payloads have been generated the requests were fired to the API. This is very trivial as there was no authentication etc. So I’ll skip further description of this step and go on to the evaluation of the payloads.
3.1. Evaluation of the payloads
I added a function to track the expected_status_code
with the actual status code returned by the API.
That means: When the the generated payload is ‘correctly’ generated, then the expected and returned status code by the API should match.
Here’s the overall result:
Executed: 287
✓ Passed: 146
✗ Failed: 141
Just a little bit more than 50% of the payloads seemed to be correct. But that is not right. For example when the LLM generated a payload that creates a pet and the request references a non-existing owner, then this request will fail. Although the payload itself may be in a valid format.
This would be a next-step for enhancement: Letting the LLM either now the dependencies between the entities or hooking a database into it.
Some general observations
Overall I’d say that the data was good, but not perfect. The randomness in the payloads was not as expected.
As an example take this screenshot of the owner
table:
The entry with the id 1 (James Carter) already existed in the init data of the Spring petclinic REST API.
This means that gpt-3.5-turbo
used this openly available data to generate “new” payloads. That’s another important aspect to take care of.
Conclusion and learnings
Generally I’d say when you want to generate lot of test data or random payloads in different forms, it can be a good idea to use LLMs. Nevertheless you’d need to do some manual steps to validate the generated data.
Here are some learnings:
- When you have a well-documented API/specification, then you get better results.
- False-positives exist. These need to be figured out manually.
- You can get better answers when you hand over the LLM information about existing data sets.
- When the web already has some libraries or samples for an API, the probability is high that LLMs will generate requests out of this data. The question is: How much randomness will we get then?
Further look
From the point of view of integration tests there are some more open possibilities which could be tried.
For example auto-generating full curl
request or test-framework related tests (like cypress
).
Try it out for some things you work on, maybe it’ll speed up some boring manual steps!
Was this helpful? 👍 👎 Thanks for your vote!