Error handling is a critical aspect to making stable, maintainable software, and APIs are no exception. I tend to spend a lot of time making web services, and there are a few principles I follow to make sure I'm baking quality in from the start. I'd like to make a few posts on this topic, so subscribe and comment for more content like this.

Error handling is responding to exceptional situations gracefully AND clearly reporting specific error details.

The single biggest source of ambiguity in API error responses I've found is the improper use of HTTP status codes.

404 - Not Found

We've all seen the big banner on a web site that says something to the effect of "404 - Page not found."

GitHub's 404 page

So does a 404 only refer to pages? Here's the definition from the RFC[1].

The 404 (Not Found) status code indicates that the origin server did not find a current representation for the target resource or is not willing to disclose that one exists.

No mention of a page, but of a resource, which is a rather generic term. In this context a resource is pretty much anything that can be addressable with a URL. So this means that it could refer to a page, image, script, or even an API resource.

When to use 404

There are times when 404 is more specific than other status codes and reduces ambiguity while interacting with an API. A 404 should be returned from an API when any resource that is referenced by a valid ID in the URL cannot be found. This means that something like https://example.com/users/1234/address should return a 404 if the ID 1234 is well formed and there is no user 1234.

In the case where JSON responses are used it could be argued that another way to represent this would be to return an empty array [] or object {} and a 200 status code. After all, if there is no user there are no addresses, and this could potentially streamline client code. While this may seem reasonable since user 1234 doesn't have any addresses, it allows for one to assume that user 1234 does exist even though we know it doesn't. This ambiguity could lead to subtle misinterpretations on the part of API consumers and result in defects in integrating code.

400 - Bad Request

If a resource is referenced by an ID that is not well formed, then that fits much better with a 400 - Bad Request, also defined in RFC 7231[2]. This is because a 400 is much more descriptive of the cause of the error - junk data - than a 404.

The 400 (Bad Request) status code indicates that the server cannot or
will not process the request due to something that is perceived to be
a client error (e.g., malformed request syntax, invalid request
message framing, or deceptive request routing).

It could also be argued that a 400 could be the response for all client request errors, but that kind of blanket policy leaves a lot to be desired in terms of specificity.

When to use 400

A 400 is the perfect response to junk data in a request. This could be a request body that doesn't match the required schema, unique identifiers that couldn't possibly map to a resource, URL query parameters with bad values, request headers with malformed data, etc.. Basically any time your API receives something it doesn't expect or can't interpret, a 400 status code is probably a reasonable response - ideally with a descriptive error in a machine parseable structure in the response body.

401 and 403

These are often used interchangeably or incorrectly, due to a conflation between authentication and authorization. Let's start with the definitions for the status codes.

401[3]

The 401 (Unauthorized) status code indicates that the request has not been applied because it lacks valid authentication credentials for the target resource.

403[4]

The 403 (Forbidden) status code indicates that the server understood the request but refuses to authorize it.

So this means that 401 and 403 refer to authentication and authorization, respectively. If you'd like a quick refresher on the difference between authentication and authorization, Okta has a great article on the subject (not affiliated).
https://www.okta.com/identity-101/authentication-vs-authorization/

409 - Conflict

This is often used to indicate that the basis of a change is no longer valid.[5] An example could be updating a user's phone number after the phone number was already soft-deleted.

The 409 (Conflict) status code indicates that the request could not be completed due to a conflict with the current state of the target resource.

This one can get a bit tricky too. If it can be determined that the update is out of sync with the state of the resource, then it's clearly a 409. But what if the data model doesn't allow for soft-deletes, and instead actually deletes records from the database? In that case there's no way to determine if this is a conflict or the resource never existed in the first place, so we'd have to fall back to a 404 - Not Found. This idea is reinforced by the next sentence in the RFC.[5]

This code is used in situations where the user might be able to resolve the conflict and resubmit the request.

In the soft-delete case where a 409 is returned, the user can re-request the state of the resource, notice that the phone number no longer exists, and either re-add the phone number or simply report that the number was deleted. This way it's reasonably resolvable from the client's perspective.

500 - Internal Server Error

Now we get into the 5xx class of status codes, which are entirely related to errors encountered on the server while processing a valid request. There are a few specific types of status codes in this category, but most often any 5xx represents a completely unexpected situation. Maybe the database or a downstream service is down and refusing traffic, in which case the API server is acting as a proxy and a 502 - Bad Gateway - could be appropriate. If the API service is down for maintenance or is overloaded, it's better to give up and return a 503 - Service Unavailable - than to return nothing at all, because it makes it very clear where the problem is from the caller's perspective.

For most anything else, a 500 is perfectly acceptable. It indicates that an error occurred while the server was handling the request, and it should always include a machine readable message telling the client what went wrong (without exposing sensitive details like stack traces, database tables, internal hostnames, etc.). One trick I like to use to easily grep for the relevant log information is to generate a sizeable and unique "event ID", write that to the logs alongside stack traces, and return that in the response body. UUIDs are great for this.

Why specificity is important

Of course, you are free to design your APIs however you like, but I've found that being as specific as possible about errors goes a long way toward allowing API consumers to write top-notch code. When APIs are clear to understand and specific about what went wrong, it gives consumers a better chance to handle errors gracefully and correctly. UX still matters for APIs, you know.

References

[1] RFC 7231 - Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content - 404
https://datatracker.ietf.org/doc/html/rfc7231#section-6.5.4

[2] RFC 7231 - Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content - 400
https://datatracker.ietf.org/doc/html/rfc7231#section-6.5.1

[3] RFC 7235 - Hypertext Transfer Protocol (HTTP/1.1): Authentication
https://datatracker.ietf.org/doc/html/rfc7235#section-3.1

[4] RFC 7231 - Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content - 403
https://datatracker.ietf.org/doc/html/rfc7231#section-6.5.3

[5] RFC 7231 - Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content - 409
https://datatracker.ietf.org/doc/html/rfc7231#section-6.5.8