TrueFlux

The problem

While adding a new feature to a legacy code base of an existing web application I came across the following code:

let const = document.querySelector("#urlInput").value;
const urlRegex = /(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?/i;

if (!urlRegex.test(url)) {
  // Add some errors to the page
}

It just so happened that I was adding a new field to an existing form, which also required URL validation. While I am a fan of fanfare for its own sake; refactoring code for the sake of it is not something I make a habit of. It tends to turn a simple job into a much more complicated one.

However in this case it seemed appropriate to encapsulate the above code and reuse it for the new feature. So I decided to rethink the previous approach. At first glance it looks thorough, its a fairly complex regex which ensures the protocol is present, checks the domain format, and even attempts to handle optional ports and paths. But on closer inspection, it’s not only brittle and difficult to maintain but it misses many edge cases.

False positives

For example the following would pass the regex:

http://trueflux..com
http://-trueflux.com
https://trueflux.com:91984 // ports above 65535 are invalid—but {1,5} allows them.

False negatives

and conversely these would fail:

https://trueflux.agency                  // modern top level domains like .agency are valid
https://web.app.trueflux.co.uk           // deeply nested sub-domains are also valid
https://example.com?search=test&sort=asc // query strings are only loosely handled by (\/.*)?

These aren’t obscure edge cases. They’re things that can and will show up in user input. Every time your regex gets one wrong, you either reject a valid user submission (bad UX), or accept a broken URL (bad data). Neither is ideal.

So, whats the solution? We could build a better regex. However, even the perfect regex would have to be very elaborate to handle the complexities of top level domains, subdomains, query strings and port numbers. And we would still need to maintain it when/if rules change.

The alternative is to utilise and depend on something built into the language:

new URL("https://trueflux.agency")

Using the native URL constructor is preferable because it relies on the browser’s built-in parsing, which fully understands modern URL structures, including long top level domains, deeply nested subdomains, query strings, ports, and proper encoding. Basically someone else has already solved, built and rigourously tested this library, so we dont need to re-invent the wheel.

Refactored code

Heres what our code looks like after refactoring:

function parseWebsiteUrl(input) {
  return new URL(input); // an invalid URL will throw an exception
}

const raw = document.querySelector("#urlInput").value;
try {
  const url = parseWebsiteUrl(raw);
  // continue on and use url safely
} catch {
  // Add some errors to the page
}

Aside from being more readable, this approach is more robust and easier to maintain. The URL constructor either returns a fully parsed URL or throws, there’s no ambiguity. We handle it at the boundary and ensure the rest of the code only ever deals with valid data.

If you're maintaining a legacy application, reach out to discuss how we can support and modernise it.

Software Engineering

Validating URLs with JavaScript

The problem

False positives

False negatives

Refactored code