Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form
Many teams’ web3 apps are doomed.
Why?
Because apps are totally dependent on nodes, and nodes are incredibly difficult to maintain. Some reasons:
These complications make nodes hard to manage and maintain.
But there are two architectural dimensions that significantly improve node reliability, or the ability for a user to successfully interact with your app when they need to:
In web3, a lack of reliability can have detrimental outcomes.
Take a well known NFT marketplace using unreliable infrastructure...
Their platform was unstable, creating:
What was the impact? $6M in lost revenue.
Don't let it happen to you.
In this article, we’ll explain:
There are five primary reasons why reliability is particularly hard to maintain in web3:
To create a highly functional app, you need reliability, combined with scalability and data accuracy.
Nodes were initially designed to be run by individuals for personal and small scale use cases. Although they’re expensive ($1000 / month), for that limited use case, they can create sufficient reliability, and maintain:
However, a development environment is not a production environment, and production requires very different tooling.
Nodes are expected to do an unbounded number of things, e.g., running peer-to-peer software, serving as a scalable database, executing arbitrary code, etc.
Nodes also have no “opinion” about what they should or shouldn’t do.
Together, this has two primary ramifications:
On average, every 5 days, nodes may have issues from:
Each issue takes significant time to diagnose, troubleshoot and fix, and means significantly less time focused on your front-end user experience.
“Working with Alchemy has helped us save the equivalent of three full-time engineers, who otherwise would have to be heads down on infra maintenance, at all times.” - Evgeny Yurtaev, CEO and Co-Founder, Zerion
Network upgrades and hard forks happen on a protocol’s schedule, not yours. Each one requires significant preparation, and getting it wrong can mean nodes crash, with detrimental impact.
“Before we integrated with Alchemy, we ran our own blockchain infrastructure. This approach was impossible to maintain and scale without issues. Any industry-wide update would take us hours to implement and caused headaches across the board.” - Evgeny Yurtaev, CEO & Co-Founder, Zerion
Running reliable node infrastructure is intrinsically challenging, and a lot of common web3 architecture is ineffective at creating reliability:
It's more simple to set up, but subject to a single point of failure, lacking both heterogeneity and redundancy. A single node’s uptime is measured as low as 72%, and scaling a single node is highly inefficient.
This creates slightly more redundancy than a single node, but still no heterogeneity (no diversity in systems), and often, a request that takes down the first node will take down the second node too. Scalability issues still exist, and you’ve introduced data accuracy issues, meaning the nodes will return different responses, as most systems are unable to guarantee that each node is on the same block head (or even the same chain).
This model creates more redundancy, but still no heterogeneity. Data accuracy and scalability issues persist (a series of retries will take out an entire fleet of nodes).
If it’s not clear yet, many “solutions” for connecting to the blockchain don’t sufficiently provide heterogeneity or redundancy, and because of that, ultimately fail. Failure has severe ramifications.
“Before Alchemy, we were plagued with managing backend infrastructure that exhausted unnecessary development resources - and other developer platforms were insufficient in the breadth of solutions they offered. Node-related incidents made up almost all of our on-call incidents.” - Brad Bayliss, Lead Technical Manager, Enjin
Nodes can require days or weeks of recovery time to get up and running again. These issues are expensive, hard to solve and require heads down work across the team.
In 2019, BlockCypher tried to navigate Ethereum’s Constantinople Hard Fork. They ran into an error of missing data, and couldn’t diagnose the root cause. They suffered from over a month of downtime, and eventually had to essentially restart from scratch.
Don’t let it happen to you ;)
The more heterogeneous and redundant your web3 infrastructure is, the more reliable it will be.
“Without question, Alchemy has been the only provider with a comprehensive set of tools that enables us to scale seamlessly, with uninterrupted reliability, while making thousands of calls each day.” - Kyle Gusdorf, Software Engineer, Decentral Games
Supernode is a combination of custom, scalable and distributed systems that enable reliability and scalability, while allowing our API to act as a single node, thereby ensuring data accuracy.
Web2 infrastructure creates reliability and scalability by removing data storage from servers and creating external storage centers.
However, because of the unique nature of node architecture, removing data storage from nodes is essentially impossible.
To create reliability and scalability, Supernode uses advanced architecture, called secondary infrastructure, to replicate nodes’ data.
Vox Nodi, Supernode’s coordination layer, runs constant system checks to ensure the secondary infrastructure has the latest data, and therefore, any requests for data, e.g., block number, txn logs, txn receipts will be accurate.
By routing requests to secondary infrastructure first, the nodes are protected and their “workload” is reduced.
Most requests can be handled by secondary infrastructure, and only edge cases will be routed to nodes.
Let’s take an example:
Users want information about various data types and blocks:
User 1’s request can be served by the secondary infrastructure system. (Importantly, if a request like this were to hit a node directly, the node would crash, whereas, the secondary infrastructure system can handle this load seamlessly).
User 2’s request can also be served by the secondary infrastructure.
User 3’s request cannot be served through secondary infrastructure, because the system does not yet have the latest Blocks data from Block 92. User 3’s request will now be served from the nodes.
In this example, the experience for Users 1, 2 and 3 is the exact same. There is no difference in downtime when a request is responded to via secondary infrastructure or directly from the nodes. But by routing requests to the secondary infra system first, the nodes are protected and their “workload” is reduced.
"Alchemy has allowed us to eliminate having to use secondary / back-up infra. Alchemy never goes down, and their uptime is super consistent. Their platform works perfectly for us.” - Johnny Rhea, Co-Founder & CTO, Element.Fi
Because the Ethereum Virtual Machine (EVM) is Turing complete, you can request that nodes execute code. Because nodes can’t self-regulate, there is constant potential for an arbitrary request to bring down an entire node fleet.
To protect against this, Alchemy de-duplicates ands stores all requests our system has previously responded to.
This means that only a uniquely new request will be run on the nodes, and otherwise, the vast majority of requests can be executed by these storage systems.
In addition to secondary infrastructure systems, Alchemy adds heterogeneity with distinct node types. In the outlier scenarios when secondary infrastructure can’t respond to a request, the system defaults to the most scalable node.
Alchemy runs multiple types of nodes, including:
Distinct properties of node types make them good at different things, so if one node can’t answer a request, the others are likely still healthy and well-suited to respond.
For example, if there is a request for the newest data, which may not be recorded in secondary infrastructure yet, that request can be routed to a fast node, and the archive node will be “protected”, since sifting through the archive node’s data would be significantly more expensive.
To make it easier to spin up new nodes as needed, the Supernode system takes “snapshots” of the nodes, so that to create a new node, the system can start from the most recent “checkpoint,” significantly shortening the the syncing process.
Lastly, Alchemy runs different implementations of the Ethereum client, including geth, Erigon and Parity. This essentially triples the backend maintenance work required, but it is hugely valuable in service of heterogeneity: the chance that these clients have the same bug is essentially zero.
At Alchemy, we’ve spent hundreds of thousands of hours engineering Supernode, a complex set of systems, built from the ground up, to be heterogeneous and redundant, drastically increasing our fault tolerance and ensuring we can reliably respond to any blockchain request.
"Infrastructure that’s both reliable and scalable, so that we can stay up when our customers need us most - that’s huge for Collab.Land. Alchemy is the GOAT here.” - Raymond Feng, CTO, Collab.Land
It’s taken us years to build the complex, interdependent systems that make this reliability possible. Building to control for reliability is extremely challenging but leveraging reliable infrastructure is crucial for the success of your app.
As you explore infrastructure options, one of the first things you should be asking a provider is: “How heterogenous and redundant is your infrastructure? And what evidence can you show me that proves your systems’ reliability?”
Your apps’ success is contingent on this choice. Let us know how we can help!