Title: The Evolution of Web3 Data Access
Authors: Geng Kai, Eric, DFG
The Significance of Data in Blockchain
Data is crucial to blockchain technology and serves as the foundation for developing decentralized applications (dApps). While much of the current discussion revolves around Data Availability (DA) – ensuring that every network participant can access the latest transaction data for verification – there is another equally important aspect that is often overlooked: Data Accessibility.
In the era of modular blockchain, DA solutions have become indispensable. These solutions ensure that all participants can use transaction data to achieve real-time validation and maintain the integrity of the network. However, the functionality of the DA layer is more like a billboard than a database. This means that data is not stored indefinitely; it is deleted over time, just like how a new poster replaces an old one on a billboard.
On the other hand, Data Accessibility focuses on the ability to retrieve historical data, which is crucial for dApp development and blockchain analysis. This aspect is essential for tasks that require access to past data to ensure accurate representation and execution. While Data Accessibility is important, it is often discussed less but equally vital as Data Availability. Both play different but complementary roles in the blockchain ecosystem, and a comprehensive data management approach must address both issues to support robust and efficient blockchain applications.
How Blockchain Data was Retrieved Before
Since its inception, blockchain has completely transformed infrastructure and has driven the creation of decentralized applications (dApps) in various fields such as gaming, finance, and social networks. However, building these dApps requires access to a significant amount of blockchain data, which is both challenging and expensive.
For dApp developers, one option is to host and operate their own archival RPC node. These nodes store all historical blockchain data from the beginning, allowing full access to the data. However, the cost of maintaining archival nodes is high, and their query capabilities are limited, making it difficult to query data in the format developers need. While running cheaper nodes is an option, their data retrieval capabilities are limited, which may hinder the operation of dApps.
Another approach is to use commercial RPC (Remote Procedure Call) node providers. These providers handle the cost and management of nodes and provide data through RPC endpoints. Public RPC endpoints are free but have rate limits that may negatively impact the user experience of dApps. Private RPC endpoints offer better performance by reducing congestion, but even simple data retrieval requires a significant amount of back and forth communication, making them request-intensive and inefficient for complex data queries. Additionally, private RPC endpoints are often challenging to scale and lack compatibility across different networks.
A Better Alternative: Blockchain Indexers
Blockchain indexers play a crucial role in organizing chain data and sending it to databases for query purposes, which is why they are often referred to as the “Google of blockchain.” They work by indexing blockchain data and making it available at all times through a query language similar to SQL (using APIs like GraphQL). By providing a unified interface for querying data, indexers allow developers to quickly and accurately retrieve the information they need using standardized query languages, greatly simplifying the process.
Different types of indexers optimize data retrieval in various ways:
1. Full Node Indexers: These indexers run full blockchain nodes and extract data directly from them, ensuring data integrity and accuracy but requiring significant storage and processing power.
2. Lightweight Indexers: These indexers rely on full nodes to fetch specific data as needed, reducing storage requirements but potentially increasing query time.
3. Specialized Indexers: These indexers are tailored for certain types of data or specific blockchains, optimizing retrieval for specific use cases such as NFT data or DeFi transactions.
4. Aggregated Indexers: These indexers extract data from multiple blockchains and sources, including off-chain information, providing a unified query interface, which is particularly useful for multi-chain dApps.
Etheruem alone requires 3TB of storage space, and as the blockchain continues to grow, the data storage volume of Erigon archival nodes will also increase. Indexer protocols deploy multiple indexers that efficiently index and query large amounts of data at high speed, something RPC nodes cannot achieve.
Indexers also allow for complex queries, easy data filtering based on different criteria, and post-analysis data extraction. Some indexers even enable aggregation of data from multiple sources, avoiding the need to deploy multiple APIs in multi-chain dApps. By being distributed across multiple nodes, indexers provide enhanced security and performance, while RPC providers may experience interruptions and downtime due to their centralized nature.
Overall, indexers improve the efficiency and reliability of data retrieval compared to RPC node providers, while also reducing the cost of deploying a single node. This makes blockchain indexer protocols the preferred choice for dApp developers.
Use Cases for Indexers
As mentioned earlier, building dApps requires retrieving and reading blockchain data to operate their services. This includes any type of dApp, including DeFi, NFT platforms, games, and even social networks, as these platforms need to read data before executing other transactions.
DeFi: DeFi protocols require different information to quote specific prices, ratios, fees, etc. Automated Market Makers (AMMs) need price and liquidity information about certain pools to calculate swap rates, while lending protocols need utilization to determine borrowing rates and debt liquidation ratios. Inputting information into their dApp before calculating rates for users is essential.
Games: GameFi requires fast indexing and access to data to ensure users can play games smoothly. Only through lightning-fast data retrieval and execution can Web3 games compete in performance with Web2 games, attracting more users. These games require data such as land ownership, in-game token balances, in-game operations, etc. Through indexers, they can better ensure stable data flow and stable uptime to ensure a perfect gaming experience.
NFT: NFT markets and lending platforms require indexed data access to various information such as NFT metadata, ownership and transfer data, royalty information, etc. Quickly indexing such data can avoid manually browsing each NFT to find ownership or NFT attribute data.
Whether it’s an AMM that needs price and liquidity information, a SocialFi app that needs to update new user posts, or any other dApp that requires fast data retrieval, being able to retrieve data quickly is crucial for smooth dApp operation. With indexers, they can efficiently and accurately retrieve data, providing a seamless user experience.
Analytics: Indexers provide a way to extract specific data from raw blockchain data (including smart contract events in each block), opening up opportunities for more specific data analysis and providing comprehensive insights.
For example, a perpetual trading protocol can identify which tokens have high trading volumes, which tokens incur fees, to decide whether to list those tokens as perpetual contracts on their platform. DEX developers can create dashboards for their products to delve into which pools have the highest returns or strongest liquidity. Public dashboards can also be created, allowing developers to freely and flexibly query any type of data to display on charts.
With multiple blockchain indexers available, identifying the differences between index protocols is crucial to ensuring developers choose the indexer that best suits their needs.
Overview of Blockchain Indexers
Indexer Overview: The Graph
The Graph is the first indexer protocol launched on Ethereum, making it easy to query transaction data that was previously difficult to access. It uses subgraphs to define and filter subsets of data collected from the blockchain, allowing developers to query the data using a similar SQL-like query language (using APIs like GraphQL). By providing a unified interface for querying data, indexers allow developers to quickly and accurately retrieve the information they need using standardized query languages, greatly simplifying the process.When it comes to all transactions related to the Uniswap v3 USDC/ETH pool, indexing plays a crucial role. Indexers stake the native token GRT for indexing and querying services, allowing delegators to choose to stake their tokens with them. Curators have access to high-quality subgraphs to assist indexers in determining which data to index for optimal query fees. As The Graph transitions towards greater decentralization, it will eventually cease its hosting services and require subgraphs to upgrade to its network while offering upgraded indexers.
The infrastructure achieves an average cost of $40 per million queries, which is significantly lower than the cost of self-hosted nodes. Utilizing file data sources, it also supports parallel indexing of on-chain and off-chain data to facilitate efficient data retrieval.
Looking at the indexer rewards for The Graph, they have steadily increased over the past few quarters. This growth is partly due to the increase in query volume but also attributed to the rise in token prices as they plan to integrate AI-assisted queries in the future.
Subsquid is a peer-to-peer, horizontally scalable decentralized data lake that efficiently aggregates a vast amount of on-chain and off-chain data, protected through zero-knowledge proofs. As a decentralized worker network, each node is responsible for storing data from specific block subsets, speeding up data retrieval by quickly identifying nodes that hold the required data. Subsquid also supports real-time indexing, allowing indexing before the block is finalized. Developers can choose the format in which data is stored, facilitating easier analysis using tools like BigQuery, Parquet, or CSV. Furthermore, subgraphs can be deployed on the Subsquid network without the need to migrate to the Squid SDK, enabling code-free deployment.
Although still in the testnet phase, Subsquid has achieved impressive statistics with over 80,000 testnet users, deploying over 60,000 Squid indexers, and verifying over 20,000 developers on the network. Recently, on June 3rd, Subsquid launched its mainnet data lake.
In addition to indexing, the Subsquid Network data lake can also serve as an RPC replacement for analytics, ZK/TEE co-processors, AI agents, and Oracles use cases.
SubQuery is a decentralized middleware infrastructure network providing RPC and indexing data services. Initially supporting Polkadot and Substrate networks, it has now expanded to include over 200 chains. Operating similarly to The Graph with proof of indexing, where indexers index data and provide query requests while delegators stake shares with the indexers, it introduces consumers to submit purchase orders to ensure income for indexers is guaranteed rather than managed.
It will introduce sharded SubQuery data nodes to prevent continuous synchronization of new data between each node, optimizing query efficiency while moving towards greater decentralization. Users can choose to pay approximately 1 SQT token per 1000 requests or set custom fees for indexers through the protocol.
Despite launching its token earlier this year, SubQuery has seen a growth in node and delegator issuance rewards in terms of USD value, indicating an increase in the number of query services provided on its platform. Since the Token Generation Event (TGE), the total staked SQT amount has increased from 6 million to 125 million, highlighting the growth in network participation.
Covalent is a decentralized indexer network where Block Sample Producers (BSP) network nodes create copies of blockchain data through batch exports and publish proofs on the Covalent L1 blockchain. These data are refined by Block Result Producer (BRP) nodes based on set rules to filter out relevant data.
Through a unified API, developers can easily extract relevant blockchain data in a consistent request and response format without the need for custom complex queries. CQT tokens settled on Moonbeam can be used as payment to extract these pre-configured data sets from network operators.
Covalent rewards have shown an overall growth trend from Q1 23 to Q1 24, partially due to the increase in the Covalent token CQT price. Considerations when choosing indexers include data customization, security, speed, scalability, and supported networks to ensure optimal performance and reliability.
In conclusion, while indexers are widely adopted in dApp development, their potential remains immense, especially when integrating AI. As AI becomes more prevalent in both Web2 and Web3, its enhancement relies on access to relevant data for training models and developing AI agents. Ensuring data integrity is crucial for AI applications to prevent biased or inaccurate information from affecting models.
In the realm of indexer solutions, Subsquid has made significant progress in performance and user metrics. Users have started experimenting with building AI agents using Subsquid, showcasing the platform’s versatility and potential in the evolving field of data indexing. Tools like AutoAgora help indexers dynamically price query services on The Graph using AI, while SubQuery supports multiple AI networks like OriginTrail and Oraichain for transparent data indexing.
The integration of AI with indexers is poised to enhance data accessibility and usability in the blockchain ecosystem. By leveraging AI technologies, indexers can provide more efficient and accurate data retrieval, enabling developers to build more sophisticated dApps and analytical tools. With AI and indexers continuing to evolve together, we remain optimistic about the future of data indexing and its role in shaping the decentralized digital landscape.