a modern open source data stack for blockchain

1.The challenge for modern blockchain data stack

There are several challenges that a modern blockchain indexer can face, including:

Large amounts of data. As the amount of data on the blockchain increases, the data index will need to scale up to handle the increased load and provide efficient access to the data. Consequently, this leads to higher storage costs, slow statistics calculation and increased load on the database server. Complex data processing pipeline. Blockchain technology is complex, and building a comprehensive and reliable data index requires a deep understanding of the underlying data structures and algorithms. The diversity of blockchain implementations inherits this. Given specific examples, NFTs in Ethereum are usually created within smart contracts following the ERC721 and ERC1155 formats. In contrast, the implementation of those on Polkadot, for example, is usually built directly within blockchain runtime. These should be considered NFTs and should be stored as such. Integration capabilities. To provide maximum value to users, a blockchain indexing solution may need to integrate its data index with other systems, such as analytics platforms or APIs. It is challenging and requires significant effort put into the architecture design.

As blockchain technology became more widespread, the amount of data stored on the blockchain increased. This is because more people are using the technology, and each transaction adds new data to the blockchain. Additionally, blockchain technology has evolved from simple money transfer applications, such as those involving the use of Bitcoin, to more complex applications involving the implementation of business logic within smart contracts. These smart contracts can generate large amounts of data, contributing to the increased complexity and size of the blockchain. Over time, this led to a larger and more complex blockchain.

In this article, we review the evolution of Footprint Analytics’ technology architecture in phases as a case study to examine how the Iceberg-Trino technology stack addresses the challenges of on-chain data.

Footprint Analytics has indexed about 22 public blockchain data, and 17 NFT market, 1900 GameFi project and more than 100,000 NFT collections in a semantic abstraction data layer. It is the most comprehensive blockchain data warehouse solution in the world.

Regardless of blockchain data, which includes more than 20 billion rows of records of financial transactions, which data analysts regularly query. this is different from entry logs in traditional data warehouses.

In the past few months we have experienced 3 major upgrades to meet the growing business requirements:

2. Architecture 1.0 Bigquery

At the beginning of Footprint Analytics, we used Google Bigquery as our storage and query engine; Bigquery is a great product. It’s lightning fast, easy to use, and offers dynamic computational power and a flexible UDF syntax that helps us get the job done quickly.

However, Bigquery also has several problems.

Data is not compressed, which results in high costs, especially when storing raw data from more than 22 blockchains from Footprint Analytics. Insufficient concurrency: Bigquery supports only 100 concurrent queries, which is unsuitable for high concurrency scenarios for Footprint Analytics when serving many analysts and users. Join Google Bigquery, which is a closed source product.

So we decided to explore other alternative architectures.

3. Architecture 2.0 OLAP

We were very interested in some of the OLAP products that became very popular. The most attractive advantage of OLAP is its query response time, which usually takes sub-seconds to return query results for large amounts of data, and it can also support thousands of concurrent queries.

We chose one of the best OLAP databases, Doris, to try it out. This engine works well. However, at some point we soon ran into some other problems:

Data types such as Array or JSON are not yet supported (Nov, 2022). Arrays are a common type of data in some blockchains. For example, the subject field in evm logs. Not being able to compute on Array directly affects our ability to compute many business metrics. Limited support for DBT and for merge statements. These are common requirements for data engineers for ETL/ELT scenarios where we need to update some newly indexed data.

That said, we couldn’t use Doris for our entire data pipeline on production, so we tried using Doris as an OLAP database to solve part of our problem in the data production pipeline, acting as a query engine and provide fast and highly concurrent query capabilities.

Unfortunately, we could not replace Bigquery with Doris, so we had to periodically synchronize data from Bigquery to Doris using it as a query engine. This synchronization process had several problems, one of which was that the update writes quickly piled up when the OLAP engine was busy serving queries to the front-end clients. After that, the speed of the writing process was affected, and synchronization took much longer and sometimes even became impossible to complete.

We realized that the OLAP can solve several issues we face and cannot become the key solution of Footprint Analytics, especially for the data processing pipeline. Our problem is bigger and more complex, and we can say that OLAP as a query engine alone was not enough for us.

4. Architecture 3.0 Iceberg + Trino

Welcome to Footprint Analytics Architecture 3.0, a complete overhaul of the underlying architecture. We redesigned the entire architecture from the ground up to separate the storage, computation and query of data into three different pieces. Take lessons from the two earlier architectures of Footprint Analytics and learn from the experience of other successful big data projects such as Uber, Netflix and Databricks.

4.1. Introducing the data lake

We first turned our attention to data lake, a new type of data storage for both structured and unstructured data. Data lake is perfect for on-chain data storage as the formats of on-chain data vary widely from unstructured raw data to structured abstraction data that Footprint Analytics is known for. We expected to use data lake to solve the problem of data storage, and ideally it would also support mainstream computing engines like Spark and Flink, so it wouldn’t be a pain to integrate with different types of processing engines like Footprint Analytics does not evolve. .

Iceberg integrates very well with Spark, Flink, Trino and other computing engines, and we can choose the most suitable calculation for each of our metrics. For example:

For those who need complex computational logic, Spark will be the choice. Great for real-time calculation. For simple ETL tasks that can be performed with SQL, we use Trino.

4.2. Query Engine

With Iceberg solving the storage and computation problems, we had to think about choosing a query engine. There are not many options available. The alternatives we considered were

The most important thing we considered before going deeper was that the future query engine had to be compatible with our current architecture.

To support Bigquery as a data source To support DBT, which we rely on for many metrics to be produced To support the BI tool metabase

Based on the above, we chose Trino, which has very good support for Iceberg and the team was so responsive that we made a mistake, which was fixed the next day and released to the latest version the next week. It was the best choice for the Footprint team, which also requires high implementation responsiveness.

4.3. Performance testing

After deciding on our direction, we did a performance test on the Trino + Iceberg combo to see if it could meet our needs and to our surprise the queries were incredibly fast.

Knowing that Presto + Hive was the worst comparator for years in all the OLAP hype, the combination of Trino + Iceberg completely blew our minds.

Here are the results of our tests.

case 1: join a large dataset

An 800 GB table1 joins another 50 GB table2 and performs complex business calculations

case2: use a large single table to perform a clear query

Test sql: select different (address) from table group by day

The Trino+Iceberg combination is about 3 times faster than Doris in the same configuration.

Moreover, there is another surprise because Iceberg supports data formats like Parquet, ORC, etc. can use, which will compress and store the data. Iceberg’s table storage takes up only about 1/5 the space of other data warehouses. The storage size of the same table in the three databases is as follows:

Note: The above tests are examples we encountered in real production and are for reference only.

4.4. Upgrade effect

The performance test reports gave us enough performance that it took our team about 2 months to complete the migration, and this is a diagram of our architecture after the upgrade.

Multiple computing engines suit our different needs. Trino supports DBT, and can query Iceberg directly, so we no longer need to deal with data synchronization. The amazing performance of Trino + Iceberg enables us to open all Bronze data (raw data) to make for our users.

5. Summary

Since its launch in August 2021, Footprint Analytics team has completed three architectural upgrades in less than a year and a half, thanks to its strong desire and determination to bring the benefits of the best database technology to its crypto users and good execution with implementation and upgrading its underlying infrastructure and architecture.

The Footprint Analytics architecture upgrade 3.0 has bought a new experience to its users, enabling users from different backgrounds to gain insights into more diverse usage and applications:

Built with the Metabase BI tool, Footprint facilitates analysts to access decoded on-chain data, explore with complete freedom of choice of tools (no-code or hard-wired), query the entire history, and cross-examine datasets, to gain insights to get in no-time. Integrate both on-chain and off-chain data with analysis on web2 + web3; By building / querying metrics on top of Footprint’s business abstraction, analysts or developers save time on 80% of repetitive data processing work and focus on meaningful statistics, research and product solutions based on their business. Seamless experience from Footprint Web to REST API calls, all based on SQL Real-time alerts and actionable notifications on key signals to support investment decisions

Disclaimer for Uncirculars, with a Touch of Personality:

While we love diving into the exciting world of crypto here at Uncirculars, remember that this post, and all our content, is purely for your information and exploration. Think of it as your crypto compass, pointing you in the right direction to do your own research and make informed decisions.

No legal, tax, investment, or financial advice should be inferred from these pixels. We’re not fortune tellers or stockbrokers, just passionate crypto enthusiasts sharing our knowledge.

And just like that rollercoaster ride in your favorite DeFi protocol, past performance isn’t a guarantee of future thrills. The value of crypto assets can be as unpredictable as a moon landing, so buckle up and do your due diligence before taking the plunge.

Ultimately, any crypto adventure you embark on is yours alone. We’re just happy to be your crypto companion, cheering you on from the sidelines (and maybe sharing some snacks along the way). So research, explore, and remember, with a little knowledge and a lot of curiosity, you can navigate the crypto cosmos like a pro!

UnCirculars – Cutting through the noise, delivering unbiased crypto news