Why is multimodal modularity an illusion of Web3AI? Why does Web3AI adopt the strategy of surrounding the city from the countryside?

2025-06-18 18:45:29

Collection

The future of Web3AI lies not in imitation, but in strategic circumvention. From semantic alignment in high-dimensional space, to information bottlenecks in attention mechanisms, and to feature alignment under heterogeneous computing power, I will elaborate one by one to explain why Web3AI should adopt the tactic of surrounding the city from the countryside.

Author: @BlazingKevin_, the Researcher at Movemaker

NVIDIA has quietly recovered all the losses brought by Deepseek and even reached new highs again. The evolution of multimodal models has not caused chaos; instead, it has deepened the technical barriers of Web2 AI—from semantic alignment to visual understanding, from high-dimensional embedding to feature fusion. Complex models are integrating various modalities of expression at an unprecedented speed, constructing an increasingly closed AI stronghold. The U.S. stock market is also voting with its feet, with both crypto stocks and AI stocks experiencing a small bull market. However, this wave of enthusiasm has no connection to Crypto. The Web3AI attempts we see, especially the evolution in the Agent direction in recent months, are almost completely misguided: there is a naive attempt to use decentralized structures to assemble a Web2-style multimodal modular system, which is actually a dual misalignment of technology and thinking. In today's world, where modular coupling is extremely strong, feature distribution is highly unstable, and computational power demands are increasingly centralized, multimodal modularity cannot stand in Web3. We must point out that the future of Web3AI lies not in imitation but in strategic circumvention. From semantic alignment in high-dimensional space to information bottlenecks in attention mechanisms, and then to feature alignment under heterogeneous computing power, I will elaborate one by one to explain why Web3AI should adopt the tactical principle of surrounding cities from the countryside.

Web3AI Based on Flat Multimodal Models, Semantic Misalignment Leads to Poor Performance

In modern Web2 AI multimodal systems, "semantic alignment" refers to mapping information from different modalities (such as images, text, audio, video, etc.) into the same or mutually convertible semantic space, allowing the model to understand and compare the inherent meanings behind these originally disparate signals. For example, a photo of a cat and the phrase "a cute cat" need to be projected close to each other in a high-dimensional embedding space so that the model can achieve "seeing a picture and speaking" or "hearing a sound and associating it with an image" during retrieval, generation, or reasoning.

Only under the premise of achieving high-dimensional embedding space does it make sense to divide the workflow into different modules for cost reduction and efficiency improvement. However, in the Web3 Agent protocol, high-dimensional embedding cannot be realized because modularization is an illusion of Web3AI.

How to understand high-dimensional embedding space? At the most intuitive level, think of "high-dimensional embedding space" as a coordinate system—just like the x-y coordinates on a plane, you can use a pair of numbers to locate a point. In our common two-dimensional plane, a point is completely determined by two numbers (x, y); whereas in "high-dimensional" space, each point needs to be described by more numbers, possibly 128, 512, or even thousands of numbers.

To understand it step by step:

Two-Dimensional Example:

Imagine you have marked the coordinates of several cities on a map, such as Beijing (116.4, 39.9), Shanghai (121.5, 31.2), Guangzhou (113.3, 23.1). Each city corresponds to a "two-dimensional embedding vector": the two-dimensional coordinates encode geographical location information into numbers.
If you want to measure the "similarity" between cities—cities that are close together on the map often belong to the same economic or climatic zone—you can directly compare their Euclidean distances.

Expanding to Multiple Dimensions:

Now suppose you not only want to describe the position in "geographical space" but also want to add some "climatic features" (average temperature, rainfall) and "demographic features" (population density, GDP). You can assign each city a vector containing these 5, 10, or even more dimensions.
For example, Guangzhou's 5-dimensional vector might be [113.3, 23.1, 24.5, 1700, 14.5], representing longitude, latitude, average temperature, annual rainfall (in mm), and economic index. This "multi-dimensional space" allows you to compare cities across multiple dimensions such as geography, climate, and economy: if the vectors of two cities are very close, it means they are very similar in these attributes.

Switching to Semantics—Why "Embed":

In natural language processing (NLP) or computer vision, we also want to map "words," "sentences," or "images" into such a multi-dimensional vector, allowing "similar meaning" words or images to be closer in space. This mapping process is called "embedding."
For example, we train a model to map "cat" to a 300-dimensional vector v₁, map "dog" to another vector v₂, and map an "unrelated" word like "economy" to v₃. In this 300-dimensional space, the distance between v₁ and v₂ will be small (because they are both animals and often appear in similar linguistic contexts), while the distance between v₁ and v₃ will be large.
As the model is trained on massive amounts of text or image-text pairs, each dimension it learns does not directly correspond to interpretable attributes like "longitude" or "latitude," but rather to some "latent semantic features." Some dimensions may capture the coarse-grained distinction of "animal vs. non-animal," others may distinguish "domestic vs. wild," and still others may correspond to feelings like "cute vs. fierce"…… In short, hundreds or thousands of dimensions work together to encode various complex and intertwined semantic layers.

What is the difference between high-dimensional and low-dimensional? Only with enough dimensions can diverse and interwoven semantic features be accommodated; only high-dimensional space can provide clearer positions for them on their respective semantic dimensions. When semantics cannot be distinguished, i.e., when semantics cannot be aligned, different signals in low-dimensional space "squeeze" against each other, leading to frequent confusion in the model during retrieval or classification, resulting in a significant drop in accuracy. Furthermore, during the strategy generation phase, it becomes difficult to capture subtle differences, which may lead to missing key trading signals or misjudging risk thresholds, directly dragging down performance. Additionally, cross-module collaboration becomes impossible, with each Agent acting independently, leading to severe information silos, increased overall response latency, and decreased robustness. Finally, when faced with complex market scenarios, low-dimensional structures have almost no capacity to carry multi-source data, making it difficult to ensure system stability and scalability. Long-term operation will inevitably fall into performance bottlenecks and maintenance dilemmas, resulting in a significant gap between product performance after launch and initial expectations.

So, can Web3AI or the Agent protocol achieve high-dimensional embedding space? First, let's answer how high-dimensional space is achieved. Traditionally, "high-dimensionality" requires that subsystems—such as market intelligence, strategy generation, execution, and risk control—align and complement each other in data representation and decision-making processes. However, most Web3 Agents simply encapsulate ready-made APIs (such as CoinGecko, DEX interfaces, etc.) into independent "Agents," lacking a unified central embedding space and cross-module attention mechanisms, resulting in information being unable to interact from multiple angles and levels between modules, only following a linear pipeline, exhibiting single functionality, and failing to form a complete closed-loop optimization.

Many Agents directly call external interfaces, even without sufficient fine-tuning or feature engineering on the returned data. For example, a market analysis Agent may simply take price and trading volume, a trading execution Agent may only place orders according to interface parameters, and a risk control Agent may only issue alerts based on a few thresholds. They each perform their respective roles but lack multimodal fusion and deep semantic understanding of the same risk events or market signals, resulting in the system being unable to quickly generate comprehensive and multi-faceted strategies when facing extreme market conditions or cross-asset opportunities.

Therefore, requiring Web3AI to achieve high-dimensional space is tantamount to requiring the Agent protocol to independently develop all involved API interfaces, which is contrary to its modular intention. The modular multimodal system envisioned by small and medium enterprises in Web3AI is untenable. High-dimensional architecture requires end-to-end unified training or collaborative optimization: from signal capture to strategy computation, and then to execution and risk control, all links share the same representation and loss function. The "module as plugin" approach of Web3 Agents exacerbates fragmentation—each Agent's upgrades, deployments, and parameter tuning are completed within their own silo, making it difficult to synchronize iterations, and there is no effective centralized monitoring and feedback mechanism, resulting in soaring maintenance costs and limited overall performance.

To achieve a full-link intelligent agent with industry barriers, systematic engineering is needed for end-to-end joint modeling, unified embedding across modules, and collaborative training and deployment. However, there is currently no such pain point in the market, and naturally, there is no market demand.

In Low-Dimensional Space, Attention Mechanisms Cannot Be Precisely Designed

High-level multimodal models require precisely designed attention mechanisms. The "attention mechanism" is essentially a way to dynamically allocate computational resources, allowing the model to selectively "focus" on the most relevant parts when processing a certain modality input. The most common are self-attention and cross-attention mechanisms in Transformers: self-attention enables the model to measure the dependency relationships between elements in a sequence, such as the importance of each word in a text; cross-attention allows information from one modality (such as text) to determine which image features to "look at" when decoding or generating another modality (such as a sequence of image features). Through multi-head attention, the model can learn various alignment methods simultaneously in different subspaces, capturing more complex and finer-grained associations.

The premise for the attention mechanism to function is that multimodal data possesses high dimensionality. In high-dimensional space, a precise attention mechanism can find the most core parts from a vast high-dimensional space in the shortest time. Before explaining why the attention mechanism needs to be placed in high-dimensional space to function, let’s first understand the process of designing attention mechanisms in Web2 AI, represented by the Transformer decoder. The core idea is to dynamically allocate "attention weights" to each element when processing sequences (text, image patches, audio frames), allowing it to focus on the most relevant information rather than treating everything equally.

In simple terms, if we compare the attention mechanism to a car, designing Query-Key-Value is like designing the engine. Q-K-V is the mechanism that helps us identify key information: Query refers to the query ("What am I looking for?"), Key refers to the index ("What labels do I have?"), and Value refers to the content ("What content is here?"). For multimodal models, the content you input to the model could be a sentence, an image, or an audio clip. To retrieve the content we need in the dimensional space, these inputs will be cut into the smallest units, such as a character, a small block of a certain pixel size, or a segment of an audio frame. The multimodal model will generate Query, Key, and Value for these smallest units to perform attention calculations. When the model processes a certain position, it will use the Query of that position to compare with all the Keys of other positions, determining which labels match the current needs best, and then extract the corresponding Values based on the degree of matching, combining them according to importance to ultimately obtain a new representation that contains both its own information and integrates global relevant content. In this way, each output can dynamically "ask—retrieve—integrate" based on context, achieving efficient and precise information focusing.

On this engine's foundation, various parts are added, cleverly combining "global interaction" with "controllable complexity": scaling dot products ensure numerical stability, multi-head parallelism enriches expression, positional encoding preserves sequence order, sparse variants balance efficiency, residuals and normalization aid stable training, and cross-attention connects multimodal data. These modular and progressively layered designs enable Web2 AI to possess strong learning capabilities while efficiently operating within a bearable computational power range when handling various sequences and multimodal tasks.

Why can't modular Web3AI achieve unified attention scheduling? First, the attention mechanism relies on a unified Query-Key-Value space; all input features must be mapped to the same high-dimensional vector space to dynamically calculate weights through dot products. However, independent APIs return data in different formats and distributions—prices, order statuses, threshold alerts—without a unified embedding layer, making it impossible to form a set of interactive Q/K/V. Secondly, multi-head attention allows for simultaneous parallel attention to different information sources at the same layer, then aggregates the results; whereas independent APIs often follow a "call A, then call B, then call C" sequence, where each step's output is merely the next module's input, lacking the ability for parallel, multi-route dynamic weighting, and thus cannot simulate the fine scheduling of the attention mechanism that scores all positions or modalities simultaneously and then integrates them. Finally, a true attention mechanism dynamically allocates weights to each element based on the overall context; under the API model, modules can only see their "independent" context when called, lacking a real-time shared central context, making it impossible to achieve global associations and focusing across modules.

Therefore, merely encapsulating various functions into discrete APIs—without a common vector representation, without parallel weighting and aggregation—cannot build a "unified attention scheduling" capability like that of Transformers, just like a car with poor engine performance cannot be improved no matter how it is modified.

Discrete Modular Assembly Leads to Shallow Static Feature Fusion

"Feature fusion" is the further combination of feature vectors obtained from processing different modalities based on alignment and attention, for direct use in downstream tasks (classification, retrieval, generation, etc.). Fusion methods can be as simple as concatenation or weighted summation, or as complex as bilinear pooling, tensor decomposition, or even dynamic routing techniques. Higher-order methods involve alternating alignment, attention, and fusion in multi-layer networks, or establishing more flexible message-passing paths between cross-modal features using graph neural networks (GNNs) to achieve deep information interaction.

Needless to say, Web3AI is certainly at the stage of the simplest concatenation, as dynamic feature fusion requires high-dimensional space and precise attention mechanisms. When these prerequisites cannot be met, the final stage of feature fusion cannot achieve outstanding performance.

Web2 AI tends to use end-to-end joint training: processing all modal features such as images, text, and audio simultaneously in the same high-dimensional space, optimizing collaboratively with downstream task layers through attention and fusion layers, allowing the model to automatically learn the optimal fusion weights and interaction methods during forward and backward propagation. In contrast, Web3 AI often adopts a discrete modular assembly approach, encapsulating various APIs for image recognition, market capture, risk assessment, etc., into independent Agents, and then simply piecing together their output labels, values, or threshold alerts, with the main logic or humans making comprehensive decisions. This approach lacks a unified training objective and does not allow for cross-module gradient flow.

In Web2 AI, the system relies on attention mechanisms to calculate the importance scores of various features in real-time based on context and dynamically adjust fusion strategies; multi-head attention can also capture various feature interaction patterns in parallel at the same level, balancing local details with global semantics. In contrast, Web3 AI often pre-fixes weights like "image × 0.5 + text × 0.3 + price × 0.2," or uses simple if/else rules to determine whether to fuse, or may not perform any fusion at all, simply presenting the outputs of each module, lacking flexibility.

Web2 AI maps all modal features into thousands of dimensions of high-dimensional space, and the fusion process involves not only vector concatenation but also addition, bilinear pooling, and various high-order interaction operations—each dimension may correspond to some latent semantics, allowing the model to capture deep and complex cross-modal associations. In contrast, the outputs of Web3 AI's various Agents often contain only a few key fields or indicators, with extremely low feature dimensions, making it almost impossible to express nuanced information such as "why image content matches text meaning" or "the subtle relationship between price fluctuations and sentiment trends."

In Web2 AI, the losses of downstream tasks are continuously fed back to various parts of the model through attention and fusion layers, automatically adjusting which features should be reinforced or suppressed, forming a closed-loop optimization. In contrast, Web3 AI often relies on manual or external processes to evaluate and tune parameters after reporting API call results, lacking automated end-to-end feedback, making it difficult for fusion strategies to iterate and optimize online.

Barriers in the AI Industry Are Deepening, but Pain Points Have Yet to Emerge

Because it is necessary to simultaneously consider cross-modal alignment, precise attention calculations, and high-dimensional feature fusion in end-to-end training, Web2 AI's multimodal systems are often extremely large engineering projects. They require massive, diverse, and precisely labeled cross-modal datasets, as well as thousands of GPUs for weeks or even months of training time; in terms of model architecture, they integrate various cutting-edge network design concepts and optimization techniques; in engineering implementation, they must build scalable distributed training platforms, monitoring systems, model version management, and deployment pipelines; in algorithm development, continuous research is needed for more efficient attention variants, more robust alignment losses, and lighter fusion strategies. This full-link, full-stack systematic work places extremely high demands on funding, data, computing power, talent, and organizational collaboration, thus forming strong industry barriers and creating the core competitiveness held by a few leading teams to date.

In April, when I reviewed Chinese AI applications and compared them to WEB3AI, I mentioned a viewpoint: In industries with strong barriers, Crypto may achieve breakthroughs, meaning that certain industries are already very mature in traditional markets but have emerged with significant pain points. High maturity means that users are familiar with similar business models, and significant pain points mean that users are willing to try new solutions, indicating a strong willingness to accept Crypto. Both are indispensable; conversely, if an industry is not already very mature in traditional markets but has significant pain points, Crypto cannot take root in it and will not have a survival space, as users will have a low willingness to fully understand it and will be unaware of its potential ceiling.

WEB3AI or any Crypto product claiming PMF needs to develop using the strategy of surrounding cities from the countryside, starting with small-scale trials in peripheral positions, ensuring a solid foundation before waiting for the emergence of core scenarios, which are the target cities. The core of Web3AI lies in decentralization, and its evolution path is reflected in high parallelism, low coupling, and compatibility with heterogeneous computing power. This gives Web3AI an advantage in scenarios such as edge computing, suitable for lightweight structures, easily parallelizable, and incentivized tasks, such as LoRA fine-tuning, behavior alignment post-training tasks, crowdsourced data training and labeling, small foundational model training, and collaborative training of edge devices. The product architecture in these scenarios is lightweight, and the roadmap can be iterated flexibly. However, this does not mean that the opportunity is now, as the barriers of WEB2AI have just begun to form. The emergence of Deepseek has instead stimulated the progress of multimodal complex task AI, which is a competition among leading enterprises and represents the early stage of the WEB2AI dividend. I believe that only when the WEB2AI dividend has largely disappeared will the pain points it leaves behind become opportunities for WEB3AI to cut in, just as DeFi was born. Before that time arrives, self-created pain points of WEB3AI will continue to emerge in the market, and we need to carefully discern whether there are protocols with "surrounding cities from the countryside," whether they can establish a foothold in rural areas (or small markets, small scenarios) where their strength is weak and where market roots are few, gradually accumulating resources and experience; whether they can combine points and surfaces, advancing in a circular manner, continuously iterating and updating products in a sufficiently small application scenario. If they cannot achieve this, then relying on PMF to achieve a market value of $1 billion will be extremely difficult, and such projects will not be on the watchlist; whether they can engage in a protracted battle with flexibility, as the potential barriers of WEB2AI are dynamically changing, and the corresponding potential pain points are also evolving. We need to pay attention to whether WEB3AI protocols have sufficient flexibility, can pivot quickly for different scenarios, and can move rapidly between rural areas to approach target cities as quickly as possible. If the protocol itself is too infrastructure-heavy and the network architecture is too large, then the likelihood of being eliminated is very high.

About Movemaker

Movemaker is the first official community organization authorized by the Aptos Foundation and jointly initiated by Ankaa and BlockBooster, focusing on promoting the construction and development of the Aptos ecosystem in the Chinese-speaking region. As the official representative of Aptos in the Chinese-speaking area, Movemaker is committed to building a diverse, open, and prosperous Aptos ecosystem by connecting developers, users, capital, and numerous ecological partners.

Disclaimer:

This article/blog is for reference only, representing the author's personal views and does not represent the position of Movemaker. This article does not intend to provide: (i) investment advice or recommendations; (ii) offers or solicitations to buy, sell, or hold digital assets; or (iii) financial, accounting, legal, or tax advice. Holding digital assets, including stablecoins and NFTs, carries high risks, with significant price volatility, and they may even become worthless. You should carefully consider whether trading or holding digital assets is suitable for you based on your financial situation. If you have specific questions, please consult your legal, tax, or investment advisor. The information provided in this article (including market data and statistics, if any) is for general reference only. Reasonable care has been taken in compiling this data and charts, but no responsibility is accepted for any factual errors or omissions expressed therein.