Frontend Development 44 min read

VR and 3D Technology Architecture Design and Practice on the Web

This article is based on the 2021 GMTC Global Frontend Technology Conference's 'New Trends in Mobile Technology' sub-topic, sharing the theme 'VR and 3D Technology Architecture Design and Practice on the Web', and is organized from the original presentation.

Beike Product & Technology
Beike Product & Technology
Beike Product & Technology
VR and 3D Technology Architecture Design and Practice on the Web

This article is based on the 2021 GMTC Global Frontend Technology Conference's 'New Trends in Mobile Technology' sub-topic, sharing the theme 'VR and 3D Technology Architecture Design and Practice on the Web', and is organized from the original presentation.

VR house viewing is one of the scenarios where VR and 3D technology are implemented. Its feature is that it allows people to truly immerse themselves through mobile terminals and use their intuitive spatial sense to feel the characteristics of the entire house. This sharing will introduce how the frontend team of Beike LuShi implements frontend architecture design based on VR 3D models. In addition, it will share how LuShi explores new business forms based on VR house viewing capabilities and the technical challenges faced.

1 Frontend Architecture Design Based on VR 3D Models

Before introducing frontend architecture design, let's first detail the composition and form of VR 3D models in the house viewing scenario.

The form of VR 3D models for house viewing is diverse, but the three main forms that users intuitively feel are: 3D model form, point panorama form, and VR glasses perspective form. The following provides a detailed introduction to these three forms.

1.1 3D Model

First, let's think about how three-dimensional models are abstractly modeled on a two-dimensional plane. The mainstream approach to three-dimensional model abstract modeling is based on polygon meshes (Polygon Mesh), as shown in Figure 1. The overall perception is that the more polygon faces (face density), the more realistic the three-dimensional effect is restored. The simplest polygon is naturally a triangle (in most scenarios, the face referred to is a triangular face). Each detail of a three-dimensional object can be described using geometric mathematical concepts such as the vertices, edges, and faces of triangular faces. From a microscopic perspective, three-dimensional models based on face modeling are essentially geometric bodies with extremely high density.

Therefore, after collecting data using some professional 3D scanners (such as LuShi's self-developed Riemann, Galois, etc.) or panoramic cameras and other equipment, and then processing it through algorithms, we can obtain the triangular face data that describes the three-dimensional structure. Frontend developers can then use technologies like WebGL to render it to the browser. At this point, we can obtain the three-dimensional outline of the property, as shown in Figure 2 (left). Of course, Figure 2 (right) is what we expect. Having only a three-dimensional 'skeleton' outline is not enough. We need to add a 'skin' on this basis, and this 'skin' is the UV texture map.

For three-dimensional models, there are two important coordinate systems: one is the position (x, y, z) coordinate of the vertex, and the other is the UV coordinate. What is UV? Simply put, it is the basis for mapping a two-dimensional plane image onto the surface of a three-dimensional model. For example, the typical UV mapping effect is shown in Figure 3. As mentioned earlier, the three-dimensional structure is composed of triangular faces made up of vertices, edges, and faces. This triangular face is two-dimensional. Through some data dependency mapping relationships, a triangle with the same edges and faces is cut out from the UV texture map and pasted onto the triangular face. Therefore, the UV here refers to the information that defines the mapping relationship between the position of each point on the two-dimensional plane image and the position of the triangular face in the three-dimensional structure. For frontend engineers, this is consistent with the principle of frontend sprite (Sprite) combining multiple icons into one image.

Thus, based on the triangular face and UV texture map data, we successfully rendered the 3D model of the property. Of course, for performance considerations, our triangular face density is not particularly high. It is currently unrealistic to rely solely on 3D models to restore the real details of the property on terminal devices (iOS/Android, etc.). With fewer triangular faces, lower data volume, and lower memory usage, we can use 3D models to restore the overall structure of the property. As for details, they are achieved through point-based cubic panoramas.

1.2 Point Panorama

As mentioned earlier, the overall structure of the property is reflected through 3D models, while details are expressed in the form of panoramas. We will select multiple suitable positions in the property to shoot panoramic images, and then render them in the form of cubic panoramas to achieve a 720-degree viewing effect, as shown in Figure 4 (left).

The implementation of panoramas is a relatively mature technology. The mainstream implementation methods are cubic panoramas and spherical panoramas. Both methods have their own advantages and disadvantages. Due to the low secondary processing cost of cubic panoramas, LuShi currently mainly adopts the cubic panorama technical solution. The principle of cubic panorama is to render a cubic box and paste a picture on each of its six faces: up, down, front, back, left, and right. It should be noted that these six pictures, when four consecutive ones are selected and spliced together, form a continuous panoramic picture, as shown in Figure 4 (right) in the unfolded T-shaped cubic texture map. At this point, when the human eye is placed at the center point of the cube and looks around, it is a continuous panoramic effect.

The effect of the panorama completely depends on the clarity of the texture map. Therefore, we can shoot high-definition 2048-resolution panoramic images to reflect the detailed information of the property at a certain position. This is also the second core form of the VR 3D model for house viewing: the point panorama form.

1.3 VR Glasses Panorama

The 3D models and point panorama forms mentioned earlier are all based on two-dimensional displays (naked-eye experience). If you want users to have an immersive feeling, you often need to rely on VR glasses devices. For such devices, we need to adapt to the WebXR Device API. Our current adaptation strategy is to render two identical point cubic panoramas, one for each eye. The final adapted effect is shown in Figure 5.

Due to the fact that most users' devices are still iOS/Android, the current naked-eye VR 3D experience is mainstream. With the promotion of hardware devices, when VR glasses become popular among ordinary users, this more immersive experience will gradually become mainstream.

Of course, in addition to the three forms of 3D model form, point panorama, and VR glasses panorama mentioned in this article, we also have multiple other forms internally, such as model vertical perspective, depth map rendering panoramic perspective, etc., but they are more technical and not deeply perceived by ordinary users, so they will not be introduced in detail here.

Finally, based on these three forms plus a two-dimensional floor plan of the property, we have formed the core structure of our VR 3D model for house viewing. On this basis, we continuously improve various interactions (such as Tween animations for switching between forms) and gradually evolve product functions into the familiar VR house viewing of Beike LuShi.

2 Frontend Architecture Layered Design

Figure 6 shows the frontend architecture layered design. After mentioning the composition of VR 3D models for properties and the three core forms, we have implemented the real restoration of property information through 3D technology. After multiple rounds of product demand iterations, we have continuously improved the entire frontend architecture layered design. At present, the frontend design of the entire VR client abstracts three layers: Web service layer, frontend data layer, and View layer.

We divide the View layer into four directions for abstraction. The first direction is the pure DOM layer, such as the homepage content, control panel, information panel, etc. This layer is usually abstracted and reused using React/Vue components. The second direction is the three-dimensional view based on Canvas/WebGL rendering, whose function is the VR 3D model interaction of the property mentioned earlier. The third direction is our 3D plugin ecosystem. Based on the VR 3D model, new interactions and capabilities are derived in the form of plugins (for example, the compass, TV video, etc. in the model are all integrated in the form of plugins). The last direction is the protocol layer abstraction. Our VR is implemented through Web frontend technology rendering and integrated into the terminal App in the form of WebView. Bidirectional communication is achieved through jsBridge. In order to ensure the consistency of business code, we abstract a layer of protocol for third-party dependencies (jsBridge/RTC/WebSocket, etc.) and develop facing the protocol to eliminate the differences between different terminals.

The second layer is the abstraction of the data layer. The data here does not refer to the data layer facing backend services, but the data layer abstraction of frontend UI interaction. We abstract the state of UI interaction in the form of global frame data. When the UI changes, it is synchronized to the frame data; of course, if the frame data is modified (modifying the frame data object), it will also drive the UI to change accordingly. This process is achieved through JavaScript's Proxy to intercept data objects, as shown in Figure 7. In other words, UI interaction can generate new frame data, and the corresponding UI state can also be restored through frame data. As for why so much effort is spent on this work, it will be explained in detail later.

The third layer, the Web layer, has two directions of core services. The HTTP service based on Node.js/Go mainly provides the HTML 'shell' and homepage data of the VR page, while the full-duplex data channel based on WebSocket service ensures real-time communication between the VR experience process and backend services. WebSocket long connection technology has unparalleled advantages over traditional HTTP methods (private protocol, high real-time performance, excellent performance, etc.), and is irreplaceable for our business intelligence and performance experience improvement, and you will have a deeper perception in the description of business exploration and performance experience below.

The frontend design of Beike LuShi's client is roughly like this. Most of our core businesses, such as VR language guide, VR real-time viewing, and AR house introduction, are all developed based on this design.

3 Differences Between 3D Model Development and Traditional DOM Development

As a frontend R&D engineer engaged in 3D-related work, I am often asked about the differences between development based on 3D models and traditional DOM development. There are differences from traditional frontend development, but if you adapt to the following three points, you can basically enter the threshold of frontend 3D development.

3.1 Three-dimensional Coordinate System vs DOM Tree

Frontend DOM tree layout is based on the CSS box model and Flex layout. Most frontend layouts are implemented based on this. In addition, there are classic layout systems such as the Holy Grail and the Flying Wings. On the two-dimensional plane, with the powerful CSS, frontend layout is arbitrary. However, in three-dimensional space, we spend most of our time dealing with coordinate systems and switching between coordinate systems.

Figure 8 shows the world coordinate system. The first threshold for 3D development is dealing with various coordinate systems, such as the coordinate system of the three-dimensional object itself (generally called the local coordinate system). A three-dimensional space will have multiple three-dimensional objects. How to place these three-dimensional objects requires a three-dimensional world coordinate system to locate them. In addition, three-dimensional objects in three-dimensional space are usually stationary, and their movement, rotation, etc. are all achieved by controlling the movement of the camera (of course, the camera is also a special three-dimensional object). Moreover, our screen is two-dimensional. The camera, as an 'eye', projects three-dimensional objects onto the two-dimensional screen, which involves plane coordinate systems, homogeneous coordinate systems, etc. Therefore, how to clarify the concepts of these coordinate systems and the mutual conversion between coordinate systems is the first threshold for 3D development. Once you understand these, you will be able to do things 'with ease' in future development.

3.2 Facing Asynchronous Hooks Events

When dealing with 3D behavior interaction experience, there is another very obvious difference from traditional frontend, which is that there are much more asynchronous details faced. In DOM-level frontend development, the asynchronous events we come into contact with are mainly concentrated in clicks/touches, scrolling, and Ajax asynchronous requests. However, in 3D interaction, in addition to these, we also frequently encounter various asynchronous behaviors such as zooming in and out, dragging and displacement, and mode switching.

In our underlying rendering engine, we maintain a relatively complete set of asynchronous Hooks events to cope with various scenarios of interactive behavior. For example, the effect shown in Figure 9 (left) is a common VR property point panorama interaction walking point movement. The entire process triggers nine asynchronous event callbacks, as listed in Figure 9 (right). These callbacks expose all the details of the entire process, making it convenient for R&D personnel to control the experience more accurately. It is difficult for general terminal engineers to experience this kind of fine-grained precise control at the interactive level.初次接触需要适应。

3.3 Collision Detection

The last relatively obvious difference is collision monitoring in three-dimensional space.

Figure 10 shows that in three-dimensional space, placing new objects may involve occlusion and overlap. In actual development, we try to avoid such phenomena from happening. The conventional approach to collision monitoring is to create a regular three-dimensional geometric shape around the object to surround it and then analyze whether there are overlapping parts; there is another idea to establish a ray, obtain the focus between this ray and the two objects, and then analyze whether they overlap.

Collision monitoring usually adopts appropriate methods in different scenarios. For moving objects, sometimes we also need to add support for physics engines in the modeling system.

4 Exploration and Practice of New Business Scenarios

The above involves technical aspects. Below, I will share some explorations and practices that LuShi has made in business.

4.1 Three-dimensional Space Analysis and Calculation and Secondary Processing

The three-dimensional model comes from the real property in reality (obtained through professional equipment shooting and algorithm analysis). We can analyze the three-dimensional model and identify the furniture objects inside it (as shown in Figure 11). After identifying these objects, we can do some interesting things. For example, if we identify a monitor or TV, we can add a video to play advertisements or programs here to create a more realistic 3D scene, as shown in Figure 12 (left). Identify a smooth floor, and we can place a sweeping robot or 3D treasure box to do some marketing activities, etc., as shown in Figure 12 (right).

In addition to object recognition in space, the floor plan is also a key direction for our secondary processing. For example, we clean up all the furniture and decoration objects in the second-hand property, and then get an extremely 'pure' white model; based on the original floor structure, we re-plan to transform a two-bedroom and one-living room property into a three-bedroom and one-living room property, and then re-process the decoration style and place furniture objects, etc. The whole process, as shown in Figure 13 (left), goes through the process from a real complex ordinary property to a simple white model and then to a complex new decoration home style, giving potential home buyers an example of the transformation space of this property in advance.

In addition, we have also made some breakthroughs in technical experience. We have achieved the interactive experience of real-time switching and side-by-side comparison of real properties and design properties on the terminal side, and the final effect is shown in Figure 13 (right).

4.2 VR Real-time Viewing: Same-screen Connection, Efficient House Viewing

Another business scenario exploration is the implementation of online VR real-time viewing capabilities. First, let me explain why we want to explore in this direction? Everyone who has experienced buying or renting a house knows that in most scenarios, the broker drives you to the site to view the house. You can only view a few properties in a day, and you may have to climb stairs, wait for traffic lights, or be exposed to the sun, etc.

Although VR properties restore the real scene of the property, three-dimensional space interaction is still relatively complicated and requires users to explore details. Figure 14 (left) shows the classic information flow layout: search → navigation → recommendation → screening → list. This is the most efficient information display layout in two dimensions. Almost all Apps that provide data services domestically (e-commerce JD.com, catering Meituan, real estate Beike, etc.) use this layout. However, there is no such clear three-dimensional space interaction. The panorama can only view the current position, and most users do not know about panorama roaming. In addition, information such as the community information and nearby schools and hospitals of the property cannot be clearly reflected in the VR 3D model. Therefore, we have achieved the transformation from users wandering aimlessly in the VR 3D model to explore information to professionals accompanying the picture synchronization and real-time language explanation by brokers.

In addition to end-to-end VR viewing, we have also implemented the business capability of VR real-time voice viewing between terminal Apps (iOS/Android) and WeChat Mini Programs. The entire link channel is shown in Figure 15.

Online VR real-time viewing capabilities were initially implemented and implemented at the end of 2018. Due to the impact of the COVID-19 pandemic in 2020, which caused a large number of potential home buyers and brokers to be isolated at home, online VR real-time viewing has now become a core scenario of the viewing business.

4.3 VR Smart House Introduction: Intelligent Commentary, Immersive Experience

Earlier, it was mentioned that VR viewing is to understand the property with the company of a professional broker to solve the problem of inefficient information acquisition in VR 3D house viewing. However, this business scenario also has some shortcomings:

Human resource cost: Brokers may not be able to respond in time, such as during rest hours at night.

Professional level: It cannot be guaranteed that brokers are familiar with all properties, and there are issues such as dialect communication efficiency.

Customer 'social phobia': Not everyone is willing to communicate with strangers, etc.

Given this, we want to make VR 3D interaction smarter. How can we be smarter? First, we must not rely entirely on real brokers. We first collect the images and timbre of real brokers and then use video stitching and language TTS services to abstract a virtual broker, and move this virtual broker image to the user's terminal screen.

With a virtual broker, what content should be explained? The voice of VR viewing comes from people, and the frame data of the picture behavior also comes from people. At this time, it is necessary to synthesize the script through the algorithm level and generate the corresponding audio and sequence frame data. The overall architecture is shown in Figure 18. The frontend needs to support is to define the sequence frame data format specification for picture behavior, and the script service and NLG service of the AI team calculate the LRC text script and behavior sequence. Then, through the main control service, generate the video of the virtual broker with the script audio and attach the behavior sequence frame data to the frontend 'translation'.

Because there are too many points involved, more details will not be explained in detail in this article. You can scan the QR code in Figure 19 or visit the two-bedroom and one-living room of Zhujiang Rome嘉园 East District to experience it. In short, due to the dual capabilities of WebSocket and the abstraction of frontend sequence frame data, the overall experience of VR has become more intelligent.

5 Performance Challenges Faced and Coping Strategies

During the past three years of VR viewing and derived business R&D, we have mainly faced two performance bottlenecks: loading time and memory overflow.

5.1 Loading Time

Before August 2019, the average loading time of Beike LuShi VR's first screen was 7.6s. By July 2021, it had dropped to 1.92s. Under normal network conditions, users basically do not need to wait too long to experience VR. What exactly have we done to achieve such a huge improvement? First, we need to analyze the reasons for the previous slowness, and then 'treat the symptoms'. Moreover, the improvement of first-screen performance is not something that can be achieved overnight. We have established a virtual special team for performance experience and have continued for nearly a year to achieve the final effect of 1.92s.

Where is the problem? It is mainly in three aspects:

Dense HTTP requests: As mentioned earlier, VR 3D models rely on a large number of model UV texture maps and panoramic images; in addition, there are a large number of map, house introduction audio and video, and other resources. Under the limitations of the browser, the request limit for the same domain CDN is 3-6 (there will be differences in different browsers). A large number of network requests can only wait in line.

Real-time calculation: There are a large number of real-time calculations on the frontend, such as 3D model file decompression, floor plan data parsing, three-dimensional space analysis, and collision detection. Due to JavaScript's single process, these calculation dependencies also block some core logic.

Unreasonable module rendering loading strategy: Due to insufficient consideration in the early stage of VR development, our asynchronous rendering loading strategy design is unreasonable, and the priority strategy division is chaotic.

After analyzing the reasons, the optimization strategy is very clear. For dense HTTP requests, we first add more CDN domain names to support, ensure that the number of requests at the same time is within five, and increase support for the HTTP2 protocol. The strategy adopted for the time-consuming real-time calculation is to make full use of caching (offline calculation caching, browser caching, and server-side calculation caching, etc.); in addition, we have redesigned the module rendering loading strategy. Each module is planned with a weight, and loaded according to the weight. In addition, some non-core interactions are loaded and rendered only when triggered by the user. Due to the heavy historical burden, the entire process lasted for nearly a year, and finally achieved a performance improvement from 7.6s to 2.55s in the loading time of the first screen, as shown in Figure 20 (left).

In addition to the optimizations mentioned above, we have also fully tapped some client capabilities. The first capability is client HTTP request interception proxy and caching. Usually, the 'threshold' of the WebView cache pool is very low, while the client cache pool is much larger; in addition, compared with analysis, the efficiency of client HTTP requests is much higher than that of WebView HTTP requests. After supporting HTTP request proxy and caching, the entire loading time was reduced by nearly 500ms.

Another core capability is to add client-side first-screen rendering: that is, the client pre-loads the first-screen content before entering the VR page, displays the client content during the loading phase, and then switches to the frontend rendering effect after the frontend completes the first-screen rendering. The whole process is seamless, and users may not even perceive the loading process. The final effect is shown in Figure 20 (right).

5.2 Memory Overflow

The loading time has now achieved relatively good results. The biggest bottleneck we currently face is memory overflow.

In the first-screen optimization mentioned earlier, we spent a lot of time improving the module loading and rendering strategy. Therefore, during VR interaction, as various modules continue to complete rendering, memory usage gradually increases, as shown in Figure 21 (left). The pie chart in Figure 21 (right) also lists the memory usage of different modules. At present, the threshold for WebView memory crash on iOS devices is about 1.5G, while the threshold for Android devices varies from device to device. High-end Android devices are generally much higher than iOS devices, but low-end devices have a threshold far below 1.5G memory.

We take two directions to avoid memory overflow problems:

Increase the memory pool: Currently, we have tested various WebView controls on iOS/Android devices. Except for implementing WebView independent processes, we have not found a way to break through the WebView memory limit.

Reduce memory usage: We have made some breakthroughs, such as on-demand rendering and destroying modules in non-visible areas, but only reduce the crash rate, and the effect is not significant.

As business continues to iterate, VR capabilities become richer and richer, and memory usage is still increasing. Relying on the technical stack of WebView+WebGL+jsBridge to implement VR experience currently has very obvious limitations. Although the pure native technical stack has been put on the agenda, it is still difficult to implement in the short term. In order to weaken the impact of memory overflow, we currently adopt a strategy of dynamically degrading according to the user's usage scenario to provide users with the most appropriate interactive experience.

Performance optimization is essentially progressive enhancement and graceful degradation. Grasping every detail and doing your part well will generally have a better performance. We systematically analyzed the factors that cause performance bottlenecks, as shown in Figure 22. In fact, it is difficult for us to make a breakthrough and then completely solve the memory problem. We can only downgrade to ensure the experience.

How to make 'progressive enhancement' and 'graceful degradation' smarter? The first thing needed is for the frontend to support the 'hot swap' capability of modules, that is, to dynamically destroy a certain module to free up memory space for other modules. In addition, we maintain a data warehouse about memory bottlenecks. Relying on the dual capabilities of WebSocket, VR interaction will collect the user's terminal device information and some VR user behaviors, and analyze the maximum bearing capacity of the user's terminal in real time, push it to the frontend, and then dynamically load or unload frontend modules to achieve the effect of strengthening experience or downgrading.

6 Summary

The theme of this speech is 'New Trends in Mobile Technology'. The above tells everyone how the frontend team of Beike LuShi implements architecture design in the Web field based on VR and 3D technology, and shares some business exploration and practice in this field. Finally, from a technical perspective, I will make the following four summaries:

Fun: Development in the three-dimensional field is much more interesting than traditional DOM-based frontend development. For example, the product team has said that secondary processing and decoration design in three-dimensional space is a higher-level 'LEGO'-style game. Welcome everyone to join this field.

Sequence frame abstraction and data-driven: In the past, frontend interaction was triggered by users, but in the 3D direction, the interaction model needs to be played automatically to improve the way of obtaining information. Frontend data layer sequence frame abstraction, supporting data-driven, serialization and deserialization will be an indispensable part.

'Hot swap': Memory usage in 3D field development is much higher than that of traditional frontend pages, especially under the limitations of terminal device WebView containers. The encapsulation of modules, components, and plugins all need to support 'hot swap' to achieve dynamic enhancement of experience or downgrade effects.

WebSocket: We have gradually abandoned active Ajax. The real-time and intelligence of data both depend on the dual capabilities of WebSocket. At present, WebSocket services have become core basic construction.

Performance Optimizationfrontend developmentWebSocket3D modelingWebGLArchitecture DesignVR Technology
Beike Product & Technology
Written by

Beike Product & Technology

As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.