Why HTML is still more “Metaverse” than 3D

13 min readDec 14, 2020

*Disclaimer: I am a co-founder and developer of* *Unbound, a creator-friendly, collaborative 3D technology and app.*

Many times over the past years, I’ve tried to articulate the potential of content creation in the medium of real-time 3D and its potential to go beyond the production of state-of-the-art video games. Not an easy feat as there are many specialized applications, formats, and content creation approaches in this medium. When speaking with tech investors, I noticed that this message of “easy 3D” is a bit more than just burned out. There have been so many pitches, attempts, and products selling the idea of “easy 3D” over the past 20+ years, none of which have really stuck.

Google made a big push, but sadly have recently given up on their Poly platform. Microsoft even dubbed a major Windows 10 update “3D for Everybody”, but we haven’t heard anything since. Going back further, people have previously taken the Web (e.g. VRML) as inspiration for the potential scale of a 3D medium, but simply expressing 3D as XML might have been a bit too literal of an analogy.

Given these and other large-scale 3D content creation efforts run by experienced teams over decades, it’s hard to believe that we haven’t reached the breakthrough needed to unlock the 3D medium’s full potential. A potential often expressed as the “Metaverse” or a future iteration of the internet. In part, the Metaverse describes a shared 3D reality that empowers every possible creator, business, student, etc. to express, build, share, and collaborate on a massive scale in 3D. I can’t help but think that the limitations to unlocking this potential are somewhere in the structure of the building blocks of the 3D medium itself. Defining these limitations well and measuring success of a more scalable design is perhaps not immediately obvious.

This led me to drawing up an analogy to the Web and the 1% Rule of the Internet. The 1% Rule of the Internet basically states that 1% of all people online create all original content, 9% of all users are contributors or remixers, and the remaining 90% are viewers or consumers of content. While one can question the validity of the 1% rule in many ways, it is regardless a very interesting observation that might very well represent the expected distribution from creators to consumers when not limited by technology or the ability to share. While 1% seems like a small number, compared to the pre-Internet world of media, this shift is what unlocked the new world of content creators that we are now so accustomed to.

Assuming the 1% rule is a valid measure of success, let’s bring this “measure” to the medium of real-time 3D and look at it from the perspective of game development, a major driving force of 3D technology. It was recently reported there were roughly 2.7B gamers on the planet. Based on recent numbers I’ve seen in the press, adding Unity creators, Unreal creators, Roblox creators, and even adding all people working in the entire games industry we don’t come close to 1%. Whether that is one order of magnitude away from the (arguably more mature) 2D world of the Web or two orders of magnitude if you count remixed content, that says something significant.

In order to achieve this desirable creator-consumer distribution, I’m proposing to look at the 3D medium under the following three aspects: (1) ease of creation; (2) ease of remixing; and (3) ease of sharing. Bear in mind that I am driving towards and idea of unlocking, as opposed to an economic demand model of specific content in various media. Let me elaborate on these three points.

1. Naturally Forgiving

A medium is forgiving if an arbitrary user input (such as banging on the keyboard) more often than not yields valid results. In the real world, anything you create never breaks the physics of the universe. The shadow cast by a sculpture for example won’t ever appear glitchy or broken, no matter what the sculptor does. As a digital example, compare 3D voxel modeling vs. surface modeling (e.g. triangle meshes). Player inputs in Minecraft have a similar quality, not even a 3 year old can break the medium. This removes almost all of the “how” and immediately lets the creator enjoy the “what” and the “why”. Forgivingness of a medium not only makes it easier for human creators, but also for machines, like in the case of AI and procedural generation of content. For the algorithm author, this removes the burden of solving for validity constraints within the medium and enables more exploration. Triangle mesh surface modeling on the other hand does not have this inherent forgivingness. The basic building blocks of mesh modeling, when placed casually, don’t always result in a closed surface (the probability of self-intersections, non-manifold, triangle soup, etc. is extremely high). For a rendering engine this often has immediate implications such as breaking lighting, shadowing, or the physics simulation. This lack of forgivingness may contribute to why a medium remains difficult, despite the excellence of its user interface design.

2. Non-Destructive

Non-destructive is a term often used in the Photoshop community. There, it describes the option to create an image by applying changes in layers with the intention of retaining these steps when saving and sharing the file. The destructive approach would be to apply effects or pixel operations that overwrite previous pixel values and drop the creators steps when the app is closed and the in-app history buffer is gone.

In any medium, this distinction is instrumental to allowing better collaboration and remixing of content. Destructive approaches (often called “baking steps”) are often side effects of delivery optimizations for viewing-only situations.

3. Highly Compact

Without a doubt, network bandwidth is increasing rapidly and access to this bandwidth across the globe is increasing too. Storage is also one of the lesser issues in the digital space. However, there are thresholds that eventually inhibit sharing. Even with a 1Gb/s fiber connection in your home, access to 3D content is more often minutes or hours away versus seconds from the point of inspiration. Downloading a reasonably modern game takes anywhere from 30 minutes to multiple hours. This is fine for what it is, but not even remotely close to the speed of clicking a link to access a webpage. Compactness of content remains important. In a way, it still segments usage into what I want to call different latency categories and differentiates how we share content and at what velocity. It also contributes to how collaborative a certain digital medium can be.

The 2D Web

Taking into consideration these three criteria: forgivingness, non-destructiveness; and compactness, I personally came to the conclusion that perhaps the most mature digital format is HTML (this includes CSS). Regardless of HTML’s open (not proprietary) standard, HTML excels when measured against these three criteria. They form a foundational catalyst to reaching the ubiquity and scale HTML has achieved throughout its evolution. Ultimately these very qualities of its building blocks may have just been the key to enabling the kinds of participation that reflect the “1% Rule of the Internet” creator/consumer distribution.

Certainly there were many hiccups throughout the development history of the web browser. Although we might see small layout issues here and there, we almost never see a web page where the HTML code breaks the rendering engine such that there are severe issues like pixel errors, flickers, or other types of glitches. Of course, boxes inside boxes inside boxes aren’t as simple as setting the color of a pixel, but this structure provides naturally forgiving and flexible building blocks for 2D layouts. While it obviously doesn’t prevent you from making something visually displeasing, you cannot break this system like you can triangles connected to triangles connected to triangles.

When you open a web page in your web browser, you have access to all of the page’s code, all its building blocks, idioms, structure, and styling. It’s as if all YouTube videos were available as Adobe After Effects project files. Regardless of a long history of server-side procedural generation of HTML pages, it is inherently a non-destructive medium with remixing and reuse being a day-to-day tool of every author. Had there been a “baking step” between the content description and what gets delivered to the client, I’d argue that the Web wouldn’t be as huge as it is today. You can also see the benefits of this flexibility in applications like Visual Studio Code. Built on HTML/JS, Visual Studio Code has surpassed the amount and variety of community created add-ons of its 23-year old sibling Visual Studio in only 4 years. The medium has reached a level of maturity that now allows state-of-the-art tools such as Squarespace, Figma, Canva, etc. to exist within itself.

And it is compact enough to unlock instantaneous sharing, viewing, and collaboration.

Real-time 3D

To start, I want to acknowledge that a direct comparison of any 2D digital medium with real-time 3D would be unfair simply because the latter requires such a greater level of computational power. I also want to acknowledge that the development of GPUs only recently has departed from predominantly fixed circuitry to widely available general purpose, massively parallel compute power. That shift has already given birth to new possibilities in Scientific Computation and Artificial Intelligence. It also provides an opportunity to re-imagine 3D rendering in a more content creation friendly way. Independent game developers and established companies alike are beginning to shift to explore this new horizon.

3D Scanning

One approach to making 3D content creation more accessible is 3D scanning. The goal of 3D scanning is to make 3D content creation as simple or akin to taking photos. One plus of 3D scanning is that creators don’t have to learn complex 3D modeling methodologies. However, for use in scalable, real-time scenes, a creator would still need to resort to the tools, tasks, and skills of an experienced technical artist, and spend time doing laborious post processing.

Enter Unreal Engine 5. On the very high end of 3D fidelity, one of the most incredible prospects of Unreal Engine 5 is that it’s rendering tech can handle movie quality 3D scans and high density meshes created in professional tools as-is. Not only does it make a giant leap in visual realism, but it also cuts out the very laborious, previously necessary technical tasks and dramatically simplifies the production of high-end content.

Unreal Engine 5 — Tech Demo

This technical development heavily relies on general purpose compute power and marks a new era of leading rendering tech no longer exclusively relying on the fixed function GPU hardware of the past 20 years. While only a handful of people can truly acknowledge just how major and how complex realizing this piece of technology was, it’s evident that this was created with the creator in mind. And this achievement, of reducing barriers for 3D creators at all levels, is in my opinion still underrated and undervalued by the games industry. It will be very interesting to observe whether it will also grow to support more casual remixability (non-destructiveness) and instantaneous sharing (compactness) in real-time 3D, at rates like the Web.

With regards to instantaneous sharing, Cloud Streaming shows some promise for delivering 3D content, be it megabytes or hundreds of gigabytes, as easily as clicking a link to a Web page. My compactness argument is challenged by cloud streaming in that all assets and rendering remain in the data center and any device capable of streaming video can participate. Files are not exchanged. There are also hybrid streaming approaches such as Mutate, which builds upon the forgivingness of editing voxel data. Though instead of simply streaming a remote rendering, Mutate progressively streams parts of a giant voxel dataset to the client for rendering, which I consider an elegant solution.

Will availability of reliable high speed internet, bandwidth data caps, or input latency be a limiting factors to wide spread adoption of this approach? We’ll see.

Signed Distance Fields

Over the past few years, Signed Distance Fields (SDFs) have risen in popularity due to the newly available compute power of modern GPUs. In short, SDFs have the benefit of volumetric modeling like voxels, but can express much more articulate surfaces at much lower resolutions. Lighting techniques that harness SDFs have been in production for years now (ex. many Unreal Engine games), and have enabled more plausible lighting effects such as long range ambient occlusion and soft shadowing.

For modeling, SDFs have gained traction via products like Medium By Adobe, a “best of both worlds” technology that keeps a discrete distance field volume (storing numbers on a 3D grid) as well as the resulting surface mesh around without the creator having to handle these technical differences. The creator benefits from this forgivingness. They can combine arbitrary parts with an existing model without having to worry about breaking topology. Additionally, it retains the ability to apply surface operations like one would do in a purely mesh based medium.

Sculpting in Medium By Adobe. Image Source: https://uploadvr.com/adobe-medium/

Relative to the two other discussed criteria, non-destructiveness and compactness, its important to note that Medium By Adobe: (1) uses a destructive approach, creators steps can not be retained when saving a creation to a file; and (2) the resulting volumetric data takes hundreds of megabytes for a single model, limiting sharing and collaboration.

Signed Distance Functions

With deep roots in the Demoscene, graphics programmers have a long history of using Signed Distance Functions to achieve highly expressive real-time 3D scenes, and quickly. As opposed to storing large voxel volumes, Signed Distance Functions describe models and scenes by combining small mathematical functions, blending and carving them with sharp and smooth sculpting operations.

The above work by demoscener and Pixar alumni Iñigo Quilez is written in code. Iñigo has a strong mathematical background. This way of creating is certainly not accessible to every person. At least not quickly. However, because this technique is so powerful, efforts to make it accessible to everyone have been picking up pace recently.

Most notably, Media Molecule’s Dreams is using these techniques as their foundation. With Dreams, creators have access to this magic and can casually create, via their Playstation, even from their couch. There is no need for external software. The resulting visual quality ranges from high-end product shots to realistic nature scenes, rendered and modifiable in real-time.

Stand outs of this approach are that it is not only incredibly forgiving with respect to the building blocks for creating and sculpting models, but it is also non-destructive, allowing creators to change and remix creations, whether wholly or partially. With this approach there is also no need to store massive volumetric data when saving to a disk or sharing over the network because it simply stores the few parameters of the individual building blocks. In a way, this has many similarities to the set of qualities that made the Web achieve the scale it has today.

With Unbound, though we originally took a Signed Distance Field approach like Medium By Adobe, we now use Signed Distance Functions as the building blocks for everything. You can see this explained in this short video:

Video Source: http://unbound.io/

Outside of this, if you want to try out modeling with distance functions in Unity, there is a plug-in for that called Clayxels, which is available here.

Signed Distance Functions have some limitations too, though nothing that should stop the fearless engineer. For one, Signed Distance Functions require a bit of extra computational power on the client device as well as some highly complex technology that can chop down the necessary calculations to a minimum amount of work necessary for rendering. They also present a slight paradigm shift in modeling because they don’t immediately allow for more traditional sculpting operations that experts are accustomed to in traditional tools like ZBrush. The former limitation will alleviate as GPUs grow in power every year, including on mobile devices. It will eventually fade away. The latter limitation might be less of an issue for a new generation of creators. As with every new paradigm, it is vastly unexplored and many artistic techniques are yet to be discovered. There is also a promise for this format to become a shared language between human and computational creativity, but I’ll leave that to a separate article.

We can’t deny that the dream of a platform that allows for massive, frictionless participation of people has manifested in the 2D digital world with HTML. I do believe we are very close to having a similar breakthrough in 3D, however, the choice of building blocks, in my opinion, is key. The building blocks of 3D worlds are vital to its ubiquity. My research and development has led me to bet on Signed Distance Functions.

The 3D building block choice and approach of Dreams, Clayxels, other similar technologies in the works, and what we have developed at Unbound is very much akin to the Web and HTML in terms of the three criteria I proposed. These three qualities, inherent forgivingness, remixability (non-destructiveness), and tiny data footprint (compactness) might as well be the enabling factors to crack open the medium. We’ll see widespread collaboration in 3D, maybe even a future iteration of the internet like the “Metaverse.” But as long as we haven’t achieved the ubiquity of the Web, there is work to do.

Let’s invest in this goal and work together.

Why HTML is still more “Metaverse” than 3D

1. Naturally Forgiving

2. Non-Destructive

3. Highly Compact

Written by Benjaminsfydh