So you’ve decided to build a VR game in Unity and have settled on the Samsung Gear VR as your target platform. Getting it up and running on the device was easy enough, but there’s a problem—the frame rate is just too low. Your reticle snaps and sticks, there are flickering black bars on the sides of your vision, and motion looks like somebody just kicked the camera operator in the shins. You’ve read about how important it is to maintain a solid frame rate, and now you know why—in mobile VR, anything less than 60 frames per second doesn’t just look bad, it feels bad. Your high-end PC runs the game at about 1000 frames per second, but it sounds like a jet engine and actually levitates slightly when the fans really get going. What you need is a way to optimize your masterpiece to run on a mobile chipset. This series of articles is designed to help you do just that.
Part 1: The Gear VR Environment and Traits of Efficient VR Games
This isn’t an all-encompassing expose on performance optimization for the Gear VR–it’s more of a quick start. In this first post we’ll discuss the Gear VR hardware and traits of a well-designed mobile VR application. A follow-up post will cover performance improvement for apps you’ve already built. I’ve elected to base this article on the behavior (and quirks) of Unity, as it seems to be very popular amongst Gear VR developers. Still, the concepts presented here should apply to just about any game engine.
Know Your Hardware
Before you tear your project apart looking for inefficiencies, it’s worth thinking a little about the performance characteristics of mobile phones. Generally speaking, mobile graphics pipelines rely on a pretty fast CPU that is connected to a pretty fast GPU by a pretty slow bus and/or memory controller, and an OpenGL ES driver with a lot of overhead. The Gear VR runs on the Samsung Note 4 and the Samsung Galaxy S6. These two product lines actually represent a number of different hardware configurations:
- The Note 4 comes in two chipset flavors. Devices sold in North America and Europe are based on Qualcomm’s Snapdragon chipset (specifically, a Snapdragon 805), while those sold in South Korea and some other parts of Asia include Samsung’s Exynos chipset (the Exynos 5433). The Snapdragon is a quad-core CPU configuration, while the Exynos has eight cores. These devices sport two different GPUs: the Adreno 420 and Mali-T760, respectively.
- The Note 4 devices are further segmented by operating system. Most run Android 4.4.4 (KitKat) but Android 5 (Lollipop) is now available as an update on most carriers around the world. The Exynos-based Note 4 devices all run Android 5.
- The Galaxy S6 devices are all based on the same chipset: the Exynos 7420 (with a Mali-T760M8 GPU). There is actually a second version of the S6, the Galaxy S6 Edge, but internally it is the same as the S6.
- All Galaxy S6 devices ship with Android 5.
If this seems like a lot to manage, don’t worry: though the hardware varies from device to device, the performance profile of all these devices is pretty similar (with one serious exception—see “Gotchas” below). If you can make it fast on one device, it should run well on all the others.
As with most mobile chipsets, these devices have pretty reliable characteristics when it comes to 3D graphics performance. Here are the things that generally slow Gear VR projects down (in order of severity):
- Scenes requiring dependent renders (e.g., shadows and reflections) (CPU / GPU cost).
- Binding VBOs to issue draw calls (CPU / driver cost).
- Transparency, multi-pass shaders, per-pixel lighting, and other effects that fill a lot of pixels (GPU / IO cost).
- Large texture loads, blits, and other forms of memcpy (IO / memory controller cost).
- Skinned animation (CPU cost).
- Unity garbage collection overhead (CPU cost).
On the other hand, these devices have relatively large amounts of RAM and can push quite a lot of polygons. Note that the Note 4 and S6 are both 2560×1440 displays, though by default we render to two 1024×1024 textures to save fill rate.
Know the VR Environment
VR rendering throws hardware performance characteristics into sharp relief because every frame must be drawn twice, once for each eye. In Unity 4.6.4p3 and 5.0.1p1 (the latest releases at the time of this writing), that means that every draw call is issued twice, every mesh is drawn twice, and every texture is bound twice. There is also a small amount of overhead involved in putting the final output frame together with distortion and TimeWarp (budget for 2 ms). It is reasonable to expect optimizations that will improve performance in the future, but as of right now we’re stuck with drawing the whole frame twice. That means that some of the most expensive parts of the graphics pipeline cost twice as much time in VR as they would in a flat game.
With that in mind, here are some reasonable targets for Gear VR applications.
- 50 – 100 draw calls per frame
- 50k – 100k polygons per frame
- As few textures as possible (but they can be large)
- 1 ~ 3 ms spent in script execution (Unity Update())
Bear in mind that these are not hard limits; treat them as rules of thumb.
Also note that the Oculus Mobile SDK introduces an API for throttling the CPU and GPU to control heat and battery drain (see OVRModeParams.cs for sample usage). These methods allow you to choose whether the CPU or GPU is more important for your particular scene. For example, if you are bound on draw call submission, clocking the CPU up (and the GPU down) might improve overall frame rate. If you neglect to set these values, your application will be throttled down significantly, so take time to experiment with them.
Finally, Gear VR comes with Oculus’s Asynchronous TimeWarp technology. TimeWarp provides intermediate frames based on very recent head pose information when your game starts to slow down. It works by distorting the previous frame to match the more recent head pose, and while it will help you smooth out a few dropped frames now and then, it’s not an excuse to run at less than 60 frames per second all the time. If you see black flickering bars at the edges of your vision when you shake your head, that indicates that your game is running slowly enough that TimeWarp doesn’t have a recent enough frame to fill in the blanks.
Designing for Performance
The best way to produce a high-performance application is to design for it up-front. For Gear VR applications, that usually means designing your art assets around the characteristics of mobile GPUs.
Before you start, make sure that your Unity project settings are organized for maximum performance. Specifically, ensure that the following values are set:
- Static batching
- Dynamic batching
- GPU skinning
- Multithreaded Rendering
- Default Orientation to Landscape Left
Since we know that draw calls are usually the most expensive part of a Gear VR application, a fantastic first step is to design your art to require as few draw calls as possible. A draw call is a command to the GPU to draw a mesh or a part of a mesh. The expensive part of this operation is actually the selection of the mesh itself. Every time the game decides to draw a new mesh, that mesh must be processed by the driver before it can be submitted to the GPU. The shader must be bound, format conversions might take place, et cetera; the driver has CPU work to do every time a new mesh is selected. It is this selection process that incurs the most overhead when issuing a draw call.
However, that also means that once a mesh (or, more specifically, a vertex buffer object, orVBO) is selected, we can pay the selection cost once and draw it multiple times. As long as no new mesh (or shader, or texture) is selected, the state will be cached in the driver and subsequent draw calls will issue much more quickly. To leverage this behavior to improve performance, we can actually wrap multiple meshes up into a single large array of verts and draw them individually out of the same vertex buffer object. We pay the selection cost for the whole mesh once, then issue as many draw calls as we can from meshes contained within that object. This trick, called batching, is much faster than creating a unique VBO for each mesh, and is the basis for almost all of our draw call optimization.
All of the meshes contained within a single VBO must have the same material settings for batching to work properly: the same texture, the same shader, and the same shader parameters. To leverage batching in Unity, you actually need to go a step further: objects will only be batched properly if they have the same material object pointer. To that end, here are some rules of thumb:
- Macrotexture / Texture Atlases: Use as few textures as possible by mapping as many of your models as possible to a small number of large textures.
- Static Flag: Mark all objects that will never move as Static in the Unity Inspector.
- Material Access: Be careful when accessing Renderer.material. This will duplicate the material and give you back the copy, which will opt that object out of batching consideration (as its material pointer is now unique). Use Renderer.sharedMaterial.
- Ensure batching is turned on: Make sure Static Batching andDynamic Batching are both enabled in Player Settings (see below).
Unity provides two different methods to batch meshes together: static batching and dynamic batching.
When you mark a mesh as static, you are telling Unity that this object will never move, animate, or scale. Unity uses that information to automatically batch together meshes that share materials into a single, large mesh at build time. In some cases, this can be a significant optimization; in addition to grouping meshes together to reduce draw calls, Unity also burns transformations into the vertex positions of each mesh, so that they do not need to be transformed at runtime. The more parts of your scene that you can mark as static, the better. Just remember that this process requires meshes to have the same material in order to be batched.
Note that since static batching generates new conglomerate meshes at build time, it may increase the final size of your application binary. This usually isn’t a problem for Gear VR developers, but if your game has a lot of individual scenes, and each scene has a lot of static mesh, the cost can add up. Another option is to use StaticBatchingUtility.Combine at runtime to generate the batched mesh without bloating the size of your application (at the cost of a one-time significant CPU hit and some memory).
Finally, be careful to ensure that the version of Unity you are using supports static batching (see “Gotchas” below).
Unity can also batch meshes that are not marked as static as long as they conform to the shared material requirement. If you have the Dynamic Batching option turned on, this process is mostly automatic. There is some overhead to compute the meshes to be batched every frame, but it almost always yields a significant net win in terms of performance.
Other batching Issues
Note that there are a few other ways you can break batching. Drawing shadows and other multi-pass shaders requires a state switch and prevents objects from batching correctly. Multi-pass shaders can also cause the mesh to be submitted multiple times, and should be treated with caution on the Gear VR. Per-pixel lighting can have the same effect: using the default Diffuse shader in Unity 4, the mesh will be resubmitted for each light that touches it. This can quickly blow out your draw call and poly count limits. If you need per-pixel lighting, try setting the total number of simultaneous lights in the Quality Settings window to one. The closest light will be rendered per-pixel, and surrounding lights will be calculated using spherical harmonics. Even better, drop all pixel lights and rely on Light Probes. Also note that batching usually doesn’t work on skinned meshes. Transparent objects must be drawn in a certain order and therefore rarely batch well.
The good news is that you can actually test and tune batching in the editor. Both the Unity Profiler (Unity Pro only) and the Stats pane on the Game window can show you how many draw calls are being issued and how many are being saved by batching. If you organize your geometry around a very small number of textures, make sure that you do not instance your materials, and mark static objects with the Static Flag, you should be well on your way to a very efficient scene.
Transparency, Alpha Test, and Overdraw
As discussed above, mobile chipsets are often “fill-bound,” meaning that the cost of filling pixels can be the most expensive part of the frame. The key to reducing fill cost is to try to draw every pixel on the screen only once. Multi-pass shaders, per-pixel lighting effects (such as Unity’s default specular shader), and transparent objects all require multiple passes over the pixels that they touch. If you touch too many of these pixels, you can saturate the bus.
As a best practice, try to limit the Pixel Light Count in Quality Settings to one. If you use more than one per-pixel light, make sure you know which geometry it is being applied to and the cost of drawing that geometry multiple times. Similarly, strive to keep transparent objects small. The cost here is touching pixels, so the fewer pixels you touch, the faster your frame can complete. Watch out for transparent particle effects like smoke that may touch many more pixels than you expect with mostly-transparent quads.
Also note that you should never use alpha test shaders, such as Unity’s cutout shader, on a mobile device. The alpha test operation (as well as clip(), or an explicit discard in the fragment shader) forces some common mobile GPUs to opt out of certain hardware fill optimizations, making it extremely slow. Discarding fragments mid-pipe also tends to cause a lot of ugly aliasing, so stick opaque geometry or alpha-to-coverage for cutouts.
Before you can test the performance of your scene reliably, you need to ensure that your CPU and GPU throttling settings are set. Because VR games push mobile phones to their limit, you are required to select a weighting between the CPU and GPU. If your game is CPU-bound, you can downclock the GPU in order to run the CPU at full speed. If your app is GPU-bound you can do the reverse. And if you have a highly efficient app, you can downclock both and save your users a bunch of battery life to encourage longer play sessions. See “Power Management” in the Mobile SDK documentation for more information about CPU and GPU throttling.
The important point here is that you must select a CPU and GPU throttle setting before you do any sort of performance testing. If you fail to initialize these values, your app will run in a significantly downclocked environment by default. Since most Gear VR applications tend to be bound on CPU-side driver overhead (like draw call submission), it is common to set the clock settings to favor the CPU over the GPU. An example of how to initialize throttling targets can be found in OVRModeParams.cs, which you can copy and paste into a script that executes on game startup.
Here are some tricky things you should keep in the back of your mind while considering your performance profile.
- One particular device profile, specifically the Snapdragon-based Note 4 running Android 5, is slower than everything else; the graphics driver seems to contain a regression related to draw call submission. Games that are already draw call bound may find that this new overhead (which can be as much as a 20% increase in draw call time) is significant enough to cause regular pipeline stalls and drop the overall frame rate. We’re working hard with Samsung and Qualcomm to resolve this regression. Snapdragon-based Note 4 devices running Android 4.4, as well as Exynos-based Note 4 and S6 devices, are unaffected.
- Though throttling the CPU and GPU dramatically reduces the amount of heat generated by the phone, it is still possible for heavy-weight applications to cause the device to overheat during long play sessions. When this happens, the phone warns the user, then dynamically lowers the clock rate of its processors, which usually makes VR applications unusable. If you are working on performance testing and manage to overheat your device, let it sit without the game running for a good five minutes before testing again.
- Unity 4 Free does not support static batching or the Unity Profiler. However, Unity 5 Personal Edition does.
- The S6 does not support anisotropic texture filtering.
That’s all for now. In the next post, we’ll discuss how to go about debugging real-world performance problems.
For more information on optimizing your Unity mobile VR development, see “Best Practices: Mobile” in our Unity Integration guide.
Part 2: Solving Performance Problems
In my last post I discussed ways to build efficient Gear VR games and the traits of Gear VR devices. In this installment I’ll focus on ways to debug Unity applications that are not sufficiently performant on those devices.
Even if you’ve designed your scene well and set reasonable throttling values, you may find that your game does not run at a solid 60 frames per second on the Gear VR device. The next step is to decipher what’s going on using three tools: Unity’s internal profiler log, Unity’s Profiler , and the Oculus Remote Monitor.
The very first thing you should do when debugging Unity performance is to turn on theEnable Internal Profiler option in Player Settings. This will spit a number of important frame rate statistics to the logcat console every few seconds, and should give you a really good idea of where your frame time is going.
To illustrate the common steps to debugging performance, let’s look at some sample data from a fairly complicated scene in a real game running on a Note 4 Gear VR device:
Android Unity internal profiler stats: cpu-player> min: 8.8 max: 44.3 avg: 16.3 cpu-ogles-drv> min: 5.1 max: 6.0 avg: 5.6 cpu-present> min: 0.0 max: 0.3 avg: 0.1 frametime> min: 14.6 max: 49.8 avg: 22.0 draw-call #> min: 171 max: 177 avg: 174 | batched: 12 tris #> min: 153294 max: 153386 avg: 153326 | batched: 2362 verts #> min: 203346 max: 203530 avg: 203411 | batched: 3096 player-detail> physx: 0.1 animation: 0.1 culling 0.0 skinning: 0.0 batching: 0.4 render: 11.6 fixed-update-count: 1 .. 1 mono-scripts> update: 0.9 fixedUpdate: 0.0 coroutines: 0.0 mono-memory> used heap: 3043328 allocated heap: 3796992 max number of collections: 0 collection total duration: 0.0
To capture this sample you’ll need to connect to your device over Wi-Fi using ADB over TCPIP and run
adb logcat while the device is docked in the headset (for more information, see “Android Debugging” in the Mobile SDK documentation).
What the sample above tells us is that our average frame time is 22 ms, which is about 45 frames per second—way below our target of 60 frames per second. We can also see that this scene is heavy on the CPU—16.3 ms of our 22 ms total is spent on the CPU. We’re spending 5 ms in the driver (“cpu-ogles-drv”), which suggests that we’re sending the driver down some internal slow path. The probable culprit is pretty clear: at 174 draw calls per frame, we’re significantly over our target of 100. In addition, we’re pushing more polygons than we would like. This view of the scene doesn’t really tell us what’s happening on the GPU, but it tells us that we can explain our 45 ms frame rate just by looking at the CPU, and that reducing draw calls should be the focus of our attention.
This data also shows that we have regular spikes in the frametime (represented by max frametime of 49.8 ms). To understand where those are coming from, the next step is to connect the Unity Profiler and look at its output.
As expected, the graph shows a regular spike. During non-spike periods, our render time is similar to the values reported above, and there is no other significant contributor to our final frametime.
Wait for present
WaitForPresent (and its cousin, Overhead) appears to be some repeated cost that comes along and destroys our frame. In fact, it does not actually represent some mysterious work being performed. Rather, WaitForPresent records the amount of time that the render pipeline has stalled.
One way to think of the render pipeline is to imagine a train station. Trains leave at reliable times—every 16.6 ms. Let’s say the train only holds one person at a time, and there is a queue of folks waiting to catch a ride. As long as each person in the queue can make it to the platform and get on the next train before it leaves, you’ll be able to ship somebody out to their destination every 16 ms. But if even one guy moves too slowly—maybe he trips on his shoelaces—he’ll not only miss his train, he’ll be sitting there waiting on the platform for another 16 ms for the next train to come. Even though he might have only been 1 ms late for his train, missing it means that he has to sit there and wait a long time.
In a graphics pipeline, the train is the point at which the front buffer (currently displayed) and back buffer (containing the next frame to display) swap. This usually happens when the previous frame finishes its scanout to the display. Assuming that the GPU can execute all of the render commands buffered for that frame in a reasonable amount of time, there should be a buffer swap every 16 ms. To maintain a 60 frames-per-second refresh rate, the game must finish all of its work for the next frame within 16 ms. When the CPU takes too long to complete a frame, even if it’s only late by 1 ms, the swap period is missed, the scanout of the next frame begins using the previous frame’s data, and the CPU has to sit there and wait for the next swap to roll around. To use the parlance of our example above, the train is the swap and the frame commands issued by the CPU are the passengers.
WaitForPresent indicates that this sort of pipeline stall has occurred and records the amount of time the CPU idles while waiting on the GPU. Though less common, this can also happen if the CPU finishes its work very quickly and has to wait for the next swap.
In this particular example, it’s pretty clear that our frame rate is inconsistent enough that we cannot schedule our pipeline reliably. The way to fix WaitForPresent is thus to ignore it in the profiler and concentrate on optimizing everything else, which in this case means reducing the number of draw calls we have in the scene.
Other Profiler Information
The Unity profiler is very useful for digging into all sorts of other runtime information, including memory usage, draw calls, texture allocations, and audio overhead. For serious performance debugging, it’s a good idea to turn off Multithreaded Rendering in the Player Preferences. This will slow the renderer down a lot but also give you a clearer view of where your frame time is going. When you’re done with optimizations, remember to turn Multithreaded Rendering back on.
In addition to draw call batching, other common areas of performance overhead include overdraw (often caused by large transparent objects or poor occlusion culling), skinning and animation, physics overhead, and garbage collection (usually caused by memory leaks or other repeated allocations). Watch for these as you dig into the performance of your scene. Also remember that displaying the final VR output, which includes warping and TimeWarp overhead, costs about 2 ms every frame.
Oculus Remote Monitor
OVRMonitor is a new tool recently released with the Oculus Mobile SDK. It helps developers understand the way pipelining works and identify pipeline stalls. It can also stream low resolution unwarped video from a Gear VR device wirelessly, which is useful for usability testing.
OVRMonitor is currently in development, but this early version can still be used to visualize the graphics pipeline for Gear VR applications. Here’s a shot of a tool inspecting a game running the same scene discussed above:
The yellow bar represents the vertical sync interrupt that indicates that scan out for a frame has completed. The image at the top of the window is a capture of the rendered frame, and the left side of the image is aligned to the point in the frame where drawing began. The red bar in the middle of the image shows the TimeWarp thread, and you can see it running parallel to the actual game. The bottom blue are indicates the load on CPU and GPU, which are constant (i.e., in this case, all four CPUs are running).
This shot actually shows us one of the WaitForPresent spikes we saw above in the Unity Profiler. The frame in the middle of the timeline began too late to complete by the next vertical blank, and as a result the CPU blocked for a full frame (evidenced by the lack of screen shot in the subsequent frame and the 16.25 ms WarpSwapInternal thread time).
OVRMonitor is a good way to get a sense of what is happening in your graphics pipeline from frame to frame. It can be used with any Gear VR app built against the latest SDK. See the documentation in SdkRoot/Tools/OVRMonitor for details. More documentation and features are coming soon.
Tips and Tricks
Here are a few performance tricks we’ve used or heard about from other developers. These are not guaranteed solutions for all VR applications, but they might give you some ideas about potential solutions for your particular scene.
- Draw big, opaque meshes last. Try sorting your skybox into the Geometry+1 render queue so that it draws after all the other opaque geometry. Depending on your scene, this probably causes a lot of the pixels covered by the skybox to be discarded by depth test, thus saving you time. Ground planes and other static, opaque objects that touch a lot of pixels and are likely to be mostly occluded by other objects are candidates for this optimization.
- Dynamically change your CPU / GPU throttling settings. You can change your throttling settings at any time. If you are able to run most of your game at a low setting but have one or two particularly challenging scenes, consider cranking the CPU or the GPU up just during those scenes. You can also drop the speed of one or both processors in order to save battery life and reduce heat during scenes that are known to be simple. For example, why not set GPU to 0 during a scene load?
- Update render targets infrequently. If you have secondary render textures that you are drawing the scene to, try drawing them at a lower frequency than the main scene. For example, a stereoscopically rendered mirror might only refresh its reflection for only one eye each frame. This effectively lowers the frame rate of the mirror to 30 frames per second, but because there is new data every frame in one eye or the other, it looks okay for subtle movements.
- Lower the target eye resolution. By trading a bit of visual quality you can often improve your fill-bound game performance significantly by slightly lowering the size of the render target for each eye. OVRManager.virtualTextureScale is a value between 0.0 and 1.0 that controls the size of the render output. Dropping resolution slightly when running on older devices is often an easy way to support slower hardware.
- Compress your textures. All Gear VR devices support the ASTC compression format, which can significantly reduce the size of your textures. Note that as of this writing, Unity 4.6 expects ASTC compression to go hand-in-hand with OpenGL ES 3.0. If you use ASTC under GLES 2.0, your textures will be decompressed on load, which will probably lengthen your application’s start-up time significantly. ETC1 is a lower-quality but universally supported alternative for GLES 2.0 users.
- Use texture atlases. As described above, large textures that can be shared across a number of meshes will batch efficiently. Avoid lots of small individual textures.
For more information on optimizing your Unity mobile VR development, see “Best Practices: Mobile” in our Unity Integration guide.