ArKit Object Persistence under the hood

Update: Since I wrote this extended tracking was added in Vuforia 7.2 and ARWorldMap was introduced in ARKit 2.0. While there may still be some viable use cases for registering AR with OpenCV markers those are likely better solutions for localized or persistent AR. I have also posted an example of the same technique in 100% native iOS code using ArUco.

With Apple’s iOS 11 release of ARKit and Google’s preview release of ARCore, it’s become much easier for developers to ship AR apps that don’t require printed markers. This has made free-movement world-scale AR possible on many current mobile devices and has led to a rise in impressive and useful new AR apps in the app store. When building my own AR apps there is one issue that I kept coming up against – Not only did I want to anchor an object to the real world but I wanted to register it to that space in such a way that I could persist object positioning on another device or at a later point in time. If I wanted to build a real-time multiplayer AR game that shares the same space with another device or need to give my users the ability load-in a previous furniture arrangement in a room designer app – well, this is problematic because ARKit’s SLAM tracking is not registered to real-world space.

SLAM generally refers to computer vision algorithms that are able to construct a map of an unknown environment and continuously track the position of an object within that space in real-time. It stands for Simultaneous Localization And Mapping. ARKit and ARCore are using a specific SLAM technique called Visual Inertial Odometry (VIO). It’s comparing results from inertial sensors with visual feature detection from the camera to estimate it’s movement through space. So, if we are able to detect feature points in an image and track how those points change relative to each other between successive shots, and if we can then also determine the distance the camera has moved between these shots then it’s possible to work out how a device is moving over time. Luckily, thankfully, ios developers like myself need not worry themselves too much about those details. We just ask ARKit to start a session and it updates us with its estimate of the device’s position and orientation as we move around. ARKit’s accuracy is remarkable. I can anchor an object next to my keyboard, walk around the office, come back to my desk and it’s still pretty much right there where I left it.

But the device is only tracked relative to its own position when ARKit first establishes tracking. So if ARkit reports that my device has moved 10 meters in the x-axis that transformation is only useful in the context of that particular session. If I try to save that data and use it against the transformation space of a different session it won’t match. In fact, it would be offset by the difference between the device’s position at the moment each session began tracking.

To solve this problem we need to be able to register ARKit’s transformation to the real-world space during tracking. Devices such as the Tango, which I cover briefly in a past post, save their mapping data as area definition files that the device, with limited results, can localize against. I believe Microsofts Hololens is also able to localize against a similar sort of geometric map of its space. ARKit’s VIO system does not allow us to save and re-localize to any sort of area file or point cloud that is generated. I have no idea if this might be possible down the road, but for now, it’s a feature we’ll have to work out on our own.

Although one of the main benefits of ARKit is it’s marker-less AR capability, introducing AR markers is a great way to determine the position of a device in real-world space. Using standard AR Marker Detection techniques with libraries like OpenCV we can intercept ARFrames from ARKit, and process those images to estimate our devices pose from a fixed marker. This allows us to establish a positional relationship between ARKits transformation space and the marker. We can then place our 3D objects based on this offset – allowing our users to save and load back positional data anchored to real-world locations between AR sessions.

Before we get into any code, here is a video of how this works in practice. This is an ugly looking prototype app I built as a proof of concept to demonstrate how we could register ARKit tracking to a marker for purposes of indoor wayfinding. The user is able to scan a marker, load in a path, and navigate around an interior space. An admin is also able to localize against the same marker in order to record new paths in the same space.

I built this particular demo in Unity using ARKit, OpenCV, and Firebase for data storage. The key to making this work is accessing the ARKit video frame and processing that through OpenCV such that we end up with two Matrices – one for ARKits transformation and the other for OpenCVs pose estimation.

The project setup required a number of dependencies including UnityARKit plugin and OpenCV plugin for Unity. For the purpose of this blog post, I am assuming the reader has familiarity with ARKit and a general understanding of using OpenCV for marker detection. Both topics are well covered elsewhere. My goal here is to discuss specifically how to integrate the two systems and I’ll try to highlight the key components that make this work.

To begin we’ll set up UnityARSessionNativeInterface to run with a world tracking configuration:

UnityARSessionNativeInterface arSession = UnityARSessionNativeInterface.GetARSessionNativeInterface();
ARKitWorldTrackingSessionConfiguration config = new ARKitWorldTrackingSessionConfiguration();
config.planeDetection = planeDetection;
config.alignment = startAlignment;
config.getPointCloudData = getPointCloud;
config.enableLightEstimation = enableLightEstimation;
arSession.RunWithConfig(config);

We will also need to assign a delegate to respond to ARFrame Events, specifically this event which is delegated everytime ARkit processes a new frame:

UnityARSessionNativeInterface.ARFrameUpdatedEvent += ARFrameUpdated;

private void ARFrameUpdated(UnityARCamera arCamera) {
   //shouldLocalize set by 'scan marker' button
   if(!shouldLocalize) {
      return;
   }
   shouldLocalize = false;

   //type conversion from UnityARMatrix4x4 to Matrix4x4
   UnityARMatrix4x4 acm = arCamera.worldTransform;
   Matrix4x4 arCameraWorldMat = new Matrix4x4(acm.column0, acm.column1, acm.column2, acm.column3);

   //start processing the camera feed for a marker image
   localizeManager.startProcessingImageForMarker(arCameraWorldMat, arCamera.intrinsics, markerData);
}

This method checks to see if we should be processing a frame based on the state of shouldLocalize which is set from the scan button. It then simply converts the world transform of the camera to a standard Matrix4x4 and passes that along to our localizer class along with two other properties – camera intrinsics and a markerData object.

The second parameter – camera intrinsics, will be used to construct a 3×3 matrix that transforms coordinates between the 3D camera space and the 2D image plane. It is a Vector4 object and its values are hardware-specific including the focal length and principal point of the camera. This 3 part blog post is a great resource for understanding the camera matrix. We can get the intrinsic matrix either by measuring the camera directly with a printed checkerboard pattern or in the case of ARkit they are included as property of ARCamera Either approach is not without issues.

When measuring camera intrinsics directly I recommend using VIZARIO.Cam app and a printed checkerboard glued to a stiff piece of cardboard. Take readings from each device type you will support and pay careful attention to orientation (landscape/portrait) and image resolution. I’ve found that capturing a minimum of at least 30 frames from many different angles produces the best measurements. You will then need to reference the proper matrix based on the current type of device and flip the z and w properties based on its current orientation.

A couple of things about using ARKits intrinsic matrices: First, you will probably end up with an error when trying to access the intrinsic property of UnityARCamera. That’s because at the time I’m writing this, UnityARKit plugin does not implement intrinsics. I noticed a pending pull request for this feature and ended up just manually patching the latest release for UnityARkit locally. Finally, a possible warning – I have noticed some issues on an iPhone X where the intrinsics seem wrong. Since I don’t currently have access to an iPhone X I haven’t had a chance yet to dig into this further. I assume it is somehow my error and not Apple’s, but if you are encountering issues with intrinsics on an iPhone X you are not alone.

The third property passed to our localizer is an array of markerData objects that our detector will be looking for.  These are black and white grid patterns encoded into flat boolean arrays for corresponding black and white nodes. For marker detection, we’ll be working off the AR Marker Example in the OpenCV for Unity package and our markerData is stored in a format that mirrors the MarkerDesign data structure required by those classes.

using UnityEngine;
using System.Collections;

namespace OpenCVMarkerBasedAR
{
    [System.Serializable]
    public class MarkerDesign
    {
        public int gridSize = 5;
        public float markerSizeInMeters = 0.0889f;
        public bool[] data = new bool[5 * 5];
    }
}

Note that the properties are initialized with typical default values, but it’s critical that the markerSizeInMeters property is set to the exact dimensions for each physical marker.

Within the StartProcessingImageForMarker method, we’ll begin by formatting our data for marker detection. First, I’m converting the MarkerData set loaded remotely to an array of MarkerDesign objects discussed above.  Next, I convert our camera renderTexture to an OpenCV Mat object. Finally, I define a Matrix4x4 to hold our camera offset and run the marker detection algorithm by calling detectMarkerInTexture

  public startProcessingImageForMarker(Matrix4x4 cameraWorldTransform, Vector4 cameraIntrinsics, Dictionary<string,MarkerDataItem> markers) {
        //format data
        markerDesigns = setMarkerDesignsFromData(markers);
		Mat imgMat = new Mat (renderTexture.height, renderTexture.width, CvType.CV_8UC4);
		OpenCVForUnity.Utils.textureToMat(renderTexture,imgMat);
        Matrix4x4 offsetMat;
        
        //run marker detection
		int result = detectMarkerInTexture (imageMat, cameraIntrinsics, markerDesigns, out offsetMat);
		
        ... code ommitted

        
	}

DetectMarkerInTexture further formats camera intrinsics into a proper 3×3 matrix and sets a default distortion Coefficient (which I found sufficient), and then initializes a new MarkerDetector instance with those properties, processes a frame, and then pulls out the first detected marker and assigns it’s transformation to the output matrix and returns the id of the detected marker.

private int detectMarkerInTexture(Mat imgMat, Vector4 cameraIntrinsics, MarkerDesigns[] markerDesigns out Matrix4x4 matrix) {
		
		MatOfDouble distCoeffs = new MatOfDouble (0, 0, 0, 0);

		//w over z = portrait, z over w = landscape
		Mat arkitCamMatrix = new Mat(3,3,CvType.CV_64FC1);
		arkitCamMatrix.put (0, 0, cameraIntrinsics.x);
		arkitCamMatrix.put (0, 1, 0);
		arkitCamMatrix.put (0, 2, cameraIntrinsics.w);
		arkitCamMatrix.put (1, 0, 0);
		arkitCamMatrix.put (1, 1, cameraIntrinsics.y); 
		arkitCamMatrix.put (1, 2, cameraIntrinsics.z); //flip with w for landscape
		arkitCamMatrix.put (2, 0, 0);
		arkitCamMatrix.put (2, 1, 0);
		arkitCamMatrix.put (2, 2, 1.0f);

        MarkerDetector markerDetector = new MarkerDetector (arkitCamMatrix, distCoeffs, markerDesigns);
		markerDetector.processFrame (imgMat, 1); 

		List findMarkers = markerDetector.getFindMarkers ();
		if (findMarkers.Count <= 0) {
			matrix = new Matrix4x4 ();
			return -1;
		}

		Marker marker = findMarkers [0];
        matrix = marker.transformation;
		return marker.id;
	}

MarkerDetector is a class within the MarkerBasedAR OpenCV example. This is the main algorithm for marker detection where the frame is processed, patterns are matched and pose estimation is calculated. The output of this produces a matrix that represents the transformation matrix between the marker image and the AR Camera in [game engine] world space. That’s the offsetMatrix set in the startProcessingImageForMarker method above. It’s the relationship between this matrix and the AR camera transformation that will allow us to persist our AR objects within the transformation space of the marker image.

So taking the offsetMatrix and the original cameraWorldTransform we can instantiate a game object in the transformation space of the marker image like so:

Matrix4x4 unityCameraWorldMat = Matrix4x4.TRS (UnityARMatrixOps.GetPosition (cameraWorldMat), UnityARMatrixOps.GetRotation (cameraWorldMat), new Vector3 (1, 1, 1));
Matrix4x4 invertYM = Matrix4x4.TRS (Vector3.zero, Quaternion.identity, new Vector3 (1, -1, 1));
Matrix4x4 invertZM = Matrix4x4.TRS (Vector3.zero, Quaternion.identity, new Vector3 (1, 1, -1));
Matrix4x4 markerMat = unityCameraWorldMat * invertYM * offsetMat * invertZM;
GameObject sceneHolder = new GameObject ();
MatrixUtils.SetTransformFromMatrix (sceneHolder.transform, ref markerMat);

Now we can add, save, and load objects using the local space of SceneHolder and this solves the basic problem of persistent AR, albeit with a marker image.