Depth Map Generation for Holographic Displays

Converting 2D Images into RGB-D Assets for Looking Glass Holograms

Looking Glass Factory is a company pioneering the next generation of holographic displays. These are devices that let you see and interact with three-dimensional content without wearing a headset or glasses. Their products use light-field technology to project many slightly different perspectives of a scene at once. Your eyes naturally perceive depth as you move around, creating a true 3D illusion that feels tactile and spatial.

Unlike VR or AR, these holographic displays don't isolate you from your surroundings. You're not "looking into" a screen, you are looking through one, at what feels like a physical object floating in space. Artists, developers, and 3D creators use them for 3D photography, volumetric captures and light-field media, interactive 3D applications (Unity, Unreal, Blender, WebGL) and scientific visualization. I released some of my own artworks to the Portrait at disconnect.london a few years ago.


What is RGB-D?

RGB-D stands for Red-Green-Blue + Depth.

  • The RGB part is your regular color image.
  • The D (Depth) part is a grayscale map representing distance. White for near, black for far.

When combined, these form a 4-channel image that lets holographic displays like the Looking Glass reconstruct multiple viewpoints of a single scene.

RGB-D data can be captured directly with devices that include depth sensors or LiDAR, like the old Kinect, RealSense, or newer iPhones and iPads with LiDAR scanners. But unless the sensor is particularly high quality, the resulting depth maps can be noisy or low-resolution, which leads to poor visual fidelity when viewed holographically.

This post walks through how to estimate depth from a regular 2D photo. Since I shoot with a mirrorless camera that already produces high-quality images, the results are noticeably better than what you'd get from an iPhone's portrait mode.

To make this easy to import into the Looking Glass Studio, we generate side-by-side images or videos where the RGB image is on the left and the depth map is on the right.


Estimating Depth from a Single Image

Depth estimation is the process of predicting the distance of every pixel in a 2D image without knowing the actual 3D geometry. Modern AI models can do this remarkably well by learning from millions of examples of photos paired with real depth data.

The script uses transformer-based neural networks from Intel's Dense Prediction Transformer family, which are state-of-the-art for monocular depth estimation.

  • DPT-Large is optimized for accuracy.
  • DPT-Hybrid (MiDaS) offers better generalization to a wider variety of images.
  • Optionally, both can be used together for even better depth fidelity.

These models are accessed through Hugging Face Transformers, running on PyTorch for GPU-accelerated inference.

Improving Accuracy and Detail

Depth prediction can struggle with fine textures, soft edges, or lighting gradients. This pipeline layers multiple techniques to get the cleanest and most realistic results possible:

  • Multi-scale inference: The image is processed at several resolutions (384, 512, 640 px), and the results are fused to preserve both global structure and small details.
  • Test-time augmentation: The image is flipped horizontally and re-evaluated, then both predictions are averaged.
  • Edge-aware weighting: Sharp edges in the RGB image get higher confidence to keep object boundaries clean.
  • Guided / joint bilateral filtering: The depth map is refined using color edges from the RGB image to smooth flat regions while preserving boundaries.
  • DenseCRF: A probabilistic post-processing step that further sharpens transitions between foreground and background.

These steps combine into what you could think of as a depth super-resolution pipeline, producing near-studio-grade RGB-D pairs from plain photos.


Under the Hood

  1. Load the pretrained models from Hugging Face (Intel/dpt-large, Intel/dpt-hybrid-midas).
  2. Pre-process the input image using DPTImageProcessor, resizing and normalising it.
  3. Run inference with torch.autocast for mixed precision, speeding up processing on GPUs.
  4. Interpolate the output back to the original image size using bicubic upsampling.
  5. Merge multi-scale and TTA predictions using edge-weighted averaging.
  6. Normalize the result into a 0–1 range for depth encoding.
  7. Optionally refine using OpenCV's guided filter or DenseCRF.
  8. Output two images (or videos):
    • <filename>_depth.png: grayscale depth map
    • <filename>_rgbd_lr.png: side-by-side RGB and depth image ready for Looking Glass import.

For videos, each frame goes through the same pipeline, optionally skipping frames with --stride to balance speed and quality.


From Script to Hologram

Once generated, these RGB-D images or videos can be imported directly into Looking Glass Studio:

  • Select "RGBD Photo and Video" option.
  • Set the Depth Position to Right (default).
  • Adjust depth and focus.
#!/usr/bin/env python3

import argparse, sys, io, os
from pathlib import Path
import numpy as np
from PIL import Image, ImageOps

# ---------- Optional deps ----------
try:
    import cv2
    _HAS_CV2 = True
    _HAS_XIMGPROC = hasattr(cv2, "ximgproc") and hasattr(cv2.ximgproc, "guidedFilter")
except Exception:
    _HAS_CV2 = False
    _HAS_XIMGPROC = False

try:
    import pydensecrf.densecrf as dcrf
    from pydensecrf.utils import unary_from_softmax
    _HAS_DCRF = True
except Exception:
    _HAS_DCRF = False

# ---------- Torch / Transformers ----------
try:
    import torch
except Exception:
    print("ERROR: install torch/torchvision"); sys.exit(2)
try:
    from transformers import DPTForDepthEstimation, DPTImageProcessor
except ModuleNotFoundError:
    print("ERROR: install transformers timm accelerate"); sys.exit(3)

# ---------- Models ----------
MODEL_REPOS = {
    "dpt-large": "Intel/dpt-large",
    "dpt-hybrid-midas": "Intel/dpt-hybrid-midas",
}

def load_dpt(name, device):
    repo = MODEL_REPOS[name]
    proc = DPTImageProcessor.from_pretrained(repo)
    net  = DPTForDepthEstimation.from_pretrained(repo).to(device)
    net.eval()
    print(f"Loaded: {repo}")
    return net, proc

# ---------- Core inference ----------
@torch.inference_mode()
def _infer_depth_logits(pil_rgb, model, processor, device, use_fp16=True):
    W, H = pil_rgb.size
    inputs = processor(images=pil_rgb, return_tensors="pt")
    inputs = {k: v.to(device) for k,v in inputs.items()}
    autocast_dtype = torch.float16 if (device == "cuda" and use_fp16) else (torch.bfloat16 if device=="cuda" else torch.float32)
    with torch.autocast(device_type=("cuda" if device=="cuda" else "cpu"), dtype=autocast_dtype):
        out = model(**inputs)
        pred = out.predicted_depth  # [1,h,w]
        pred = torch.nn.functional.interpolate(
            pred.unsqueeze(1), size=(H, W), mode="bicubic", align_corners=False
        ).squeeze(1).squeeze(0).float()
    return pred.cpu().numpy()

def _normalize01(arr, robust=True):
    if robust:
        p2, p98 = np.percentile(arr, (2, 98))
        arr = np.clip(arr, p2, p98)
    rng = arr.max() - arr.min()
    return (arr - arr.min()) / max(1e-8, rng)

def _rgb_edge_weights(pil_rgb, sigma=1.2):
    rgb = np.array(pil_rgb.convert("RGB"), dtype=np.uint8)
    if _HAS_CV2:
        gray = cv2.cvtColor(rgb, cv2.COLOR_RGB2GRAY)
        blur = cv2.GaussianBlur(gray, (0,0), sigma)
        edges = cv2.Laplacian(blur, cv2.CV_32F, ksize=3)
        mag = np.abs(edges)
        mag = mag / (mag.max() + 1e-8)
        return 0.2 + 0.8*mag
    gx = np.gradient(rgb.mean(axis=2), axis=1)
    gy = np.gradient(rgb.mean(axis=2), axis=0)
    mag = np.sqrt(gx*gx + gy*gy)
    return mag / (mag.max() + 1e-8)

def _detail_aware_merge(depth_list, weight_maps):
    weights = np.stack(weight_maps, axis=0)  # [N,H,W]
    weights = weights / (weights.sum(axis=0, keepdims=True) + 1e-8)
    depths  = np.stack(depth_list, axis=0)   # [N,H,W]
    return (weights * depths).sum(axis=0)

def infer_depth_max(pil_img, device, backends, tta, pyr_sizes, use_fp16=True):
    rgb = pil_img.convert("RGB")
    W, H = rgb.size
    scales = []
    for s in (pyr_sizes or []):
        s = int(s)
        if s >= 64: scales.append(s)
    if not scales:
        scales = [max(W, H)]
    edge_w = _rgb_edge_weights(rgb)

    preds, weights = [], []
    for name in backends:
        model, proc = _MODEL_CACHE[name]
        for s in scales:
            if W >= H:
                newW, newH = s, int(round(H * s / W))
            else:
                newH, newW = s, int(round(W * s / H))
            img_s = rgb.resize((newW, newH), Image.BICUBIC) if s != max(W,H) else rgb

            d = _infer_depth_logits(img_s, model, proc, device, use_fp16=use_fp16)
            if img_s.size != rgb.size:
                d = np.array(Image.fromarray(d).resize((W, H), Image.BICUBIC), dtype=np.float32)
            preds.append(d); weights.append(edge_w)

            if tta:
                img_f = img_s.transpose(Image.FLIP_LEFT_RIGHT)
                d_f = _infer_depth_logits(img_f, model, proc, device, use_fp16=use_fp16)
                if img_s.size != rgb.size:
                    d_f = np.array(Image.fromarray(d_f).resize((W, H), Image.BICUBIC), dtype=np.float32)
                preds.append(np.fliplr(d_f)); weights.append(edge_w)

    merged = _detail_aware_merge(preds, weights)
    return _normalize01(merged, robust=True)

# ---------- Refinement ----------
def refine_guided_or_bilateral(depth01, pil_rgb, strength=6):
    if not _HAS_CV2:
        return depth01
    guide = np.array(pil_rgb.convert("RGB"))
    if _HAS_XIMGPROC:
        gray = cv2.cvtColor(guide, cv2.COLOR_RGB2GRAY)
        radius = int(max(2, strength))
        eps = 1e-3
        d = depth01.astype(np.float32)
        out = cv2.ximgproc.guidedFilter(guide=gray, src=d, radius=radius, eps=eps, dDepth=-1)
        return np.clip(out, 0, 1).astype(np.float32)
    d8 = np.uint8(np.clip(depth01*255, 0, 255))
    out = cv2.bilateralFilter(d8, d=int(2*strength+1), sigmaColor=75, sigmaSpace=75)
    return (out.astype(np.float32)/255.0)

def refine_densecrf(depth01, pil_rgb, iters=5):
    if not _HAS_DCRF:
        return depth01
    H, W = depth01.shape
    rgb = np.array(pil_rgb.convert("RGB"))
    K = 64
    bins = np.linspace(0, 1, K+1)
    lab = np.digitize(depth01, bins) - 1
    prob = np.zeros((K, H*W), dtype=np.float32)
    prob[lab, np.arange(H*W)] = 1.0
    unary = unary_from_softmax(prob)
    d = dcrf.DenseCRF2D(W, H, K)
    d.setUnaryEnergy(unary)
    d.addPairwiseBilateral(sxy=20, srgb=5, rgbim=rgb, compat=3)
    d.addPairwiseGaussian(sxy=3, compat=1)
    Q = d.inference(iters)
    Q = np.array(Q).reshape(K, H, W)
    return (Q * np.linspace(0, 1, K)[:, None, None]).sum(axis=0).astype(np.float32)

# ---------- RGB-D writer ----------
def rgbd_lr_canvas(rgb_img_pil, depth01):
    W, H = rgb_img_pil.size
    rgb = rgb_img_pil.convert("RGB")
    depth8 = np.uint8(np.clip(depth01*255, 0, 255))
    d_rgb = Image.fromarray(depth8, "L").convert("RGB").resize((W, H), Image.NEAREST)
    canvas = Image.new("RGB", (W*2, H))
    canvas.paste(rgb, (0, 0))
    canvas.paste(d_rgb, (W, 0))
    return canvas

def write_depth_frame_bgr(depth01, size):
    W, H = size
    d8 = np.uint8(np.clip(depth01*255, 0, 255))
    d_rgb = cv2.cvtColor(d8, cv2.COLOR_GRAY2BGR)
    return cv2.resize(d_rgb, (W, H), interpolation=cv2.INTER_NEAREST)

# ---------- Image pipeline ----------
def process_image(in_path: Path, args, device):
    pil = Image.open(in_path)
    pil = ImageOps.exif_transpose(pil)

    backends = build_backends(args)
    depth01 = infer_depth_max(pil, device, backends, args.tta, args.pyr, use_fp16=not args.no_fp16)
    if args.refine:
        depth01 = refine_guided_or_bilateral(depth01, pil, strength=args.refine_strength)
    if args.crf:
        depth01 = refine_densecrf(depth01, pil, iters=5)
    if args.invert:
        depth01 = 1.0 - depth01

    depth_png = in_path.with_name(in_path.stem + "_depth.png")
    Image.fromarray(np.uint8(np.clip(depth01*255,0,255)), "L").save(depth_png, optimize=True, compress_level=9)
    out_png = in_path.with_name(in_path.stem + "_rgbd_lr.png")
    rgbd = rgbd_lr_canvas(pil, depth01)
    rgbd.save(out_png, format="PNG", optimize=True)
    print(f"Saved: {depth_png}")
    print(f"Saved: {out_png}")
    print("Import: RGB-D Photo → Depth Position=Right; toggle inversion if needed.")

# ---------- Video pipeline ----------
def process_video(in_path: Path, args, device):
    if not _HAS_CV2:
        print("ERROR: OpenCV (cv2) is required for video processing."); sys.exit(1)
    cap = cv2.VideoCapture(str(in_path))
    if not cap.isOpened():
        print(f"ERROR: cannot open video: {in_path}"); sys.exit(1)

    src_fps = cap.get(cv2.CAP_PROP_FPS) or 30.0
    src_w   = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    src_h   = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    total   = int(cap.get(cv2.CAP_PROP_FRAME_COUNT) or 0)

    proc_long = args.resize_long if args.resize_long and args.resize_long > 0 else max(src_w, src_h)
    if src_w >= src_h:
        proc_w, proc_h = proc_long, int(round(src_h * proc_long / src_w))
    else:
        proc_h, proc_w = proc_long, int(round(src_w * proc_long / src_h))

    fourcc = cv2.VideoWriter_fourcc(*('m','p','4','v'))
    out_rgbd_path = in_path.with_name(in_path.stem + "_rgbd_lr.mp4")
    out_rgbd = cv2.VideoWriter(str(out_rgbd_path), fourcc, src_fps/max(1,args.stride), (proc_w*2, proc_h))

    out_depth = None
    if args.save_depth_video:
        out_depth_path = in_path.with_name(in_path.stem + "_depth.mp4")
        out_depth = cv2.VideoWriter(str(out_depth_path), fourcc, src_fps/max(1,args.stride), (proc_w, proc_h))

    print(f"Video in: {src_w}x{src_h}@{src_fps:.3f}  frames={total if total>0 else '?'}")
    print(f"Processing at ~{proc_w}x{proc_h}  stride={args.stride}  write fps≈{src_fps/max(1,args.stride):.3f}")

    backends = build_backends(args)

    frame_idx = 0
    written = 0
    limit = args.max_frames if args.max_frames and args.max_frames > 0 else None

    while True:
        ok, frame = cap.read()
        if not ok:
            break
        if frame_idx % args.stride != 0:
            frame_idx += 1
            continue

        rgb_bgr = cv2.resize(frame, (proc_w, proc_h), interpolation=cv2.INTER_AREA) if (frame.shape[1],frame.shape[0]) != (proc_w,proc_h) else frame
        rgb_pil = Image.fromarray(cv2.cvtColor(rgb_bgr, cv2.COLOR_BGR2RGB))

        depth01 = infer_depth_max(rgb_pil, device, backends, args.tta, args.pyr, use_fp16=not args.no_fp16)
        if args.refine:
            depth01 = refine_guided_or_bilateral(depth01, rgb_pil, strength=args.refine_strength)
        if args.crf:
            depth01 = refine_densecrf(depth01, rgb_pil, iters=5)
        if args.invert:
            depth01 = 1.0 - depth01

        depth_bgr = write_depth_frame_bgr(depth01, (proc_w, proc_h))
        rgbd_lr = np.concatenate([rgb_bgr, depth_bgr], axis=1)
        out_rgbd.write(rgbd_lr)
        if out_depth is not None:
            out_depth.write(depth_bgr)

        written += 1
        frame_idx += 1
        if limit and written >= limit:
            break
        if written % 30 == 0:
            print(f"Processed {written} frames...")

    cap.release()
    out_rgbd.release()
    if out_depth is not None:
        out_depth.release()

    print(f"Saved: {out_rgbd_path}")
    if args.save_depth_video:
        print(f"Saved: {out_depth_path}")
    print("Import: RGB-D Video → Depth Position=Right; toggle inversion if needed.")

# ---------- Backend builder / cache ----------
_MODEL_CACHE = {}
def build_backends(args):
    backends = ["dpt-large"] if args.backend == "dpt-large" else ["dpt-hybrid-midas"]
    if args.ensemble:
        backends = ["dpt-large", "dpt-hybrid-midas"] if args.backend == "dpt-large" else ["dpt-hybrid-midas", "dpt-large"]
    for b in backends:
        if b not in _MODEL_CACHE:
            _MODEL_CACHE[b] = load_dpt(b, args.device)
    return backends

# ---------- Utils ----------
def is_video_file(path: Path):
    return path.suffix.lower() in {".mp4", ".mov", ".m4v", ".avi", ".mkv", ".webm"}

def is_image_file(path: Path):
    return path.suffix.lower() in {".jpg", ".jpeg", ".png", ".bmp", ".tif", ".tiff"}

def parse_pyr(pyr_str):
    if not pyr_str or not pyr_str.strip():
        return [384, 512, 640]
    try:
        return [int(x.strip()) for x in pyr_str.split(",") if x.strip()]
    except Exception:
        print("Warning: could not parse --pyr; using defaults 384,512,640")
        return [384, 512, 640]

def pick_file_dialog():
    try:
        import tkinter as tk
        from tkinter import filedialog
        root = tk.Tk(); root.withdraw()
        return filedialog.askopenfilename(
            title="Select an image or video",
            filetypes=[("Media","*.jpg *.jpeg *.png *.bmp *.tif *.tiff *.mp4 *.mov *.m4v *.avi *.mkv *.webm")]
        ) or None
    except Exception:
        return None

# ---------- Main ----------
def main():
    ap = argparse.ArgumentParser("RGB-D LR for images & video")
    ap.add_argument("input", nargs="?", help="Path to input image or video (optional)")
    ap.add_argument("--backend", default="dpt-large",
                    choices=["dpt-large", "dpt-hybrid-midas"],
                    help="Primary depth model (default: dpt-large)")
    ap.add_argument("--ensemble", action="store_true",
                    help="Average dpt-large and dpt-hybrid-midas")
    ap.add_argument("--tta", action="store_true",
                    help="Enable test-time augmentation (hflip)")
    ap.add_argument("--pyr", default="384,512,640",
                    help="Comma-separated long-edge sizes (default: 384,512,640)")
    ap.add_argument("--refine", action="store_true",
                    help="Edge-aware refinement (guided/joint bilateral)")
    ap.add_argument("--refine-strength", type=int, default=6,
                    help="Refinement radius/strength (default: 6)")
    ap.add_argument("--crf", action="store_true",
                    help="Optional DenseCRF (needs pydensecrf)")
    ap.add_argument("--invert", action="store_true",
                    help="Invert near/far after normalization")
    ap.add_argument("--no-fp16", action="store_true",
                    help="Disable CUDA fp16 autocast (use if you see NaNs/artifacts)")

    # Video-specific
    ap.add_argument("--stride", type=int, default=1,
                    help="Process every Nth frame (default: 1 = every frame)")
    ap.add_argument("--resize-long", type=int, default=0,
                    help="Long-edge size for processing/output (0 = source size)")
    ap.add_argument("--max-frames", type=int, default=0,
                    help="Limit number of processed frames (0 = all)")
    ap.add_argument("--save-depth-video", action="store_true",
                    help="Also write a depth-only MP4 alongside RGB-D")

    args = ap.parse_args()

    # If no input provided, open a file dialog (double-click friendly)
    if not args.input:
        picked = pick_file_dialog()
        if not picked:
            print("No input selected. Provide a path or pick a file from the dialog.")
            sys.exit(1)
        args.input = picked

    in_path = Path(args.input).expanduser()
    if not in_path.exists():
        print(f"Not found: {in_path}"); sys.exit(1)

    # Device
    args.device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Device: {args.device}")

    # Parse pyramid
    args.pyr = parse_pyr(args.pyr)

    # Route
    if is_image_file(in_path):
        process_image(in_path, args, args.device)
    elif is_video_file(in_path):
        process_video(in_path, args, args.device)
    else:
        print("Unsupported input type. Use an image (.jpg/.png/...) or video (.mp4/.mov/...).")
        sys.exit(1)

if __name__ == "__main__":
    main()