Depth Map Generation for Holographic Displays
Converting 2D Images into RGB-D Assets for Looking Glass Holograms

Looking Glass Factory is a company pioneering the next generation of holographic displays. These are devices that let you see and interact with three-dimensional content without wearing a headset or glasses. Their products use light-field technology to project many slightly different perspectives of a scene at once. Your eyes naturally perceive depth as you move around, creating a true 3D illusion that feels tactile and spatial.
Unlike VR or AR, these holographic displays don't isolate you from your surroundings. You're not "looking into" a screen, you are looking through one, at what feels like a physical object floating in space. Artists, developers, and 3D creators use them for 3D photography, volumetric captures and light-field media, interactive 3D applications (Unity, Unreal, Blender, WebGL) and scientific visualization. I released some of my own artworks to the Portrait at disconnect.london a few years ago.
What is RGB-D?
RGB-D stands for Red-Green-Blue + Depth.
- The RGB part is your regular color image.
- The D (Depth) part is a grayscale map representing distance. White for near, black for far.
When combined, these form a 4-channel image that lets holographic displays like the Looking Glass reconstruct multiple viewpoints of a single scene.
RGB-D data can be captured directly with devices that include depth sensors or LiDAR, like the old Kinect, RealSense, or newer iPhones and iPads with LiDAR scanners. But unless the sensor is particularly high quality, the resulting depth maps can be noisy or low-resolution, which leads to poor visual fidelity when viewed holographically.
This post walks through how to estimate depth from a regular 2D photo. Since I shoot with a mirrorless camera that already produces high-quality images, the results are noticeably better than what you'd get from an iPhone's portrait mode.

To make this easy to import into the Looking Glass Studio, we generate side-by-side images or videos where the RGB image is on the left and the depth map is on the right.
Estimating Depth from a Single Image
Depth estimation is the process of predicting the distance of every pixel in a 2D image without knowing the actual 3D geometry. Modern AI models can do this remarkably well by learning from millions of examples of photos paired with real depth data.
The script uses transformer-based neural networks from Intel's Dense Prediction Transformer family, which are state-of-the-art for monocular depth estimation.
- DPT-Large is optimized for accuracy.
- DPT-Hybrid (MiDaS) offers better generalization to a wider variety of images.
- Optionally, both can be used together for even better depth fidelity.
These models are accessed through Hugging Face Transformers, running on PyTorch for GPU-accelerated inference.

Improving Accuracy and Detail
Depth prediction can struggle with fine textures, soft edges, or lighting gradients. This pipeline layers multiple techniques to get the cleanest and most realistic results possible:
- Multi-scale inference: The image is processed at several resolutions (384, 512, 640 px), and the results are fused to preserve both global structure and small details.
- Test-time augmentation: The image is flipped horizontally and re-evaluated, then both predictions are averaged.
- Edge-aware weighting: Sharp edges in the RGB image get higher confidence to keep object boundaries clean.
- Guided / joint bilateral filtering: The depth map is refined using color edges from the RGB image to smooth flat regions while preserving boundaries.
- DenseCRF: A probabilistic post-processing step that further sharpens transitions between foreground and background.
These steps combine into what you could think of as a depth super-resolution pipeline, producing near-studio-grade RGB-D pairs from plain photos.
Under the Hood
- Load the pretrained models from Hugging Face (
Intel/dpt-large,Intel/dpt-hybrid-midas). - Pre-process the input image using
DPTImageProcessor, resizing and normalising it. - Run inference with
torch.autocastfor mixed precision, speeding up processing on GPUs. - Interpolate the output back to the original image size using bicubic upsampling.
- Merge multi-scale and TTA predictions using edge-weighted averaging.
- Normalize the result into a 0–1 range for depth encoding.
- Optionally refine using OpenCV's guided filter or DenseCRF.
- Output two images (or videos):
<filename>_depth.png: grayscale depth map<filename>_rgbd_lr.png: side-by-side RGB and depth image ready for Looking Glass import.
For videos, each frame goes through the same pipeline, optionally skipping frames with --stride to balance speed and quality.
From Script to Hologram
Once generated, these RGB-D images or videos can be imported directly into Looking Glass Studio:
- Select "RGBD Photo and Video" option.
- Set the Depth Position to Right (default).
- Adjust depth and focus.
#!/usr/bin/env python3
import argparse, sys, io, os
from pathlib import Path
import numpy as np
from PIL import Image, ImageOps
# ---------- Optional deps ----------
try:
import cv2
_HAS_CV2 = True
_HAS_XIMGPROC = hasattr(cv2, "ximgproc") and hasattr(cv2.ximgproc, "guidedFilter")
except Exception:
_HAS_CV2 = False
_HAS_XIMGPROC = False
try:
import pydensecrf.densecrf as dcrf
from pydensecrf.utils import unary_from_softmax
_HAS_DCRF = True
except Exception:
_HAS_DCRF = False
# ---------- Torch / Transformers ----------
try:
import torch
except Exception:
print("ERROR: install torch/torchvision"); sys.exit(2)
try:
from transformers import DPTForDepthEstimation, DPTImageProcessor
except ModuleNotFoundError:
print("ERROR: install transformers timm accelerate"); sys.exit(3)
# ---------- Models ----------
MODEL_REPOS = {
"dpt-large": "Intel/dpt-large",
"dpt-hybrid-midas": "Intel/dpt-hybrid-midas",
}
def load_dpt(name, device):
repo = MODEL_REPOS[name]
proc = DPTImageProcessor.from_pretrained(repo)
net = DPTForDepthEstimation.from_pretrained(repo).to(device)
net.eval()
print(f"Loaded: {repo}")
return net, proc
# ---------- Core inference ----------
@torch.inference_mode()
def _infer_depth_logits(pil_rgb, model, processor, device, use_fp16=True):
W, H = pil_rgb.size
inputs = processor(images=pil_rgb, return_tensors="pt")
inputs = {k: v.to(device) for k,v in inputs.items()}
autocast_dtype = torch.float16 if (device == "cuda" and use_fp16) else (torch.bfloat16 if device=="cuda" else torch.float32)
with torch.autocast(device_type=("cuda" if device=="cuda" else "cpu"), dtype=autocast_dtype):
out = model(**inputs)
pred = out.predicted_depth # [1,h,w]
pred = torch.nn.functional.interpolate(
pred.unsqueeze(1), size=(H, W), mode="bicubic", align_corners=False
).squeeze(1).squeeze(0).float()
return pred.cpu().numpy()
def _normalize01(arr, robust=True):
if robust:
p2, p98 = np.percentile(arr, (2, 98))
arr = np.clip(arr, p2, p98)
rng = arr.max() - arr.min()
return (arr - arr.min()) / max(1e-8, rng)
def _rgb_edge_weights(pil_rgb, sigma=1.2):
rgb = np.array(pil_rgb.convert("RGB"), dtype=np.uint8)
if _HAS_CV2:
gray = cv2.cvtColor(rgb, cv2.COLOR_RGB2GRAY)
blur = cv2.GaussianBlur(gray, (0,0), sigma)
edges = cv2.Laplacian(blur, cv2.CV_32F, ksize=3)
mag = np.abs(edges)
mag = mag / (mag.max() + 1e-8)
return 0.2 + 0.8*mag
gx = np.gradient(rgb.mean(axis=2), axis=1)
gy = np.gradient(rgb.mean(axis=2), axis=0)
mag = np.sqrt(gx*gx + gy*gy)
return mag / (mag.max() + 1e-8)
def _detail_aware_merge(depth_list, weight_maps):
weights = np.stack(weight_maps, axis=0) # [N,H,W]
weights = weights / (weights.sum(axis=0, keepdims=True) + 1e-8)
depths = np.stack(depth_list, axis=0) # [N,H,W]
return (weights * depths).sum(axis=0)
def infer_depth_max(pil_img, device, backends, tta, pyr_sizes, use_fp16=True):
rgb = pil_img.convert("RGB")
W, H = rgb.size
scales = []
for s in (pyr_sizes or []):
s = int(s)
if s >= 64: scales.append(s)
if not scales:
scales = [max(W, H)]
edge_w = _rgb_edge_weights(rgb)
preds, weights = [], []
for name in backends:
model, proc = _MODEL_CACHE[name]
for s in scales:
if W >= H:
newW, newH = s, int(round(H * s / W))
else:
newH, newW = s, int(round(W * s / H))
img_s = rgb.resize((newW, newH), Image.BICUBIC) if s != max(W,H) else rgb
d = _infer_depth_logits(img_s, model, proc, device, use_fp16=use_fp16)
if img_s.size != rgb.size:
d = np.array(Image.fromarray(d).resize((W, H), Image.BICUBIC), dtype=np.float32)
preds.append(d); weights.append(edge_w)
if tta:
img_f = img_s.transpose(Image.FLIP_LEFT_RIGHT)
d_f = _infer_depth_logits(img_f, model, proc, device, use_fp16=use_fp16)
if img_s.size != rgb.size:
d_f = np.array(Image.fromarray(d_f).resize((W, H), Image.BICUBIC), dtype=np.float32)
preds.append(np.fliplr(d_f)); weights.append(edge_w)
merged = _detail_aware_merge(preds, weights)
return _normalize01(merged, robust=True)
# ---------- Refinement ----------
def refine_guided_or_bilateral(depth01, pil_rgb, strength=6):
if not _HAS_CV2:
return depth01
guide = np.array(pil_rgb.convert("RGB"))
if _HAS_XIMGPROC:
gray = cv2.cvtColor(guide, cv2.COLOR_RGB2GRAY)
radius = int(max(2, strength))
eps = 1e-3
d = depth01.astype(np.float32)
out = cv2.ximgproc.guidedFilter(guide=gray, src=d, radius=radius, eps=eps, dDepth=-1)
return np.clip(out, 0, 1).astype(np.float32)
d8 = np.uint8(np.clip(depth01*255, 0, 255))
out = cv2.bilateralFilter(d8, d=int(2*strength+1), sigmaColor=75, sigmaSpace=75)
return (out.astype(np.float32)/255.0)
def refine_densecrf(depth01, pil_rgb, iters=5):
if not _HAS_DCRF:
return depth01
H, W = depth01.shape
rgb = np.array(pil_rgb.convert("RGB"))
K = 64
bins = np.linspace(0, 1, K+1)
lab = np.digitize(depth01, bins) - 1
prob = np.zeros((K, H*W), dtype=np.float32)
prob[lab, np.arange(H*W)] = 1.0
unary = unary_from_softmax(prob)
d = dcrf.DenseCRF2D(W, H, K)
d.setUnaryEnergy(unary)
d.addPairwiseBilateral(sxy=20, srgb=5, rgbim=rgb, compat=3)
d.addPairwiseGaussian(sxy=3, compat=1)
Q = d.inference(iters)
Q = np.array(Q).reshape(K, H, W)
return (Q * np.linspace(0, 1, K)[:, None, None]).sum(axis=0).astype(np.float32)
# ---------- RGB-D writer ----------
def rgbd_lr_canvas(rgb_img_pil, depth01):
W, H = rgb_img_pil.size
rgb = rgb_img_pil.convert("RGB")
depth8 = np.uint8(np.clip(depth01*255, 0, 255))
d_rgb = Image.fromarray(depth8, "L").convert("RGB").resize((W, H), Image.NEAREST)
canvas = Image.new("RGB", (W*2, H))
canvas.paste(rgb, (0, 0))
canvas.paste(d_rgb, (W, 0))
return canvas
def write_depth_frame_bgr(depth01, size):
W, H = size
d8 = np.uint8(np.clip(depth01*255, 0, 255))
d_rgb = cv2.cvtColor(d8, cv2.COLOR_GRAY2BGR)
return cv2.resize(d_rgb, (W, H), interpolation=cv2.INTER_NEAREST)
# ---------- Image pipeline ----------
def process_image(in_path: Path, args, device):
pil = Image.open(in_path)
pil = ImageOps.exif_transpose(pil)
backends = build_backends(args)
depth01 = infer_depth_max(pil, device, backends, args.tta, args.pyr, use_fp16=not args.no_fp16)
if args.refine:
depth01 = refine_guided_or_bilateral(depth01, pil, strength=args.refine_strength)
if args.crf:
depth01 = refine_densecrf(depth01, pil, iters=5)
if args.invert:
depth01 = 1.0 - depth01
depth_png = in_path.with_name(in_path.stem + "_depth.png")
Image.fromarray(np.uint8(np.clip(depth01*255,0,255)), "L").save(depth_png, optimize=True, compress_level=9)
out_png = in_path.with_name(in_path.stem + "_rgbd_lr.png")
rgbd = rgbd_lr_canvas(pil, depth01)
rgbd.save(out_png, format="PNG", optimize=True)
print(f"Saved: {depth_png}")
print(f"Saved: {out_png}")
print("Import: RGB-D Photo → Depth Position=Right; toggle inversion if needed.")
# ---------- Video pipeline ----------
def process_video(in_path: Path, args, device):
if not _HAS_CV2:
print("ERROR: OpenCV (cv2) is required for video processing."); sys.exit(1)
cap = cv2.VideoCapture(str(in_path))
if not cap.isOpened():
print(f"ERROR: cannot open video: {in_path}"); sys.exit(1)
src_fps = cap.get(cv2.CAP_PROP_FPS) or 30.0
src_w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
src_h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT) or 0)
proc_long = args.resize_long if args.resize_long and args.resize_long > 0 else max(src_w, src_h)
if src_w >= src_h:
proc_w, proc_h = proc_long, int(round(src_h * proc_long / src_w))
else:
proc_h, proc_w = proc_long, int(round(src_w * proc_long / src_h))
fourcc = cv2.VideoWriter_fourcc(*('m','p','4','v'))
out_rgbd_path = in_path.with_name(in_path.stem + "_rgbd_lr.mp4")
out_rgbd = cv2.VideoWriter(str(out_rgbd_path), fourcc, src_fps/max(1,args.stride), (proc_w*2, proc_h))
out_depth = None
if args.save_depth_video:
out_depth_path = in_path.with_name(in_path.stem + "_depth.mp4")
out_depth = cv2.VideoWriter(str(out_depth_path), fourcc, src_fps/max(1,args.stride), (proc_w, proc_h))
print(f"Video in: {src_w}x{src_h}@{src_fps:.3f} frames={total if total>0 else '?'}")
print(f"Processing at ~{proc_w}x{proc_h} stride={args.stride} write fps≈{src_fps/max(1,args.stride):.3f}")
backends = build_backends(args)
frame_idx = 0
written = 0
limit = args.max_frames if args.max_frames and args.max_frames > 0 else None
while True:
ok, frame = cap.read()
if not ok:
break
if frame_idx % args.stride != 0:
frame_idx += 1
continue
rgb_bgr = cv2.resize(frame, (proc_w, proc_h), interpolation=cv2.INTER_AREA) if (frame.shape[1],frame.shape[0]) != (proc_w,proc_h) else frame
rgb_pil = Image.fromarray(cv2.cvtColor(rgb_bgr, cv2.COLOR_BGR2RGB))
depth01 = infer_depth_max(rgb_pil, device, backends, args.tta, args.pyr, use_fp16=not args.no_fp16)
if args.refine:
depth01 = refine_guided_or_bilateral(depth01, rgb_pil, strength=args.refine_strength)
if args.crf:
depth01 = refine_densecrf(depth01, rgb_pil, iters=5)
if args.invert:
depth01 = 1.0 - depth01
depth_bgr = write_depth_frame_bgr(depth01, (proc_w, proc_h))
rgbd_lr = np.concatenate([rgb_bgr, depth_bgr], axis=1)
out_rgbd.write(rgbd_lr)
if out_depth is not None:
out_depth.write(depth_bgr)
written += 1
frame_idx += 1
if limit and written >= limit:
break
if written % 30 == 0:
print(f"Processed {written} frames...")
cap.release()
out_rgbd.release()
if out_depth is not None:
out_depth.release()
print(f"Saved: {out_rgbd_path}")
if args.save_depth_video:
print(f"Saved: {out_depth_path}")
print("Import: RGB-D Video → Depth Position=Right; toggle inversion if needed.")
# ---------- Backend builder / cache ----------
_MODEL_CACHE = {}
def build_backends(args):
backends = ["dpt-large"] if args.backend == "dpt-large" else ["dpt-hybrid-midas"]
if args.ensemble:
backends = ["dpt-large", "dpt-hybrid-midas"] if args.backend == "dpt-large" else ["dpt-hybrid-midas", "dpt-large"]
for b in backends:
if b not in _MODEL_CACHE:
_MODEL_CACHE[b] = load_dpt(b, args.device)
return backends
# ---------- Utils ----------
def is_video_file(path: Path):
return path.suffix.lower() in {".mp4", ".mov", ".m4v", ".avi", ".mkv", ".webm"}
def is_image_file(path: Path):
return path.suffix.lower() in {".jpg", ".jpeg", ".png", ".bmp", ".tif", ".tiff"}
def parse_pyr(pyr_str):
if not pyr_str or not pyr_str.strip():
return [384, 512, 640]
try:
return [int(x.strip()) for x in pyr_str.split(",") if x.strip()]
except Exception:
print("Warning: could not parse --pyr; using defaults 384,512,640")
return [384, 512, 640]
def pick_file_dialog():
try:
import tkinter as tk
from tkinter import filedialog
root = tk.Tk(); root.withdraw()
return filedialog.askopenfilename(
title="Select an image or video",
filetypes=[("Media","*.jpg *.jpeg *.png *.bmp *.tif *.tiff *.mp4 *.mov *.m4v *.avi *.mkv *.webm")]
) or None
except Exception:
return None
# ---------- Main ----------
def main():
ap = argparse.ArgumentParser("RGB-D LR for images & video")
ap.add_argument("input", nargs="?", help="Path to input image or video (optional)")
ap.add_argument("--backend", default="dpt-large",
choices=["dpt-large", "dpt-hybrid-midas"],
help="Primary depth model (default: dpt-large)")
ap.add_argument("--ensemble", action="store_true",
help="Average dpt-large and dpt-hybrid-midas")
ap.add_argument("--tta", action="store_true",
help="Enable test-time augmentation (hflip)")
ap.add_argument("--pyr", default="384,512,640",
help="Comma-separated long-edge sizes (default: 384,512,640)")
ap.add_argument("--refine", action="store_true",
help="Edge-aware refinement (guided/joint bilateral)")
ap.add_argument("--refine-strength", type=int, default=6,
help="Refinement radius/strength (default: 6)")
ap.add_argument("--crf", action="store_true",
help="Optional DenseCRF (needs pydensecrf)")
ap.add_argument("--invert", action="store_true",
help="Invert near/far after normalization")
ap.add_argument("--no-fp16", action="store_true",
help="Disable CUDA fp16 autocast (use if you see NaNs/artifacts)")
# Video-specific
ap.add_argument("--stride", type=int, default=1,
help="Process every Nth frame (default: 1 = every frame)")
ap.add_argument("--resize-long", type=int, default=0,
help="Long-edge size for processing/output (0 = source size)")
ap.add_argument("--max-frames", type=int, default=0,
help="Limit number of processed frames (0 = all)")
ap.add_argument("--save-depth-video", action="store_true",
help="Also write a depth-only MP4 alongside RGB-D")
args = ap.parse_args()
# If no input provided, open a file dialog (double-click friendly)
if not args.input:
picked = pick_file_dialog()
if not picked:
print("No input selected. Provide a path or pick a file from the dialog.")
sys.exit(1)
args.input = picked
in_path = Path(args.input).expanduser()
if not in_path.exists():
print(f"Not found: {in_path}"); sys.exit(1)
# Device
args.device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {args.device}")
# Parse pyramid
args.pyr = parse_pyr(args.pyr)
# Route
if is_image_file(in_path):
process_image(in_path, args, args.device)
elif is_video_file(in_path):
process_video(in_path, args, args.device)
else:
print("Unsupported input type. Use an image (.jpg/.png/...) or video (.mp4/.mov/...).")
sys.exit(1)
if __name__ == "__main__":
main()