Image Voice Memos — macOS App

Features

Everything you need to document photos with your voice.

🎙️

Instant Recording

Audio hardware is pre-warmed — recording starts without delay after a 1-second countdown.

📝

Transcription

Automatic speech-to-text using Apple's Speech Framework. Supports German and English.

🌐

Translation

German transcriptions are optionally translated to English — completely local on-device.

🖼️

RAW Support

Supports NEF, RAF, ORF, DNG and all common image formats. Fast thumbnails via CGImageSource.

📁

Folder-Based

No import needed — just pick a folder. Meomos are stored as sidecar files alongside your photos.

🔒

Privacy First

Runs exclusively on Apple Silicon. No cloud, no external APIs. Sandboxed with Hardened Runtime.

Workflow

The complete flow from folder selection to finished transcription.

User Workflow

flowchart TD
    A["📁 Select Folder"] --> B["🖼️ Load Photo Grid"]
    B --> C["👆 Select Photo"]
    C --> D{"Voice memo\nexists?"}

    D -- No --> E["🎙️ Start Recording"]
    D -- Yes --> K["▶️ Playback / 🗑️ Delete"]

    E --> F["⏱️ 1s Countdown\n+ Hardware Warmup"]
    F --> G["🔴 Recording\n+ Waveform Display"]
    G --> H["⏹️ Stop"]
    H --> I["💾 PCM → AAC\nConversion"]
    I --> J["📝 Transcription\n(Speech Framework)"]
    J --> L{"Translation\nenabled?"}
    L -- Yes --> M["🌐 DE → EN\nTranslation"]
    L -- No --> N["✅ Done"]
    M --> N
    K --> C

    style A fill:#1a3a5c,stroke:#5e9eff,color:#e8e8e8
    style G fill:#5c1a1a,stroke:#ff5e5e,color:#e8e8e8
    style N fill:#1a5c2a,stroke:#34d058,color:#e8e8e8

State Machine — VoiceMemoState

stateDiagram-v2
    [*] --> noNote
    noNote --> countingDown : Record
    countingDown --> recording : Countdown = 0
    recording --> converting : Stop
    recording --> noNote : Cancel
    converting --> noteExists : AAC saved
    noteExists --> playing : Play
    noteExists --> countingDown : Re-Record
    noteExists --> noNote : Delete
    playing --> paused : Pause
    playing --> noteExists : Stop / End
    paused --> playing : Resume
    paused --> noteExists : Stop

Architecture

MVVM pattern with SwiftUI, @MainActor ViewModels, and specialized services.

System Overview

graph TD
    subgraph V["Views"]
        CV[ContentView] --> DV[DetailView]
        CV --> PG[PhotoGridView]
    end

    subgraph VM["ViewModels"]
        LVM[LibraryViewModel]
        VNVM[VoiceMemoViewModel]
    end

    subgraph S["Services"]
        ARS[AudioRecording]
        TS[Transcription]
        TRS[Translation]
        ILS[ImageLoading]
    end

    subgraph FS["File System"]
        M4A[".m4a"]
        TXT[".txt"]
        ENTXT[".en.txt"]
    end

    PG --> LVM
    DV --> VNVM
    LVM --> ILS
    VNVM --> ARS
    VNVM --> TS
    VNVM --> TRS
    ARS --> M4A
    TS --> TXT
    TRS --> ENTXT

    style V fill:#1a2a3a,stroke:#5e9eff,color:#e8e8e8
    style VM fill:#2a1a3a,stroke:#a05eff,color:#e8e8e8
    style S fill:#1a3a2a,stroke:#34d058,color:#e8e8e8
    style FS fill:#3a2a1a,stroke:#f0883e,color:#e8e8e8

Audio Pipeline

flowchart TD
    MIC["Microphone"] --> PCM["PCM 44.1kHz .caf"]
    PCM --> AAC["AAC .m4a"]
    AAC --> PLAY["Playback"]
    AAC --> SF["Speech Framework"]
    SF --> TXT["Transcript .txt"]
    TXT --> TR["Translation Framework"]
    TR --> EN["Translation .en.txt"]

    style MIC fill:#5c1a1a,stroke:#ff5e5e,color:#e8e8e8
    style AAC fill:#1a3a5c,stroke:#5e9eff,color:#e8e8e8
    style TXT fill:#1a5c2a,stroke:#34d058,color:#e8e8e8
    style EN fill:#3a2a1a,stroke:#f0883e,color:#e8e8e8

Tech Stack

Native macOS technologies — zero external dependencies.

SwiftUI

Declarative UI Framework

AVFoundation

Audio Recording & Playback

Speech Framework

On-Device Transcription

Translation

Local Translation DE→EN

CGImageSource

RAW Thumbnail Extraction

App Sandbox

Security-Scoped Bookmarks

XcodeGen

Project Generation via YAML

Hardened Runtime

Security Hardening

File Structure — Sidecar Pattern

File	Path	Description
Photo	`/Folder/photo.jpg`	Original image file
Voice Memo	`/Folder/.voicememos/photo.m4a`	AAC audio (converted from PCM)
Transcript	`/Folder/.voicememos/photo.txt`	Speech-to-text result
Translation	`/Folder/.voicememos/photo.en.txt`	English translation