🏗️ System Design Hard 2025–26

Design an Analytics SDK

A comprehensive Android system design breakdown of a production-grade analytics SDK — covering event capture, in-memory buffering, offline persistence, batched transport, consent management, and every edge case that distinguishes principal engineers from mid-level candidates.

⏱ 45–60 min interview

📖 ~16 min read

🎯 Senior / Staff+

🔥 Asked at Google, Meta, Flipkart, Swiggy

Understanding the Problem

An analytics SDK is one of the most infrastructure-heavy Android problems an interviewer can pose. It sits at the intersection of background processing, persistence, network reliability, privacy law, and performance — all without being visible to the user. The key insight interviewers are testing: the SDK must be invisible to the host app's UX while being absolutely reliable about event delivery.

💡 Clarifying questions to ask upfront: Is this a first-party SDK (you own the server) or third-party (you can't change the ingestion API)? Do we need real-time streaming or can we batch? What's the guaranteed delivery SLA — at-least-once or at-most-once? Does the SDK need to be consent-aware (GDPR/CCPA)?

Functional Requirements

Core Requirements

Expose a track(eventName, properties) API that is non-blocking and safe to call from any thread, including the main thread
Persist events locally so they survive process death and can be delivered when network becomes available
Batch events and upload them to an ingestion endpoint — reducing battery and network overhead vs. per-event HTTP calls
Guarantee at-least-once delivery: an event that was accepted for tracking must reach the server eventually, even if the device is offline for days

Below the line (out of scope for this session):

Auto-capture of every View click (mention it: ViewTreeObserver + AccessibilityDelegate)
Real-time streaming via WebSocket (can be added as a "hot path" for critical events)
A/B experiment assignment (separate service, SDK would receive assignments via remote config)
Crash reporting (separate SDK concern — separate database table, separate flush path)

Non-Functional Requirements

Core Requirements

Zero ANR risk: track() must return in under 1ms — write to an in-memory channel and return immediately
Battery efficiency: No wake locks for analytics. Flush on natural triggers (network available, app background, periodic WorkManager) — never polling
Disk quota: Max 50 MB of persisted events. When full, drop oldest events — never crash the host app due to analytics storage
Idempotent delivery: Each batch carries a stable batchId. Server deduplicates on re-delivery after network timeout
Privacy-safe: PII fields declared in schema are hashed/stripped before leaving the device. SDK stops all activity immediately when consent is revoked

Below the line:

Sub-100ms upload latency (batching already makes this a non-issue at scale)
End-to-end encryption of event payload (TLS is sufficient; field-level encryption is a separate concern)

The Set Up

SDK Architecture Principles

Before touching code, four design principles constrain every decision in this SDK:

Caller-thread safety. track() is called from UI code. It must never block, never throw, and never touch the network. Hand the event to a background channel in the same call and return.
Room as the durability guarantee. An event that reaches Room is safe. Everything between track() and Room is best-effort in-memory. This is the exactly-once boundary for durability.
Stateless flush workers. The flush logic reads from Room, sends to network, deletes on success. It must be fully restartable — if the process is killed mid-flush, the next flush picks up from the same Room state with no corruption.
Backpressure by design. Every queue in the system has a maximum size. When full, the oldest data is evicted — analytics data is never worth OOM-killing the host app.

🔷Pattern: Bounded Channel as the Fast Path

The public track() function writes to a Channel<AnalyticsEvent>(capacity = 2000) with onBufferOverflow = DROP_OLDEST. It returns immediately — the caller never suspends. A single background coroutine drains this channel and writes to Room. This gives you lock-free, back-pressured ingestion with zero allocation on the hot path.

Data Models

The SDK manages four entities. The EventStatus state machine is the most important — it tracks exactly where each event is in the delivery pipeline and makes the flush engine fully restartable.

AnalyticsEvent @Entity

StringeventId@PrimaryKey, UUID

StringsessionIdFK → SessionRecord

Stringnamee.g. "button_tap"

StringpropertiesJsonJSON blob, scrubbed

LongdeviceTimestampcurrentTimeMillis

LongelapsedMselapsedRealtime()

IntschemaVersiondefault = 1

EventStatusstatusPENDING → IN_FLIGHT → SENT

String?batchIdset when IN_FLIGHT

IntretryCountmax 5 → DEAD_LETTER

SessionRecord @Entity

StringsessionId@PrimaryKey, UUID

LongstartTimesession start epoch ms

LonglastActiveTimeupdated on each event

String?userIdnull if anonymous

StringappVersionBuildConfig.VERSION_NAME

StringosVersionBuild.VERSION.RELEASE

StringdeviceModelBuild.MODEL

StringlocaleLocale.getDefault()

BooleanisBackgroundfrom ProcessLifecycle

EventBatch (Network DTO)

StringbatchIdUUID, idempotency key

StringappKeySDK init token

List<EventPayload>eventsmax 100 per batch

LongsentAtdevice time at send

InteventCountfor quick server validation

ConsentState (DataStore)

BooleananalyticsEnabledmaster kill-switch

BooleanpiiAllowedhash vs. drop PII fields

LongconsentTimestampepoch ms of last change

StringconsentVersionlegal policy version

Set<String>allowedCategoriese.g. {"performance","ux"}

⚠️ Two timestamps per event, always. Store both deviceTimestamp (wall clock, System.currentTimeMillis()) and elapsedMs (SystemClock.elapsedRealtime()). Wall clock can drift or be changed by the user. elapsedRealtime() is monotonic since last boot — the server uses it to reconstruct correct event ordering even when the device clock is wrong.

High-Level Design

Component Overview

The SDK has five layers. Each has a single responsibility and communicates only downward — this is what makes the flush engine safely restartable with no shared mutable state between layers.

📡

Capture Layer

Public track() API. Thread-safe. Writes to bounded Channel. Never blocks.

🧠

Buffer Layer

In-memory Channel<AnalyticsEvent>. Drains to Room on background coroutine. Survives config changes.

🗄️

Persistence Layer

Room database. PENDING → IN_FLIGHT → SENT state machine. Survives process death.

⚙️

Flush Engine

Triggered by 4 signals: batch size, timer, app background, network restored. Stateless and restartable.

🌐

Transport Layer

OkHttp with gzip. Retry with exponential backoff. Idempotency key per batch.

🔒

Consent Guard

Checks DataStore before every operation. Purges all data immediately on revocation.

Analytics SDK — full layer architecture dashed arrows = flush triggers

Event Lifecycle — end to end

📊 Event Journey: track() → Server

Step	Layer	What happens	State after
1track() called	Capture Layer	Consent checked (DataStore). Properties scrubbed of PII. Event wrapped with sessionId + timestamps. Sent to Channel.	In Channel buffer
2Channel drain	Buffer Layer	Background coroutine batches up to 50 events per Room transaction. Writes in one transaction for efficiency.	Room: PENDING
3Flush triggered	Flush Engine	Reads up to 100 PENDING events from Room. Assigns a batchId. Marks them IN_FLIGHT in a transaction.	Room: IN_FLIGHT
4HTTP POST	Transport Layer	Sends gzip-compressed JSON batch. Sets batchId header. Awaits response with 30s timeout.	Room: IN_FLIGHT
5a200 OK	Transport Layer	Deletes all events with this batchId from Room in one transaction.	Room: deleted ✓
5b5xx / timeout	Flush Engine	Resets events to PENDING. Increments retryCount. Schedules retry with exponential backoff.	Room: PENDING
5c4xx (perm fail)	Flush Engine	Marks events DEAD_LETTER. They are excluded from future flushes. Logged for diagnostics.	Room: DEAD_LETTER

🔷Pattern: IN_FLIGHT State Prevents Double-Delivery

Marking events IN_FLIGHT before sending — not after — is critical. If the process is killed while the HTTP request is in flight, the events stay IN_FLIGHT in Room. On next launch, a startup check resets all IN_FLIGHT events back to PENDING. The server deduplicates using the stable batchId, so redelivery is safe.

Event status state machine — Room is the durability boundary dashed line = startup recovery path

Flush Triggers

There are four independent flush triggers. They all funnel into the same FlushEngine.flush() coroutine, which is idempotent — concurrent triggers are fine, the second call exits immediately if a flush is already in progress.

⚡ Flush Trigger Matrix

Trigger	Mechanism	When fires	Why needed
Batch size	Room `count(PENDING)` ≥ 100	During active use (heavy event rate)	Prevents unbounded memory/disk growth
Timer	WorkManager `PeriodicWorkRequest` every 15 min	Even if app is in background or killed	Guarantees delivery even for low-traffic apps
App background	`ProcessLifecycleOwner` ON_STOP	User leaves app (home, task switch)	Flushes session-end events before process may be killed
Network restored	`ConnectivityManager.NetworkCallback`	Device comes back online after offline period	Delivers backlog immediately when connection returns

💡 WorkManager is the only trigger that survives process death. The other three require the app to be running. WorkManager's NETWORK_CONNECTED constraint means the timer-based flush also only runs when connectivity is available — no wasted wake-ups.

Low-Level Design

The SDK is composed of six classes with strict one-directional dependencies. Each class owns exactly one responsibility and is independently testable. The diagram below shows every class, its key methods, and how they wire together at runtime.

Class dependency graph — arrows show runtime dependenciesdashed purple = flush triggers that call FlushEngine.flush()

0) SDK Initialization — wiring all classes together

All six classes are composed inside a single AnalyticsSdk container that is built once in Application.onCreate(). There is no service locator or global state beyond the Analytics object itself — every dependency is explicit and injected via the constructor, making each class independently unit-testable with fakes.

// AnalyticsSdk.kt — the composition root
class AnalyticsSdk private constructor(
    val analytics: Analytics
) {
    companion object {
        fun init(context: Context, config: AnalyticsConfig): AnalyticsSdk {
            // 1. persistence
            val db        = AnalyticsDatabase.build(context)
            val eventDao   = db.eventDao()
            val dataStore  = context.analyticsDataStore      // extension val

            // 2. privacy layer — built before anything can call track()
            val piiScrubber   = PiiScrubber(config.piiSchema)
            val consentGuard  = ConsentGuard(dataStore)

            // 3. session management
            val sessionMgr = SessionManager(db)

            // 4. transport + flush engine
            val okHttp     = OkHttpClient.Builder()
                .connectTimeout(15, TimeUnit.SECONDS)
                .readTimeout(30, TimeUnit.SECONDS)
                .build()
            val transport  = AnalyticsTransport(okHttp, config.endpointUrl, config.appKey)
            val flushEngine = FlushEngine(eventDao, transport, consentGuard)

            // 5. public entry point
            val analytics  = Analytics(consentGuard, piiScrubber, sessionMgr, eventDao, flushEngine)
            analytics.start() // launches drain coroutine + registers lifecycle observers

            // 6. schedule WorkManager periodic flush (idempotent — KEEP policy)
            AnalyticsFlushWorker.schedule(WorkManager.getInstance(context))

            return AnalyticsSdk(analytics)
        }
    }
}

// Usage in Application.onCreate() — sub-millisecond, no I/O on main thread
class App : Application() {
    override fun onCreate() {
        super.onCreate()
        AnalyticsSdk.init(this, AnalyticsConfig(
            endpointUrl = "https://ingest.example.com",
            appKey      = BuildConfig.ANALYTICS_KEY,
            piiSchema   = PiiSchema(mapOf(
                "email"  to PiiPolicy.HASH,
                "phone"  to PiiPolicy.STRIP,
                "userId" to PiiPolicy.HASH
            ))
        ))
    }
}

1) Capture Layer — Analytics.track()

The capture layer is the only public surface of the SDK. It must return in under 1ms on the main thread — no I/O, no locks, no suspension. The sequence below shows exactly what happens on each track() call, and how the event moves from the caller into Room asynchronously.

track() call sequence — main thread returns after ~10µs, Room write happens off-thread

// Analytics.kt — the public entry point
class Analytics(
    private val consentGuard:  ConsentGuard,
    private val piiScrubber:   PiiScrubber,
    private val sessionManager: SessionManager,
    private val eventDao:       AnalyticsEventDao,
    private val flushEngine:    FlushEngine
) {
    private val scope   = CoroutineScope(SupervisorJob() + Dispatchers.IO)
    private val channel = Channel<RawEvent>(2000, BufferOverflow.DROP_OLDEST)

    // ← Called from UI thread. Returns in ~10µs. Never suspends.
    fun track(name: String, properties: Map<String, Any?> = emptyMap()) {
        if (!consentGuard.isAnalyticsEnabled()) return    // @Volatile — no I/O
        val event = RawEvent(
            eventId         = UUID.randomUUID().toString(),
            sessionId       = sessionManager.currentSessionId,  // @Volatile
            name            = name,
            properties      = piiScrubber.scrub(properties),    // CPU only
            deviceTimestamp = System.currentTimeMillis(),
            elapsedMs       = SystemClock.elapsedRealtime(),
            schemaVersion   = SCHEMA_VERSION
        )
        channel.trySend(event)  // non-blocking; DROP_OLDEST on overflow
    }

    // Started once in AnalyticsSdk.init() — runs for the lifetime of the app
    fun start() {
        scope.launch { drainLoop() }
        scope.launch { observeConsent() }
        registerLifecycleObserver()
        registerNetworkCallback()
    }

    private suspend fun drainLoop() {
        val buffer = mutableListOf<RawEvent>()
        while (isActive) {
            buffer.clear()
            buffer.add(channel.receive())                           // suspend until ≥1
            repeat(49) { channel.tryReceive().getOrNull()?.let { buffer.add(it) } }
            eventDao.insertAll(buffer.map { it.toEntity() })        // single txn
            if (eventDao.countPending() >= 100) flushEngine.requestFlush()
        }
    }

    private suspend fun observeConsent() {
        consentGuard.consentFlow.collect { state ->
            if (!state.analyticsEnabled) flushEngine.cancelAndPurge()
        }
    }
}

2) SessionManager — 30-minute inactivity window

SessionManager maintains a single @Volatile field for the current session ID so track() can read it lock-free. The Mutex only activates on session creation or expiry — a rare path. Using elapsedRealtime() instead of wall clock means the timeout is immune to the user changing the device time.

class SessionManager(private val db: AnalyticsDatabase) {

    private val TIMEOUT_MS = 30 * 60_000L
    @Volatile var currentSessionId: String = ""  // read lock-free by track()
    private val mutex = Mutex()
    private var lastActiveElapsed = 0L

    // Called once on start() to eagerly create a session
    suspend fun initSession() = mutex.withLock { createNewSession() }

    // Called by the drain coroutine after each batch — checks expiry
    suspend fun touchSession() {
        val now = SystemClock.elapsedRealtime()
        if (now - lastActiveElapsed < TIMEOUT_MS) {
            lastActiveElapsed = now               // still active — just update
            db.sessionDao().updateLastActive(currentSessionId, now)
        } else {
            mutex.withLock { createNewSession() } // expired — new session
        }
    }

    private suspend fun createNewSession() {
        val session = SessionRecord(
            sessionId         = UUID.randomUUID().toString(),
            startTime         = System.currentTimeMillis(),
            lastActiveElapsed = SystemClock.elapsedRealtime(),
            appVersion        = BuildConfig.VERSION_NAME,
            osVersion         = Build.VERSION.RELEASE,
            deviceModel       = Build.MODEL,
            locale            = Locale.getDefault().toLanguageTag()
        )
        db.sessionDao().insert(session)
        currentSessionId   = session.sessionId   // @Volatile write — visible to track()
        lastActiveElapsed  = session.lastActiveElapsed
    }
}

3) FlushEngine — stateless, Mutex-guarded

The flush engine reads from Room, marks events IN_FLIGHT, sends to the server, and deletes on success. Every step is designed so it can be interrupted and replayed safely. The sequence below shows one complete flush cycle including the failure paths.

flush() sequence — three outcome paths: success (green), transient retry (amber), permanent failure (red)

class FlushEngine(
    private val dao:          AnalyticsEventDao,
    private val transport:    AnalyticsTransport,
    private val consentGuard: ConsentGuard
) {
    private val flushMutex = Mutex()

    fun requestFlush() { scope.launch { flush() } }  // fire-and-forget

    suspend fun flush() {
        if (!consentGuard.isAnalyticsEnabled()) return
        if (!flushMutex.tryLock()) return           // already flushing
        try {
            dao.resetInFlightToPending()             // recover from crash mid-send
            while (true) {
                val events = dao.queryPending(limit = 100)
                if (events.isEmpty()) break
                val batchId = UUID.randomUUID().toString()
                dao.markInFlight(events.map { it.eventId }, batchId)

                when (val result = transport.sendBatch(EventBatch(batchId, events))) {
                    is SendResult.Success          -> dao.deleteByBatchId(batchId)
                    is SendResult.TransientFailure  -> { dao.resetToRetry(batchId); break }
                    is SendResult.PermanentFailure  -> dao.markDeadLetter(batchId)
                }
            }
        } finally {
            flushMutex.unlock()
        }
    }

    // Called immediately on consent revocation
    suspend fun cancelAndPurge() {
        flushMutex.withLock { dao.deleteAll() }
    }
}

4) AnalyticsTransport — gzip + exponential backoff

AnalyticsTransport is deliberately stateless — it receives a batch, sends it, and returns a typed SendResult. All retry state (attempt count, backoff timing) lives inside this function call, not in any field. This means it can be swapped for a fake in tests with zero ceremony, and the flush engine doesn't need to know anything about HTTP.

class AnalyticsTransport(
    private val httpClient:  OkHttpClient,
    private val endpointUrl: String,
    private val appKey:      String
) {
    suspend fun sendBatch(batch: EventBatch): SendResult {
        val body = gson.toJson(batch).toByteArray().gzip()

        repeat(3) { attempt ->
            try {
                val response = httpClient.newCall(
                    Request.Builder()
                        .url("$endpointUrl/v1/events")
                        .header("X-Batch-ID",     batch.batchId)     // idempotency key
                        .header("X-App-Key",      appKey)
                        .header("Content-Encoding", "gzip")
                        .header("X-SDK-Version",  BuildConfig.SDK_VERSION)
                        .post(body.toRequestBody())
                        .build()
                ).await()

                return when {
                    response.isSuccessful      -> SendResult.Success
                    response.code in 400..499 -> SendResult.PermanentFailure(response.code)
                    else                       -> SendResult.TransientFailure(response.code)
                }
            } catch (e: IOException) {
                if (attempt < 2) delay(1_000L shl attempt)  // 1s, 2s, 4s
            }
        }
        return SendResult.TransientFailure(0)
    }
}

// Sealed class — exhaustive when() in FlushEngine, no integer codes leaking out
sealed class SendResult {
    object  Success                                      : SendResult()
    data class TransientFailure(val code: Int)            : SendResult()
    data class PermanentFailure(val code: Int)            : SendResult()
}

6) AnalyticsFlushWorker — the only crash-safe trigger

WorkManager is the only flush trigger that survives process death. The three in-process triggers (batch size, lifecycle, network callback) all require the app to be running. AnalyticsFlushWorker is a thin wrapper — it holds no state and just delegates to FlushEngine.flush().

class AnalyticsFlushWorker(
    appContext: Context, params: WorkerParameters,
    private val flushEngine: FlushEngine        // injected via WorkerFactory
) : CoroutineWorker(appContext, params) {

    override suspend fun doWork(): Result = try {
        flushEngine.flush()
        Result.success()
    } catch (e: Exception) {
        if (runAttemptCount < 3) Result.retry() else Result.failure()
    }

    companion object {
        fun schedule(wm: WorkManager) = wm.enqueueUniquePeriodicWork(
            "analytics_flush",
            ExistingPeriodicWorkPolicy.KEEP,           // don't reset if scheduled
            PeriodicWorkRequestBuilder<AnalyticsFlushWorker>(15, TimeUnit.MINUTES)
                .setConstraints(Constraints(requiredNetworkType = NetworkType.CONNECTED))
                .setBackoffCriteria(BackoffPolicy.EXPONENTIAL, 30, TimeUnit.SECONDS)
                .addTag("analytics_flush")
                .build()
        )
    }
}

Edge Cases

These are the eight scenarios that separate senior engineers from staff — each one has caused real production incidents in analytics systems at scale.

💀 Process Killed Mid-Flush

The app is killed by the OS (OOM killer, user force-stop, or system resource pressure) while an HTTP request is in flight. The events are marked IN_FLIGHT in Room but no server acknowledgement was received. On the next app launch, these events remain IN_FLIGHT indefinitely — never retried, never delivered.

Fix: On every SDK initialisation (in Application.onCreate()), run eventDao.resetInFlightToPending() synchronously before any other operation. This converts all orphaned IN_FLIGHT events back to PENDING so they're included in the next flush. The server deduplicates using the stable batchId if the original request actually completed — at-least-once delivery is preserved with no double-counting.

🕐 Device Clock Skew

The user manually sets their device clock backward or forward — a common anti-pattern to exploit time-limited promotions. If you use System.currentTimeMillis() as the only timestamp, your event timeline will have events with impossible ordering: a button tap 3 days before the session started, or events timestamped in the future. This corrupts funnel analysis irreparably.

Fix: Store two timestamps per event: deviceTimestamp (System.currentTimeMillis()) and elapsedMs (SystemClock.elapsedRealtime()). The elapsed time is monotonic since device boot — it cannot be altered by the user. The server computes the correct wall-clock time as: serverReceivedAt - (batchSentElapsed - eventElapsedMs). This gives a monotonically consistent event timeline regardless of device clock manipulation.

🔁 Duplicate Events on Network Retry

The SDK sends a batch, the server processes it successfully and returns 200, but the response is lost in transit (network drops after the ACK leaves the server). The SDK times out, treats it as a transient failure, resets the batch to PENDING, and resends. The server now processes the same 100 events twice — inflating pageview counts, double-counting purchases, corrupting cohort analysis.

Fix: Every batch carries a UUID batchId that is generated once and remains stable across retries. The server maintains a deduplication index keyed by (appKey, batchId) with a 7-day TTL. On receipt of a duplicate batchId, the server returns 200 immediately without reprocessing. This is a standard idempotency key pattern — the same mechanism used for payment deduplication.

🌊 Event Flood — High-Frequency Tracking

A game or animation-heavy app calls track() 50–100 times per second (frame events, position updates, collision callbacks). The in-memory Channel fills up, Room write throughput is exceeded, the database grows at 1 MB/min, and the app lags due to I/O contention on the SQLite WAL.

Fix: Two mechanisms in concert. First, declare high-frequency event types as sampled in the SDK config — e.g., SamplingConfig("frame_rendered", rate = 0.01f) means only 1% of these events are tracked. The track() function checks the sampling rate before inserting into the Channel. Second, aggregate events client-side: instead of individual "scroll_pixel" events, emit "scroll_session" with totalPx and durationMs on scroll end. This reduces event volume by 99% with no loss of analytical signal.

💾 Offline for Days — Disk Quota Exceeded

A field worker uses the app in an area without connectivity for 72 hours. The SDK accumulates thousands of events in Room. After 3 days, the analytics database is 200 MB — this competes with the host app's own Room databases, the image cache, and the OS low-storage threshold. At ~50 MB the OS may start restricting background operations.

Fix: Enforce a hard disk quota of 50 MB (configurable at SDK init). Before every Room insert, check current database file size. When the quota is exceeded, execute DELETE FROM analytics_events WHERE status = 'PENDING' ORDER BY deviceTimestamp ASC LIMIT 500 — drop the oldest 500 events to make room. Log a quota_exceeded meta-event (which itself is exempt from the quota) so the server knows data was trimmed. Run VACUUM to reclaim SQLite file space. This is intentional lossy behaviour — analytics data should never crash or degrade the host app.

🚫 Consent Revoked Mid-Session

The user opens the privacy settings and toggles off analytics while the app is running. Under GDPR Article 17 (right to erasure), the app must stop collecting data and delete all stored personal data immediately — not at the next app restart. Events currently in the Channel, in Room, and potentially in-flight to the server must all be purged.

Fix: The ConsentGuard exposes a DataStore Flow observed by the SDK. When it emits analyticsEnabled = false, the SDK executes a four-step purge: (1) set the cached consent flag so new track() calls drop immediately, (2) clear the in-memory Channel buffer, (3) delete all rows from Room in a single transaction, (4) cancel all WorkManager tasks tagged "analytics_flush". If an HTTP request is currently in flight, the response is ignored on arrival — the SDK checks consent before acting on any HTTP callback. Send a DELETE request to the server to purge server-side data if the ingestion API supports it.

📦 Schema Migration — Old Events, New Fields

You ship SDK v2 which adds a required experimentGroup field to all events. Events from SDK v1 devices don't have this field. The server-side pipeline expects experimentGroup and throws a parse error, sending all v1 events to a dead-letter queue. Weeks of data from users who haven't updated are silently lost.

Fix: Every event carries a schemaVersion: Int field, set to the SDK version that produced it. The server applies a schema registry pattern — each schemaVersion maps to a transformer that fills in defaults for missing fields before the event enters the pipeline. This means old events are always processable. On the client, Room migration via Migration(1, 2) adds new columns with sensible defaults so the database doesn't break on upgrade. Never make new fields required in the server schema; always provide server-side defaults for absent fields.

⛔ ANR from Synchronous Tracking

A junior engineer on the host app team calls Analytics.track() inside a RecyclerView.onBindViewHolder() loop for 200 items during fast fling. If the SDK performs any I/O — even a SharedPreferences read to check consent — the main thread stalls. At 200 items × even 1ms of I/O = 200ms lag, causing jank. A heavier SDK that hits Room synchronously causes an ANR (5s main thread block threshold on Android).

Fix: The track() function must be provably non-blocking. The consent check reads from a @Volatile in-memory field updated by a background coroutine — never from DataStore directly. The Channel trySend() is non-blocking by design. No Room access, no SharedPreferences, no synchronisation primitives on the main-thread path. Include a StrictMode.noteSlowCall("Analytics.track") assertion in debug builds to immediately surface any regression that adds I/O to the hot path.

What Interviewers Expect at Each Level

Mid-Level

Can describe batching to reduce network calls
Knows WorkManager for background jobs
Aware of offline persistence with Room
Understands basic retry logic on failure
Mentions threading — doesn't block main thread
Can define basic Event and Session models

Senior

Designs the full state machine: PENDING → IN_FLIGHT → SENT → DEAD_LETTER
Articulates why IN_FLIGHT state prevents event loss on process death
Uses bounded Channel with DROP_OLDEST for back-pressure
Separates flush triggers (4 signals) and explains each tradeoff
Handles dual timestamps for clock skew resilience
Designs PII scrubbing at capture time, not flush time
Discusses idempotency keys and server-side deduplication

Staff / Principal

Designs the SDK as a reusable library with a clean public API surface
Discusses schema versioning and backwards-compatible server pipelines
Covers sampling strategies for high-frequency event sources
Addresses GDPR/CCPA consent with immediate purge guarantees
Designs for multi-process apps (separate DB connection per process)
Discusses cold-start telemetry: SDK must not delay Application.onCreate()
Considers SDK size budget: analytics shouldn't add more than 200 KB to APK
Proposes a testing strategy: fake transport layer, Room in-memory, StrictMode validation

Interview Q&A

Tap any question to reveal the answer. These cover the questions most commonly asked at Google, Meta, Flipkart, and Swiggy for senior Android roles.

Q1 Easy Why should track() never touch Room or the network directly?

▾

track() is called from the UI thread — RecyclerView bind, button click handlers, Fragment lifecycle callbacks. Any I/O on the main thread risks ANR (Android kills the app if the main thread is blocked for 5 seconds). Even a 5ms SharedPreferences read repeated in a RecyclerView loop of 200 items equals 1 second of jank. The correct pattern is to write to an in-memory, non-blocking structure — a Channel with trySend() — and let a background coroutine drain it to Room asynchronously. The consent check must also be a @Volatile in-memory read, never a DataStore or SharedPreferences read on the call path.

Q2 Easy What is the difference between at-least-once and at-most-once delivery? Which does an analytics SDK need?

▾

At-most-once: an event is sent once and if lost (network failure), it is not retried. Zero duplicates guaranteed, but events can be dropped. At-least-once: an event is retried until the server acknowledges it. Events are guaranteed to arrive but may arrive more than once. Analytics SDKs need at-least-once delivery — a missing purchase event or funnel drop-off corrupts your metrics. Duplicates are handled server-side via idempotency keys (batchId). Exactly-once delivery would require distributed transactions which are impractical on a mobile client.

Q3 Medium Why do you need an IN_FLIGHT state? Why not just PENDING and SENT?

▾

Without IN_FLIGHT, the flush engine has no way to know which events are currently being sent. If the process is killed during an HTTP request, those events stay PENDING and are re-selected by the next flush — but the original request may have already been received by the server. The SDK would continuously retry these events forever. IN_FLIGHT marks a "claim" on a set of events for a specific batchId. On startup, any orphaned IN_FLIGHT events (from a crashed previous session) are reset to PENDING and re-sent with the same batchId, allowing the server to deduplicate safely.

Q4 Medium How do you handle an app that generates 100 events per second?

▾

Two strategies in combination: sampling and client-side aggregation. Sampling means the SDK only tracks 1 in N events for declared high-frequency event types — configured at SDK init via a SamplingConfig. For positional/animation events, client-side aggregation is better: instead of tracking every frame, track a single "scroll_session_ended" event with aggregated metrics (totalScrollPx, duration, direction changes). The bounded Channel(capacity=2000, DROP_OLDEST) also provides a hard circuit-breaker — at 100 events/second it would fill in 20 seconds and then drop the oldest, self-regulating the pipeline without any special logic.

Q5 Medium How do you guarantee the SDK doesn't delay Application.onCreate() cold-start time?

▾

SDK initialisation in Application.onCreate() must be sub-millisecond. Avoid any I/O, Room.build(), or OkHttp client creation on the main thread. Use lazy initialisation: the AnalyticsDatabase and OkHttpClient are built on first access inside a background coroutine, not at init time. The public Analytics object is created synchronously (just allocating the Channel and CoroutineScope), but the heavy lifting is deferred. A ContentProvider (like Firebase's FirebaseInitProvider) can auto-initialise the SDK without requiring a call in Application.onCreate() — but this must also be verified with Systrace to confirm it doesn't add to TTID.

Q6 Medium How does the SDK behave when the user is offline for 72 hours?

▾

Events accumulate in Room with status PENDING. The WorkManager periodic task fires every 15 minutes but immediately exits because the NETWORK_CONNECTED constraint is not satisfied — no battery wasted. When connectivity is restored, two things happen simultaneously: the NetworkCallback fires an immediate flush, and the next WorkManager window opens. The flush engine processes batches of 100 events, deleting each confirmed batch before reading the next. For 72 hours of events, the flush may take several minutes — all in the background. A disk quota check prevents the 72-hour backlog from ever exceeding 50 MB by evicting the oldest events if the threshold is reached.

Q7 Medium Where do you scrub PII — at track() time or at flush time?

▾

Always at track() time — before the event enters the Channel. If you scrub at flush time, PII data is stored unredacted in Room between capture and flush. Room databases are accessible via ADB backup, crash reporting tools (if you log Room contents), and if the device is compromised. Scrubbing at the capture boundary ensures PII never reaches any persistent store. The scrubbing function is synchronous and CPU-only (SHA-256 hash or null replacement) — it adds approximately 50–100µs, which is still safely non-blocking on the main thread.

Q8 Medium How do you handle a GDPR consent revocation while a flush is in progress?

▾

The consent revocation and the flush run concurrently on different coroutines, so you need a cooperative cancellation approach. The ConsentGuard sets a @Volatile analyticsEnabled = false immediately, then launches a coroutine to purge Room. The flush engine checks consentGuard.isAnalyticsEnabled() at the start of each batch loop iteration — it will exit after the current batch (which may already be in flight) completes. The in-flight batch response is ignored if consent is revoked before it arrives. This is a best-effort approach: GDPR allows a "reasonable time" for propagation. For stricter compliance, cancel the OkHttp call directly using Call.cancel().

Q9 Hard How do you design the SDK for a multi-process Android app?

▾

Multi-process apps (a main process + a :background process for a service, for example) have a critical constraint: Room cannot be safely shared across processes with write access from both. Each process gets its own in-process SQLite WAL writer. Two solutions: (1) Single writer process — only the main process writes to Room; the background process sends events to the main process via AIDL or ContentProvider-backed IPC, and the main process inserts them. (2) Separate databases per process with a periodic merge operation — simpler but causes duplicate session IDs and makes cross-process funnel analysis harder. Option 1 is preferred for correctness. The ContentProvider approach (exposing an insert() URI) is the cleanest Android-native IPC mechanism for this use case.

Q10 Hard How would you design a "critical event" fast path that bypasses batching?

▾

Some events — payment confirmations, crash signals, security events — should be sent immediately rather than waiting for the next batch. Design a dual-path system: track() for standard events (goes through Channel → Room → batch flush) and trackImmediate() for critical events (writes to Room then immediately triggers a single-event flush coroutine, bypassing the batch size and timer conditions). The trackImmediate() path still writes to Room first to preserve durability — it just skips the wait. On the server, separate the ingestion endpoint for critical events so they can be prioritised in the processing pipeline without being blocked by high-volume standard event batches.

Q11 Hard How do you test an analytics SDK reliably?

▾

Testing requires controlled substitution at every layer. Use a fake transport that records batches without making network calls — verify event names, property values, and batch sizing in unit tests. Use Room in-memory database (Room.inMemoryDatabaseBuilder()) for fast integration tests without disk I/O. For the flush engine, mock SystemClock to test timer-triggered flushes without actually waiting 15 minutes. Use StrictMode in instrumented tests with detectDiskReads() and detectDiskWrites() on the main thread — this catches any regression that adds I/O to the track() hot path. For WorkManager, use TestWorkerBuilder to run the flush worker synchronously in tests. Add a deliberate process-kill simulation test: insert events, mark them IN_FLIGHT, restart the SDK, verify they become PENDING again.

Q12 Easy Why use gzip compression on event payloads?

▾

Analytics event JSON is highly repetitive — field names like "eventId", "sessionId", "schemaVersion" are repeated in every event in a batch of 100. gzip achieves 5–8× compression on typical event payloads, reducing a 40 KB batch to ~6 KB. This matters for three reasons: (1) faster upload on slow connections, (2) lower mobile data usage (important in price-sensitive markets), and (3) lower egress cost on the server side at scale (millions of devices × hundreds of events/day = significant bandwidth). OkHttp handles gzip transparently via GzipRequestInterceptor — one line of code, significant real-world impact.

Q13 Medium How do you define a session? What edge cases exist in session management?

▾

A session is a contiguous period of user engagement, typically defined as activity within a rolling 30-minute window. Session edge cases: (1) App backgrounded and resumed — if the app was in the background for less than 30 minutes, it should continue the existing session, not start a new one. Use ProcessLifecycleOwner to track background time via elapsedRealtime(). (2) Device restart — elapsedRealtime() resets to zero on reboot, so always start a new session after a device restart regardless of the timestamp comparison. (3) User identity change — if Analytics.identify(userId) is called mid-session (user logs in), consider starting a new session to avoid attributing pre-login events to the identified user. (4) Clock change during session — don't use wall clock for timeout calculation; always use elapsedRealtime().

Q14 Medium How do you handle schema changes between SDK versions?

▾

Every event carries a schemaVersion integer. The server maintains a schema registry mapping each version to a transformer function. When SDK v2 adds a new required field, the server-side transformer for v1 events fills in a sensible default (e.g., experimentGroup = "unknown"). Room database migrations use Migration(1, 2) { database -> database.execSQL("ALTER TABLE analytics_events ADD COLUMN schemaVersion INTEGER NOT NULL DEFAULT 1") } to add new columns without dropping old events. The key principle: never make a field required on the server that didn't exist in older SDK versions. Always deploy server-side schema changes before releasing the SDK version that produces them (expand-then-migrate pattern).

Q15 Hard How would you implement auto-capture of screen views without manual track() calls?

▾

Register an Application.ActivityLifecycleCallbacks in the SDK initialisation. In onActivityResumed(activity), automatically call track("screen_view", mapOf("screen_name" to activity.localClassName)). For Fragment tracking, attach a FragmentManager.FragmentLifecycleCallbacks recursively to every Activity via the lifecycle callback. The screen name can be derived from the Fragment class name, or from a custom annotation (@ScreenName("Home Feed")) if more readable names are needed. For Jetpack Compose, instrument via a custom NavigationEventListener on the NavController. The host app opts in to auto-capture at SDK init — it should be off by default since some apps have confidential screen names.

Q16 Hard What happens to events tracked during the very first cold start before the SDK is initialised?

▾

This is a classic bootstrap problem. Some events — "app_opened", "splash_screen_shown" — are logically the first events in any session, but they happen before Analytics.init() is called in Application.onCreate(). Solution: implement a pre-init buffer. Before initialisation, track() calls write to a simple in-memory list (not a Channel, not Room — just an ArrayList). When init() is called, the pre-init buffer is drained into the real Channel first, preserving event ordering. This buffer has a hard cap of 50 events and is discarded if init() is never called (e.g., if the SDK is not configured). The timestamps are captured at track() call time, so they're accurate even if replayed after init.

Q17 Easy Why use WorkManager instead of AlarmManager or a Foreground Service for periodic flush?

▾

AlarmManager can fire at exact times but requires the app to hold a wake lock, drains battery, and doesn't respect Doze mode constraints. Foreground Service requires a persistent notification — unacceptable UX for an analytics SDK that should be invisible. WorkManager is designed exactly for deferrable background work: it respects Doze mode and App Standby, coalesces work to minimise wake-ups, uses JobScheduler on API 23+ (battery-optimal), survives app restarts, and supports constraint-based execution (only run when network is available). For analytics, the 15-minute minimum interval is fine — there's no requirement to flush in real-time. WorkManager is the only correct answer here for a production SDK.

Q18 Hard How would you implement remote sampling configuration — changing sampling rates without an app update?

▾

The SDK fetches a sampling configuration from a remote config endpoint at initialisation and caches it in DataStore. The config maps event names (or categories) to sampling rates: {"scroll_event": 0.01, "button_tap": 1.0, "purchase": 1.0}. The track() function looks up the rate for the event name and uses a fast LCG random number generator (not java.util.Random, which is synchronized and slow on the hot path) to decide whether to accept the event. The remote config TTL should be short enough to respond to production incidents (e.g., a new event type flooding the pipeline) but long enough to avoid excessive network calls — 1-hour cache with background refresh is typical. Always default to samplingRate = 1.0 if the remote config is unavailable, so you don't silently drop events on first install.

Q19 Medium How do you handle the case where retryCount exceeds the maximum and events become DEAD_LETTER?

▾

Events are moved to DEAD_LETTER status after exceeding maxRetryCount (typically 5). They are excluded from all future flush queries. The SDK should: (1) emit a meta-event (sdk_dead_letter) with the count and event names — this reaches the server only if the SDK itself is working, providing a signal that something is wrong with specific event types. (2) Purge DEAD_LETTER events after 7 days via a scheduled cleanup query — they're consuming disk space with no hope of delivery. (3) Log the failure reason locally (via the host app's logging hook if provided) so developers can diagnose whether it was a 4xx schema error or a persistent network issue. Never silently drop DEAD_LETTER events before logging them.

Q20 Hard How do you ensure the analytics SDK adds minimal size to the host app's APK?

▾

SDK size budget is a real constraint — adding 2 MB to an APK can noticeably hurt install conversion rates. Strategies: (1) Minimise transitive dependencies. Avoid pulling in OkHttp if the host app already uses it — declare it as compileOnly and require the host to provide it, or use the platform's HttpURLConnection as a fallback. (2) ProGuard rules. Ship a consumer-rules.pro file that keeps only the public API and lets R8 shrink the rest. (3) No Gson/Moshi — write a minimal JSON serialiser for the fixed event schema using JSONObject or a handwritten serialiser to avoid pulling in a full JSON library. (4) Use Android's built-in APIs where possible: DataStore instead of custom SharedPreferences wrappers, WorkManager instead of a custom scheduler. Measure the AAR size with ./gradlew assembleRelease and track it in CI. A well-optimised analytics SDK should add under 200 KB to the release APK.