Design an Analytics SDK
A comprehensive Android system design breakdown of a production-grade analytics SDK β covering event capture, in-memory buffering, offline persistence, batched transport, consent management, and every edge case that distinguishes principal engineers from mid-level candidates.
An analytics SDK is one of the most infrastructure-heavy Android problems an interviewer can pose. It sits at the intersection of background processing, persistence, network reliability, privacy law, and performance β all without being visible to the user. The key insight interviewers are testing: the SDK must be invisible to the host app's UX while being absolutely reliable about event delivery.
π‘ Clarifying questions to ask upfront: Is this a first-party SDK (you own the server) or third-party (you can't change the ingestion API)? Do we need real-time streaming or can we batch? What's the guaranteed delivery SLA β at-least-once or at-most-once? Does the SDK need to be consent-aware (GDPR/CCPA)?
- Expose a
track(eventName, properties)API that is non-blocking and safe to call from any thread, including the main thread - Persist events locally so they survive process death and can be delivered when network becomes available
- Batch events and upload them to an ingestion endpoint β reducing battery and network overhead vs. per-event HTTP calls
- Guarantee at-least-once delivery: an event that was accepted for tracking must reach the server eventually, even if the device is offline for days
- Auto-capture of every View click (mention it: ViewTreeObserver + AccessibilityDelegate)
- Real-time streaming via WebSocket (can be added as a "hot path" for critical events)
- A/B experiment assignment (separate service, SDK would receive assignments via remote config)
- Crash reporting (separate SDK concern β separate database table, separate flush path)
- Zero ANR risk:
track()must return in under 1ms β write to an in-memory channel and return immediately - Battery efficiency: No wake locks for analytics. Flush on natural triggers (network available, app background, periodic WorkManager) β never polling
- Disk quota: Max 50 MB of persisted events. When full, drop oldest events β never crash the host app due to analytics storage
- Idempotent delivery: Each batch carries a stable
batchId. Server deduplicates on re-delivery after network timeout - Privacy-safe: PII fields declared in schema are hashed/stripped before leaving the device. SDK stops all activity immediately when consent is revoked
- Sub-100ms upload latency (batching already makes this a non-issue at scale)
- End-to-end encryption of event payload (TLS is sufficient; field-level encryption is a separate concern)
Before touching code, four design principles constrain every decision in this SDK:
- Caller-thread safety.
track()is called from UI code. It must never block, never throw, and never touch the network. Hand the event to a background channel in the same call and return. - Room as the durability guarantee. An event that reaches Room is safe. Everything between
track()and Room is best-effort in-memory. This is the exactly-once boundary for durability. - Stateless flush workers. The flush logic reads from Room, sends to network, deletes on success. It must be fully restartable β if the process is killed mid-flush, the next flush picks up from the same Room state with no corruption.
- Backpressure by design. Every queue in the system has a maximum size. When full, the oldest data is evicted β analytics data is never worth OOM-killing the host app.
The public track() function writes to a Channel<AnalyticsEvent>(capacity = 2000) with onBufferOverflow = DROP_OLDEST. It returns immediately β the caller never suspends. A single background coroutine drains this channel and writes to Room. This gives you lock-free, back-pressured ingestion with zero allocation on the hot path.
The SDK manages four entities. The EventStatus state machine is the most important β it tracks exactly where each event is in the delivery pipeline and makes the flush engine fully restartable.
β οΈ Two timestamps per event, always. Store both deviceTimestamp (wall clock, System.currentTimeMillis()) and elapsedMs (SystemClock.elapsedRealtime()). Wall clock can drift or be changed by the user. elapsedRealtime() is monotonic since last boot β the server uses it to reconstruct correct event ordering even when the device clock is wrong.
The SDK has five layers. Each has a single responsibility and communicates only downward β this is what makes the flush engine safely restartable with no shared mutable state between layers.
track() API. Thread-safe. Writes to bounded Channel. Never blocks.| Step | Layer | What happens | State after |
|---|---|---|---|
| 1track() called | Capture Layer | Consent checked (DataStore). Properties scrubbed of PII. Event wrapped with sessionId + timestamps. Sent to Channel. | In Channel buffer |
| 2Channel drain | Buffer Layer | Background coroutine batches up to 50 events per Room transaction. Writes in one transaction for efficiency. | Room: PENDING |
| 3Flush triggered | Flush Engine | Reads up to 100 PENDING events from Room. Assigns a batchId. Marks them IN_FLIGHT in a transaction. | Room: IN_FLIGHT |
| 4HTTP POST | Transport Layer | Sends gzip-compressed JSON batch. Sets batchId header. Awaits response with 30s timeout. | Room: IN_FLIGHT |
| 5a200 OK | Transport Layer | Deletes all events with this batchId from Room in one transaction. | Room: deleted β |
| 5b5xx / timeout | Flush Engine | Resets events to PENDING. Increments retryCount. Schedules retry with exponential backoff. | Room: PENDING |
| 5c4xx (perm fail) | Flush Engine | Marks events DEAD_LETTER. They are excluded from future flushes. Logged for diagnostics. | Room: DEAD_LETTER |
Marking events IN_FLIGHT before sending β not after β is critical. If the process is killed while the HTTP request is in flight, the events stay IN_FLIGHT in Room. On next launch, a startup check resets all IN_FLIGHT events back to PENDING. The server deduplicates using the stable batchId, so redelivery is safe.
There are four independent flush triggers. They all funnel into the same FlushEngine.flush() coroutine, which is idempotent β concurrent triggers are fine, the second call exits immediately if a flush is already in progress.
| Trigger | Mechanism | When fires | Why needed |
|---|---|---|---|
| Batch size | Room count(PENDING) β₯ 100 | During active use (heavy event rate) | Prevents unbounded memory/disk growth |
| Timer | WorkManager PeriodicWorkRequest every 15 min | Even if app is in background or killed | Guarantees delivery even for low-traffic apps |
| App background | ProcessLifecycleOwner ON_STOP | User leaves app (home, task switch) | Flushes session-end events before process may be killed |
| Network restored | ConnectivityManager.NetworkCallback | Device comes back online after offline period | Delivers backlog immediately when connection returns |
π‘ WorkManager is the only trigger that survives process death. The other three require the app to be running. WorkManager's NETWORK_CONNECTED constraint means the timer-based flush also only runs when connectivity is available β no wasted wake-ups.
The SDK is composed of six classes with strict one-directional dependencies. Each class owns exactly one responsibility and is independently testable. The diagram below shows every class, its key methods, and how they wire together at runtime.
All six classes are composed inside a single AnalyticsSdk container that is built once in Application.onCreate(). There is no service locator or global state beyond the Analytics object itself β every dependency is explicit and injected via the constructor, making each class independently unit-testable with fakes.
Analytics.track()The capture layer is the only public surface of the SDK. It must return in under 1ms on the main thread β no I/O, no locks, no suspension. The sequence below shows exactly what happens on each track() call, and how the event moves from the caller into Room asynchronously.
SessionManager maintains a single @Volatile field for the current session ID so track() can read it lock-free. The Mutex only activates on session creation or expiry β a rare path. Using elapsedRealtime() instead of wall clock means the timeout is immune to the user changing the device time.
The flush engine reads from Room, marks events IN_FLIGHT, sends to the server, and deletes on success. Every step is designed so it can be interrupted and replayed safely. The sequence below shows one complete flush cycle including the failure paths.
AnalyticsTransport is deliberately stateless β it receives a batch, sends it, and returns a typed SendResult. All retry state (attempt count, backoff timing) lives inside this function call, not in any field. This means it can be swapped for a fake in tests with zero ceremony, and the flush engine doesn't need to know anything about HTTP.
ConsentGuard has two jobs: provide a zero-cost synchronous check on the track() hot path (via @Volatile), and react to consent changes via a DataStore Flow. PiiScrubber is pure β same input always gives the same output, no state β making it trivially testable.
π¨ Scrub at capture, not at flush. PII that reaches Room can be exfiltrated via ADB backup, crash dumps, or debug logging. The scrubbing boundary is the track() call β nothing downstream should ever see raw PII fields.
WorkManager is the only flush trigger that survives process death. The three in-process triggers (batch size, lifecycle, network callback) all require the app to be running. AnalyticsFlushWorker is a thin wrapper β it holds no state and just delegates to FlushEngine.flush().
These are the eight scenarios that separate senior engineers from staff β each one has caused real production incidents in analytics systems at scale.
The app is killed by the OS (OOM killer, user force-stop, or system resource pressure) while an HTTP request is in flight. The events are marked IN_FLIGHT in Room but no server acknowledgement was received. On the next app launch, these events remain IN_FLIGHT indefinitely β never retried, never delivered.
Fix: On every SDK initialisation (in Application.onCreate()), run eventDao.resetInFlightToPending() synchronously before any other operation. This converts all orphaned IN_FLIGHT events back to PENDING so they're included in the next flush. The server deduplicates using the stable batchId if the original request actually completed β at-least-once delivery is preserved with no double-counting.
The user manually sets their device clock backward or forward β a common anti-pattern to exploit time-limited promotions. If you use System.currentTimeMillis() as the only timestamp, your event timeline will have events with impossible ordering: a button tap 3 days before the session started, or events timestamped in the future. This corrupts funnel analysis irreparably.
Fix: Store two timestamps per event: deviceTimestamp (System.currentTimeMillis()) and elapsedMs (SystemClock.elapsedRealtime()). The elapsed time is monotonic since device boot β it cannot be altered by the user. The server computes the correct wall-clock time as: serverReceivedAt - (batchSentElapsed - eventElapsedMs). This gives a monotonically consistent event timeline regardless of device clock manipulation.
The SDK sends a batch, the server processes it successfully and returns 200, but the response is lost in transit (network drops after the ACK leaves the server). The SDK times out, treats it as a transient failure, resets the batch to PENDING, and resends. The server now processes the same 100 events twice β inflating pageview counts, double-counting purchases, corrupting cohort analysis.
Fix: Every batch carries a UUID batchId that is generated once and remains stable across retries. The server maintains a deduplication index keyed by (appKey, batchId) with a 7-day TTL. On receipt of a duplicate batchId, the server returns 200 immediately without reprocessing. This is a standard idempotency key pattern β the same mechanism used for payment deduplication.
A game or animation-heavy app calls track() 50β100 times per second (frame events, position updates, collision callbacks). The in-memory Channel fills up, Room write throughput is exceeded, the database grows at 1 MB/min, and the app lags due to I/O contention on the SQLite WAL.
Fix: Two mechanisms in concert. First, declare high-frequency event types as sampled in the SDK config β e.g., SamplingConfig("frame_rendered", rate = 0.01f) means only 1% of these events are tracked. The track() function checks the sampling rate before inserting into the Channel. Second, aggregate events client-side: instead of individual "scroll_pixel" events, emit "scroll_session" with totalPx and durationMs on scroll end. This reduces event volume by 99% with no loss of analytical signal.
A field worker uses the app in an area without connectivity for 72 hours. The SDK accumulates thousands of events in Room. After 3 days, the analytics database is 200 MB β this competes with the host app's own Room databases, the image cache, and the OS low-storage threshold. At ~50 MB the OS may start restricting background operations.
Fix: Enforce a hard disk quota of 50 MB (configurable at SDK init). Before every Room insert, check current database file size. When the quota is exceeded, execute DELETE FROM analytics_events WHERE status = 'PENDING' ORDER BY deviceTimestamp ASC LIMIT 500 β drop the oldest 500 events to make room. Log a quota_exceeded meta-event (which itself is exempt from the quota) so the server knows data was trimmed. Run VACUUM to reclaim SQLite file space. This is intentional lossy behaviour β analytics data should never crash or degrade the host app.
The user opens the privacy settings and toggles off analytics while the app is running. Under GDPR Article 17 (right to erasure), the app must stop collecting data and delete all stored personal data immediately β not at the next app restart. Events currently in the Channel, in Room, and potentially in-flight to the server must all be purged.
Fix: The ConsentGuard exposes a DataStore Flow observed by the SDK. When it emits analyticsEnabled = false, the SDK executes a four-step purge: (1) set the cached consent flag so new track() calls drop immediately, (2) clear the in-memory Channel buffer, (3) delete all rows from Room in a single transaction, (4) cancel all WorkManager tasks tagged "analytics_flush". If an HTTP request is currently in flight, the response is ignored on arrival β the SDK checks consent before acting on any HTTP callback. Send a DELETE request to the server to purge server-side data if the ingestion API supports it.
You ship SDK v2 which adds a required experimentGroup field to all events. Events from SDK v1 devices don't have this field. The server-side pipeline expects experimentGroup and throws a parse error, sending all v1 events to a dead-letter queue. Weeks of data from users who haven't updated are silently lost.
Fix: Every event carries a schemaVersion: Int field, set to the SDK version that produced it. The server applies a schema registry pattern β each schemaVersion maps to a transformer that fills in defaults for missing fields before the event enters the pipeline. This means old events are always processable. On the client, Room migration via Migration(1, 2) adds new columns with sensible defaults so the database doesn't break on upgrade. Never make new fields required in the server schema; always provide server-side defaults for absent fields.
A junior engineer on the host app team calls Analytics.track() inside a RecyclerView.onBindViewHolder() loop for 200 items during fast fling. If the SDK performs any I/O β even a SharedPreferences read to check consent β the main thread stalls. At 200 items Γ even 1ms of I/O = 200ms lag, causing jank. A heavier SDK that hits Room synchronously causes an ANR (5s main thread block threshold on Android).
Fix: The track() function must be provably non-blocking. The consent check reads from a @Volatile in-memory field updated by a background coroutine β never from DataStore directly. The Channel trySend() is non-blocking by design. No Room access, no SharedPreferences, no synchronisation primitives on the main-thread path. Include a StrictMode.noteSlowCall("Analytics.track") assertion in debug builds to immediately surface any regression that adds I/O to the hot path.
- Can describe batching to reduce network calls
- Knows WorkManager for background jobs
- Aware of offline persistence with Room
- Understands basic retry logic on failure
- Mentions threading β doesn't block main thread
- Can define basic Event and Session models
- Designs the full state machine: PENDING β IN_FLIGHT β SENT β DEAD_LETTER
- Articulates why IN_FLIGHT state prevents event loss on process death
- Uses bounded Channel with DROP_OLDEST for back-pressure
- Separates flush triggers (4 signals) and explains each tradeoff
- Handles dual timestamps for clock skew resilience
- Designs PII scrubbing at capture time, not flush time
- Discusses idempotency keys and server-side deduplication
- Designs the SDK as a reusable library with a clean public API surface
- Discusses schema versioning and backwards-compatible server pipelines
- Covers sampling strategies for high-frequency event sources
- Addresses GDPR/CCPA consent with immediate purge guarantees
- Designs for multi-process apps (separate DB connection per process)
- Discusses cold-start telemetry: SDK must not delay
Application.onCreate() - Considers SDK size budget: analytics shouldn't add more than 200 KB to APK
- Proposes a testing strategy: fake transport layer, Room in-memory, StrictMode validation
Tap any question to reveal the answer. These cover the questions most commonly asked at Google, Meta, Flipkart, and Swiggy for senior Android roles.
track() is called from the UI thread β RecyclerView bind, button click handlers, Fragment lifecycle callbacks. Any I/O on the main thread risks ANR (Android kills the app if the main thread is blocked for 5 seconds). Even a 5ms SharedPreferences read repeated in a RecyclerView loop of 200 items equals 1 second of jank. The correct pattern is to write to an in-memory, non-blocking structure β a Channel with trySend() β and let a background coroutine drain it to Room asynchronously. The consent check must also be a @Volatile in-memory read, never a DataStore or SharedPreferences read on the call path.
At-most-once: an event is sent once and if lost (network failure), it is not retried. Zero duplicates guaranteed, but events can be dropped. At-least-once: an event is retried until the server acknowledges it. Events are guaranteed to arrive but may arrive more than once. Analytics SDKs need at-least-once delivery β a missing purchase event or funnel drop-off corrupts your metrics. Duplicates are handled server-side via idempotency keys (batchId). Exactly-once delivery would require distributed transactions which are impractical on a mobile client.
Without IN_FLIGHT, the flush engine has no way to know which events are currently being sent. If the process is killed during an HTTP request, those events stay PENDING and are re-selected by the next flush β but the original request may have already been received by the server. The SDK would continuously retry these events forever. IN_FLIGHT marks a "claim" on a set of events for a specific batchId. On startup, any orphaned IN_FLIGHT events (from a crashed previous session) are reset to PENDING and re-sent with the same batchId, allowing the server to deduplicate safely.
Two strategies in combination: sampling and client-side aggregation. Sampling means the SDK only tracks 1 in N events for declared high-frequency event types β configured at SDK init via a SamplingConfig. For positional/animation events, client-side aggregation is better: instead of tracking every frame, track a single "scroll_session_ended" event with aggregated metrics (totalScrollPx, duration, direction changes). The bounded Channel(capacity=2000, DROP_OLDEST) also provides a hard circuit-breaker β at 100 events/second it would fill in 20 seconds and then drop the oldest, self-regulating the pipeline without any special logic.
SDK initialisation in Application.onCreate() must be sub-millisecond. Avoid any I/O, Room.build(), or OkHttp client creation on the main thread. Use lazy initialisation: the AnalyticsDatabase and OkHttpClient are built on first access inside a background coroutine, not at init time. The public Analytics object is created synchronously (just allocating the Channel and CoroutineScope), but the heavy lifting is deferred. A ContentProvider (like Firebase's FirebaseInitProvider) can auto-initialise the SDK without requiring a call in Application.onCreate() β but this must also be verified with Systrace to confirm it doesn't add to TTID.
Events accumulate in Room with status PENDING. The WorkManager periodic task fires every 15 minutes but immediately exits because the NETWORK_CONNECTED constraint is not satisfied β no battery wasted. When connectivity is restored, two things happen simultaneously: the NetworkCallback fires an immediate flush, and the next WorkManager window opens. The flush engine processes batches of 100 events, deleting each confirmed batch before reading the next. For 72 hours of events, the flush may take several minutes β all in the background. A disk quota check prevents the 72-hour backlog from ever exceeding 50 MB by evicting the oldest events if the threshold is reached.
Always at track() time β before the event enters the Channel. If you scrub at flush time, PII data is stored unredacted in Room between capture and flush. Room databases are accessible via ADB backup, crash reporting tools (if you log Room contents), and if the device is compromised. Scrubbing at the capture boundary ensures PII never reaches any persistent store. The scrubbing function is synchronous and CPU-only (SHA-256 hash or null replacement) β it adds approximately 50β100Β΅s, which is still safely non-blocking on the main thread.
The consent revocation and the flush run concurrently on different coroutines, so you need a cooperative cancellation approach. The ConsentGuard sets a @Volatile analyticsEnabled = false immediately, then launches a coroutine to purge Room. The flush engine checks consentGuard.isAnalyticsEnabled() at the start of each batch loop iteration β it will exit after the current batch (which may already be in flight) completes. The in-flight batch response is ignored if consent is revoked before it arrives. This is a best-effort approach: GDPR allows a "reasonable time" for propagation. For stricter compliance, cancel the OkHttp call directly using Call.cancel().
Multi-process apps (a main process + a :background process for a service, for example) have a critical constraint: Room cannot be safely shared across processes with write access from both. Each process gets its own in-process SQLite WAL writer. Two solutions: (1) Single writer process β only the main process writes to Room; the background process sends events to the main process via AIDL or ContentProvider-backed IPC, and the main process inserts them. (2) Separate databases per process with a periodic merge operation β simpler but causes duplicate session IDs and makes cross-process funnel analysis harder. Option 1 is preferred for correctness. The ContentProvider approach (exposing an insert() URI) is the cleanest Android-native IPC mechanism for this use case.
Some events β payment confirmations, crash signals, security events β should be sent immediately rather than waiting for the next batch. Design a dual-path system: track() for standard events (goes through Channel β Room β batch flush) and trackImmediate() for critical events (writes to Room then immediately triggers a single-event flush coroutine, bypassing the batch size and timer conditions). The trackImmediate() path still writes to Room first to preserve durability β it just skips the wait. On the server, separate the ingestion endpoint for critical events so they can be prioritised in the processing pipeline without being blocked by high-volume standard event batches.
Testing requires controlled substitution at every layer. Use a fake transport that records batches without making network calls β verify event names, property values, and batch sizing in unit tests. Use Room in-memory database (Room.inMemoryDatabaseBuilder()) for fast integration tests without disk I/O. For the flush engine, mock SystemClock to test timer-triggered flushes without actually waiting 15 minutes. Use StrictMode in instrumented tests with detectDiskReads() and detectDiskWrites() on the main thread β this catches any regression that adds I/O to the track() hot path. For WorkManager, use TestWorkerBuilder to run the flush worker synchronously in tests. Add a deliberate process-kill simulation test: insert events, mark them IN_FLIGHT, restart the SDK, verify they become PENDING again.
Analytics event JSON is highly repetitive β field names like "eventId", "sessionId", "schemaVersion" are repeated in every event in a batch of 100. gzip achieves 5β8Γ compression on typical event payloads, reducing a 40 KB batch to ~6 KB. This matters for three reasons: (1) faster upload on slow connections, (2) lower mobile data usage (important in price-sensitive markets), and (3) lower egress cost on the server side at scale (millions of devices Γ hundreds of events/day = significant bandwidth). OkHttp handles gzip transparently via GzipRequestInterceptor β one line of code, significant real-world impact.
A session is a contiguous period of user engagement, typically defined as activity within a rolling 30-minute window. Session edge cases: (1) App backgrounded and resumed β if the app was in the background for less than 30 minutes, it should continue the existing session, not start a new one. Use ProcessLifecycleOwner to track background time via elapsedRealtime(). (2) Device restart β elapsedRealtime() resets to zero on reboot, so always start a new session after a device restart regardless of the timestamp comparison. (3) User identity change β if Analytics.identify(userId) is called mid-session (user logs in), consider starting a new session to avoid attributing pre-login events to the identified user. (4) Clock change during session β don't use wall clock for timeout calculation; always use elapsedRealtime().
Every event carries a schemaVersion integer. The server maintains a schema registry mapping each version to a transformer function. When SDK v2 adds a new required field, the server-side transformer for v1 events fills in a sensible default (e.g., experimentGroup = "unknown"). Room database migrations use Migration(1, 2) { database -> database.execSQL("ALTER TABLE analytics_events ADD COLUMN schemaVersion INTEGER NOT NULL DEFAULT 1") } to add new columns without dropping old events. The key principle: never make a field required on the server that didn't exist in older SDK versions. Always deploy server-side schema changes before releasing the SDK version that produces them (expand-then-migrate pattern).
Register an Application.ActivityLifecycleCallbacks in the SDK initialisation. In onActivityResumed(activity), automatically call track("screen_view", mapOf("screen_name" to activity.localClassName)). For Fragment tracking, attach a FragmentManager.FragmentLifecycleCallbacks recursively to every Activity via the lifecycle callback. The screen name can be derived from the Fragment class name, or from a custom annotation (@ScreenName("Home Feed")) if more readable names are needed. For Jetpack Compose, instrument via a custom NavigationEventListener on the NavController. The host app opts in to auto-capture at SDK init β it should be off by default since some apps have confidential screen names.
This is a classic bootstrap problem. Some events β "app_opened", "splash_screen_shown" β are logically the first events in any session, but they happen before Analytics.init() is called in Application.onCreate(). Solution: implement a pre-init buffer. Before initialisation, track() calls write to a simple in-memory list (not a Channel, not Room β just an ArrayList). When init() is called, the pre-init buffer is drained into the real Channel first, preserving event ordering. This buffer has a hard cap of 50 events and is discarded if init() is never called (e.g., if the SDK is not configured). The timestamps are captured at track() call time, so they're accurate even if replayed after init.
AlarmManager can fire at exact times but requires the app to hold a wake lock, drains battery, and doesn't respect Doze mode constraints. Foreground Service requires a persistent notification β unacceptable UX for an analytics SDK that should be invisible. WorkManager is designed exactly for deferrable background work: it respects Doze mode and App Standby, coalesces work to minimise wake-ups, uses JobScheduler on API 23+ (battery-optimal), survives app restarts, and supports constraint-based execution (only run when network is available). For analytics, the 15-minute minimum interval is fine β there's no requirement to flush in real-time. WorkManager is the only correct answer here for a production SDK.
The SDK fetches a sampling configuration from a remote config endpoint at initialisation and caches it in DataStore. The config maps event names (or categories) to sampling rates: {"scroll_event": 0.01, "button_tap": 1.0, "purchase": 1.0}. The track() function looks up the rate for the event name and uses a fast LCG random number generator (not java.util.Random, which is synchronized and slow on the hot path) to decide whether to accept the event. The remote config TTL should be short enough to respond to production incidents (e.g., a new event type flooding the pipeline) but long enough to avoid excessive network calls β 1-hour cache with background refresh is typical. Always default to samplingRate = 1.0 if the remote config is unavailable, so you don't silently drop events on first install.
Events are moved to DEAD_LETTER status after exceeding maxRetryCount (typically 5). They are excluded from all future flush queries. The SDK should: (1) emit a meta-event (sdk_dead_letter) with the count and event names β this reaches the server only if the SDK itself is working, providing a signal that something is wrong with specific event types. (2) Purge DEAD_LETTER events after 7 days via a scheduled cleanup query β they're consuming disk space with no hope of delivery. (3) Log the failure reason locally (via the host app's logging hook if provided) so developers can diagnose whether it was a 4xx schema error or a persistent network issue. Never silently drop DEAD_LETTER events before logging them.
SDK size budget is a real constraint β adding 2 MB to an APK can noticeably hurt install conversion rates. Strategies: (1) Minimise transitive dependencies. Avoid pulling in OkHttp if the host app already uses it β declare it as compileOnly and require the host to provide it, or use the platform's HttpURLConnection as a fallback. (2) ProGuard rules. Ship a consumer-rules.pro file that keeps only the public API and lets R8 shrink the rest. (3) No Gson/Moshi β write a minimal JSON serialiser for the fixed event schema using JSONObject or a handwritten serialiser to avoid pulling in a full JSON library. (4) Use Android's built-in APIs where possible: DataStore instead of custom SharedPreferences wrappers, WorkManager instead of a custom scheduler. Measure the AAR size with ./gradlew assembleRelease and track it in CI. A well-optimised analytics SDK should add under 200 KB to the release APK.