Design a Real-Time Chat Application
A mobile-first system design breakdown of a WhatsApp-style chat app — covering offline-first architecture, WebSocket lifecycle management, delivery receipts, and media handling on Android.
Start your interview by defining the functional and non-functional requirements. For a chat application, functional requirements describe what the system does, while non-functional requirements describe the system qualities — things like "messages should be delivered in under 500ms" or "the app should work fully offline."
Prioritise the top 3 functional requirements. Everything else shows your product thinking, but clearly note it as "below the line" so the interviewer knows you won't be including them in your design. Check in to see if the interviewer wants to move anything.
- Users should be able to send and receive text messages in real time (1:1 chat)
- Users should see delivery receipts — Sent (✓), Delivered (✓✓), Read (✓✓ blue)
- Messages should queue and deliver even when the user is temporarily offline
- Group chat (>2 participants)
- End-to-end encryption
- Voice / video calling
- Message reactions and threads
- The system should deliver messages in under 500ms on a stable network connection
- The system should work fully offline — messages queue locally and sync on reconnect
- The system should be battery-efficient — no aggressive polling or persistent foreground services
- The app is read-heavy; local DB is the single source of truth for the UI at all times
- Sub-50ms latency (a local DB + WebSocket gives us <200ms realistically)
- Message search and full-text indexing
- Compliance / data retention policies
Here's how it might look on your whiteboard: write "Functional Requirements" and "Non-Functional Requirements" with a horizontal line separating core from out-of-scope. Tell the interviewer explicitly what you're deprioritising and why.
On Android, the mobile client owns significantly more responsibility than in a web app. The device must manage its own connection state, persist data locally for offline access, handle background sync, and deliver push notifications when the process is killed. Before drawing architecture boxes, establish this contract clearly.
The core insight that separates strong candidates: the local Room database is the single source of truth. The UI never reads from the network directly. It only observes the DB. The repository layer syncs the DB with the server silently in the background.
Before designing the architecture, align on the data model. The two central entities are Message and Conversation. The trickiest design decision here is the dual-ID system on Message:
// Two-ID pattern — the most important design decision in this system @Entity(tableName = "messages") data class MessageEntity( @PrimaryKey val localId: String, // UUID assigned by device at send time val serverId: String?, // null until server ACKs; used for ordering val conversationId: String, val senderId: String, val text: String?, val mediaUri: String?, // local path initially, then CDN URL val status: MessageStatus, // PENDING | SENT | DELIVERED | READ | FAILED val createdAt: Long, // client timestamp (ms) val serverTs: Long? // used for canonical ordering after sync ) @Entity(tableName = "conversations") data class ConversationEntity( @PrimaryKey val id: String, val participantIds: String, // JSON array val lastMessageId: String?, val unreadCount: Int, val updatedAt: Long )
The localId is generated by the device the instant the user taps Send. This allows the message bubble to render immediately without waiting for any network response. The serverId arrives later via the WebSocket ACK frame and is written back to the same DB row.
The dual-ID system enables optimistic UI — the message appears instantly, before any server confirmation. This is how WhatsApp, iMessage, and Telegram all work. The localId acts as an idempotency key: even if the network request is retried, the server deduplicates by localId and never creates a duplicate message.
When the app launches, it connects to the chat server via a persistent WebSocket and opens a reactive stream from Room that the UI observes. All writes — both from the local user and from the server — go through Room first, so the UI always renders from a consistent local state.
Let's walk through exactly what each layer does:
- UI Layer: Jetpack Compose screens observe
StateFlowfrom ViewModels. They never call network methods directly — they only call ViewModel functions. - Domain Layer: Use cases encapsulate business logic.
SendMessageUseCaseassigns alocalId, writes to Room, and triggers the outbox worker. - ChatRepository: The central coordinator. It writes incoming WebSocket frames to Room and reads from Room for all UI queries. Room → UI is a reactive
Flow. - Room Database: The only source the UI reads from. This makes the UI identical whether online or offline — it always renders from local data.
- WebSocketManager: Maintains the persistent connection. Routes incoming frames to the Repository. Handles reconnection with exponential backoff.
- OutboxWorker: A WorkManager task constrained to
NETWORK_CONNECTED. Drains the local outbox table and sends pending messages. Auto-retries on failure.
When a user taps Send, we want the message to appear in the UI immediately — regardless of network state. Here's the exact sequence:
// OutboxWorker — WorkManager drains pending messages class OutboxWorker(ctx: Context, params: WorkerParameters) : CoroutineWorker(ctx, params) { override suspend fun doWork(): Result { val pending = messageDao.getPendingMessages() pending.forEach { msg -> try { webSocketManager.send(msg.toWsFrame()) // Don't mark SENT here — wait for server ACK via WebSocket } catch (e: Exception) { if (runAttemptCount >= 3) { messageDao.updateStatus(msg.localId, MessageStatus.FAILED) return Result.failure() } return Result.retry() // WorkManager uses exponential backoff } } return Result.success() } }
The critical question isn't just "use WebSockets" — it's how to manage the connection lifecycle on Android, where the OS aggressively kills background processes. Here's the technology trade-off:
| Option | Latency | Battery | Complexity | Decision |
|---|---|---|---|---|
| HTTP Polling (5s interval) | ~5s | 🔴 High | Low | Rejected |
| HTTP Long Polling | <1s | Medium | Medium | Rejected |
| Server-Sent Events (SSE) | <200ms | Low | Low | Acceptable |
| WebSocket (OkHttp) | <100ms | Low | Medium | ✅ Chosen |
| gRPC Streaming | <100ms | Low | High | Good alternative |
The lifecycle rule: connect when the app comes to foreground, disconnect cleanly on background. When the app is backgrounded or killed, FCM push notifications wake it up for new messages. This avoids the battery drain of a persistent foreground service.
On connection failure, do not retry immediately — this thundering herd problem can DDoS your own server when millions of clients lose connectivity at once (e.g. server deploy). Instead, use exponential backoff with jitter: delay = min(2ⁿ × 1000ms, 30000ms) + random(0, 1000ms).
The key principle: never block the message send on media upload. The image renders immediately from the local file. Upload happens in the background. This is the same approach used by WhatsApp, Telegram, and Signal.
- User picks image: Copy to app-internal storage. Compress to ≤300KB with target 1080px width. Generate a
localUri. - Render from local URI: Message bubble shows the image from disk immediately. Status =
PENDING. - MediaUploadWorker runs: Chunked multipart upload to S3 via a pre-signed URL. Stores the last uploaded byte so resumable uploads survive network interruptions.
- Upload complete: DB row updated with the CDN URL. Coil swaps the image source transparently — no flicker, no reload.
⚠️ Common pitfall: Don't use the original file from MediaStore for upload — the URI may become invalid if the user deletes the photo mid-upload. Always copy to your app's own internal storage first, then upload from there.
The HLD shows what components exist. The LLD shows exactly how they talk to each other — method calls, data transformations, and state changes at every step. Below are three precise flows you should be able to draw and explain in an interview.
Start here. Draw this on the whiteboard in the first 5 minutes to anchor the entire conversation. Every other flow is a zoom-in of one arrow on this diagram.
This is the most important flow to master. Every method call, every state change, in exact order.
SendMessageUseCase. Assigns localId = UUID.randomUUID(), status = PENDING.Flow<List<Message>> fires immediately — bubble appears on screen with a clock icon (PENDING).NETWORK_CONNECTED. If offline, WorkManager queues it and waits. If online, it runs immediately.webSocket.send(). Does not mark as SENT yet.serverId, and sends back an ACK frame on the same WebSocket connection.WebSocketManager emits the frame onto its SharedFlow. Repository collects it.serverId set, status → SENT. UI's Flow fires again — clock icon becomes ✓ (single tick).The receive path is simpler because all the heavy lifting (outbox, retry) is on the sender side. The receiver's job is: decode the frame → write to Room → let the UI react.
WebSocketListener.onMessage() fires on OkHttp's internal thread.Dispatchers.IO. Repository is already collecting this SharedFlow in a coroutineScope.insertOrIgnore is idempotent — if the same message arrives twice (reconnect scenario), no duplicate is created.Each message row follows a strict one-way state machine. The receipt flows are the most commonly asked follow-up in interviews — draw this clearly.
// Repository collecting WebSocket frames and updating status class ChatRepository @Inject constructor( private val dao: MessageDao, private val wsManager: WebSocketManager, private val scope: CoroutineScope ) { init { scope.launch(Dispatchers.IO) { wsManager.incomingFrames.collect { frame -> when (frame.type) { WsFrameType.MESSAGE -> handleIncoming(frame) WsFrameType.ACK -> dao.updateAck(frame.localId, frame.serverId, MessageStatus.SENT) WsFrameType.DELIVERED -> dao.updateStatus(frame.serverId, MessageStatus.DELIVERED) WsFrameType.READ -> dao.updateStatus(frame.serverId, MessageStatus.READ) } } } } private suspend fun handleIncoming(frame: WsFrame) { dao.insertOrIgnore(frame.toEntity()) // Send DELIVERED receipt back immediately after DB write wsManager.send(WsFrame.deliveredAck(frame.serverId)) } // UI observes this — reactive, no manual refresh needed fun messages(conversationId: String): Flow<List<MessageEntity>> = dao.getMessages(conversationId) // Room returns Flow automatically }
On reconnect, the server may re-deliver messages the client already received. Using INSERT OR IGNORE (Room's OnConflictStrategy.IGNORE) means duplicate frames are silently dropped. The UI never shows a duplicate bubble — no extra deduplication logic needed anywhere else.
Use Paging 3 with RemoteMediator. Recent messages load from Room instantly. When the user scrolls up past the local cache boundary, RemoteMediator fetches older pages from the REST API, writes them to Room, and Paging 3 re-emits. The UI always reads from Room — the pagination source switch is invisible to the user.
Don't send a receipt event per message — this floods the WebSocket. Instead, batch them: when the user opens a chat screen, send a single READ_ACK frame with the newest messageId they've seen. The server marks all messages up to that ID as read. This reduces receipt traffic by ~90%.
Firebase Cloud Messaging (FCM) delivers a data-only push (not a notification push) to the device. Android wakes the app's FirebaseMessagingService, which connects the WebSocket, fetches missed messages via REST, writes them to Room, and shows a local notification. This is battery-safe because FCM uses the system-level push channel — no background process required.
Use the Signal Protocol (also used by WhatsApp). Each device generates a key pair on first launch. The public key is uploaded to the server. On first message, the sender fetches the recipient's public key and performs an X3DH key exchange to derive a shared secret. From that point, every message is encrypted on-device using the Double Ratchet algorithm. The server never sees plaintext.
E2E encryption introduces a key distribution problem: what if a user installs the app on a new device? They need to re-establish keys with every contact. WhatsApp handles this by requiring the user to re-verify contacts' safety numbers. This is a UX trade-off you should surface in the interview.
Use Kotlin Multiplatform (KMM). The Repository layer, use cases, data models, and Room queries (via SQLDelight on iOS) can be shared. Only the UI layer (Compose on Android, SwiftUI on iOS) and platform-specific code (WorkManager, FCM) stay separate. This is a strong signal at Staff+ level interviews.
Your answer to this question will be evaluated differently depending on the role you're interviewing for. Here's what each level needs to demonstrate:
- Clean MVVM architecture with Repository pattern
- Room as local DB, Retrofit for REST
- Basic offline support with WorkManager
- FCM for push notifications
- Can articulate delivery receipt states
- Dual-ID / outbox pattern explained clearly
- WebSocket lifecycle tied to app foreground state
- Chunked resumable media upload
- Paging 3 + RemoteMediator for history
- Batched read receipts to reduce traffic
- E2E encryption trade-offs (Signal Protocol)
- KMM for cross-platform code sharing
- Performance monitoring + regression alerting
- Group chat fan-out strategies
- Multi-device session management
These are the most frequently asked follow-up questions in real Chat App system design interviews at Google, Swiggy, Flipkart, and CRED. Each one is a potential 10-minute rabbit hole — know them cold.
Reading from the network directly in the ViewModel breaks offline support, introduces loading states in the UI, and makes the UI depend on network availability. Room acts as a local cache that the UI always reads from — the same query works whether you're online or offline. The Repository syncs Room with the server silently in the background. This is the repository pattern and the foundation of offline-first architecture. The UI is always fast because local DB reads are microseconds, not hundreds of milliseconds.
If you wait for the server to generate an ID before inserting into Room, the user sees a loading spinner after tapping Send — which feels sluggish. The localId (a UUID generated on-device) lets you insert into Room immediately, render the bubble, and send to the server in the background. When the server ACK arrives, you write back the serverId to the same row. The localId also acts as an idempotency key — if the network request is retried, the server ignores duplicate frames with the same localId.
If you call the API directly from the ViewModel and the user closes the app mid-send, the coroutine is cancelled and the message is lost. WorkManager survives process death — it stores the work request in its own SQLite DB and re-executes it when the app restarts or connectivity returns. It also handles NETWORK_CONNECTED constraints automatically, so you don't need to manage connectivity callbacks yourself. For anything that must complete eventually regardless of app lifecycle, WorkManager is the right tool.
Use exponential backoff with jitter: delay = min(2ⁿ × 1000ms, 30000ms) + random(0–1000ms). The jitter prevents thousands of clients from reconnecting simultaneously after a server restart (thundering herd). Tie the connection to the app's foreground state using ProcessLifecycleOwner — connect on ON_START, disconnect on ON_STOP. When backgrounded, rely on FCM push to wake the app instead of keeping a persistent connection. Also register a ConnectivityManager.NetworkCallback to reconnect immediately when network becomes available, rather than waiting for the next backoff interval.
On reconnect: (1) The WebSocket connects and the server pushes any missed messages that arrived while the client was offline. (2) For a 3-day gap, the server may have too many messages to push via WebSocket — instead, the client calls a REST GET /messages?since={lastKnownServerTs} endpoint to fetch the backlog and writes them all to Room. (3) Any outbox messages (PENDING rows in Room) are immediately picked up by WorkManager and sent. The key is storing lastKnownServerTs persistently in DataStore so the sync-from point survives app restarts.
Sending a receipt per message is fine for low-traffic chats but doesn't scale. If a conversation has 100 unread messages, opening it would trigger 100 READ receipt frames simultaneously. Instead, batch them: send a single READ_ACK{ conversationId, upToMessageId } frame when the chat screen becomes visible. The server marks all messages up to that ID as read. This reduces receipt traffic by ~90%. For DELIVERED receipts, send one per incoming message immediately after writing to Room — this is unavoidable since delivery is per-message.
Use INSERT OR IGNORE (Room's OnConflictStrategy.IGNORE) when inserting incoming messages. The localId is the primary key — if the same frame arrives twice (reconnect scenario, server retry), the second insert is silently dropped. On the sender side, the outbox pattern ensures the message is in Room before any network call, so the bubble is never duplicated regardless of how many times WorkManager retries. The key insight: make every write idempotent at the DB level rather than trying to deduplicate at the UI level.
Group chat introduces fan-out: one message must be delivered to N recipients. Two strategies: (1) Fan-out on write (server-side) — when the server receives a message, it immediately pushes to all online group members' WebSocket connections and queues FCM for offline ones. Simple for the client, scales to ~100 members. (2) Fan-out on read (client-side) — the server stores one copy and clients pull. Simpler server, but more complex client sync logic. For the client, group messages add a groupId field and read receipts become per-member (you need to track who has read, not just whether the message was read).
Never block the message send on media upload. The flow: (1) Copy image to app-internal storage, compress to ≤300KB. (2) Insert a message row with mediaUri = localPath, status = PENDING — bubble renders immediately from local file. (3) MediaUploadWorker uploads to S3 via a pre-signed URL in the background using chunked multipart upload. Store the last uploaded byte offset in DataStore so uploads resume after interruption. (4) On success, update the row with the CDN URL. Coil swaps the image source transparently. The receiver downloads from CDN on first open and caches to disk.
When the app is killed, the WebSocket is gone. The server detects the disconnect and switches to FCM. Send a data-only push (not a notification push) — this wakes the app's FirebaseMessagingService even when the app is killed. In onMessageReceived(): (1) Connect the WebSocket briefly. (2) Fetch missed messages via REST GET /messages?since=lastTs. (3) Write to Room. (4) Show a local notification using NotificationManager. Use data push not notification push so you control the notification content (unread count, sender name) rather than FCM controlling it. Handle notification grouping for multiple conversations using NotificationCompat.InboxStyle.
Use Paging 3 with RemoteMediator. The PagingSource reads from Room (fast, local). When the user scrolls past the oldest locally cached message, RemoteMediator.load() fires and fetches the next page from REST GET /messages?before={oldestLocalServerId}&limit=50. Write the fetched page to Room. Paging 3 automatically emits the updated list — the UI scrolls smoothly with no manual handling. Use cursor-based pagination (by serverId or serverTs), never offset-based — offset pagination breaks when new messages are inserted during scrolling.
Never use the client timestamp (createdAt) as the canonical ordering key. Client clocks can be wrong by minutes or days. Use serverTs — the timestamp assigned by the server when it persists the message — for ordering. For display, show the client timestamp (so "just now" is accurate), but sort by serverTs. In Room: ORDER BY serverTs ASC, localId ASC — the localId as a tiebreaker handles the brief window where serverTs is null (PENDING messages). PENDING messages always appear at the bottom since their serverTs is null.
Use the Signal Protocol (X3DH + Double Ratchet). Key changes: (1) On first launch, generate an identity key pair and a set of one-time pre-keys. Upload public keys to the server's key distribution service. (2) Before the first message to a user, fetch their public keys and run X3DH to derive a shared session key. (3) Encrypt every message on-device before sending. The server only ever sees ciphertext. (4) Room stores encrypted blobs — you decrypt on read, in the ViewModel before mapping to UI models. Major trade-off: key management becomes complex. New device onboarding, key rotation, and message backup all require careful design. The server cannot moderate content.
Typing indicators are ephemeral — never persist them to Room. The flow: (1) When the user types, send a TYPING{ conversationId } WebSocket frame. (2) Debounce the send — only fire after 500ms of inactivity stops (don't send on every keystroke). (3) The server forwards the event to the recipient's WebSocket. (4) On receipt, show the indicator and start a 5-second timer. If another TYPING frame arrives, reset the timer. If it expires, hide the indicator. (5) Send a TYPING_STOP frame when the user clears the input or sends the message. Store the typing state in a simple MutableStateFlow<Boolean> in the ViewModel — no DB involved.
Each device maintains its own WebSocket connection identified by a deviceId. The server maintains a mapping of userId → [deviceId1, deviceId2, ...]. When a message is sent, the server fans out to all active WebSocket connections for that user. For offline devices, FCM is registered per-device so each gets its own push token. The client-side Room DB is per-device — messages sync independently on each device using lastSyncedTs. The trickiest part: read receipts. If a user reads a message on their tablet, the phone should also mark it as read. The server broadcasts a READ_SYNC event to all other devices of the same user.
The conversation list is a Room query with a reactive Flow. Use a Room @Query with a JOIN between conversations and messages tables:SELECT c.*, m.text as lastMessageText FROM conversations c LEFT JOIN messages m ON m.localId = c.lastMessageId ORDER BY c.updatedAt DESC. Every time a new message is inserted, Room emits on this Flow automatically — the conversation list reorders and shows the new preview without any manual refresh. Unread count is a column on the conversations table, incremented on incoming message insert and reset to 0 when the chat screen opens.
Three layers: (1) Unit tests — Test SendMessageUseCase with a fake Repository. Test ChatViewModel using runTest + Turbine to assert StateFlow emissions. Test ChatRepository with a fake DAO and fake WebSocketManager. (2) Integration tests — Test Room DAO with an in-memory Room database (Room.inMemoryDatabaseBuilder) using runTest. Test the outbox flow with a TestListenableWorkerBuilder for WorkManager. (3) UI tests — Use MockWebServer (OkHttp) to simulate WebSocket frames and assert Compose UI reactions. Use Hilt's @UninstallModules + @TestInstallIn to replace real dependencies with fakes.
Delete for me: Soft-delete in the local Room DB — add a deletedForMe: Boolean flag. The Room query filters these out. Never actually delete the row (it may be needed for receipt sync). Delete for everyone: Send a DELETE{ serverId } WebSocket frame. The server marks the message as deleted in its DB and fans out a DELETE event to all recipients' devices. On receipt, the client sets deletedForEveryone = true in Room. The UI shows "This message was deleted" in place of the content. The content itself can be nulled out in Room. Hard limit: WhatsApp allows delete-for-everyone only within 60 hours — enforce this on the server.
The Android client itself doesn't change at scale — it still manages one WebSocket and one Room DB. The client-side concerns at scale are: (1) Reconnect storms — exponential backoff with jitter prevents all 1M clients hitting the server simultaneously after an outage. (2) Battery efficiency — the foreground/background WebSocket lifecycle pattern keeps battery impact minimal. (3) DB growth — implement a message retention policy: delete messages older than 30 days from Room (keep on server). Use a periodic WorkManager task for DB housekeeping. (4) Memory — Paging 3 ensures only the visible window of messages is in memory, not all 10,000.
Use Kotlin Multiplatform (KMM). The shareable layer includes: Repository, Use Cases, data models, and the outbox pattern logic. Room is replaced by SQLDelight (which generates type-safe Kotlin code from SQL for both Android and iOS). kotlinx.coroutines works on both platforms. The WebSocket client can use Ktor's WebSocket client (multiplatform). What stays platform-specific: UI (Compose on Android, SwiftUI on iOS), WorkManager (use BackgroundTasks on iOS), and FCM (use APNs on iOS). This approach typically reduces business logic duplication by 60–70%, but adds build complexity and requires the team to be comfortable with KMM's maturity limitations.