fix(connectors): contentDeferred pattern + validation fixes across all connectors by waleedlatif1 · Pull Request #3793 · simstudioai/sim

waleedlatif1 · 2026-03-26T22:09:44Z

Summary

Converted all 10 connectors to contentDeferred: true pattern — listDocuments now returns lightweight metadata stubs (no content download), and content is fetched lazily via getDocument only for new/changed documents. This prevents task timeouts on large syncs (root cause of Google Drive, Fireflies failures).
Ran full validation audit on all 10 connectors against their respective API docs. Fixed all critical and warning issues found.
Updated add-connector and validate-connector skills to document the contentDeferred best practice.

Connectors converted to contentDeferred

Connector	contentHash strategy	Notable fixes
Google Drive	`gdrive:${id}:${modifiedTime}`	orderBy, precise maxFiles, byte-length truncation
OneDrive	`onedrive:${id}:${lastModifiedDateTime}`	orderBy for deterministic pagination
SharePoint	`sharepoint:${id}:${lastModifiedDateTime}`	byte-length size check
Dropbox	`dropbox:${id}:${content_hash}`	uses Dropbox native content_hash
Notion	`notion:${pageId}:${last_edited_time}`	code/equation blocks, empty page fallback, concurrency reduction
Confluence	`confluence:${pageId}:${versionNumber}`	syncContext cloudId caching, label concurrency reduction
Gmail	`gmail:${threadId}:${historyId}`	joinTagArray for labels
Obsidian	stub: `obsidian:stub:${path}:${syncRunId}`, doc: `obsidian:${path}:${mtime}`	forced re-fetch per sync (API lacks mtime in listing)
Evernote	`evernote:${guid}:${updated}`	retryOptions threaded through Thrift RPC calls
GitHub	`git-sha:${sha}` (pre-existing)	added contentDeferred: false to getDocument

Infrastructure changes

sync-engine.ts: added syncRunId to syncContext for Obsidian change detection
confluence/utils.ts: replaced raw fetch() with fetchWithRetry, added retryOptions parameter
oauth.ts: added supportsRefreshTokenRotation: false for Dropbox

Test plan

Trigger sync for Google Drive connector — should complete without timeout
Trigger sync for each converted connector — documents should appear without refresh
Verify lastSyncDocCount matches actual KB document count after sync
Verify previously stuck "processing" documents get retried on next sync
Run bun run lint — passes clean
Run bunx tsc --noEmit — no errors in changed files

vercel · 2026-03-26T22:09:49Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
docs	Skipped		Mar 27, 2026 1:49am

cursor · 2026-03-26T22:09:50Z

PR Summary

Medium Risk
Touches the sync engine and multiple connector implementations, changing listing/hydration and deletion behavior; regressions could cause missed updates, extra reprocessing, or skipped deletions if hashes/metadata are wrong.

Overview
Standardizes the contentDeferred connector pattern across the repo. listDocuments for several connectors now returns lightweight metadata stubs (content: '', contentDeferred: true) with metadata-derived contentHash values, and getDocument fetches/returns full content with contentDeferred: false.

Improves sync execution safety and reliability. The sync engine now de-dupes externalIds, hydrates deferred docs while preserving consistent hashes/metadata, enqueues processing per batch (instead of one large enqueue at end), and adds safeguards around deletion reconciliation (skip when listing was capped; skip large deletion spikes unless full sync). Documentation/checklists for adding/validating connectors were updated to require content deferral and metadata-based hashing, and OAuth/utility fixes include fetchWithRetry in Confluence cloud-id lookup and explicit Dropbox refresh token rotation support config.

^{Written by Cursor Bugbot for commit fcde037. Configure here.}

greptile-apps · 2026-03-26T22:13:13Z

Greptile Summary

This PR converts all 10 knowledge-base connectors from an eager content-download model to a lazy contentDeferred: true pattern where listing returns lightweight metadata stubs and content is fetched on-demand via getDocument only for new or changed documents. The sync engine gains per-batch hydration, post-hydration hash early-out, a 50% deletion safety threshold, and correct listingCapped gating of deletion reconciliation. All previously flagged issues have been resolved.

Confidence Score: 4/5

PR is safe to merge; all previously flagged critical concerns are resolved and only minor P2 efficiency observations remain.

All prior review concerns (Evernote retryOptions, listingCapped propagation, Confluence cloudId, deletion reconciliation) are correctly addressed. The contentDeferred pattern is implemented consistently. Score is 4 rather than 5 because the Gmail historyId concern could cause an invisible performance regression worth a quick confirmation.

apps/sim/connectors/gmail/gmail.ts — verify historyId is present in threads.list responses, or the skip-getDocument optimisation never fires for Gmail.

Important Files Changed

Filename	Overview
apps/sim/lib/knowledge/connectors/sync-engine.ts	Major refactor: per-batch deferred hydration, syncRunId injection, post-hydration hash early-out, 50% deletion safety threshold, and listingCapped-gated deletion reconciliation. Logic is correct.
apps/sim/connectors/gmail/gmail.ts	Converted to contentDeferred; historyId availability in threads.list should be verified — if absent, getDocument is called for all threads every sync (correct but unoptimised).
apps/sim/connectors/google-drive/google-drive.ts	Clean conversion; adds orderBy, precise effectivePageSize, and correct byte-length UTF-8 truncation. listingCapped set on limit hit.
apps/sim/connectors/confluence/confluence.ts	Removes body.storage from listing, reduces label concurrency, fetches labels only in getDocument; blogpost fallback and cloudId caching correct.
apps/sim/connectors/notion/notion.ts	Adds code/equation block rendering, empty-page fallback to title, archived-page early return; listingCapped set in all three listing paths.
apps/sim/connectors/evernote/evernote.ts	retryOptions correctly threaded to all Thrift RPC callers; stub uses evernote:guid:updated hash; empty-note early return added.
apps/sim/connectors/obsidian/obsidian.ts	syncRunId-based stub hash forces per-sync hydration intentionally; getDocument populates full metadata and mtime-based hash for correct post-hydration skip.
apps/sim/connectors/dropbox/dropbox.ts	Uses Dropbox native content_hash with server_modified fallback; clean stub conversion; listingCapped added.
apps/sim/connectors/onedrive/onedrive.ts	Added $orderby=lastModifiedDateTime desc to initial and subfolder URLs; Microsoft's @odata.nextLink preserves query params so continuation pages are consistent.
apps/sim/connectors/sharepoint/sharepoint.ts	Clean stub conversion; correct byte-length truncation; listingCapped added; old double-truncation bug removed.
apps/sim/tools/confluence/utils.ts	Replaced raw fetch() with fetchWithRetry; added optional retryOptions parameter; backward-compatible.
apps/sim/lib/oauth/oauth.ts	Adds supportsRefreshTokenRotation: false for Dropbox to prevent refresh token double-invalidation.
apps/sim/connectors/github/github.ts	Minimal change: adds contentDeferred: false to getDocument return and stubs _syncContext parameter.

Sequence Diagram

sequenceDiagram
    participant SE as sync-engine
    participant C as Connector
    participant API as External API
    participant DB as Database
    participant Q as ProcessingQueue

    SE->>C: listDocuments(cursor, syncContext)
    C->>API: List metadata (no content download)
    API-->>C: [{id, title, contentHash_stub}]
    C-->>SE: ExternalDocumentList (contentDeferred: true)

    SE->>SE: Build pendingOps (add/update via stub hash vs stored hash)

    loop SYNC_BATCH_SIZE chunks
        SE->>SE: Split into deferredOps + readyOps
        par Hydrate deferred docs
            SE->>C: getDocument(id, syncContext)
            C->>API: Fetch full content for single doc
            API-->>C: Full document
            C-->>SE: ExternalDocument (contentDeferred: false)
        end
        SE->>SE: Post-hydration hash check (skip re-embed if hash unchanged)
        SE->>DB: addDocument / updateDocument
        DB-->>SE: DocumentData[]
        SE->>Q: processDocumentsWithQueue(batchDocs)
    end

    SE->>SE: Deletion reconciliation (skip if listingCapped and not fullSync)
    SE->>DB: hardDeleteDocuments(removedIds)

Comments Outside Diff (1)

apps/sim/connectors/gmail/gmail.ts, line 287-301 (link)

historyId may not be present in threads.list response

The Gmail users.threads.list API returns Thread objects with id and snippet in list responses; historyId is typically only populated when you call threads.get. Because historyId is typed as optional here, stub hashes will consistently resolve to gmail:${id}: (empty suffix), which will never match the stored hash produced by getDocument (gmail:${id}:${realHistoryId}).

The practical consequence: every thread is classified as a pending update on every sync, so getDocument is called for all threads regardless of whether they changed. The post-hydration hash comparison still prevents re-embedding (docsUnchanged is incremented correctly), so correctness is maintained — but the intended optimisation of skipping getDocument for unchanged threads never fires.

If historyId is indeed absent from list responses, using a constant sentinel like 'pending' would be cleaner than an empty-suffix fallback, making it explicit that the post-hydration check drives all skip logic.

_{Reviews (11): Last reviewed commit: "fix(evernote): thread retryOptions throu..." | Re-trigger Greptile}

…outs

waleedlatif1 · 2026-03-26T22:15:36Z

@greptile

waleedlatif1 · 2026-03-26T22:15:39Z

@cursor review

cursor

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

…d fix validation issues All 10 connectors now use contentDeferred: true in listDocuments, returning lightweight metadata stubs instead of downloading content during listing. Content is fetched lazily via getDocument only for new/changed documents, preventing Trigger.dev task timeouts on large syncs. Connector-specific fixes from validation audit: - Google Drive: metadata-based contentHash, orderBy for deterministic pagination, precise maxFiles, byte-length size check with truncation warning - OneDrive: metadata-based contentHash, orderBy for deterministic pagination - SharePoint: metadata-based contentHash, byte-length size check - Dropbox: metadata-based contentHash using content_hash field - Notion: code/equation block extraction, empty page fallback to title, reduced CHILD_PAGE_CONCURRENCY to 5, syncContext parameter - Confluence: syncContext caching for cloudId, reduced label concurrency to 5 - Gmail: use joinTagArray for label tags - Obsidian: syncRunId-based stub hash for forced re-fetch, mtime-based hash in getDocument, .trim() on vaultUrl, lightweight validateConfig - Evernote: retryOptions threaded through apiFindNotesMetadata and apiGetNote - GitHub: added contentDeferred: false to getDocument, syncContext parameter Infrastructure: - sync-engine: added syncRunId to syncContext for Obsidian change detection - confluence/utils: replaced raw fetch with fetchWithRetry, added retryOptions - oauth: added supportsRefreshTokenRotation: false for Dropbox - Updated add-connector and validate-connector skills with contentDeferred docs Co-Authored-By: Claude Opus 4.6 <[email protected]>

waleedlatif1 · 2026-03-26T23:00:30Z

@greptile

waleedlatif1 · 2026-03-26T23:00:34Z

@cursor review

apps/sim/connectors/evernote/evernote.ts

apps/sim/connectors/gmail/gmail.ts

apps/sim/connectors/obsidian/obsidian.ts

apps/sim/connectors/confluence/confluence.ts

apps/sim/connectors/google-drive/google-drive.ts

…tions, UTF-8 safety - Sync engine: merge metadata from getDocument during deferred hydration, so Gmail/Obsidian/Confluence tags and metadata survive the stub→full transition - Evernote: pass retryOptions {retries:3, backoff:500} from listDocuments and getDocument callers into apiFindNotesMetadata and apiGetNote - Google Drive + SharePoint: safe UTF-8 truncation that walks back to the last complete character boundary instead of splitting multi-byte chars Co-Authored-By: Claude Opus 4.6 <[email protected]>

waleedlatif1 · 2026-03-26T23:16:25Z

@greptile

waleedlatif1 · 2026-03-26T23:16:29Z

@cursor review

maxRetries/initialDelayMs instead of retries/backoff to match the RetryOptions interface from lib/knowledge/documents/utils. Co-Authored-By: Claude Opus 4.6 <[email protected]>

apps/sim/connectors/gmail/gmail.ts

apps/sim/connectors/obsidian/obsidian.ts

…s after hydration - Merge title from getDocument during deferred hydration so Gmail documents get the email Subject header instead of the snippet text - After hydration, compare the hydrated contentHash against the stored DB hash — if they match, skip the update. This prevents Obsidian (and any connector with a force-refresh stub hash) from re-uploading and re-processing unchanged documents every sync Co-Authored-By: Claude Opus 4.6 <[email protected]>

waleedlatif1 · 2026-03-26T23:32:04Z

@cursor review

waleedlatif1 · 2026-03-26T23:32:07Z

@greptile

…merge sourceUrl Three sync engine gaps identified during audit: 1. Duplicate externalId guard: if a connector returns the same externalId across pages (pagination overlap), skip the second occurrence to prevent unique constraint violations on add and double-uploads on update. 2. Deletion reconciliation: previously required explicit fullSync or syncMode='full', meaning docs deleted from the source accumulated in the KB forever. Now runs on all non-incremental syncs (which return ALL docs). Includes a safety threshold: if >50% of existing docs (and >5 docs) would be deleted, skip and warn — protects against partial listing failures. Explicit fullSync bypasses the threshold. 3. sourceUrl merge: hydration now picks up sourceUrl from getDocument, falling back to the stub's sourceUrl if getDocument doesn't set one. Co-Authored-By: Claude Opus 4.6 <[email protected]>

apps/sim/lib/knowledge/connectors/sync-engine.ts

apps/sim/connectors/confluence/confluence.ts

apps/sim/connectors/gmail/gmail.ts

…e maxFiles guard - Confluence: use `version?.number` directly (undefined) in metadata instead of `?? ''` (empty string) to prevent Number('') = 0 passing NaN check in mapTags. Hash still uses `?? ''` for string interpolation. - Google Drive: add early return when previouslyFetched >= maxFiles to prevent effectivePageSize <= 0 which violates the API's pageSize requirement (1-1000). Co-Authored-By: Claude Opus 4.6 <[email protected]>

waleedlatif1 · 2026-03-27T01:06:15Z

@cursor review

waleedlatif1 · 2026-03-27T01:06:16Z

@greptile

apps/sim/connectors/confluence/confluence.ts

apps/sim/lib/knowledge/connectors/sync-engine.ts

apps/sim/connectors/obsidian/obsidian.ts

…iation - Confluence: fetchLabelsForPages now tries both /pages/{id}/labels and /blogposts/{id}/labels, preventing label loss when getDocument hydrates blogpost content (previously returned empty labels on 404). - Sync engine: skip deletion reconciliation when listing was capped (maxFiles/maxThreads). Connectors signal this via syncContext.listingCapped. Prevents incorrect deletion of docs beyond the cap that still exist in source. fullSync override still forces deletion for explicit cleanup. - Google Drive & Gmail: set syncContext.listingCapped = true when cap is hit. Co-Authored-By: Claude Opus 4.6 <[email protected]>

waleedlatif1 · 2026-03-27T01:23:26Z

@greptile

waleedlatif1 · 2026-03-27T01:23:30Z

@cursor review

… caps OneDrive, Dropbox, SharePoint, Confluence (v2 + CQL), and Notion (3 listing functions) now set syncContext.listingCapped = true when their respective maxFiles/maxPages limit is hit. Without this, the sync engine's deletion reconciliation would run against an incomplete listing and incorrectly hard-delete documents that exist in the source but fell outside the cap window. Co-Authored-By: Claude Opus 4.6 <[email protected]>

apps/sim/connectors/dropbox/dropbox.ts

waleedlatif1 · 2026-03-27T01:32:40Z

All review comments have been addressed and resolved:

listingCapped flag (50182d1): Added syncContext.listingCapped = true to all 7 connectors that compute hitLimit (OneDrive, Dropbox, SharePoint, Confluence v2+CQL, Notion x3). Deletion reconciliation now correctly skips capped listings.
Confluence blogpost labels (6ffc0e3): fetchLabelsForPages now tries both /pages/{id}/labels and /blogposts/{id}/labels.
Confluence version metadata (bc73154): Uses undefined instead of empty string for metadata, preserving correct Number.isNaN filtering in mapTags.
Google Drive maxFiles guard (bc73154): Early return when previouslyFetched >= maxFiles.
Obsidian per-sync fetch: Intentional trade-off — API lacks mtime in directory listing, post-hydration hash skip avoids expensive re-embedding.

All threads resolved. Retriggering reviews.

waleedlatif1 · 2026-03-27T01:32:44Z

@cursor review

waleedlatif1 · 2026-03-27T01:32:45Z

@greptile

apps/sim/connectors/evernote/evernote.ts

…ebooks All calls to apiListTags and apiListNotebooks in both listDocuments and getDocument now pass retryOptions for consistent retry protection across all Thrift RPC calls. Co-Authored-By: Claude Opus 4.6 <[email protected]>

waleedlatif1 · 2026-03-27T01:49:41Z

Fixed the Evernote retryOptions gap (fcde037) — all apiListTags and apiListNotebooks calls now pass retry options consistently. All review comments across all rounds have been addressed and resolved.

waleedlatif1 · 2026-03-27T01:49:46Z

@cursor review

waleedlatif1 · 2026-03-27T01:49:47Z

@greptile

apps/sim/connectors/confluence/confluence.ts

waleedlatif1 · 2026-03-27T02:04:11Z

Resolved the Confluence labels comment — label-only edits not being detected is a pre-existing limitation of Confluence's version-based change detection (labels don't increment version.number), not a regression from this PR. The v2 stub labels: [] is correct because labels are fetched and merged during hydration via the already-fixed blogpost-aware fetchLabelsForPages. All review threads are resolved.

waleedlatif1 · 2026-03-27T02:04:14Z

@cursor review

waleedlatif1 · 2026-03-27T02:04:15Z

@greptile

cursor

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

fix(knowledge): enqueue connector docs per-batch to survive sync time…

4c19754

…outs

waleedlatif1 force-pushed the waleedlatif1/fix-connector-sync-timeout branch from 32a04dc to 4c19754 Compare March 26, 2026 22:14

vercel bot temporarily deployed to Preview March 26, 2026 22:14 Inactive

cursor bot reviewed Mar 26, 2026

View reviewed changes

waleedlatif1 changed the title ~~fix(knowledge): enqueue connector docs per-batch to survive sync timeouts~~ fix(connectors): contentDeferred pattern + validation fixes across all connectors Mar 26, 2026

vercel bot deployed to Preview March 26, 2026 22:58 View deployment

greptile-apps bot reviewed Mar 26, 2026

View reviewed changes

apps/sim/connectors/evernote/evernote.ts Outdated Show resolved Hide resolved

cursor bot reviewed Mar 26, 2026

View reviewed changes

apps/sim/connectors/gmail/gmail.ts Show resolved Hide resolved

apps/sim/connectors/obsidian/obsidian.ts Show resolved Hide resolved

apps/sim/connectors/confluence/confluence.ts Show resolved Hide resolved

apps/sim/connectors/google-drive/google-drive.ts Outdated Show resolved Hide resolved

vercel bot temporarily deployed to Preview March 26, 2026 23:14 Inactive

fix(evernote): use correct RetryOptions property names

eb78814

maxRetries/initialDelayMs instead of retries/backoff to match the RetryOptions interface from lib/knowledge/documents/utils. Co-Authored-By: Claude Opus 4.6 <[email protected]>

vercel bot temporarily deployed to Preview March 26, 2026 23:19 Inactive

cursor bot reviewed Mar 26, 2026

View reviewed changes

apps/sim/connectors/gmail/gmail.ts Show resolved Hide resolved

apps/sim/connectors/obsidian/obsidian.ts Show resolved Hide resolved

vercel bot temporarily deployed to Preview March 26, 2026 23:31 Inactive

vercel bot temporarily deployed to Preview March 26, 2026 23:40 Inactive

cursor bot reviewed Mar 26, 2026

View reviewed changes

apps/sim/lib/knowledge/connectors/sync-engine.ts Show resolved Hide resolved

apps/sim/connectors/confluence/confluence.ts Show resolved Hide resolved

apps/sim/connectors/gmail/gmail.ts Show resolved Hide resolved

vercel bot temporarily deployed to Preview March 27, 2026 01:05 Inactive