Skip to content

fix(connectors): contentDeferred pattern + validation fixes across all connectors#3793

Merged
waleedlatif1 merged 11 commits intostagingfrom
waleedlatif1/fix-connector-sync-timeout
Mar 27, 2026
Merged

fix(connectors): contentDeferred pattern + validation fixes across all connectors#3793
waleedlatif1 merged 11 commits intostagingfrom
waleedlatif1/fix-connector-sync-timeout

Conversation

@waleedlatif1
Copy link
Collaborator

@waleedlatif1 waleedlatif1 commented Mar 26, 2026

Summary

  • Converted all 10 connectors to contentDeferred: true patternlistDocuments now returns lightweight metadata stubs (no content download), and content is fetched lazily via getDocument only for new/changed documents. This prevents task timeouts on large syncs (root cause of Google Drive, Fireflies failures).
  • Ran full validation audit on all 10 connectors against their respective API docs. Fixed all critical and warning issues found.
  • Updated add-connector and validate-connector skills to document the contentDeferred best practice.

Connectors converted to contentDeferred

Connector contentHash strategy Notable fixes
Google Drive gdrive:${id}:${modifiedTime} orderBy, precise maxFiles, byte-length truncation
OneDrive onedrive:${id}:${lastModifiedDateTime} orderBy for deterministic pagination
SharePoint sharepoint:${id}:${lastModifiedDateTime} byte-length size check
Dropbox dropbox:${id}:${content_hash} uses Dropbox native content_hash
Notion notion:${pageId}:${last_edited_time} code/equation blocks, empty page fallback, concurrency reduction
Confluence confluence:${pageId}:${versionNumber} syncContext cloudId caching, label concurrency reduction
Gmail gmail:${threadId}:${historyId} joinTagArray for labels
Obsidian stub: obsidian:stub:${path}:${syncRunId}, doc: obsidian:${path}:${mtime} forced re-fetch per sync (API lacks mtime in listing)
Evernote evernote:${guid}:${updated} retryOptions threaded through Thrift RPC calls
GitHub git-sha:${sha} (pre-existing) added contentDeferred: false to getDocument

Infrastructure changes

  • sync-engine.ts: added syncRunId to syncContext for Obsidian change detection
  • confluence/utils.ts: replaced raw fetch() with fetchWithRetry, added retryOptions parameter
  • oauth.ts: added supportsRefreshTokenRotation: false for Dropbox

Test plan

  • Trigger sync for Google Drive connector — should complete without timeout
  • Trigger sync for each converted connector — documents should appear without refresh
  • Verify lastSyncDocCount matches actual KB document count after sync
  • Verify previously stuck "processing" documents get retried on next sync
  • Run bun run lint — passes clean
  • Run bunx tsc --noEmit — no errors in changed files

@vercel
Copy link

vercel bot commented Mar 26, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Mar 27, 2026 1:49am

Request Review

@cursor
Copy link

cursor bot commented Mar 26, 2026

PR Summary

Medium Risk
Touches the sync engine and multiple connector implementations, changing listing/hydration and deletion behavior; regressions could cause missed updates, extra reprocessing, or skipped deletions if hashes/metadata are wrong.

Overview
Standardizes the contentDeferred connector pattern across the repo. listDocuments for several connectors now returns lightweight metadata stubs (content: '', contentDeferred: true) with metadata-derived contentHash values, and getDocument fetches/returns full content with contentDeferred: false.

Improves sync execution safety and reliability. The sync engine now de-dupes externalIds, hydrates deferred docs while preserving consistent hashes/metadata, enqueues processing per batch (instead of one large enqueue at end), and adds safeguards around deletion reconciliation (skip when listing was capped; skip large deletion spikes unless full sync). Documentation/checklists for adding/validating connectors were updated to require content deferral and metadata-based hashing, and OAuth/utility fixes include fetchWithRetry in Confluence cloud-id lookup and explicit Dropbox refresh token rotation support config.

Written by Cursor Bugbot for commit fcde037. Configure here.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 26, 2026

Greptile Summary

This PR converts all 10 knowledge-base connectors from an eager content-download model to a lazy contentDeferred: true pattern where listing returns lightweight metadata stubs and content is fetched on-demand via getDocument only for new or changed documents. The sync engine gains per-batch hydration, post-hydration hash early-out, a 50% deletion safety threshold, and correct listingCapped gating of deletion reconciliation. All previously flagged issues have been resolved.

Confidence Score: 4/5

PR is safe to merge; all previously flagged critical concerns are resolved and only minor P2 efficiency observations remain.

All prior review concerns (Evernote retryOptions, listingCapped propagation, Confluence cloudId, deletion reconciliation) are correctly addressed. The contentDeferred pattern is implemented consistently. Score is 4 rather than 5 because the Gmail historyId concern could cause an invisible performance regression worth a quick confirmation.

apps/sim/connectors/gmail/gmail.ts — verify historyId is present in threads.list responses, or the skip-getDocument optimisation never fires for Gmail.

Important Files Changed

Filename Overview
apps/sim/lib/knowledge/connectors/sync-engine.ts Major refactor: per-batch deferred hydration, syncRunId injection, post-hydration hash early-out, 50% deletion safety threshold, and listingCapped-gated deletion reconciliation. Logic is correct.
apps/sim/connectors/gmail/gmail.ts Converted to contentDeferred; historyId availability in threads.list should be verified — if absent, getDocument is called for all threads every sync (correct but unoptimised).
apps/sim/connectors/google-drive/google-drive.ts Clean conversion; adds orderBy, precise effectivePageSize, and correct byte-length UTF-8 truncation. listingCapped set on limit hit.
apps/sim/connectors/confluence/confluence.ts Removes body.storage from listing, reduces label concurrency, fetches labels only in getDocument; blogpost fallback and cloudId caching correct.
apps/sim/connectors/notion/notion.ts Adds code/equation block rendering, empty-page fallback to title, archived-page early return; listingCapped set in all three listing paths.
apps/sim/connectors/evernote/evernote.ts retryOptions correctly threaded to all Thrift RPC callers; stub uses evernote:guid:updated hash; empty-note early return added.
apps/sim/connectors/obsidian/obsidian.ts syncRunId-based stub hash forces per-sync hydration intentionally; getDocument populates full metadata and mtime-based hash for correct post-hydration skip.
apps/sim/connectors/dropbox/dropbox.ts Uses Dropbox native content_hash with server_modified fallback; clean stub conversion; listingCapped added.
apps/sim/connectors/onedrive/onedrive.ts Added $orderby=lastModifiedDateTime desc to initial and subfolder URLs; Microsoft's @odata.nextLink preserves query params so continuation pages are consistent.
apps/sim/connectors/sharepoint/sharepoint.ts Clean stub conversion; correct byte-length truncation; listingCapped added; old double-truncation bug removed.
apps/sim/tools/confluence/utils.ts Replaced raw fetch() with fetchWithRetry; added optional retryOptions parameter; backward-compatible.
apps/sim/lib/oauth/oauth.ts Adds supportsRefreshTokenRotation: false for Dropbox to prevent refresh token double-invalidation.
apps/sim/connectors/github/github.ts Minimal change: adds contentDeferred: false to getDocument return and stubs _syncContext parameter.

Sequence Diagram

sequenceDiagram
    participant SE as sync-engine
    participant C as Connector
    participant API as External API
    participant DB as Database
    participant Q as ProcessingQueue

    SE->>C: listDocuments(cursor, syncContext)
    C->>API: List metadata (no content download)
    API-->>C: [{id, title, contentHash_stub}]
    C-->>SE: ExternalDocumentList (contentDeferred: true)

    SE->>SE: Build pendingOps (add/update via stub hash vs stored hash)

    loop SYNC_BATCH_SIZE chunks
        SE->>SE: Split into deferredOps + readyOps
        par Hydrate deferred docs
            SE->>C: getDocument(id, syncContext)
            C->>API: Fetch full content for single doc
            API-->>C: Full document
            C-->>SE: ExternalDocument (contentDeferred: false)
        end
        SE->>SE: Post-hydration hash check (skip re-embed if hash unchanged)
        SE->>DB: addDocument / updateDocument
        DB-->>SE: DocumentData[]
        SE->>Q: processDocumentsWithQueue(batchDocs)
    end

    SE->>SE: Deletion reconciliation (skip if listingCapped and not fullSync)
    SE->>DB: hardDeleteDocuments(removedIds)
Loading

Comments Outside Diff (1)

  1. apps/sim/connectors/gmail/gmail.ts, line 287-301 (link)

    historyId may not be present in threads.list response

    The Gmail users.threads.list API returns Thread objects with id and snippet in list responses; historyId is typically only populated when you call threads.get. Because historyId is typed as optional here, stub hashes will consistently resolve to gmail:${id}: (empty suffix), which will never match the stored hash produced by getDocument (gmail:${id}:${realHistoryId}).

    The practical consequence: every thread is classified as a pending update on every sync, so getDocument is called for all threads regardless of whether they changed. The post-hydration hash comparison still prevents re-embedding (docsUnchanged is incremented correctly), so correctness is maintained — but the intended optimisation of skipping getDocument for unchanged threads never fires.

    If historyId is indeed absent from list responses, using a constant sentinel like 'pending' would be cleaner than an empty-suffix fallback, making it explicit that the post-hydration check drives all skip logic.

Reviews (11): Last reviewed commit: "fix(evernote): thread retryOptions throu..." | Re-trigger Greptile

@waleedlatif1 waleedlatif1 force-pushed the waleedlatif1/fix-connector-sync-timeout branch from 32a04dc to 4c19754 Compare March 26, 2026 22:14
@waleedlatif1
Copy link
Collaborator Author

@greptile

@waleedlatif1
Copy link
Collaborator Author

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

…d fix validation issues

All 10 connectors now use contentDeferred: true in listDocuments, returning
lightweight metadata stubs instead of downloading content during listing.
Content is fetched lazily via getDocument only for new/changed documents,
preventing Trigger.dev task timeouts on large syncs.

Connector-specific fixes from validation audit:
- Google Drive: metadata-based contentHash, orderBy for deterministic pagination,
  precise maxFiles, byte-length size check with truncation warning
- OneDrive: metadata-based contentHash, orderBy for deterministic pagination
- SharePoint: metadata-based contentHash, byte-length size check
- Dropbox: metadata-based contentHash using content_hash field
- Notion: code/equation block extraction, empty page fallback to title,
  reduced CHILD_PAGE_CONCURRENCY to 5, syncContext parameter
- Confluence: syncContext caching for cloudId, reduced label concurrency to 5
- Gmail: use joinTagArray for label tags
- Obsidian: syncRunId-based stub hash for forced re-fetch, mtime-based hash
  in getDocument, .trim() on vaultUrl, lightweight validateConfig
- Evernote: retryOptions threaded through apiFindNotesMetadata and apiGetNote
- GitHub: added contentDeferred: false to getDocument, syncContext parameter

Infrastructure:
- sync-engine: added syncRunId to syncContext for Obsidian change detection
- confluence/utils: replaced raw fetch with fetchWithRetry, added retryOptions
- oauth: added supportsRefreshTokenRotation: false for Dropbox
- Updated add-connector and validate-connector skills with contentDeferred docs

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@waleedlatif1 waleedlatif1 changed the title fix(knowledge): enqueue connector docs per-batch to survive sync timeouts fix(connectors): contentDeferred pattern + validation fixes across all connectors Mar 26, 2026
@waleedlatif1
Copy link
Collaborator Author

@greptile

@waleedlatif1
Copy link
Collaborator Author

@cursor review

…tions, UTF-8 safety

- Sync engine: merge metadata from getDocument during deferred hydration,
  so Gmail/Obsidian/Confluence tags and metadata survive the stub→full transition
- Evernote: pass retryOptions {retries:3, backoff:500} from listDocuments and
  getDocument callers into apiFindNotesMetadata and apiGetNote
- Google Drive + SharePoint: safe UTF-8 truncation that walks back to the last
  complete character boundary instead of splitting multi-byte chars

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@waleedlatif1
Copy link
Collaborator Author

@greptile

@waleedlatif1
Copy link
Collaborator Author

@cursor review

maxRetries/initialDelayMs instead of retries/backoff to match the
RetryOptions interface from lib/knowledge/documents/utils.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…s after hydration

- Merge title from getDocument during deferred hydration so Gmail
  documents get the email Subject header instead of the snippet text
- After hydration, compare the hydrated contentHash against the stored
  DB hash — if they match, skip the update. This prevents Obsidian
  (and any connector with a force-refresh stub hash) from re-uploading
  and re-processing unchanged documents every sync

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@waleedlatif1
Copy link
Collaborator Author

@cursor review

@waleedlatif1
Copy link
Collaborator Author

@greptile

…merge sourceUrl

Three sync engine gaps identified during audit:

1. Duplicate externalId guard: if a connector returns the same externalId
   across pages (pagination overlap), skip the second occurrence to prevent
   unique constraint violations on add and double-uploads on update.

2. Deletion reconciliation: previously required explicit fullSync or
   syncMode='full', meaning docs deleted from the source accumulated in
   the KB forever. Now runs on all non-incremental syncs (which return
   ALL docs). Includes a safety threshold: if >50% of existing docs
   (and >5 docs) would be deleted, skip and warn — protects against
   partial listing failures. Explicit fullSync bypasses the threshold.

3. sourceUrl merge: hydration now picks up sourceUrl from getDocument,
   falling back to the stub's sourceUrl if getDocument doesn't set one.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…e maxFiles guard

- Confluence: use `version?.number` directly (undefined) in metadata instead
  of `?? ''` (empty string) to prevent Number('') = 0 passing NaN check in
  mapTags. Hash still uses `?? ''` for string interpolation.
- Google Drive: add early return when previouslyFetched >= maxFiles to prevent
  effectivePageSize <= 0 which violates the API's pageSize requirement (1-1000).

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@waleedlatif1
Copy link
Collaborator Author

@cursor review

@waleedlatif1
Copy link
Collaborator Author

@greptile

…iation

- Confluence: fetchLabelsForPages now tries both /pages/{id}/labels and
  /blogposts/{id}/labels, preventing label loss when getDocument hydrates
  blogpost content (previously returned empty labels on 404).
- Sync engine: skip deletion reconciliation when listing was capped
  (maxFiles/maxThreads). Connectors signal this via syncContext.listingCapped.
  Prevents incorrect deletion of docs beyond the cap that still exist in source.
  fullSync override still forces deletion for explicit cleanup.
- Google Drive & Gmail: set syncContext.listingCapped = true when cap is hit.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@waleedlatif1
Copy link
Collaborator Author

@greptile

@waleedlatif1
Copy link
Collaborator Author

@cursor review

… caps

OneDrive, Dropbox, SharePoint, Confluence (v2 + CQL), and Notion (3 listing
functions) now set syncContext.listingCapped = true when their respective
maxFiles/maxPages limit is hit. Without this, the sync engine's deletion
reconciliation would run against an incomplete listing and incorrectly
hard-delete documents that exist in the source but fell outside the cap window.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@waleedlatif1
Copy link
Collaborator Author

All review comments have been addressed and resolved:

  • listingCapped flag (50182d1): Added syncContext.listingCapped = true to all 7 connectors that compute hitLimit (OneDrive, Dropbox, SharePoint, Confluence v2+CQL, Notion x3). Deletion reconciliation now correctly skips capped listings.
  • Confluence blogpost labels (6ffc0e3): fetchLabelsForPages now tries both /pages/{id}/labels and /blogposts/{id}/labels.
  • Confluence version metadata (bc73154): Uses undefined instead of empty string for metadata, preserving correct Number.isNaN filtering in mapTags.
  • Google Drive maxFiles guard (bc73154): Early return when previouslyFetched >= maxFiles.
  • Obsidian per-sync fetch: Intentional trade-off — API lacks mtime in directory listing, post-hydration hash skip avoids expensive re-embedding.

All threads resolved. Retriggering reviews.

@waleedlatif1
Copy link
Collaborator Author

@cursor review

@waleedlatif1
Copy link
Collaborator Author

@greptile

…ebooks

All calls to apiListTags and apiListNotebooks in both listDocuments and
getDocument now pass retryOptions for consistent retry protection across
all Thrift RPC calls.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@waleedlatif1
Copy link
Collaborator Author

Fixed the Evernote retryOptions gap (fcde037) — all apiListTags and apiListNotebooks calls now pass retry options consistently. All review comments across all rounds have been addressed and resolved.

@waleedlatif1
Copy link
Collaborator Author

@cursor review

@waleedlatif1
Copy link
Collaborator Author

@greptile

@waleedlatif1
Copy link
Collaborator Author

Resolved the Confluence labels comment — label-only edits not being detected is a pre-existing limitation of Confluence's version-based change detection (labels don't increment version.number), not a regression from this PR. The v2 stub labels: [] is correct because labels are fetched and merged during hydration via the already-fixed blogpost-aware fetchLabelsForPages. All review threads are resolved.

@waleedlatif1
Copy link
Collaborator Author

@cursor review

@waleedlatif1
Copy link
Collaborator Author

@greptile

@waleedlatif1 waleedlatif1 merged commit 4c474e0 into staging Mar 27, 2026
12 checks passed
@waleedlatif1 waleedlatif1 deleted the waleedlatif1/fix-connector-sync-timeout branch March 27, 2026 02:04
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant