Skip to content

Conversation

@meirk-brd
Copy link

Summary

Integrate Bright Data into SIM with tools, block UI, and API routes. Adds all dataset tools from Bright Data, registers them in the tool registry and Bright Data block, and adds a scoped Bright Data. Includes Bright Data icon updates, dataset polling with 10‑minute timeout.

Fixes #N/A

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation
  • Other: ___________

Testing

  • cd apps/sim && bun x tsc --noEmit -p tsconfig.brightdata.json
  • Manual: ran dev server and validated Bright Data search + scrape‑markdown, dataset trigger/polling flow
  • Lint (Bright Data scope) pending bun x biome check --write --unsafe apps/sim/app/api/tools/brightdata apps/sim/tools/brightdata apps/sim/blocks/blocks/brightdata.ts apps/sim/tools/registry.ts apps/sim/blocks/registry.ts

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

Screenshots/Videos

Example of Dataset usage (there are 40+ datasets, but ):

image

Search :
image

Scrape:
image

@vercel
Copy link

vercel bot commented Jan 14, 2026

@meirk-brd is attempting to deploy a commit to the Sim Team on Vercel.

A member of the Team first needs to authorize it.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 14, 2026

Greptile Summary

  • Adds comprehensive Bright Data integration with 47 dataset tools covering major platforms (Amazon, LinkedIn, Facebook, TikTok, etc.), plus web scraping and search functionality
  • Creates a Bright Data block with conditional UI fields for 49+ operations and registers tools in the central registry following established integration patterns
  • Implements API routes for dataset polling with 10-minute timeout, markdown scraping, and search engine functionality with proper error handling and logging

Important Files Changed

Filename Overview
apps/sim/tools/brightdata/index.ts New barrel export file exporting 45 dataset tools, 2 utility tools, and types for the Bright Data integration
apps/sim/blocks/blocks/brightdata.ts New Bright Data block configuration with extensive conditional UI fields, operation mapping, and tool integration for 49+ dataset operations
apps/sim/app/api/tools/brightdata/dataset/route.ts New API route implementing dataset triggering and 10-minute polling mechanism for async Bright Data operations
apps/sim/tools/registry.ts Updated to register 47 new Bright Data tools using consistent naming conventions in central tool registry
apps/sim/blocks/registry.ts Updated to register the BrightDataBlock in the central block registry following alphabetical ordering

Confidence score: 4/5

  • This PR adds significant functionality with minimal risk due to comprehensive integration following established patterns
  • Score reflects well-structured implementation but concerns with complex conditional field logic and potential timeout handling edge cases
  • Pay close attention to apps/sim/blocks/blocks/brightdata.ts for the complex conditional field configuration and apps/sim/app/api/tools/brightdata/dataset/route.ts for polling timeout logic

Sequence Diagram

sequenceDiagram
    participant User
    participant BrightDataBlock
    participant API
    participant BrightDataDatasetAPI
    participant BrightDataAPI
    participant DatasetPoller

    User->>BrightDataBlock: "Select operation and input parameters"
    BrightDataBlock->>API: "Route request based on operation type"
    
    alt Dataset Operations
        API->>BrightDataDatasetAPI: "POST /api/tools/brightdata/dataset"
        BrightDataDatasetAPI->>BrightDataAPI: "Trigger dataset with datasetId"
        BrightDataAPI-->>BrightDataDatasetAPI: "Return snapshot_id"
        BrightDataDatasetAPI->>DatasetPoller: "Poll snapshot status every 1s"
        DatasetPoller->>BrightDataAPI: "GET snapshot status"
        BrightDataAPI-->>DatasetPoller: "Status: running/building/starting"
        loop Until Complete or 10min timeout
            DatasetPoller->>BrightDataAPI: "Check snapshot status"
            BrightDataAPI-->>DatasetPoller: "Status update"
        end
        DatasetPoller-->>BrightDataDatasetAPI: "Final dataset results"
        BrightDataDatasetAPI-->>API: "Dataset response with data"
    else Scrape Markdown
        API->>BrightDataAPI: "POST /api/tools/brightdata/scrape-markdown"
        BrightDataAPI-->>API: "Markdown content and metadata"
    else Search Engine
        API->>BrightDataAPI: "POST /api/tools/brightdata/search-engine"
        BrightDataAPI-->>API: "Search results array"
    end
    
    API-->>BrightDataBlock: "Processed response"
    BrightDataBlock-->>User: "Results with data/markdown/search results"
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

55 files reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

return NextResponse.json({
markdown,
url,
title: title || undefined,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Redundant check - title is already undefined if falsy, so || undefined is unnecessary

Suggested change
title: title || undefined,
title,

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/sim/app/api/tools/brightdata/scrape-markdown/route.ts
Line: 80:80

Comment:
**style:** Redundant check - title is already undefined if falsy, so `|| undefined` is unnecessary

```suggestion
      title,
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +92 to +93
const maxCount = Number.isFinite(maxResults) ? Number(maxResults) : undefined
const results = maxCount ? normalizedResults.slice(0, maxCount) : normalizedResults
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Redundant slice when maxResults is undefined - normalizedResults already contains all results

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/sim/app/api/tools/brightdata/search-engine/route.ts
Line: 92:93

Comment:
**style:** Redundant slice when maxResults is undefined - `normalizedResults` already contains all results

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +21 to +26
num_of_comments: {
type: 'string',
required: false,
visibility: 'user-or-llm',
description: 'Number of comments to fetch (default: 10)',
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Using string type for num_of_comments parameter is inconsistent - this should be number type since it represents a count

Suggested change
num_of_comments: {
type: 'string',
required: false,
visibility: 'user-or-llm',
description: 'Number of comments to fetch (default: 10)',
},
num_of_comments: {
type: 'number',
required: false,
visibility: 'user-or-llm',
description: 'Number of comments to fetch (default: 10)',
},

Is there a specific reason this numeric parameter needs to be a string rather than a number?

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/sim/tools/brightdata/dataset_youtube_comments.ts
Line: 21:26

Comment:
**style:** Using string type for `num_of_comments` parameter is inconsistent - this should be number type since it represents a count

```suggestion
    num_of_comments: {
      type: 'number',
      required: false,
      visibility: 'user-or-llm',
      description: 'Number of comments to fetch (default: 10)',
    },
```

 Is there a specific reason this numeric parameter needs to be a string rather than a number?

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +49 to +51
if (body.num_of_comments === undefined) {
body.num_of_comments = '10'
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: This logic is flawed - params.num_of_comments is undefined when not provided, but body.num_of_comments was just assigned that undefined value on line 46

Suggested change
if (body.num_of_comments === undefined) {
body.num_of_comments = '10'
}
if (params.num_of_comments === undefined) {
body.num_of_comments = '10'
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/sim/tools/brightdata/dataset_youtube_comments.ts
Line: 49:51

Comment:
**logic:** This logic is flawed - `params.num_of_comments` is undefined when not provided, but `body.num_of_comments` was just assigned that undefined value on line 46

```suggestion
      if (params.num_of_comments === undefined) {
        body.num_of_comments = '10'
      }
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +49 to +51
if (body.days_limit === undefined) {
body.days_limit = '3'
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: the conditional check for undefined occurs after the property is already assigned to the body object - consider checking params.days_limit directly before assignment

Suggested change
if (body.days_limit === undefined) {
body.days_limit = '3'
}
if (params.days_limit === undefined) {
body.days_limit = '3'
} else {
body.days_limit = params.days_limit
}

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/sim/tools/brightdata/dataset_google_maps_reviews.ts
Line: 49:51

Comment:
**style:** the conditional check for undefined occurs after the property is already assigned to the body object - consider checking params.days_limit directly before assignment

```suggestion
      if (params.days_limit === undefined) {
        body.days_limit = '3'
      } else {
        body.days_limit = params.days_limit
      }
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +149 to +200
{
id: 'url',
title: 'Dataset URL',
type: 'short-input',
placeholder: 'https://example.com',
condition: {
field: 'operation',
value: [
'dataset_amazon_product',
'dataset_amazon_product_reviews',
'dataset_amazon_product_search',
'dataset_walmart_product',
'dataset_walmart_seller',
'dataset_ebay_product',
'dataset_homedepot_products',
'dataset_zara_products',
'dataset_etsy_products',
'dataset_bestbuy_products',
'dataset_linkedin_person_profile',
'dataset_linkedin_company_profile',
'dataset_linkedin_job_listings',
'dataset_linkedin_posts',
'dataset_linkedin_people_search',
'dataset_crunchbase_company',
'dataset_zoominfo_company_profile',
'dataset_instagram_profiles',
'dataset_instagram_posts',
'dataset_instagram_reels',
'dataset_instagram_comments',
'dataset_facebook_posts',
'dataset_facebook_marketplace_listings',
'dataset_facebook_company_reviews',
'dataset_facebook_events',
'dataset_tiktok_profiles',
'dataset_tiktok_posts',
'dataset_tiktok_shop',
'dataset_tiktok_comments',
'dataset_google_maps_reviews',
'dataset_google_shopping',
'dataset_google_play_store',
'dataset_apple_app_store',
'dataset_reuter_news',
'dataset_github_repository_file',
'dataset_yahoo_finance_business',
'dataset_x_posts',
'dataset_zillow_properties_listing',
'dataset_booking_hotel_listings',
'dataset_youtube_profiles',
'dataset_youtube_comments',
'dataset_reddit_posts',
'dataset_youtube_videos',
],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: This large condition array excludes 'dataset_npm_package' and 'dataset_pypi_package' operations but includes them in the DATASET_TOOL_MAP. Should these operations also require the URL field? Should npm and pypi package datasets also require a URL input, or do they only need the package_name field?

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/sim/blocks/blocks/brightdata.ts
Line: 149:200

Comment:
**logic:** This large condition array excludes 'dataset_npm_package' and 'dataset_pypi_package' operations but includes them in the DATASET_TOOL_MAP. Should these operations also require the URL field? Should npm and pypi package datasets also require a URL input, or do they only need the package_name field?

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant