Webscrapping Datasource

The WebScrapping datasource is a powerful tool for automatically extracting content from websites. This guide will walk you through configuring and using WebScrapper to collect data from websites and import it into your workspace.

Getting Started

1. Creating a New WebScrapper Import

To create a new import:

Navigate to the Datasources section
Click "Add New Datasource" or select an existing WebScrapper datasource
Click "+ New Import" to configure a new import

Basic Configuration

URL Configuration

Start URL: The specific URL where crawling will begin. This is the entry point for the scraper.
- Example: https://en.wikipedia.org/wiki/Artificial_intelligence

Crawling Parameters

Max Crawl Depth: Controls how deep the crawler will navigate from the starting URL.
- 0: Only crawls the starting URL
- 1: Includes pages directly linked from the starting URL
- 2: Includes links from those direct links
- 3: Goes three levels deep (maximum)

Advanced Configuration

Max Pages: Limits the total number of pages crawled.
- Enable "Limit Max Pages" to set a specific limit
- Recommended for large websites to prevent excessive crawling

Content Relevance

Relevance Keywords: Keywords that determine which pages are more important to crawl.
- Pages containing these keywords receive higher priority
- Separate multiple keywords with commas
- Example: AI, machine learning, neural networks
Keywords Weight: How strongly to prioritize pages with keywords.
- 0.0: Ignore keywords completely
- 1.0: Prioritize keywords above all other factors
- 0.7: (Default) Balances keyword matching with other factors

URL Patterns

URL Patterns to Include: Restricts which URLs will be crawled based on patterns.
- Use as a wildcard
- Example: /products/* matches all pages in the products directory
- Use alone or leave empty to include all URLs
- Separate multiple patterns with commas
URL Patterns to Exclude: Specify URL patterns that should NOT be crawled.
- Example: /admin/*, /login/ excludes admin pages and login page
- Separate multiple patterns with commas

Content Selection

Content CSS Selector: CSS selector that defines which content to extract from pages.
- This limits both crawling and content extraction scope—any content outside these selectors will be ignored.
- Example: article.content,.main,.data-container
Elements to Exclude: CSS selector for elements to remove from processing.
- This works like the Content CSS Selector but in reverse—specified elements will be excluded from both markdown generation and crawling.
- Example: #ads, .cookies to remove ads and cookies
Target Elements: CSS selectors for specific content extraction.
- These elements will be used for markdown generation while still allowing the crawler to process all page links and media.
- Example: article.content,.main,.data-container
Tags to Exclude: HTML tags to skip during content extraction.
- These tags will be ignored during markdown generation but still checked for crawlable links.
- Example: nav

Proxy Settings

Enable Proxy: Toggle to use a proxy server for web scraping requests
- When enabled, additional proxy configuration fields will appear

Import Settings

Workspace: Select the workspace where the scraped content will be imported
Frequency (minutes): Set how often the scraper should run
- Set to 0 for manual triggering only

Best Practices

Start Small: Begin with a shallow crawl depth and limited pages to test
Refine Gradually: Expand your configuration after confirming initial results
Use Content Selection: Apply HTML and CSS selectors to specify which content to extract and process from pages
Use Relevance Keywords: For large sites, use keywords to prioritize content
Respect Website Rules: Avoid aggressive crawling that might overload sites
Check Results: Regularly review imported content to ensure quality

Troubleshooting

Empty Results: Check URL patterns and content selectors
Too Much Content: Reduce max depth or pages, or add selection/exclusion patterns
Irrelevant Content: Refine CSS selectors to target specific content areas
Import Failures: Check the site's robots.txt rules or try using a proxy

Example Configuration

For crawling Wikipedia articles about AI:

Start URL: https://en.wikipedia.org/wiki/Artificial_intelligence
Max Crawl Depth: 1
Max Pages: 20
Relevance Keywords: machine learning, neural network, deep learning
Keywords Weight: 0.7
Content CSS Selector: main
Element to Exclude: .sidebar,.vector-column-end,.vector-page-toolbar,.vector-body-before-content,.navigation-not-searchable