The WebScrapping datasource is a powerful tool for automatically extracting content from websites. This guide will walk you through configuring and using WebScrapper to collect data from websites and import it into your workspace.
Getting Started
1. Creating a New WebScrapper Import
To create a new import:
- Navigate to the Datasources section
- Click "Add New Datasource" or select an existing WebScrapper datasource
- Click "+ New Import" to configure a new import
Basic Configuration
URL Configuration
- Start URL: The specific URL where crawling will begin. This is the entry point for the scraper.
- Example:
https://en.wikipedia.org/wiki/Artificial_intelligence
- Example:
Crawling Parameters
- Max Crawl Depth: Controls how deep the crawler will navigate from the starting URL.
- 0: Only crawls the starting URL
- 1: Includes pages directly linked from the starting URL
- 2: Includes links from those direct links
- 3: Goes three levels deep (maximum)
Advanced Configuration
- Max Pages: Limits the total number of pages crawled.
- Enable "Limit Max Pages" to set a specific limit
- Recommended for large websites to prevent excessive crawling
Content Relevance
- Relevance Keywords: Keywords that determine which pages are more important to crawl.
- Pages containing these keywords receive higher priority
- Separate multiple keywords with commas
- Example:
AI, machine learning, neural networks
- Keywords Weight: How strongly to prioritize pages with keywords.
- 0.0: Ignore keywords completely
- 1.0: Prioritize keywords above all other factors
- 0.7: (Default) Balances keyword matching with other factors
URL Patterns
- URL Patterns to Include: Restricts which URLs will be crawled based on patterns.
- Use as a wildcard
- Example:
/products/*
matches all pages in the products directory - Use alone or leave empty to include all URLs
- Separate multiple patterns with commas
- URL Patterns to Exclude: Specify URL patterns that should NOT be crawled.
- Example:
/admin/*, /login/
excludes admin pages and login page - Separate multiple patterns with commas
- Example:
Content Selection
- Content CSS Selector: CSS selector that defines which content to extract from pages.
- This limits both crawling and content extraction scope—any content outside these selectors will be ignored.
- Example:
article.content,.main,.data-container
- Elements to Exclude: CSS selector for elements to remove from processing.
- This works like the Content CSS Selector but in reverse—specified elements will be excluded from both markdown generation and crawling.
- Example:
#ads, .cookies
to remove ads and cookies
- Target Elements: CSS selectors for specific content extraction.
- These elements will be used for markdown generation while still allowing the crawler to process all page links and media.
- Example:
article.content,.main,.data-container
- Tags to Exclude: HTML tags to skip during content extraction.
- These tags will be ignored during markdown generation but still checked for crawlable links.
- Example:
nav
Proxy Settings
- Enable Proxy: Toggle to use a proxy server for web scraping requests
- When enabled, additional proxy configuration fields will appear
Import Settings
- Workspace: Select the workspace where the scraped content will be imported
- Frequency (minutes): Set how often the scraper should run
- Set to 0 for manual triggering only
Best Practices
- Start Small: Begin with a shallow crawl depth and limited pages to test
- Refine Gradually: Expand your configuration after confirming initial results
- Use Content Selection: Apply HTML and CSS selectors to specify which content to extract and process from pages
- Use Relevance Keywords: For large sites, use keywords to prioritize content
- Respect Website Rules: Avoid aggressive crawling that might overload sites
- Check Results: Regularly review imported content to ensure quality
Troubleshooting
- Empty Results: Check URL patterns and content selectors
- Too Much Content: Reduce max depth or pages, or add selection/exclusion patterns
- Irrelevant Content: Refine CSS selectors to target specific content areas
- Import Failures: Check the site's robots.txt rules or try using a proxy
Example Configuration
For crawling Wikipedia articles about AI:
- Start URL:
https://en.wikipedia.org/wiki/Artificial_intelligence
- Max Crawl Depth: 1
- Max Pages: 20
- Relevance Keywords:
machine learning, neural network, deep learning
- Keywords Weight: 0.7
- Content CSS Selector:
main
- Element to Exclude:
.sidebar,.vector-column-end,.vector-page-toolbar,.vector-body-before-content,.navigation-not-searchable