# Scraping Web Data

{% embed url="<https://youtu.be/xcRdYpzAEf0>" %}

This guide walks through the process of scraping webpage content in MindStudio and using that content in a custom AI agent. The example agent extracts article content from a URL and turns it into a LinkedIn post.

## Use Case Overview

We’ll build a **URL to LinkedIn Post** agent that:

1. Collects a URL from the user.
2. Scrapes the content from that page.
3. Uses AI to generate a LinkedIn post based on the page content.

## Step 1: Create a User Input for the URL

1. Add a **User Input** block to your workflow.
2. Choose the **Short Text** input type.
3. Name the variable: `url`
4. Set the label: `Enter the URL you'd like to write a LinkedIn post about`
5. Add placeholder text:\
   `e.g., https://www.theverge.com/...`
6. Enable **URL validation** to ensure the input is a proper URL.
7. (Optional) Set a **test value** for debugging, like a real article URL.

## Step 2: Scrape the Webpage

1. Add a **Scrape URL** block.
2. In the URL field, use the variable: `{{ url }}`
3. Set the **output variable** name: `scraped_content`
4. Choose **Output Format**: `Text only`
5. Enable **Auto-enhance** to improve scraping reliability.
6. Keep the **Default scraper** selected (Firecrawl is also available if needed).
7. Leave **Screenshot** disabled unless required.

The block will now extract and store webpage content into the `scraped_content` variable.

## Step 3: Generate AI Output

1. Add a **Generate Text** block.
2. Write your prompt, including the scraped content:

   ```
   cssCopyEditWrite an attention-grabbing LinkedIn post based on the following article:
   <content>{{ scraped_content }}</content>
   ```
3. Choose an appropriate model (e.g., Claude 3.5 Haiku).

## Step 4: Test the Agent

1. Click **Preview** and open the draft agent.
2. Try inputting an invalid value (like `not a URL`) to confirm validation works.
3. Enter a valid URL or use the test value.
4. The AI will:
   * Scrape the page.
   * Analyze the content.
   * Generate a LinkedIn post for you to copy or repurpose.

## Recap and Best Practices

* Use the **Scrape URL** block to pull live content from any webpage.
* Always validate user input when collecting URLs.
* Store scraped data in a clearly named variable for easy reuse.
* Keep the output format as “Text only” for general analysis or “JSON” for structured use cases.
* Auto-enhance improves scraping accuracy on dynamic or complex websites.

You can further extend this workflow by adding post-processing steps or integration blocks to share or save the generated content.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://university.mindstudio.ai/1-core-building-principles/scraping-web-data.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
