# Tutorial: Export Crawl Data

https://api-docs.lumar.io/docs/graphql/tutorials/export-crawl-data

This tutorial covers how to export crawl data from Lumar, including report downloads for bulk export and pagination for programmatic access.

## Option 1: Report downloads (recommended for bulk export)

Report downloads generate a downloadable file (CSV or other formats) containing the full dataset for a report. This is the most efficient way to export large amounts of data.

### Step 1: Create a report download

Use the `createReportDownload` mutation to request a file. You can specify which metrics (columns) to include and apply filters.

```graphql
mutation CreateReportDownload($input: CreateReportDownloadInput!) {
  createReportDownload(input: $input) {
    reportDownload {
      ...ReportDownloadDetails
    }
  }
}

fragment ReportDownloadDetails on ReportDownload {
  id
  status
  outputType
  # ...other fields you want to retrieve
}
```

**Variables:**
```json
{
  "input": {
    "crawlId": "TjAwNUNyYXdsMTc2NjI0MQ",
    "reportTemplateCode": "all_pages",
    "reportTypeCode": "Basic",
    "selectedMetrics": ["url", "httpStatusCode"],
    "fileName": "url-error-status-codes",
    "filter": { "httpStatusCode": { "ge": 500 } }
  }
}
```

**Response:**
```json
{
  "data": {
    "createReportDownload": {
      "reportDownload": {
        "id": "TjAxNFJlcG9ydERvd25sb2FkNzU4Njg3",
        "status": "Generating",
        "outputType": "CsvZip"
      }
    }
  }
}
```

### Step 2: Poll for download completion

The report download starts with a `Generating` status. Poll until it reaches `Completed`, then use the `fileUrl` to download the file.

```graphql
query GetReportDownloadStatus($reportDownloadId: ObjectID!) {
  node(id: $reportDownloadId) {
    ... on ReportDownload {
      id
      status
      outputType
      fileURL
      createdAt
    }
  }
}
```

**Variables:**
```json
{
  "reportDownloadId": "TjAxNFJlcG9ydERvd25sb2FkNzU4Njg3"
}
```

**Response:**
```json
{
  "data": {
    "node": {
      "id": "TjAxNFJlcG9ydERvd25sb2FkNzU4Njg3",
      "status": "Completed",
      "outputType": "CsvZip",
      "fileURL": "https://storage.example.com/reports/download.csv.zip",
      "createdAt": "2025-01-15T10:00:00.000Z"
    }
  }
}
```

```typescript
async function waitForDownload(reportDownloadId: string): Promise<string> {
  while (true) {
    const result = await executeQuery(STATUS_QUERY, { reportDownloadId });
    const download = result.data.node;

    if (download.status === "Completed") {
      return download.fileUrl;
    }

    if (download.status === "Failed") {
      throw new Error("Report download failed");
    }

    console.log(`Status: ${download.status}. Checking again in 10s...`);
    await new Promise(resolve => setTimeout(resolve, 10000));
  }
}
```

### Step 3: Download the file

The `fileUrl` is a signed URL that you can download using any HTTP client:

```bash
curl -o report.csv.zip "SIGNED_FILE_URL_HERE"
```

## Option 2: Paginated API queries

For smaller datasets or when you need real-time access, paginate through the API directly.

```graphql
query ExportCrawlUrls($crawlId: ObjectID!, $cursor: String) {
  getReportStat(
    input: { crawlId: $crawlId, reportTemplateCode: "all_pages" }
  ) {
    crawlUrls(first: 500, after: $cursor) {
      pageInfo {
        hasNextPage
        endCursor
      }
      nodes {
        url
        httpStatusCode
        pageTitle
        wordCount
        fetchTime
      }
      totalCount
    }
  }
}
```

**Variables:**
```json
{
  "crawlId": "TjAwNUNyYXdsMTU4MzI0NQ",
  "cursor": null
}
```

**Response:**
```json
{
  "data": {
    "getReportStat": {
      "crawlUrls": {
        "pageInfo": {
          "hasNextPage": true,
          "endCursor": "NTAw"
        },
        "nodes": [
          {
            "url": "https://www.example.com/",
            "httpStatusCode": 200,
            "pageTitle": "Home - Example",
            "wordCount": 1250,
            "fetchTime": 0.42
          }
        ],
        "totalCount": 2186
      }
    }
  }
}
```

### Pagination loop

```typescript
async function exportAllUrls(crawlId: string): Promise<any[]> {
  const allUrls: any[] = [];
  let cursor: string | null = null;
  let hasNextPage = true;

  while (hasNextPage) {
    const result = await executeQuery(EXPORT_QUERY, { crawlId, cursor });
    const connection = result.data.getReportStat.crawlUrls;

    allUrls.push(...connection.nodes);
    hasNextPage = connection.pageInfo.hasNextPage;
    cursor = connection.pageInfo.endCursor;

    console.log(`Fetched ${allUrls.length} / ${connection.totalCount} URLs`);
  }

  return allUrls;
}
```

## Tips for large datasets

- **Use report downloads** for datasets over 10,000 URLs. Paginating through tens of thousands of records via the API is slow and consumes your rate limit budget.
- **Select only the metrics you need** in `selectedMetrics` to reduce file size.
- **Apply filters** to limit the export to relevant URLs (e.g., only broken pages or a specific segment).
- **Use `first: 500`** as a reasonable page size when paginating via the API. Larger page sizes increase response time.

## Next steps

- [Generate Report Downloads](../generate-report-downloads.md) -- detailed reference for the report download workflow.
- [Filtering](../filtering.md) -- apply filters to narrow down exported data.
- [Pagination](../pagination.md) -- full cursor-based pagination reference.