# Retrieving Raw Data

https://api-docs.lumar.io/docs/graphql/get-raw-data

## Using the correct query

There are two ways to retrieve raw URL (/link/sitemap/etc) data from Lumar:

- This page describes how to download _all_ data from a datasource in a single request, however this cannot be filtered or sorted. This is the most efficient way to access all data.
- The [Get URL Data](get-url-data) guide describes how to retrieve defined metrics for URLs in the crawl. That query can be filtered, sorted, etc. but requires you to paginate URLs 100 at a time. It is perfect for getting a sample of the available data, but is not well suited to getting _all_ data for a crawl.

## Downloading all raw data

During a typical crawl, Lumar may produce millions of rows worth of URL data, and hundreds of millions of rows about links. While the GraphQL API allows you to access a hundred of these rows per page, paginating through hundreds of thousands of requests is not an efficient way to download data if you need information about _all_ URLs or links.

In the background, Lumar stores crawl data in parquet-formatted files. Parquet is a compressed, columnar format that is widely supported by datalake and query systems. To allow clients easy access to their full datasets, we make these parquet files available to be directly downloaded.

The sample query below will return 3 properties (`expiresAt`, `datasourceName`, `files`) for the `crawl_urls` datasource (URLs in the crawl) - remove this filter to see other available data (links, sitemaps, etc).

```graphql
query getParquet {
  getCrawl(id: 1234) {
    parquetFiles(datasourceName: "crawl_urls") {
      expiresAt
      datasourceName
      files
    }
  }
}
```

**Response:**
```json
{
  "data": {
    "getCrawl": {
      "parquetFiles": [
        {
          "expiresAt": "2021-03-27T06:59:25.000Z",
          "datasourceName": "crawl_urls",
          "files": [
            "https://ds-odinapi-s3-odins3bucket-resources-staging-use1.s3.amazonaws.com/odin_crawl_urls_efs/odin_crawl_urls_parquet_staging_1612640/20210210_065907_02503_agbm6_bucket-00000?AWSAccessKeyId=AKIARIBNDXU57UDU6UZ7&Expires=1616828364&Signature=frga36vn5JoQdmzUtqHHXWzeZ%2Fk%3D"
          ]
        }
      ]
    }
  }
}
```

The response contains authenticated links to download the requested files. The links are valid for 7 days from the time they are generated, so data should be downloaded promptly.

## Using a Parquet file

Parquet is an industry standard format for big data storage and analysis. One single file will typically contain all data from a given datasource. You can access and analyse the data in any standard parquet-compatible system.

Some of our favourite readers are:

- NodeJS: [node-duckdb](https://www.npmjs.com/package/node-duckdb) - a Lumar-maintained node wrapper for DuckDB - this will allow you to run SQL queries over a parquet file without first loading it into a database.
- NodeJS: [parquetjs-lite](https://www.npmjs.com/package/parquetjs-lite) - a parquet reader that allows extraction of records from the file
- Python: [Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_parquet.html) can natively read parquet files into a DataFrame
- Python: [parquet-python](https://pypi.org/project/parquet/) is a native parquet reader for Python
- AWS: Parquet is widely supported in AWS’s ecosystem - [S3 Select](https://docs.aws.amazon.com/AmazonS3/latest/API/API_SelectObjectContent.html), [Athena](https://aws.amazon.com/blogs/big-data/analyzing-data-in-s3-using-amazon-athena/), EMR, and other analysis services
- Datalakes: Parquet can be natively ingested/read into [BigQuery](https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet), [Azure](https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-parquet-files), [Snowflake](https://docs.snowflake.com/en/user-guide/semistructured-intro.html#what-is-parquet)