Concepts

Information is transmitted exclusively via HTTP. Each request should result in an HTTP status code of 200, unless an error occurs, in which case the status code will fall within the 4xx or 5xx range, depending on the nature of the error.

The data served over HTTP consists of two main concepts: Snapshots and Feeds.

Page format

Pages utilize the multipart media type for formatting, as specified in RFC 2046.

To differentiate between entities within a multipart document, a Content-Type header specifying a boundary is required: multipart/mixed; boundary="<random-boundary>". While the primary subtype for multipart documents should be multipart/mixed, it’s important for consumers to handle any other subtype. This boundary, a unique string not found in the content, separates each entity.

NOTE: According to RFC 2046, empty pages are not possible.

Required and optional entity headers

Each entity within a page must include specific headers.

NameRequiredExample
Content-TypeyesContent-Type: text/plain
Last-ModifiedyesLast-Modified: Mon, 27 Nov 2023 03:10:00 GMT
Content-LengthnoContent-Length: 42

Content-Length

The optional Content-Length specifies the size of the entity’s content in bytes. It helps producers to determine if the current page has reached its capacity, resulting to create a new page for the addition of subsequent entities.

Additional headers for feed entities

NameRequiredExample
Content-IDyesContent-ID: a-random-content-id-or-hash
Operation-TypeyesOperation-Type: http-equiv=PUT

Content-ID

The Content-ID header is used to uniquely identify an entity within a feed. It is required for every feed entity and must be unique within a feed.

Operation-Type

The Operation-Type header is used to indicate the operation that was performed on the entity. Possible values are: PUT, DELETE, PATCH.

Examples:

Snapshot page:

--<random-boundary>
Content-Type: text/plain
Last-Modified: Thu, 5 Oct 2023 03:00:13 GMT
Content-Length: 5

Hello
--<random-boundary>
Content-Type: text/plain
Last-Modified: Thu, 5 Oct 2023 03:00:14 GMT
Content-Length: 8

Snapshot
--<random-boundary>--

Feed page:

--<random-boundary>
Operation-Type: http-equiv=PUT
Content-Type: text/plain
Content-ID: <1-A@random-content-id>
Last-Modified: Mon, 27 Nov 2023 03:10:00 GMT
Content-Length: 5

hello
--<random-boundary>
Operation-Type: http-equiv=PUT
Content-Type: text/plain
Content-ID: <1-B@random-content-id>
Last-Modified: Mon, 27 Nov 2023 03:10:00 GMT
Content-Length: 4

Feed
--<random-boundary>--

Snapshot Overview

This document outlines the structure of a snapshot, which is composed of a snapshot index and multiple pages. The snapshot index is detailed in Snapshot Index Format and serves as a directory of page URLs. Each page, as defined in Page Format, contains at least one or more entities.

Required headers

Each snapshot page must include the following mandatory headers.

NameRequiredExample
Content-TypeyesContent-Type: multipart/mixed; boundary=“rdm-bny”
Last-ModifiedyesLast-Modified: Mon, 27 Nov 2023 03:10:00 GMT

Snapshot Index Format

The snapshot index, structured as a JSON document, is delivered over HTTP with the content type application/json. All linked pages within the index must contain absolute URLs.

JSON Schema

{
    "$id": "https://www.datareplication.io/spec/snapshot/index.schema.json",
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "title": "Snapshot index",
    "description": "The index page of a snapshot with all the corresponding links to the pages of the snapshot.",
    "type": "object",
    "additionalProperties": true,
    "required": [
        "id",
        "createdAt",
        "pages"
    ],
    "properties": {        
        "id": {
            "type": "string",
            "description": "Unique identifier of the snapshot.",
            "examples": [
                "12345678",
                "1695841775",
                "341f9312-f27c-22d5-e712-223454980097"
            ]
        },
        "createdAt": {
            "type": "string",
            "format": "date-time",
            "description": "Creation time of the snapshot in ISO 8601 format.",
            "examples": [
                "2023-11-27T20:52:17.000Z"
            ]
        },
        "pages": {
            "type": "array",
            "description": "List of absolute page URLs.",
            "examples": [
                "https://example.datareplication.io/snapshot/12345678/page/0",
                "https://example.datareplication.io/snapshot/12345678/page/1",
                "https://example.datareplication.io/snapshot/12345678/page/2"
            ],
            "items": {
                "type": "string"
            }
        }
    }
}

It is allowed to include additional properties.

Example of the snapshot index format

{
    "id": "12345678",
    "createdAt": "2023-09-27T20:52:17.000Z",
    "pages": [
      "https://example.datareplication.io/12345678/page/1",
      "https://example.datareplication.io/12345678/page/2",
      "https://example.datareplication.io/12345678/page/3"
    ]
  }

Feed

A feed is a sequence of entities that chronicle modifications to a resource, organized into multiple pages. Each page contains at least one or more entities.

Typically, a page in a feed includes three Link HTTP headers:

  • self: the URL of the current page
  • prev: the URL of the preceding page
  • next: the URL of the subsequent page

NOTE: The initial page lacks a prev link, and the final page omits a next link. Additional details on this can be found in RFC 8288.

Refer to Page Format for the structure of each page, which accommodates one or more entities.

Required headers

There are some required headers for every page. For a feed page, the Link; rel=next and Link; rel=prev header is optional.

NameRequiredExample
Content-TypeyesContent-Type: multipart/mixed; boundary=“rdm-bny”
Last-ModifiedyesLast-Modified: Mon, 27 Nov 2023 03:10:00 GMT
Link; rel=selfyesLink: https://example.com/feed/hash;rel=self
Link; rel=nextnoLink: https://example.com/feed/hash;rel=next
Link; rel=prevnoLink: https://example.com/feed/hash;rel=prev

Finding an entry point

Each feed consumer needs to find a valid starting point. This can be determined reading the feed from the beginning or by using a snapshot as a starting point. The snapshot’s creation date serves as the entry point.

Pagination

To ensure efficient navigation and access for feed consumers, it’s essential for providers to incorporate pagination mechanisms. This process involves crawling to the feed through both HEAD requests and GET requests for retrieving specific pages. Think about a doubly-linked list of pages, where each page has a prev and next link.

NOTE: Links must be consistent, i.e. the prev and next links of adjacent pages must match and the feed must not form a loop.

Last-Modified

Each entity must have a Last-Modified header. They must be formatted using the timestamp format for HTTP Last-Modified Headers.

NOTE: The must be monotonically increasing across all pages.

Immutability of pages

Treat published pages as immutable once created, with a few specific exceptions:

  • New entities can be added to the most recent page if it doesn’t have a next link.
  • A next link can be introduced to the most recent page, after which it becomes immutable.
  • A prev link may be removed from an older page as part of a gradual cleanup process.

Consumer Algorithm

  • crawl back via timestamp

  • walk forward:

    • skip everything older than timestamp
    • with content ID: first entity after content ID -> exactly once semantics
    • without content ID: first entity with given timestamp -> at least once semantics
  • forward-crawling via page timestamps as an optimization is allowed

    • but probably not that useful?
  • consumers SHOULD report errors:

    • missing content ID
    • timestamps not old enough
  • musings on different entry points?

Resume consumption

After fully consuming the feed, it’s important for the consumer to note the Last-Modified date of the last entity. This date then becomes the new entry point for future consumption. Additionally, to avoid processing the same entity multiple times, consumers are advised to record the Content-ID of the last entity they processed.