Concepts
Information is transmitted exclusively via HTTP. Each request should result in an HTTP status code of 200, unless an error occurs, in which case the status code will fall within the 4xx or 5xx range, depending on the nature of the error.
The data served over HTTP consists of two main concepts: Snapshots and Feeds.
Page format
Pages utilize the multipart media type for formatting, as specified in RFC 2046.
To differentiate between entities within a multipart document, a Content-Type
header specifying a boundary is required: multipart/mixed; boundary="<random-boundary>"
.
While the primary subtype for multipart documents should be multipart/mixed
, it’s important for consumers to handle any other subtype.
This boundary, a unique string not found in the content, separates each entity.
NOTE: According to RFC 2046, empty pages are not possible.
Required and optional entity headers
Each entity within a page must include specific headers.
Name | Required | Example |
---|---|---|
Content-Type | yes | Content-Type: text/plain |
Last-Modified | yes | Last-Modified: Mon, 27 Nov 2023 03:10:00 GMT |
Content-Length | no | Content-Length: 42 |
Content-Length
The optional Content-Length
specifies the size of the entity’s content in bytes.
It helps producers to determine if the current page has reached its capacity,
resulting to create a new page for the addition of subsequent entities.
Additional headers for feed entities
Name | Required | Example |
---|---|---|
Content-ID | yes | Content-ID: a-random-content-id-or-hash |
Operation-Type | yes | Operation-Type: http-equiv=PUT |
Content-ID
The Content-ID
header is used to uniquely identify an entity within a feed.
It is required for every feed entity and must be unique within a feed.
Operation-Type
The Operation-Type
header is used to indicate the operation that was performed on the entity.
Possible values are: PUT
, DELETE
, PATCH
.
Examples:
Snapshot page:
--<random-boundary>
Content-Type: text/plain
Last-Modified: Thu, 5 Oct 2023 03:00:13 GMT
Content-Length: 5
Hello
--<random-boundary>
Content-Type: text/plain
Last-Modified: Thu, 5 Oct 2023 03:00:14 GMT
Content-Length: 8
Snapshot
--<random-boundary>--
Feed page:
--<random-boundary>
Operation-Type: http-equiv=PUT
Content-Type: text/plain
Content-ID: <1-A@random-content-id>
Last-Modified: Mon, 27 Nov 2023 03:10:00 GMT
Content-Length: 5
hello
--<random-boundary>
Operation-Type: http-equiv=PUT
Content-Type: text/plain
Content-ID: <1-B@random-content-id>
Last-Modified: Mon, 27 Nov 2023 03:10:00 GMT
Content-Length: 4
Feed
--<random-boundary>--
Snapshot Overview
This document outlines the structure of a snapshot, which is composed of a snapshot index and multiple pages. The snapshot index is detailed in Snapshot Index Format and serves as a directory of page URLs. Each page, as defined in Page Format, contains at least one or more entities.
Required headers
Each snapshot page must include the following mandatory headers.
Name | Required | Example |
---|---|---|
Content-Type | yes | Content-Type: multipart/mixed; boundary=“rdm-bny” |
Last-Modified | yes | Last-Modified: Mon, 27 Nov 2023 03:10:00 GMT |
Snapshot Index Format
The snapshot index, structured as a JSON document, is delivered over HTTP with the content type application/json
.
All linked pages within the index must contain absolute URLs.
JSON Schema
{
"$id": "https://www.datareplication.io/spec/snapshot/index.schema.json",
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "Snapshot index",
"description": "The index page of a snapshot with all the corresponding links to the pages of the snapshot.",
"type": "object",
"additionalProperties": true,
"required": [
"id",
"createdAt",
"pages"
],
"properties": {
"id": {
"type": "string",
"description": "Unique identifier of the snapshot.",
"examples": [
"12345678",
"1695841775",
"341f9312-f27c-22d5-e712-223454980097"
]
},
"createdAt": {
"type": "string",
"format": "date-time",
"description": "Creation time of the snapshot in ISO 8601 format.",
"examples": [
"2023-11-27T20:52:17.000Z"
]
},
"pages": {
"type": "array",
"description": "List of absolute page URLs.",
"examples": [
"https://example.datareplication.io/snapshot/12345678/page/0",
"https://example.datareplication.io/snapshot/12345678/page/1",
"https://example.datareplication.io/snapshot/12345678/page/2"
],
"items": {
"type": "string"
}
}
}
}
It is allowed to include additional properties.
Example of the snapshot index format
{
"id": "12345678",
"createdAt": "2023-09-27T20:52:17.000Z",
"pages": [
"https://example.datareplication.io/12345678/page/1",
"https://example.datareplication.io/12345678/page/2",
"https://example.datareplication.io/12345678/page/3"
]
}
Feed
A feed is a sequence of entities that chronicle modifications to a resource, organized into multiple pages. Each page contains at least one or more entities.
Typically, a page in a feed includes three Link
HTTP headers:
self
: the URL of the current pageprev
: the URL of the preceding pagenext
: the URL of the subsequent page
NOTE: The initial page lacks a
prev
link, and the final page omits anext
link. Additional details on this can be found in RFC 8288.
Refer to Page Format for the structure of each page, which accommodates one or more entities.
Required headers
There are some required headers for every page.
For a feed page, the Link; rel=next
and Link; rel=prev
header is optional.
Name | Required | Example |
---|---|---|
Content-Type | yes | Content-Type: multipart/mixed; boundary=“rdm-bny” |
Last-Modified | yes | Last-Modified: Mon, 27 Nov 2023 03:10:00 GMT |
Link; rel=self | yes | Link: https://example.com/feed/hash;rel=self |
Link; rel=next | no | Link: https://example.com/feed/hash;rel=next |
Link; rel=prev | no | Link: https://example.com/feed/hash;rel=prev |
Finding an entry point
Each feed consumer needs to find a valid starting point. This can be determined reading the feed from the beginning or by using a snapshot as a starting point. The snapshot’s creation date serves as the entry point.
Pagination
To ensure efficient navigation and access for feed consumers, it’s essential for providers to incorporate pagination mechanisms.
This process involves crawling to the feed through both HEAD
requests and GET
requests for retrieving specific pages.
Think about a doubly-linked list of pages, where each page has a prev
and next
link.
NOTE: Links must be consistent, i.e. the prev and next links of adjacent pages must match and the feed must not form a loop.
Last-Modified
Each entity must have a Last-Modified header. They must be formatted using the timestamp format for HTTP Last-Modified
Headers.
NOTE: The must be monotonically increasing across all pages.
Immutability of pages
Treat published pages as immutable once created, with a few specific exceptions:
- New entities can be added to the most recent page if it doesn’t have a
next
link. - A
next
link can be introduced to the most recent page, after which it becomes immutable. - A
prev
link may be removed from an older page as part of a gradual cleanup process.
Consumer Algorithm
-
crawl back via timestamp
-
walk forward:
- skip everything older than timestamp
- with content ID: first entity after content ID -> exactly once semantics
- without content ID: first entity with given timestamp -> at least once semantics
-
forward-crawling via page timestamps as an optimization is allowed
- but probably not that useful?
-
consumers SHOULD report errors:
- missing content ID
- timestamps not old enough
-
musings on different entry points?
Resume consumption
After fully consuming the feed, it’s important for the consumer to note the Last-Modified
date of the last entity.
This date then becomes the new entry point for future consumption.
Additionally, to avoid processing the same entity multiple times, consumers are advised to record the Content-ID
of the last entity they processed.