Audio Transcripts
NPR transcribes the audio content of many of its shows. This data is provided to users via the website as well as through third parties such as LexisNexis. In order to represent this data, CDS supports representing the full text of an audio asset through linked transcripts. This section goes through how CDS expects users to publish an audio asset with a linked transcript.
How it all fits together - audio and transcripts
Audio assets live inside a publishable document (Reminder: documents with the publishable profile have the assets
array) in the assets
array, with pointers living within the audio
array (audio
array is part of the has-audio profile). Transcripts, on the other hand, live as stand-alone documents. How do these two get linked together?
In comes the transcriptLink and transcriptWebPageLink fields inside of an audio asset.
First we start with the transcript document
The transcript profile documentation has some basic information on how to get started with making the necessary JSON for a transcript. We’ll thread the needle a bit more in this tutorial.
For this tutorial we’ll use a basic transcript. Assume that the audio file for https://npr-audio-host.org/audio-files/the-podcast/episode-1.mp3
has the one scripted line of “Hello this is a transcript” (not much of a real world example - most transcripts have multiple lines of text, but we’re keeping it simple here). We’d represent a transcript for such a show in CDS like this:
{
"id": "doc-transcript-123456",
"profiles": [
{
"href": "/v1/profiles/transcript",
"rels": [ "type" ]
},
{
"href": "/v1/profiles/document"
}
],
"text": "<Speaker> Hello this is a transcript"
}
Let’s break things down.
id
: This is a unique ID that includes your Document Prefix. For help on what that all means to you, visit the “Publishing” section of Getting Started. Note that this is the document ID for the transcript that sits stand-alone in CDS. No one can find this transcript unless they have a link to /v1/documents/doc-transcript-123456
- more on this later.
profiles
: The array containing our “type” profile /v1/profiles/transcript
and the necessary base class profile /v1/profiles/document
. See the “Publishing” section of Getting Started for more.
text
: This is a string
meaning any length of text. It contains the entirety of the transcript inside it. Usually, at NPR, this means more than the one line in our example. When there are multiple lines, we’d put in line breaks, more speaker handles, etc., in that string.
With this inserted into CDS we now have a beautifully crafted transcript
document that lives at the address https://content.api.npr.org/v1/documents/doc-transcript-123456
. But wait! You might be asking:
- How would anyone know how to get this document? What about automated systems scraping CDS?
- How do we map this document to the document representing the story the transcript is transcribing?
Read on to learn more about pointing to the transcript via the power of Hypermedia
.
Pointing users to your transcript with transcriptLink
Hypermedia is a concept where we link together elements inside of an API via links. CDS implements this pattern here with transcripts by linking the above-mentioned address for a transcript document inside of an audio asset, which lives in the document representing the podcast episode, story or other content being transcribed.
If we have a document inside of CDS that represents our transcript, then we link it to an audio asset via transcriptLink
. transcriptLink
follows the same link
type schema in CDS – it has properties of href
and rels
.
Let’s take a look at the example below where we’ve snipped out a single audio asset from a CDS document representing a podcast episode:
{
"id": "12345",
"profiles": [
{
"href": "/v1/profiles/audio"
},
{
"href": "/v1/profiles/document"
}
],
"tile": "Episode 1: How we made a podcast",
"enclosures": [
{
"href": "https://npr-audio-host.org/audio-files/the-podcast/episode-1.mp3",
"rels": [
"sponsored",
"tracked"
],
"type": "audio/mpeg",
"fileSize": 50000000
}
],
"duration": 360,
"isAvailable": true,
"isDownloadable": true,
"isStreamable": true,
"availabilityMessage": "This audio is available.",
"transcriptLink": {
"href": "/v1/documents/doc-transcript-123456",
"rels": [
"extension",
"always-display"
]
},
"embeddedPlayerLink": "https://the-podcast.com/episode-1/player.html"
}
There’s a lot going on here, but what we want to focus on is the transcriptLink
field of the body. This field is an object that adheres to the internal-document-link schema. What’s important to note about that schema here is:
- You must include an
href
field inside thetranscriptLink
object - The value of
href
must be in the pattern/v1/documents/{id}
where{id}
is the transcript document ID inside of CDS
As you can recall we have an internal link: our transcript we made earlier. The href
/v1/documents/doc-transcript-123456
tells anyone looking at this JSON that they can get the transcript for this audio at that location.
Some more details:
- You must have a
rels
array with at least theextension
attribute inside the array - Optionally you can have the
always-display
rel set insiderels
. This tells the NPR site and other platforms to force displaying the transcript.
Addendum: transcriptWebPageLink
There is an additional field - transcriptWebPageLink
- which audio assets can have. The link here has no rels
array, it is only used to point to a full URI for the page displaying a transcript which is not the same as above where we linked the actual CDS document to the audio asset via transcriptLink
.