Audio Transcripts

NPR transcribes the audio content of many of its shows. This data is provided to users via the website as well as through third parties such as LexisNexis. In order to represent this data, CDS supports representing the full text of an audio asset through linked transcripts. This section goes through how CDS expects users to publish an audio asset with a linked transcript.

How it all fits together - audio and transcripts

Audio assets live inside a publishable document (Reminder: documents with the publishable profile have the assets array) in the assets array, with pointers living within the audio array (audio array is part of the has-audio profile). Transcripts, on the other hand, live as stand-alone documents. How do these two get linked together?

In comes the transcriptLink and transcriptWebPageLink fields inside of an audio asset.

First we start with the transcript document

The transcript profile documentation has some basic information on how to get started with making the necessary JSON for a transcript. We’ll thread the needle a bit more in this tutorial.

For this tutorial we’ll use a basic transcript. Assume that the audio file for https://npr-audio-host.org/audio-files/the-podcast/episode-1.mp3 has the one scripted line of “Hello this is a transcript” (not much of a real world example - most transcripts have multiple lines of text, but we’re keeping it simple here). We’d represent a transcript for such a show in CDS like this:

{
    "id": "doc-transcript-123456",
    "profiles": [
        {
            "href": "/v1/profiles/transcript",
            "rels": [ "type" ]
        },
        {
            "href": "/v1/profiles/document"
        }
    ],
    "text": "<Speaker> Hello this is a transcript"
}

Let’s break things down.

id: This is a unique ID that includes your Document Prefix. For help on what that all means to you, visit the “Publishing” section of Getting Started. Note that this is the document ID for the transcript that sits stand-alone in CDS. No one can find this transcript unless they have a link to /v1/documents/doc-transcript-123456 - more on this later.

profiles: The array containing our “type” profile /v1/profiles/transcript and the necessary base class profile /v1/profiles/document. See the “Publishing” section of Getting Started for more.

text: This is a string meaning any length of text. It contains the entirety of the transcript inside it. Usually, at NPR, this means more than the one line in our example. When there are multiple lines, we’d put in line breaks, more speaker handles, etc., in that string.

With this inserted into CDS we now have a beautifully crafted transcript document that lives at the address https://content.api.npr.org/v1/documents/doc-transcript-123456. But wait! You might be asking:

  • How would anyone know how to get this document? What about automated systems scraping CDS?
  • How do we map this document to the document representing the story the transcript is transcribing?

Read on to learn more about pointing to the transcript via the power of Hypermedia.

Hypermedia is a concept where we link together elements inside of an API via links. CDS implements this pattern here with transcripts by linking the above-mentioned address for a transcript document inside of an audio asset, which lives in the document representing the podcast episode, story or other content being transcribed.

If we have a document inside of CDS that represents our transcript, then we link it to an audio asset via transcriptLink. transcriptLink follows the same link type schema in CDS – it has properties of href and rels.

Let’s take a look at the example below where we’ve snipped out a single audio asset from a CDS document representing a podcast episode:

{
  "id": "12345",
  "profiles": [
    {
      "href": "/v1/profiles/audio"
    },
    {
      "href": "/v1/profiles/document"
    }
  ],

  "tile": "Episode 1: How we made a podcast",
  "enclosures": [
    {
      "href": "https://npr-audio-host.org/audio-files/the-podcast/episode-1.mp3",
      "rels": [
        "sponsored",
        "tracked"
      ],
      "type": "audio/mpeg",
      "fileSize": 50000000
    }
  ],
  "duration": 360,
  "isAvailable": true,
  "isDownloadable": true,
  "isStreamable": true,
  "availabilityMessage": "This audio is available.",
  "transcriptLink": {
    "href": "/v1/documents/doc-transcript-123456",
    "rels": [
      "extension",
      "always-display"
    ]
  },
  "embeddedPlayerLink": "https://the-podcast.com/episode-1/player.html"
}

There’s a lot going on here, but what we want to focus on is the transcriptLink field of the body. This field is an object that adheres to the internal-document-link schema. What’s important to note about that schema here is:

  • You must include an href field inside the transcriptLink object
  • The value of href must be in the pattern /v1/documents/{id} where {id} is the transcript document ID inside of CDS

As you can recall we have an internal link: our transcript we made earlier. The href /v1/documents/doc-transcript-123456 tells anyone looking at this JSON that they can get the transcript for this audio at that location.

Some more details:

  • You must have a rels array with at least the extension attribute inside the array
  • Optionally you can have the always-display rel set inside rels. This tells the NPR site and other platforms to force displaying the transcript.

There is an additional field - transcriptWebPageLink - which audio assets can have. The link here has no rels array, it is only used to point to a full URI for the page displaying a transcript which is not the same as above where we linked the actual CDS document to the audio asset via transcriptLink.


© 2024 npr