XTranscript is an online tool for converting transcripts saved in mainstream document formats (such as Microsoft Word, Open Document, PDF or TXT) into a lightweight XML format.

It has been developed in the field of linguistics to enable combined qualitative and quantitative studies of spoken language. Converting transcripts into XML allows for powerful and mature XML processing tools, such as XPath and XQuery, to be used to search or summarise features of the transcripts.

On this Page

Features

XTranscript currently offers two configurations for the conversion process:

Part-of-speech (grammatical) tagging can also be performed for English texts using the Stanford CoreNLP library. The Stanford tagger uses the Penn TreeBank tag-set. Summary of the tag-set.

If you are interested in converting your own notation then please get in touch and we'll see what we can do.

Results of Conversion

Given the text:

    1   BOB:    he's pretty special
    2           still beat everybody else
    3   SALLY:  yeah
    ...
The resulting XML file will have the structure:
    <transcript file="example.doc">
        <stext>
            <u who="BOB" n="1" line="1">
                he 's pretty special
                still beat everybody else
            </u>
            <u who="SALLY" n="2" line="3">
                yeah
            </u>
            ...
        </stext>
    </transcript>
The utterances will contain a who attribute to identify the speaker, be numbered, and the line numbers from the original file will be included where possible.

With part-of-speech tagging turned on, word elements will be also be included:

    <transcript file="example.doc">
        <stext>
            <u who="BOB" n="1" line="1">
                <w n="1.1" pos="PRP" lemma="he">he</w>
                <w n="1.2" pos="VBZ" lemma="be">'s</w>
                <w n="1.3" pos="RB" lemma="pretty">pretty</w>
                <w n="1.4" pos="JJ" lemma="special">special</w>
                ...
            </u>
            ...
        </stext>
    </transcript>
Note that only the text is stored as text elements in the XML, and all other data is stored within attributes. This makes extracting textual content easier and also allows the XML files to be used in Corpus Linguistic analysis software (such as AntConc and WordSmith).

Error Messages

Messages be will displayed in the XTranscript interface or as tags at the end of the generated XML file. Some of these will give information about errors that occurred during the conversion process.

A common warning you may see relates to the 'autoclosing' of annotations. This happens when only an opening or closing bracket was found where both are required. In which case XTranscript finds the longest span of text that the annotation can be applied to. This is a suitable solution for overlaps which may be written as:

    4   BOB:    won another [medal
    5   SALLY:              [yeah
Autoclosing may also happen when different annotations overlap, which would generate invalid XML without a change being made. In these cases, one of the annotations will be moved so that it is contained within the other. For example:
    6   SALLY:  still [beat (everybody] yes)
The parentheses will be closed at the end of 'everybody', i.e. where the square brackets end. To accurately deal with these situations, either the original file or the generated XML would need to be edited (e.g. to create two sets of parentheses instead of one, but please note the effect this would have on quantitative studies).

These changes are made to ensure that the resulting file is 'well-formed', which allows the XML to be used with other XML processing software. XTranscript performs a check of the generated XML to inform the user if it is well-formed or not. There is a good chance that if this check produces an error, you will need to contact us to look into the problem in more detail.

Related Projects

The Corpus of Academic Spoken English (CASE) is being compiled by a team of researchers at Saarland University. Birmingham City University is one partner whos students are taking part in the study. The first iteration of XTranscript was developed for use in the CASE project.

Useful Publications about XML

Hardie, A. (2014) 'Modest XML for corpora: Not a standard, but a suggestion'. ICAME Journal (38) 73-103.

Ruehlemann, C., A. Bagoutdinov & M.B. O'Donnell (2015) 'Modest XPath and XQuery for corpora: Exploiting deep XML annotation'. ICAME Journal (39) 47-84. Companion website and resources.

Useful Software for Working with XML

BaseX: An open-source database system for searching XML documents.

oXygen: A powerful commercial XML editor with search features.

W3C XML Validator: An online service to validate (and find errors in) XML documents.

Never underestimate the value of a good text editor. These provide some features (like syntax highlighting) to help with editing XML files: Notepad++ (Windows), BBEdit (Mac OSX).

eXistdb: An open-source system (which runs as a web service) for editing and searching XML.

Notes