XTranscript is an online tool for converting transcripts saved in mainstream document formats (such as Microsoft Word, Open Document, PDF or TXT) into a lightweight XML format.
It has been developed in the field of linguistics to enable combined qualitative and quantitative studies of spoken language. Converting transcripts into XML allows for powerful and mature XML processing tools, such as XPath and XQuery, to be used to search or summarise features of the transcripts.
On this Page
- Results of Conversion
- Error Messages
- Related Projects
- Useful Publications
- Useful Software
XTranscript currently offers two configurations for the conversion process:
- Basic: Utterances will be detected and the text will be tokenised (split into words).
- Conversation Analysis notation: In addition to the utterances, Jefferson notation (and a few known extensions) will be identified and recorded in XML elements.
If you are interested in converting your own notation then please get in touch and we'll see what we can do.
Results of Conversion
Given the text:
1 BOB: he's pretty special 2 still beat everybody else 3 SALLY: yeah ...The resulting XML file will have the structure:
<transcript file="example.doc"> <stext> <u who="BOB" n="1" line="1"> he 's pretty special still beat everybody else </u> <u who="SALLY" n="2" line="3"> yeah </u> ... </stext> </transcript>The utterances will contain a who attribute to identify the speaker, be numbered, and the line numbers from the original file will be included where possible.
With part-of-speech tagging turned on, word elements will be also be included:
<transcript file="example.doc"> <stext> <u who="BOB" n="1" line="1"> <w n="1.1" pos="PRP" lemma="he">he</w> <w n="1.2" pos="VBZ" lemma="be">'s</w> <w n="1.3" pos="RB" lemma="pretty">pretty</w> <w n="1.4" pos="JJ" lemma="special">special</w> ... </u> ... </stext> </transcript>Note that only the text is stored as text elements in the XML, and all other data is stored within attributes. This makes extracting textual content easier and also allows the XML files to be used in Corpus Linguistic analysis software (such as AntConc and WordSmith).
Messages be will displayed in the XTranscript interface or as tags at the end of the generated XML file. Some of these will give information about errors that occurred during the conversion process.
A common warning you may see relates to the 'autoclosing' of annotations. This happens when only an opening or closing bracket was found where both are required. In which case XTranscript finds the longest span of text that the annotation can be applied to. This is a suitable solution for overlaps which may be written as:
4 BOB: won another [medal 5 SALLY: [yeahAutoclosing may also happen when different annotations overlap, which would generate invalid XML without a change being made. In these cases, one of the annotations will be moved so that it is contained within the other. For example:
6 SALLY: still [beat (everybody] yes)The parentheses will be closed at the end of 'everybody', i.e. where the square brackets end. To accurately deal with these situations, either the original file or the generated XML would need to be edited (e.g. to create two sets of parentheses instead of one, but please note the effect this would have on quantitative studies).
These changes are made to ensure that the resulting file is 'well-formed', which allows the XML to be used with other XML processing software. XTranscript performs a check of the generated XML to inform the user if it is well-formed or not. There is a good chance that if this check produces an error, you will need to contact us to look into the problem in more detail.
The Corpus of Academic Spoken English (CASE) is being compiled by a team of researchers at Saarland University. Birmingham City University is one partner whos students are taking part in the study. The first iteration of XTranscript was developed for use in the CASE project.
Useful Publications about XML
Hardie, A. (2014) 'Modest XML for corpora: Not a standard, but a suggestion'. ICAME Journal (38) 73-103.
Ruehlemann, C., A. Bagoutdinov & M.B. O'Donnell (2015) 'Modest XPath and XQuery for corpora: Exploiting deep XML annotation'. ICAME Journal (39) 47-84. Companion website and resources.
Useful Software for Working with XML
BaseX: An open-source database system for searching XML documents.
oXygen: A powerful commercial XML editor with search features.
W3C XML Validator: An online service to validate (and find errors in) XML documents.
eXistdb: An open-source system (which runs as a web service) for editing and searching XML.
- Style (bold, italic, underline) can only be extracted from Microsoft Word and Open Document formats.
- The maximum file size for uploads is 50MB.
- XTranscript does not encrypt your data at any point and it will store the uploaded and converted files on the RDUES server for a period of time. Get in touch if you need to convert files securely.
- XTranscript should be considered experimental software. We provide no guarantee of availability and users may experience errors. Please let us know if you do, so we can try to fix them.