XML to tsv using XSLT

I’m working a lot with gene expression data, for instance from The Cancer Genome Atlas (TCGA). One big challenge is to investigate biological variation in different samples when it is potentially maked by variation in the way the samples for collected, stored or processed.

When 20,000 genes are measured simultaneously, how do you know?. The way you usually look at this is to perform some sort of dimensionality reduction on the data set and look if they cluster by the batch they were collected or analysed in. The TCGA assigns barcodes to their samples that provide a lot of information on that like the Plate ID.

However, there is other information that is not in the barcode directly, like the date the samples was shipped for sequencing. The GDC Data Portal provides additional information in the Biospecimen Supplement files, but extracting this information is not trivial for these XML files.

The information I was looking for was a table between the assigned barcode and the shipment date, which were hidden in a structure like this:

<?xml version="1.0" encoding="UTF-8"?>
<bio:tcga_bcr ...>
  <bio:patient>
    <bio:samples>
      <bio:sample>
        <bio:portions>
          <bio:portion>
            <bio:analytes>
              <bio:analyte>
                <bio:aliquots>
                  <bio:aliquot>
  <bio:bcr_aliquot_barcode>TCGA-x</bio:bcr_aliquot_barcode>
  <bio:bcr_year_of_shipment>2011<bio:year_of_shipment>
                  </bio:aliquot>
                </bio:aliquots>
              </bio:analyte>
            </bio:analytes>
          </bio:portion>
        </bio:portions>
      </bio:sample>
    </bio:samples>
  </bio:patient>
</bio:tcga_bcr>

One way to convert XML-formatted information into a table is to use XSLT. There are nice tutorials on how to use it, but I did not know how to handle XML name spaces until recently, which the GDC files contain.

The way to solve this is rather simple: look at all the xmlns: attributes in the XML header that you are interested in, and add them to your xsl:stylesheet tag. Then it’s just a matter of referencing them correctly, looping through your tags and printing out the values in a tab-separated way with a newline at the end.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:bio="http://tcga.nci/bcr/xml/biospecimen/2.7"
    xmlns:admin="http://tcga.nci/bcr/xml/administration/2.7">
<xsl:output method="text" indent="no"/>
<xsl:strip-space elements="*"/>

<xsl:template match="/">

  <xsl:for-each select="bio:tcga_bcr/bio:patient/bio:samples/bio:sample">
  <xsl:for-each select="bio:portions/bio:portion">
  <xsl:for-each select="bio:analytes/bio:analyte">
  <xsl:for-each select="bio:aliquots/bio:aliquot">

    <xsl:value-of select="bio:bcr_aliquot_barcode"/>
    <xsl:text>&#x9;</xsl:text>
    <xsl:value-of select="bio:year_of_shipment"/>
    <xsl:text>-</xsl:text>
    <xsl:value-of select="bio:month_of_shipment"/>
    <xsl:text>&#xa;</xsl:text>

  </xsl:for-each>
  </xsl:for-each>
  </xsl:for-each>
  </xsl:for-each>

</xsl:template>
</xsl:stylesheet>

This is of course much less error-prone that using a regular expression. And produces a table for a single file.

The data I downloaded from the GDC contained 500 files, so I used the find command to execute the XSLT transformation on all of them:

tar -xf gdc_download_20170505.tar.gz
find gdc_download_20170505 -name "\*.xml" -exec xsltproc my.xslt {} \;

In the end, I got a table of a few thousand barcodes and the date they were shipped as tab-separated values.

Barcode Date
TCGA-AA-A01C-01A-01D-A008-01 2010-03
TCGA-AA-A01C-01A-01D-A00D-09 2010-03
TCGA-AA-A01C-01A-01D-A00G-10 2010-04
TCGA-AA-A01C-01A-01D-A077-02 2010-09