The Human Gene Nomenclature Committee have just published new guidance on naming genes: the twists of DNA and RNA that express the traits of living things. The guidance contains one entry that has caught the interest of the data world. Under Scenarios that may Marit a symbol change the guidance includes:
Symbols that affect data handling and retrieval. For example, all symbols that autoconverted to dates in Microsoft Excel have been changed (for example, SEPT1 is now SEPTIN1; MARCH1 is now MARCHF1);
This comes a few years after work that found approximately 20% of genetics papers include supporting data in Excel suffer from this ‘auto-correction’ problem– potentially impacting on the analysis that has been carried out.
As the authors of that paper note:
Automatic conversion of gene symbols to dates and floating-point numbers is a problematic feature of Excel software. The description of this problem and workarounds were first highlighted over a decade ago —nevertheless, we find that these errors continue to pervade supplementary files in the scientific literature.
This issue highlights two important consideration for anyone developing data standards:
- standards and software interact – and you need to know your users and the tools they will use.
- your data will end up in Excel. Accept it and design with this in mind.
Filed under: things to draw on when I finally get around to writing a book on data standards