TrialGrid Blog

Unicode

Tue 24 October 2017
By Ian Sparks

If you haven't heard of Unicode you have certainly seen it. You are seeing it now since Unicode is the standard for the encoding of characters viewable in Web Browsers and on computers in general. As of this writing, version 10 of the standard includes more then 136,000 characters from multiple writing systems and Medidata Rave supports the Unicode standard both for study designs and for data collection. So what is the problem?

Actually, there is no problem so long as you know what characters from the Unicode standard are being used in your study, where they are and how they display and appear in outputs.

Unicode in Study Design

If you are building your study in Japanese or localizing it to Russian, Armenian or Greek then having the full set of Unicode characters to use is vital. For studies in English you may want to stick to the set of 128 characters known as ASCII (a-Z, 0-9 and symbols). But sometimes you can be surprised by characters that aren՚t what you think they are…

Did you spot those alternative characters hiding in the last sentence?

characters that aren՚t what you think they are…

vs:

characters that aren't what you think they are...

Still can't see it? Hint: It's the ՚ and the … The differences are (or at least, may be) subtle on the screen but when we render them in a Rave PDF they appear quite different:

Apostrophe and Ellipsis

It is very hard for the human eye to distinguish between these characters the way they are rendered in Browsers but they are different characters and the font that Rave uses to display characters won't have a way to render all 135,000 possible characters so it is best (in English studies at least) to stick to characters that appear in the limited ASCII set of characters that all fonts cover well.

Be especially wary of text that is cut and pasted from web pages, Word and Excel or from PDF documents. It is very tempting to copy verbatim from a Protocol document but word processors use all kinds of character variants to make writing look better on the screen or in print. You can't even trust the spaces in these documents because Unicode defines at least 20 different "empty" space characters of different widths including one that has no width at all (i.e. it is invisible!)

Tip: TrialGrid Diagnostic 70 will identify and highlight non-ASCII characters, even invisible ones

Unicode in Study Data

If unexpected characters in study design can cause strange PDF outputs, unexpected or unwanted characters in the clinical data can be real poison. A study that collects data in the English language might expect that all the text data in the study is in ASCII. However, Rave will accept data input to text fields of any Unicode character so the same problems of cut & pasted content can occur. Rave is 100% Unicode compatible so it will happily take, store and output any Unicode content but SAS and other analysis programs may have to be set to accept non-ASCII content.

In English studies you want to identify non-ASCII content at the point of entry. This can only be done with a Custom Function that looks at the content of a text field and determines if any of the characters are outside the ASCII range. A quick search of the web will throw up simple code which will return true if it finds a non-ASCII character in the input string:

    //Take string from datapoint.Data or datapoint.StandardValue
    string s = "characters that aren՚t what you think they are…";  

    foreach (char c in s)
    { 
        if (((int)c) > 127) 
        { 
            return true; 
        } 
    } 
    return false;

Tip: TrialGrid contains a CQL extension that makes this as easy as using FieldName.IsNotAscii in an Edit Check.

Summary

Rave handles Unicode really well and web browsers are very good at displaying a wide range of Unicode characters but not all characters can be displayed by all systems so be careful what you put into your study design and what you collect in your study data. Being able to cut and paste text between systems is great for productivity but can have unintended consequences.