If you haven't heard of Unicode you have certainly seen it. You are seeing it now since Unicode is the standard for
the encoding of characters viewable in Web Browsers and on computers in general. As of this writing, version 10 of the
standard includes more then 136,000 characters from multiple writing systems and Medidata Rave supports the Unicode
standard both for study designs and for data collection. So what is the problem?
Actually, there is no problem so long as you know what characters from the Unicode standard are being used in your
study, where they are and how they display and appear in outputs.
Unicode in Study Design
If you are building your study in Japanese or localizing it to Russian, Armenian or Greek then having the full set of
Unicode characters to use is vital. For studies in English you may want to stick to the set of 128 characters known as
ASCII (a-Z, 0-9 and symbols). But sometimes you can be surprised by characters that aren՚t what you think they are…
Did you spot those alternative characters hiding in the last sentence?
characters that aren՚t what you think they are…
characters that aren't what you think they are...
Still can't see it? Hint: It's the ՚ and the … The differences are (or at least, may be) subtle on the screen but
when we render them in a Rave PDF they appear quite different:
It is very hard for the human eye to distinguish between these characters the way they are rendered in Browsers but
they are different characters and the font that Rave uses to display characters won't have a way to render all 135,000
possible characters so it is best (in English studies at least) to stick to characters that appear in the limited
ASCII set of characters that all fonts cover well.
Be especially wary of text that is cut and pasted from web pages, Word and Excel or from PDF documents. It is very
tempting to copy verbatim from a Protocol document but word processors use all kinds of character variants to make
writing look better on the screen or in print. You can't even trust the spaces in these documents because Unicode defines
at least 20 different "empty" space characters of different widths including one that has no width at all (i.e. it is
Tip: TrialGrid Diagnostic 70 will identify and highlight non-ASCII characters, even invisible ones
Unicode in Study Data
If unexpected characters in study design can cause strange PDF outputs, unexpected or unwanted characters in the
clinical data can be real poison. A study that collects data in the English language might expect that all the
text data in the study is in ASCII. However, Rave will accept data input to text fields of any Unicode character
so the same problems of cut & pasted content can occur. Rave is 100% Unicode compatible so it will happily take,
store and output any Unicode content but SAS and other analysis programs may have to be set to accept non-ASCII
In English studies you want to identify non-ASCII content at the point of entry. This can only be done with a
Custom Function that looks at the content of a text field and determines if any of the characters are outside the
ASCII range. A quick search of the web will throw up simple code which will return true if it finds a non-ASCII
character in the input string:
//Take string from datapoint.Data or datapoint.StandardValue
string s = "characters that aren՚t what you think they are…";
foreach (char c in s)
if (((int)c) > 127)
Tip: TrialGrid contains a CQL extension that makes this as easy as using FieldName.IsNotAscii in an Edit Check.
Rave handles Unicode really well and web browsers are very good at displaying a wide range of Unicode characters but
not all characters can be displayed by all systems so be careful what you put into your study design and what you
collect in your study data. Being able to cut and paste text between systems is great for productivity but can have