Skip to main content
CCTSI

Informatics

Go Search
CCTSI
About Us
CTRC
Research Resources
Informatics
Education & Training
Community Engagement
Funding Opportunities
Child & Maternal Health
Contact Us
  

CCTSI > Informatics > Managing Your Research Data
 

Managing Your Research Data

Class pictureGood data management from the start can streamline your data collection and analysis. It can also help you repurpose your data for additional studies and share your data with collaborators.

A poorly executed data management plan, however, can bog down your efforts and may even impede you from completing your intended analysis.

Here are some tips to help translational investigators develop data management plans for proposed and new projects.

 
 
 

 Video Resources on Data Management

If you would like to view video presentations on topics related to data management, click on the links below:

 Last Things First: Let Your Analytic Plan Shape Data Management

Make sure that you have formally mapped every piece of data you collect to your analytic plan, safety monitoring or reporting requirements, and make sure that the data collection will supply everything you need. This will help you avoid two related ills:

  • Collecting Too Much Data: Although it is tempting to collect as much data as possible (who knows what you might find!), more is not always better (beware of what you find!). Rather than producing intriguing new discoveries, you may find that extraneous data requires extra work to extract and manage, and ultimately produces spurious, undesirable associations that you are now at pains to explain. Consider also whether adding data leads to tests for additional associations that require you to adjust for multiple comparisons – pushing the p-value for your core hypothesis above 0.05.
  • Missing Critical Data: It is heartbreaking to invest your efforts as well as those of your staff and your subjects in collecting a well-validated seven-item assessment instrument at three time points, only to find that item #4 was inadvertently left off the third time point, and the response scale on item #6 changed from the first time point to the second. Yet another reason to avoid collecting too much data is that trying to manage too many data elements makes it easy to overlook tiny critical errors in the data you care most about.

It is a good idea to map your data and analytic plan both “forward” (mapping concepts/aims to data) and “backward” (mapping data to concepts/use).

It is critical to consult with your statistical consultants early—well before you implement your data collection plan. Statistical consultants can provide “a fresh set of eyes,” identifying problems that may have been overlooked by the investigator. More importantly, statistical consultants can propose alternative approaches that can vastly improve the power and quality of your analysis. Statistical consultation is available through the Colorado Biostatistics Consortium (CBC).

 Keep Well-Documented Data Dictionaries

Make sure you create a formal data dictionary for each of your datasets, and update them when any changes are made. At the very least, the dictionary should describe the following for each field in the dataset:

  • Field Name (variable name): Although some database management systems constrain the length of field names, don’t shorten field names unnecessarily. ICLR may be a cute name for the "eye color" field, but you and your colleagues will be much more likely to remember what EYECOLOR represents when you return to the dataset months or years later
  • Description: A label or descriptor that clearly identifies the contents of the field
  • Type: Will the data be numeric, character, a date, or something else? If numeric, is it an integer or a real number? If a date, what date format (MMDDYYYY, Julian, etc.)?
  • Key "metadata" (data about data) are particularly critical. When possible, define:
The coding scheme Such as the metric used (e.g. micrograms) or the values represented (e.g. 1=sluggish reflex, 2=normal reflex, 3=brisk reflex).
A standard representation of the measurement Such as the relevant LOINC or SNOMED code (see "Use Standard Instruments and Representations" below).
How the measurement was obtained How the measurement was obtained, such as the assay used, or the algorithm for a derived value such as creatinine clearance.
Upper and lower bounds This will facilitate range checks that can be used both at data entry and data validation to ensure data integrity (e.g. making sure you don't have 8-year ol mothers, or 45-year old infants in your data set).
How missing data are represented If you choose a special value (as is commonly done in social science research, e.g. 97=subject declined to answer, 98=subject did not know) use it consistently throughout (not 97 for one field and 997 for another. Never use zero to indicate missing data. Zero is especially likely to be misinterpreted as an actual value rather than as missing data.

View an example of a data dictionary

Your data dictionary will be a crucial resource not only as you conduct your current research project, but also if you later wish to share your data with collaborators or other researchers.

 Use Standard Instruments and Representations

Don’t develop measures from scratch if you can use existing ones. Using standard measures will allow your findings to be compared meaningfully with those of others, and will allow you to reuse your datasets later. Of course, it is vital that these instruments not be modified, or they will no longer be "validated" or comparable. For demographics and health status, consider using instruments from national agencies such as CDC (e.g. BRFSS) and AHRQ (e.g. NHANES ). If you need help selecting appropriate behavior measures to include in your study, in the near future, investigators will be able to review and select common behavioral and social science measures from the Grid-Enabled Measures (GEM) database in caBIG (the cancer bioinformatics grid). If you must develop new metrics based on questionnaires or scales, consider consulting with a psychometrician to ensure these metrics are reasonably well validated.

Investigators are also increasingly recognizing the benefits of incorporating standard representations of data such as laboratory values (like hemoglobin A1c or glucose), diseases (such as sarcoidosis), and symptoms and findings (such as shortness of breath). Incorporating the following standards in your datasets will greatly improve their reusability in the future, making it easier for you to collaborate with others and allowing you to contribute your data to local and national repositories:

  • For laboratory values, LOINC ® (Logical Observation Identifiers Names and Codes) is preferred (e.g. hemoglobin A1c = 17855-8)
  • For diseases, symptoms and findings, SNOMED-CT is preferred (e.g. sarcoidosis = 24369008, shortness of breath =267036007)

The CCTSI Informatics Core is eager to assist you in incorporating standard representations into your datasets! For assistance, please contact Steve Ross.

 Excel Is Not the Answer: Choosing a Database Structure

While there are a number of other database management systems to choose from, the use of spreadsheets such as Excel for data entry and storage is never a good idea! Protect yourself by keeping original primary data in a robust database, which can be exported to a spreadsheet or statistical package for analysis without corrupting the underlying data. Among the many problems with Excel for data entry and storage are:

  • It is much too easy to corrupt data in Excel. If you make the common error of sorting on a single column and forgetting to undo the change before saving the dataset, your dataset is now hopelessly corrupted and unrecoverable.
  • Excel doesn’t provide facilities for storage of metadata
  • Range checking/data validation is possible but cumbersome
  • Keeping all the data on a single spreadsheet encourages PHI to be mixed with non-PHI, which can create privacy and security concerns

While MS Access does solve some of the data entry and storage problems inherent to Excel, it does not meet HIPAA standards for security, including standards related to authorization, authentication and audit controls. Refer to URL: http://www.ucdenver.edu/academics/research/AboutUs/regcomp/hipaa/ for additional information.

Instead of using Excel or Access, all translational researchers are strongly encouraged to use REDCap (Research Electronic Data Capture) for data entry and storage. Information about using Redcap is available at http://redcapinfo.ucdenver.edu. These services are free (underwritten) for translational projects conducted through one of the clinical translational research centers (CTRCs) upon review by the appropriate scientific review committee (such as the SARC-Scientific Advisory Review Committee) and are also available and at nominal cost for all other interested investigators.

Why use REDCAP?

  • REDCap provides the ability for you or your delegate to implement clinical report forms and surveys without the need for a programmer.
  • Data stored on REDCap are assured of being compliant with COMIRB and HIPAA standards for security.
  • By using REDCap, you are protected from data loss, because (1) data are backed up regularly and automatically and (2) changes to data are logged, creating an audit trail indicating which data were changed, by whom, and when. With this information, unwanted changes can be undone if necessary.
  • Because REDCap is Web-accessible, co-investigators from other sites can access the data without having to join a virtual private network.
  • REDCap provides an expanding set of tools for analysis and visualization of data.
  • Local assistance on the UC Denver Anschutz Medical Campus is readily available from REDCap@ucdenver.edu.

REDCap does have some shortcomings, however, and there may be instances in which use of an alternative database management system may be indicated:

  • REDCap does not currently have all of the functionality needed for trials whose data will be submitted to the FDA for a new drug or medical device application (i.e., it is not 21 CFR Part 11 enabled).
  • Because REDCap allows you to easily create data collection forms, it offers limited flexibility in form design. (By contrast, Access allows greater flexibility; however—beware of the temptation to fiddle endlessly with the appearance of Access data collection forms.)
  • Data entry into REDCap is currently possible only when a live Web connection to is available. Offline (asynchronous) data collection will be available in mid-2010.

Note that REDCap is a data collection tool. To generate reports from data that have been collected in REDCap, one must first download the data to an analysis package. REDCap makes such downloads very easy and can produce analysis files for a variety of packages, including SAS, SPSS, R, and Excel.

If you would like to see a more-detailed comparison of Excel, Access, and REDCap, click here. If you would like to discuss whether REDCap is suitable for your project with us, please contact us at REDCap@ucdenver.edu.

 Storage: Finding a Safe Place for Your Data to Abide

In considering where data abide, keep in mind two types of data, both of which must be stored in a manner that protects the privacy and security of your subjects:

  • The original data you have collected, which must be protected from corruption and stored in a robust, audited, and recoverable system.
  • Analytic datasets you derived from your original data, which can be manipulated at will using tools like SAS, SPSS, and Excel.

Given the need to protect your original data, it is clear that the hard drive of a desktop computer (much less the hard drive of a laptop or a thumb drive) is completely unsuitable for storage of original data. If you do not use REDCap, make sure that your data are stored on a secured UC Denver server (generally maintained by your Division or Department). The data on secured servers are backed up regularly, firewalled and password protected, and subject to an audit trail which can identify who has accessed your data. For questions on data storage options, please contact Thomas.Yaeger@ucdenver.edu.

While corruption and recoverability are not as significant issues for analytic datasets, privacy and security remain major considerations. There are severe penalties for privacy breaches, in some cases requiring the University to report breaches to news organizations. Unencrypted personal health information (PHI) must never be stored outside of or a secured server—you as an investigator are personally liable if you store unencrypted PHI on a laptop or thumb drive that is stolen. Try to perform all your analyses on the secured server if you can—in the event of a suspected data breach audit trails can determine whether a reportable breach has actually occurred. A virtual private network (VPN) connection can allow you to access these data remotely. If you must store analytic data elsewhere, make sure that (1) the data are stripped of all PHI, or (2) the data are encrypted or (3) both. Your department or division should be able to provide you with drive encryption software.