Formatting Recommendations for Databases to be Shared via the PD-DOC
1. Decide what data elements you wish to collect and at what time intervals. Organize that information into Case Report Forms (CRFs), incorporate as many of the PD-DOC CORE-DS (CRFs) in your study as appropriate to your research.
2. Things to consider when choosing your software are current local expertise (e.g. available data management personnel), ease of use, error checking capabilities, software licensing costs, and type of data being collected. Check to see if the database software allows you to export the data into other formats (e.g. SAS, Comma Separated File [CSV] files) for ease in sharing.
3. A relational database is a collection of relations (frequently called tables). Your database should be organized into multiple tables corresponding to your CRFs. A table is a set of data elements (values) that is organized using a model of vertical columns (which are identified by their name) and horizontal rows. A table has a specified number of columns (usually corresponding to questions on a CRF, e.g. diagnosis, year of onset), but can have any number of rows (usually but not necessarily one row for each subject in the study). The subject ID uniquely identifies a research subject’s row of data in each table. By adhering to the rules of the relational model you ensure that your data can be transferred to other relational database systems or other software packages relatively easily. Another acceptable format is storing the data directly in SAS and not using a relational database.
4. In order to maintain subject confidentiality, use a study specific subject ID (e.g. numeric code) that is unique to the individual. No two participants should have the same subject ID. Don’t use names or initials.
5. Create your supporting documentation at the same time you create your database while the details are clear and to save time at the end of the study:
a. Prepare a tabular summary of activities and CRFs completed at specified time intervals during the study. See sample Schedule of Activities.
b. Prepare a Data Dictionary, which includes the variable name, label, acceptable values and whether the variable is derived or not. Derived variables should include a clear description of how they are calculated. Some software packages (e.g. SAS) can print a Table of Contents that can be modified into a Data Dictionary.
c. Create a set of Annotated CRFs (optional). Each CRF has the variable name written next to each question on the form to create an easy visual reference.
6. In order to maintain subject confidentiality, all HIPAA identifiers (e.g. text fields, date fields) will need to be deleted from the database prior to sharing. Creating derived variables for maintenance of subject confidentiality at the time of setting up your database allows you to include them in any interim error checking procedures to assure their accuracy. Some derived variable definitions that can be used for de-identification are listed below:
a. Age = Enrollment Date (entry date into the study) minus Birth Date.
For example 08/20/2008 – 12/03/1938 = 69 years old.
b. ageGT89 (0=No,1=Yes) created for subjects 90 or older at the time of enrollment into the study. If a subject is 90 or older their corresponding age variable and birth year must be set to missing prior to data sharing. This is a HIPAA requirement.
c. For all dates, create a variable that contains just the year (e.g. enrollment year, birth year, visit year).
e. For concomitant medications, create a variable for the year a medication was started and/or stopped and for the number of days on the medication. A similar set of variables can be created for adverse experiences (e.g. duration of AE).
f. If concomitant medications were coded by a drug coding dictionary (e.g. WHO), the verbatim text should be deleted and the drug dictionary term used in its place. The text for drug indication should be reviewed/modified for information that could possibly identify an individual study subject.
g. Other derived variables can be defined and included in the database at the discretion of the study PI (e.g. duration of PD, age at diagnosis).