Workshop Data transformation 2008 Nov 4
Data Transformation Workshop Day 2 04 November 2008
DAY 2 AM
Presentation Ricardo Pietrobon - Computational Ontologies in the process of streamlining biomedical data analysis
http://www.researchonresearch.org
Presentation: http://docs.google.com/Presentation?id=ah943np39zch_1227dtpnvhcj
Support networks - coordinate databases for various networks, templates for manuscript writing etc
Research Use Case 1.
Looking at papers and 'getting a data dictionary' research question in a std format, data analysis details
Actors:clinical researcher and a statistician
Data dictionary - = a dataset?
Assume everything is a dataset - even for a prospective study.
Secondary data analysis - so no new variables
All analysis in consultation with statistician
Input=indication of stat method
Format for returning data to researcher is standardized in terms of format. Helps speed up analysis
Manuscripts written on line - using google docs
Mock Data dictionary - Carey paper *add link
Variable name, in a spreadsheet or in a relational database. CRF - study questionnaire. The questions are the variable description. All completely custom per paper. Not vocab level dization. Is a std way to present the data dictionary. Have a local electronic data capture system - builds a data dictionary when data is captured.
Quality measures: percent missing, percent implausible.
For prospective sets create a CRF - has potential to be a data dictionary
90% of studies captured electronically. Vocab and variables not std within the group when using that SW
Existing datasets - Meta data repository - Database of Databases. Data dictionary per study in the database. Wiki type mechanism to edit the data dictionary. This is a database of metadata for databases. Study level info and data dictionary - field name, question from CRF and responses, measures of quality (not available for all studies).
Originally funded by NIA - manual entry of data, converting paper systems - slow. Needs to be available Then asked researchers to add the info. Poor results response to the researchers Now assisted text mining, extracting fields. - limited abilities for text mining - just literature searching - not mining study docs or CRF due to lack of ability.
Outcome is a research question in a std format.
Title, database, years used, hypothesized conclusions, variables of different types, strata, inclusion criteria, exclusion criteria, text on stat methods, tables of data
Desire to stdize the research process to limit iterations of analysis and writing -> faster.
Hypothesized conclusions: There is a disconnect between hypotheses and real intentions (Feinstein). Need to know what are the final conclusions, not nec. what the hypothesis is. Forget about conclusions being right or wrong. Hence the term 'hypothesized conclusions'.
Did some studies looking at thought process of researchers, they think from the end of the research process.
Variables: names of these are study specific and from the data dictionary (originally derived from CRF).
Outcomes - dependent variables Predictors - main independent variables (clinical researchers focus on this) Potential confounders - everything else Strata - a certain strata might be important Inclusion/Exclusion criteria - study design
Use a diagram to make sure that all the above variables are communicated to all members of the team
Tables and graphics - the study has to have mock tables. This is std in survey design and used here at the start of the study not the end. Essential even as when all variables etc are known, this helps communicate with the statistician and the researcher. Allows presentation of the results in a format that a given journal will accept.
Question diagram has a function in communication Researcher->Statistician
Statistician->Researcher
Hypothesized conclusion must match the variables - allows internal consistency checking - QD axis. Many published papers are missing this info. All this is done completely manual. Would like an ontology to help with this.
A single dataset can have multiple question diagrams and multiple papers. A single dataset can have 5000 variables.
Establishing communication between statistician and the scientist:
Break info about tests down so that clinicians can understand the stats. Clinicians think in terms of previous examples. Use a layers of information principle. 5 different layers.
Layer 1: What a method is and what it does. Use example hypothesized conclusion from a previous question diagram
Layer 2: Input, what types of variables - has reln with the question diagram variables, methods section of the paper: Var Structure
continuous categorical nominal ordinal time dependent
Var Role:
Outcome Predictor Potential confounder ...
Layer 3:Output Tables and Graphics
examples of what the method can provide ..
Layer 4: Previous publications (teach by example) Application in another area can promote reuse
Layer 5: Additional references
Wikipedia Books Articles Datasets and scripts (to teach biostatistics)
Use case 2, Pietrobon 2004 Had a method, looking for a dataset to apply the method e.g. Item response theory
All publications work in this way.
Course in outcomes research in different universities
See paper streamlining biomedical research analysis - IEEE Carvalho and Pietrobon 2008
Using OBI
Variable ontology aligned with the research question - maybe follow the question diagram framework
E.g. Structure of the variable, role of the variable in the context of the research question
Analysis methods aligned with the input and output in the layers
- statistical tests etc
tests bed with databases where OBI can be integrated. Using semantic media wiki, could use the ontology there. There are wiki forms where can attach an ontology. Usability tests etc. Interested in implementation.
End of ppt--------------
TB:Lot of overlap with the data transformation branch, good use case, different perspective, useful for us to look at the structure.
RB:We haven't spent a lot of time on inputs and outputs yet
RB:We have been working on Zelig - 50 methods, working through these to formalize their inputs, most of those methods not used.
TB:We talked about how to classify the methods and we are still struggling with that - output/input
JM:We can do that by all.
HP:Need to add the restrictions - and we need to know what those are
RB:Is there a downside
TB:New focus for us.
JM:We've talked about data having rules, this will be a problem, not allowed in BF0. E.g. a variable can't be a predictor role in BFO
MC:We need to know what we want to represent as variables. These will be in IO.
AI:Dump out the definitions of the variables used at Duke - these are defined in templates RP
RP:Not everything is covered - there are exceptions to that structure
PRS:We have worked on different branchs, I found important going over the study descriptions. When we define the perturbations in bio systems we have a similar set up, different vocab. We are missing variable structure. Continuous etc. Relates back to where variables live. Question, of how method apply affects the study design. Think things are aligned already with OBI
RP:Structures in a data dictionary can be changed. Think need to know the variable structure prior to selecting a method
PRS:Plan and dt branches need cross talk
JM:How can we break this down. Variables, if you have these we can add them. We have already talked to AR about this. Will post to them to get a response.
RP:Continue the methods mapping inputs - stat methods to their inputs.
TB:We can take that as a concrete example. Cover the hypothesis, methods, and the results in OBI. We can start on the methods.
HP:We may need to consider modelling methods not some process, if these are not equivalent
RP:Package Zelig page has 50 methods, we have been going through them
TB:Database of database is about clinical research, yes
Day 2 PM
Discussion on the Zelig work that RP has done categorizing inputs and outputs and how to use that in the context of OBI.
JM:We can look at your use cases and develop comptency questions.
RP:My experience for clinical research is that we need a few tests
HP:Can we focus on a core set that we can add to OBI and see what the intersection is with what we have already and what Monnie did preparing the hypothesis tests?
Working on the hypothesis test document:
RP:Many of these fall into one categorg e.g. Chi squared becomes exact with an option.
HP:The current parentage doesn't need to be correct
RP:We did not include paired in our characterization, we did have parametric and non parametric.
RP:What's quanitative and qualitative
MM:ordinal is quantitative
Discussion on the classification we used vs the one RP used.
MM:We can change the classification if needed
Qualitative changes to nominal
MM:Dichotomous is a subset of nominal
Discussion on whether we need nominal ordinal or another classification
MM:The response variable was in mind when we built the sheet.
RP:The univariate and multivariate - difference between epidemiologists - would call this a bi-variate.
MM:in stats is about the no of response var, one response var is univariat
We upload this as a google doc so that we can edit it.
AI:quantitative variable - we need a definition that separates from ordinal - MM
AI: Alt term:categorical variable. Needs two child terms: dichotomous variable, polychotomous variable and definitions RP
AI:look up other types of chi squared and define them MM
HP:Do we lose anything by using hypothesis testing as a generic objective. We may want to be more precise.
JM:not much. We can always categorize.
HP:The sheet is complete. Parametric test goes somewhere else. We think this goes under plan planned process.
AI:Terms will propose these relevant terms from the header row of the spreadsheet the plan/planned process branch. PRS

