Data Stewardship and the SGP Data Package
sgp data
The SGP project is working to assemble a massive amount of new geochemical data (e.g., shale chemistry, trace metal isotopes) for the Neoproterozoic through Paleozoic epochs. This data is of the type that would traditionally be considered ‘big data’ in terms of its scale but it still relies on relatively straightforward (though custom) relational database systems to manage the analysis, with on the order of millions of analytical results.
In addition to the aforementioned new data, the SGP project is assembling or generating a vast amount of existing multi-proxy sedimentary geochemical data from a wide range of different sources. This data is then being used to answer a variety of research questions.
SGP’s goal is to provide tools and support that allow a broad community of researchers access to and use this data. Ultimately, the goal is to make this data accessible in permanent repository formats, including as part of geochemical databases.
The most important aspect of data stewardship is visibility into the lifecycle of data, in particular its lineage: where it originated, what happens to it, how it moves, and where it ends up in business contexts. This makes it possible to trace errors or problems in the use of data-say, for analytics-back to their source.
Visibility also includes understanding the relationships between the data sets that are being used. For example, a data set may be used to calculate multiple statistical models that produce the same result. For this reason, it is important that data stewardship understands the relationship between these different models and that the data steward can explain how they relate.
This is particularly important for data stewardship in the context of the big data that has become prevalent in modern business environments. Data stewardship is necessary to manage these large, complex collections of structured and unstructured information that are stored in various storage volumes across an organization. It is also crucial that data stewards have an awareness of how to access these data, and of the metadata associated with them, in order to make sense of the information that they manage.
The sgpdata package contains classes, functions and data to help users develop and analyze student growth percentiles and percentile growth projections/trajectories using large scale, longitudinal education assessment data. The package uses quantile regression to estimate the conditional density of each student’s achievement history and then to derive percentile growth predictions/trajectories based on this information. The package also provides an exemplar data set in WIDE format that models the format of the data used with lower level functions such as studentGrowthPercentiles and studentGrowthProjections.
To use the sgpdata package, users must install R, an open-source programing language that can be downloaded free of charge from the CRAN website. As an advanced programming language, R has many functions that require familiarity with its usage. Those who are new to R should consider reading the SGP Data Analysis Vignette for guidance. In addition, the sgpdata package requires a database to store its analysis results and this database must be configured with the appropriate data types.