[seek-dev] Re: Thoughts on data input into Kepler: and 'R'
Matt Jones
jones at nceas.ucsb.edu
Fri Jul 9 08:42:04 PDT 2004
Dan,
Thanks. I agree with your assessment and we have plans turning for
substantially upgrading the EML data access. Currently it is really
just a proof of concept. The way it is implemented (loading all data in
RAM, for example) is not scalable and will need to be fixed. Other
inefficiences abound as well, such as retrieving data and metadata from
the server multiple times (ie, no caching). Jing is actively in the
process of redesigning data access mechanisms in Kepler and EcoGrid so
that standard query (e.g. joins) and resultset (e.g., cursors)
operations are available. This should give us pretty good scalability.
I also agree that we'll need alternative ways to view the data than just
chunking it up to one record per fire. We've discussed 'table at a
time' delivery and your suggestion of being able to output 'vector at a
time' delivery is also a good idea. Both R and Matlab are more vector
oriented languages, and could benefit from this sort of data delivery.
Most traditional stats programs such as SAS and SPSS take more of a
relational model to data, and so that is the initial perspective that we
took on. We'll also have to deal with some real mismatches in how
relational data might be processed and how spatial data would fit into
such a flow.
Thanks for the thoughts,
Matt
Daniel Higgins wrote:
> Thoughts on data input into Kepler: and 'R'
>
> Looking into the use of 'R' inside Kepler has resulted in some
> thoughts/questions regarding just where we get the data that R (and
> Kepler) processes. I am presenting some of these ideas here for
> discussion/comments.
>
> First, consider the EML200DataSource actor. This actor uses an EML
> description to locate a data source and configures itself to have one
> output port for each column (attribute) in the data table (the entity).
> A sequence of tokens is then output through these ports, one for each
> row in the data table. The sequence of tokens out of each port is a
> data stream that could come from any of a variety of sources (database,
> file. etc.) and could conceptually handle very large data sources.
> Currently, however, the whole data table is read before the output
> stream is created. All the information ends up in local RAM, limiting
> the amount of information that can exist in a table. Also, the
> attributes are output on different ports, so that the very concept of a
> table is sort of lost. (not to mention the possibility of a very large
> and confusing number of potential ports).
>
> It would seem that alternative outputs would sometimes be useful. For
> example, one could output a Ptolemy record for each row in the table. In
> this row oriented output. a column name would be associated with each
> value. Another possibility would be to use a column oriented approach
> which would create an array for each column and then a record which
> associated a name with each column array. A single record would thus
> represent each table. (Note that we could do this within Kepler by
> adding a SequenceToArray actor on the output of the existing
> EML200DataSource and then creating an Record associating each array with
> a name.)
>
> This last idea is suggested by the way R reads data. Typically, a
> read.table() function is applied to a text file (or URL) to create a
> 'data frame' object. Each column in the data frame is a vector that can
> be individually manipulated. R "operates on named data structures',
> usually a 'vector' which is an ordered collection (most often numbers).
> This corresponds to the Ptolemy 'array', which is an ordered collection
> of tokens of the same type. Thus, it would be nice if we could just hand
> an 'R' actor a PTII array and have it converted to an 'R' vector.
> However, in an interactive 'R' session, data is usually either entered
> as command line strings or read from a file or URL. (Connections to
> databases or binary files/connections are also possible.) So, it might
> be useful to have an EMLDataSource that either created a local file or
> returned a URL for the data. An 'R' actor (script) could then just read
> this file/url as its datasource.
>
> [We might also consider using some code from Morpo which stored large
> data table as random access files which allowed us to display very large
> tables without having everything in RAM. ]
>
--
-------------------------------------------------------------------
Matt Jones jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/ Fax: 425-920-2439 Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
University of California Santa Barbara
Interested in ecological informatics? http://www.ecoinformatics.org
-------------------------------------------------------------------
More information about the Seek-dev
mailing list