[seek-dev] Re: Thoughts on data input into Kepler: and 'R'

Fri Jul 9 08:42:04 PDT 2004

Dan,

Thanks.  I agree with your assessment and we have plans turning for 
substantially upgrading the EML data access.  Currently it is really 
just a proof of concept.  The way it is implemented (loading all data in 
RAM, for example) is not scalable and will need to be fixed.  Other 
inefficiences abound as well, such as retrieving data and metadata from 
the server multiple times (ie, no caching).  Jing is actively in the 
process of redesigning data access mechanisms in Kepler and EcoGrid so 
that standard query (e.g. joins) and resultset (e.g., cursors) 
operations are available.  This should give us pretty good scalability.

I also agree that we'll need alternative ways to view the data than just 
chunking it up to one record per fire.  We've discussed 'table at a 
time' delivery and your suggestion of being able to output 'vector at a 
time' delivery is also a good idea.  Both R and Matlab are more vector 
oriented languages, and could benefit from this sort of data delivery. 
Most traditional stats programs such as SAS and SPSS take more of a 
relational model to data, and so that is the initial perspective that we 
took on.  We'll also have to deal with some real mismatches in how 
relational data might be processed and how spatial data would fit into 
such a flow.

Thanks for the thoughts,

Matt

Daniel Higgins wrote:
> Thoughts on data input into Kepler: and 'R'
> 
> Looking into the use of 'R' inside Kepler has resulted in some 
> thoughts/questions regarding just where we get the data that R (and 
> Kepler) processes. I am presenting some of these ideas here for 
> discussion/comments.
> 
> First, consider the EML200DataSource actor. This actor uses an EML 
> description to locate a data source and configures itself to have one 
> output port for each column (attribute) in the data table (the entity). 
> A sequence of tokens is then output through these ports, one for each 
> row in the data table. The sequence of tokens out of each port is a  
> data stream that could come from any of a variety of sources (database, 
> file. etc.) and could conceptually handle very large data sources. 
> Currently, however, the whole data table is read before the output 
> stream is created.  All the information ends up in local RAM, limiting 
> the amount of information that can exist in a table. Also, the 
> attributes are output on different ports, so that the very concept of a 
> table is sort of lost. (not to mention the possibility of a very large 
> and confusing number of potential ports).
> 
> It would seem that alternative outputs would sometimes be useful. For 
> example, one could output a Ptolemy record for each row in the table. In 
> this row oriented output. a column name would be associated with  each  
> value. Another possibility would be to use a column oriented approach 
> which would create an array for each column and then a record which 
> associated a name with each column array. A single record would thus 
> represent each table. (Note that we could do this within Kepler by 
> adding a SequenceToArray actor on the output of the existing 
> EML200DataSource and then creating an Record associating each array with 
> a name.)
> 
> This last idea is suggested by the way R reads data. Typically, a 
> read.table() function is applied to a text file (or URL) to create a 
> 'data frame' object. Each column in the data frame is a vector that can 
> be individually manipulated. R "operates on named data structures', 
> usually a 'vector' which is an ordered collection (most often numbers). 
> This corresponds to the Ptolemy 'array', which is an ordered collection 
> of tokens of the same type. Thus, it would be nice if we could just hand 
> an 'R' actor a PTII array and have it converted to an 'R' vector. 
> However, in an interactive 'R'  session, data is usually either entered 
> as command line strings or read from a file or URL. (Connections to 
> databases or binary files/connections are also possible.)  So, it might 
> be useful to have an EMLDataSource that either created a local file or 
> returned a URL for the data. An 'R' actor (script) could then just read 
> this file/url as its datasource.
> 
> [We might also consider using some code from Morpo which stored large 
> data table as random access files which allowed us to display very large 
> tables without having everything in RAM. ]
> 

-- 
-------------------------------------------------------------------
Matt Jones                                     jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439    Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
University of California Santa Barbara
Interested in ecological informatics? http://www.ecoinformatics.org
-------------------------------------------------------------------