[seek-dev] RE: resultset question

Fri Apr 23 10:59:34 PDT 2004

Dave and I spent some time thinking about this and arrived at a similar 
place as to #4, but took it a little further and changed how the 
resultset is defines and made a minor change to the query.

The main issue has to do with the consumers of the resultset coming back 
from an Ecogrid query.

How does a consumer interpret the results in a meaning way?
What can be done to help generic consumers and SMS?

The issue at the moment is that the contents of the <record> element is 
basically a blob and anything goes. For example:
1) Metacat return a bunch of param elements contain the data
2) DiGIR contents a bunuch of namespace qualified elements containing 
the data.
3) The SRB doesn't even have any data in the record, the identifier attr 
is meaningful.

We need to provide a mechanism for the contents to be interpreted, to do 
this we will add four things to the existing resultset schema:
1) One or more <namespace> elements the metadata - this will be the 
namespace for the new <returnfield> element
2) Add a new element <returnfield>
3) A "name" attribute for the returnfield element (basically the same as 
Peter 'xpath' att) which is a unique name within the record and may be 
meaning for whereever the data came from.
4) A "type" attribute for the returnfield element that describe the type 
of data contained in the returnfield

The most important and powerful part of the new additions is the "type" 
attr. This enables the value to be interpreted. Most of the time it can 
be described by a schema defintion type, for example "xsi:string" etc. 
Or it could be an url that points to a schema definition document. This 
means the value of the returnfield element could be anything from a 
string or integer to an entire XML document.

(Note that the namespace attr has been removed from the record element)

The new namespace attrs in the metadata provide a way for the value of 
the name attr and the type attr to be interpreted.

Here is an example of the a metacat resultset that is returned today:
<rs:resultset system="http://knb.ecoinformatics.org" resultsetId="eml.001"
  xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"

xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1 
../../src/xsd/resultset.xsd"> 
  <resultsetMetadata>
    <sendTime>2004-03-10T13:47:26-0600</sendTime>
    <startRecord>1</startRecord>
    <endRecord>14</endRecord>
    <recordCount>14</recordCount>
  </resultsetMetadata>
  <record number="1"
          system="http://dev.nceas.ucsb.edu"
          identifier="obfs2.379.1"
          namespace="eml://ecoinformatics.org/eml-2.0.0"
          lastModifiedDate="2003-11-02T11:07:43-0600"
          creationDate="2003-11-02T11:07:43-0600">
      <param  name="/eml/dataset/keywordSet/keyword">seasonality</param>
      <param  name="/eml/dataset/keywordSet/keyword">macroalgal 
bloom</param>
      <param  name="/eml/dataset/keywordSet/keyword">green tide</param>
      <param  name="/eml/dataset/keywordSet/keyword">Ulva</param>
      <param  
name="/eml/dataset/creator/individualName/surName">Nelson</param>
      <param  name="/eml/dataset/keywordSet/keyword">biomass</param>
      <param  name="/eml/dataset/keywordSet/keyword">algal blooms</param>
      <param  name="/eml/dataset/title">Armitage Bay Ulvoid Algal 
Biomass and Species Composition</param>
      <param  name="/eml/dataset/keywordSet/keyword">Enteromorpha</param>
      <param  name="/eml/dataset/keywordSet/keyword">Ulvaria</param>
  </record>

Here is an example of the same resultset as described by the new approach:
<rs:resultset system="http://knb.ecoinformatics.org" resultsetId="eml.001"
  xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"

xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1 
../../src/xsd/resultset.xsd"> 
  <resultsetMetadata>
    <sendTime>2004-03-10T13:47:26-0600</sendTime>
    <startRecord>1</startRecord>
    <endRecord>14</endRecord>
    <recordCount>14</recordCount>
    <namespace>eml://ecoinformatics.org/eml-2.0.0</namespace>
    <namespace 
prefix="xsi">http://www.w3.org/2001/XMLSchema-instance</namespace>
  </resultsetMetadata>
  <record number="1"
          system="http://dev.nceas.ucsb.edu"
          identifier="obfs2.379.1"
          lastModifiedDate="2003-11-02T11:07:43-0600"
          creationDate="2003-11-02T11:07:43-0600">
      <returnfield name="/eml/dataset/keywordSet/keyword" 
type="xsi:string">seasonality</returnfield>
      <returnfield name="/eml/dataset/keywordSet/keyword" 
type="xsi:string">macroalgal bloom</returnfield>
      <returnfield name="/eml/dataset/keywordSet/keyword" 
type="xsi:string">green tide</returnfield>
      <returnfield name="/eml/dataset/keywordSet/keyword" 
type="xsi:string">Ulva</returnfield>
      <returnfield name="/eml/dataset/creator/individualName/surName" 
type="xsi:string">Nelson</returnfield>
      <returnfield name="/eml/dataset/keywordSet/keyword" 
type="xsi:string">biomass</returnfield>
      <returnfield name="/eml/dataset/keywordSet/keyword" 
type="xsi:string">algal blooms</returnfield>
      <returnfield name="/eml/dataset/title" type="xsi:string">Armitage 
Bay Ulvoid Algal Biomass and Species Composition</returnfield>
      <returnfield name="/eml/dataset/keywordSet/keyword" 
type="xsi:string">Enteromorpha</returnfield>
      <returnfield name="/eml/dataset/keywordSet/keyword" 
type="xsi:string">Ulvaria</returnfield>
  </record>

Note how we now can interpret the resultset in a much more meaningful 
way. Also, note that there are two new namespace elements, one contains 
a "prefix" attr the other does not. The one without becaomes the default 
namespace for unqualified values in the name and type attrs.

Here is the before and after for the DiGIR query:
Before:
<rs:resultset resultsetId="foo.1.1"
    system="urn:not://sure/what/to/put/here"
    xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1 
../../src/xsd/resultset.xsd">
    <resultsetMetadata>
        <sendTime>2003-05-02T16:45:50-09:00</sendTime>
        <startRecord>1</startRecord>
        <endRecord>2</endRecord>
        <recordCount>2</recordCount>
    </resultsetMetadata>
     <record number="1"

system="http://speciesanalyst.net/digir/DiGIR.php?resource=MammalsDwC2"
             identifier="mvz1"
             namespace="http://digir.net/schema/conceptual/darwin/2003/1.0"
             lastModifiedDate="2003-03-03T10:42:13"
             creationDate="2003-03-03T10:42:13">
        <darwin:ScientificName>PEROMYSCUS LEUCOPUS 
NOVEBORACENSIS</darwin:ScientificName>
        <darwin:Longitude>121</darwin:Longitude>
        <darwin:Latitude>33</darwin:Latitude>
     </record>

After:
<rs:resultset resultsetId="foo.1.1"
    system="urn:not://sure/what/to/put/here"
    xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1 
../../src/xsd/resultset.xsd">

    <resultsetMetadata>
        <sendTime>2003-05-02T16:45:50-09:00</sendTime>
        <startRecord>1</startRecord>
        <endRecord>2</endRecord>
        <recordCount>2</recordCount>

<namespace>http://digir.net/schema/conceptual/darwin/2003/1.0</namespace>
        <namespace 
prefix="xsi">http://www.w3.org/2001/XMLSchema-instance</namespace>
    </resultsetMetadata>

    <record number="1"

system="http://speciesanalyst.net/digir/DiGIR.php?resource=MammalsDwC2"
             identifier="mvz1"
             lastModifiedDate="2003-03-03T10:42:13"
             creationDate="2003-03-03T10:42:13">
        <returnfield path="ScientificName" type="xsi:string">PEROMYSCUS 
LEUCOPUS NOVEBORACENSIS</returnfield>
        <returnfield path="Longitude" type="xsi:int">121</returnfield>
        <returnfield path="Latitude" type="xsi:int">33</returnfield>
    </record>

Here is the SRB's before and after:
Before:
<rs:resultset system="http://knb.ecoinformatics.org" 
resultsetId="SeekSRB_001"
 xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"  >
 <resultsetMetadata>
   <sendTime>2004-04-16T11:02:12-0500</sendTime>
   <startRecord>1</startRecord>
   <endRecord>2</endRecord>
   <recordCount>2</recordCount>
 </resultsetMetadata>
 <record number="1"
         system="http://srb.sdsc.edu"
         identifier="/home/testuser.sdsc/SeekTestArea/Lesli Model::0"
         namespace="srb://srb.sdsc.edu"
         lastModifiedDate="2003-11-30T13:04:59-0600"
         creationDate="2003-11-30T13:04:58-0600">
 </record>

After:
<rs:resultset system="http://knb.ecoinformatics.org" 
resultsetId="SeekSRB_001"
 xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"  >
 <resultsetMetadata>
   <sendTime>2004-04-16T11:02:12-0500</sendTime>
   <startRecord>1</startRecord>
   <endRecord>2</endRecord>
   <recordCount>2</recordCount>
   <namespace>eml://ecoinformatics.org/eml-2.0.0</namespace>
 </resultsetMetadata>
 <record number="1"
         system="http://srb.sdsc.edu"
         identifier="/home/testuser.sdsc/SeekTestArea/Lesli Model::0"
         lastModifiedDate="2003-11-30T13:04:59-0600"
         creationDate="2003-11-30T13:04:58-0600">
  <returnfield name="location" 
type="xsi:string">/home/testuser.sdsc/SeekTestArea/Lesli 
Model::0</returnfield>
 </record>
------------------------------------------------------------------------
The Query
About the only difference between the old query and the new is that is 
the returnfield value can concept attr values do not have a namespace 
then the prefix should be dropped from the namespace element , or they 
should have a namespace if there is a prefix in the element. For example:

<?xml version="1.0" encoding="UTF-8"?>
<egq:query queryId="test.1.1" system="http://knb.ecoinformatics.org"
    xmlns:egq="ecogrid://ecoinformatics.org/ecogrid-query-1.0.0beta1"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-query-1.0.0beta1 
../../src/xsd/query.xsd">
    <namespace>eml://ecoinformatics.org/eml-2.0.0</namespace>
    <returnfield>/eml/dataset/title</returnfield>

    <returnfield>/eml/dataset/creator/individualName/surName</returnfield>
    <returnfield>/eml/dataset/pubDate</returnfield>
    <returnfield>/eml/dataset/keywordSet/keyword</returnfield>
    <title>Soils metadata query</title>
    <AND>
        <OR>
            <condition operator="LIKE" concept="title">%soil%</condition>
            <condition operator="NOT LIKE" 
concept="title">%dirt%</condition>
        </OR>
        <OR>
            <condition operator="LIKE" concept="surName">%Jones%</condition>
            <condition operator="LIKE" 
concept="surName">%Vieglais%</condition>
        </OR>
    </AND>
</egq:query>
------------------------------------------------------------------------

We can either discuss this via email, or think about it and discuss it 
further during our phone meeting.

Rod

Chad Berkley wrote:

> Hi,
>
> Sorry for my late reply...we've been busy with a morpho release.  
> thanks for getting me in gear, Rod.
>
> In metacat, we only return leaf nodes (i.e. the text node child of a 
> CDATA element like in response 4 below).  The returnfield 
> functionality was originally meant as a convenient way to return 
> enough information for a meaningful resultset to display, say, on a 
> web page.  It was not meant to return whole document chunks for 
> further processing.  I can see how this would be useful, but it would 
> require returning a namespace defined chunk so that a parser would 
> know what to do with it.  Metacat currently uses the returnfields to 
> build the resultset table, then a request must be made for the whole 
> document in order to do further processing.
>
> Looking at the responses 1-3 below, to me, they are all invalid and 
> potentially problematic.  without a namespace to parse those xml 
> chunks off of, the parser is left to just do well-formedness checking 
> and any query into these document chunks may fail because we don't 
> know what to expect to get back before doing the processing (e.g. an 
> xpath query).
>
> So I guess to make a short answer long, I agree with Peter's 
> assessment of sticking with response 4 (which is basically what 
> metacat has done all along).
>
> chad
>
>
> Rod Spears wrote:
>
>> Is anyone better qualified than me, going to address Peter's questions?
>>
>> Please someone respond, thanks.
>>
>> Rod
>>
>>
>> Peter McCartney wrote:
>>
>>> it has to be well formed no matter what. so the question is really 
>>> how can we identify a namespace for the result set when the content 
>>> we stick in there has no hope of being valid? further, how can we 
>>> define  a set of rules for how the results are to be evaluated 
>>> against that namespace yet not be valid?
>>> request 1: '*/creator/individualName/surname', '/eml/dataset
>>>  
>>> Rule1: "content must appear in minimal xml tree needed to accomodate 
>>> the informaton"
>>>  
>>> Rule2: "content must appear in a potentially valid xml tree that 
>>> invalidates only due other required elements missing.
>>>  
>>> rule 3 "conent must appear in a tree that placed in in correct node 
>>> ancestry for the declared namespace.
>>>  
>>>  
>>> response 1: meets 1 and 3 and is well formed. Requires just 
>>> knowledge of parent ancestry to build.
>>> <eml>
>>>     <dataset>
>>>     <creator>
>>>         <individualName>
>>>                 <surname>mccartney</surname>
>>>                 <surname>jones</surname>
>>>         </individualname>
>>>     </creator>
>>> </dataset>
>>> <eml>
>>>  
>>> response 2: meets 1, 2 and 3 and is well formed. Requires knowledge 
>>> of ancestry and index (ie jones is in creator[2] of dataset[1] )
>>> <eml>
>>>     <dataset>
>>>     <creator>
>>>         <individualName>
>>>                 <surname>mccartney</surname>
>>>         </individualname>
>>>     </creator>
>>>     <creator>
>>>         <individualName>
>>>                 <surname>jones</surname>
>>>         </individualname>
>>>     </creator>
>>>   </dataset>
>>> <eml>
>>>  
>>>  
>>> response 3: meets 3 and is not well formed. rquires knowledge of 
>>> ancestry.
>>>  
>>> <eml>
>>>     <dataset>
>>>     <creator>
>>>         <individualName>
>>>                 <surname>mccartney</surname>
>>>         </individualname>
>>>     </creator>
>>> </dataset>
>>> <eml>
>>>     <dataset>
>>>     <creator>
>>>         <individualName>
>>>                 <surname>jones</surname>
>>>         </individualname>
>>>     </creator>
>>> </dataset>
>>> </eml>
>>>  
>>> and just a reminder of where we originally started from 
>>> (approximately)  
>>> reponse 4: meets no rule, cannot validated, but conveys all the 
>>> information to generate format 1 or 3 above using a string tokenizer 
>>> and a jDOM. but not option 2.
>>> <resultset namespace=eml......>
>>>     <returnfield 
>>> xpath="dataset/creator/individualname/surname">mccartney</returnfield>
>>>     <returnfield 
>>> xpath="dataset/creator/individualname/surname">jones</returnfield>
>>> </resultset>
>>>  
>>> I think we should really ask whether we are making ourselves deal 
>>> with some very complicated rules for really no gain in 
>>> functionality. None of the results will be valid according to the 
>>> name space. All of them are valid if i make up my own namespace for 
>>> the result set.  Unless we can hold our selves to the standard where 
>>> any code or xsl written for the schema will successfuly process the 
>>> result set (#2 is the closest to that, but depending on how loose 
>>> the code is, all three could work or none could work), why shouldnt 
>>> we opt for the easiest rule to comply with?
>>>  
>>>  
>>> Peter McCartney (peter.mccartney at asu.edu 
>>> <mailto:peter.mccartney at asu.edu>)
>>> Center for Environmental-Studies
>>> Arizona State University
>>>  
>>>
>>>     -----Original Message-----
>>>     *From:* Saritha Bhandarkar
>>>     *Sent:* Friday, April 09, 2004 10:28 AM
>>>     *To:* 'seek-dev'
>>>     *Cc:* Jing Tao; Peter McCartney; Saritha Bhandarkar
>>>     *Subject:* resultset question
>>>
>>>     Hi,
>>>
>>>     I had a question about the resultset to be returned by Xanthoria.
>>>
>>>     The schema of the resultset specifies that a record is of type
>>>     ?AnyRecordType? and optionally it may have some element content
>>>     from the record. Now, my question here is, if I am to return the
>>>     elements specified in the <returnfields> of the query, for the 
>>> matching records (that is from the matching
>>>     eml file), do I need to send it in eml format,  with only relevant
>>>     values for requested fields and no values for the fields which are
>>>     not requested? Or is it enough to return only the requested fields
>>>     with their values, as well-formed xml? Can someone please brief me
>>>     on the contents of a record in resultsetType?
>>>
>>>     Thanks,
>>>
>>>     Saritha
>>>
>>>     
>>>     
>>>     
>>>     
>>>     Saritha Bhandarkar
>>>
>>>     Research Assistant
>>>
>>>     Center for Environmental Studies
>>>
>>>     ASU-Tempe AZ
>>>
>>>     saritha.bhandarkar at asu.edu <mailto:saritha.bhandarkar at asu.edu>
>>>
>>>     
>>>     
>>
>>
>> -- 
>> Rod Spears
>> Biodiversity Research Center
>> University of Kansas
>> 1345 Jayhawk Boulevard
>> Lawrence, KS 66045, USA
>> Tel: 785 864-4082, Fax: 785 864-5335
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/seek-dev/attachments/20040423/55dee192/attachment.htm