Measurement scale in EML

Matt Jones jones at nceas.ucsb.edu
Mon Feb 28 09:31:01 PST 2005


Hi Xiaoping,

As Peter mentioned, your problems have arisen before.  See below for 
some additional recommendations beyond Peter's from my personal perspective.

Xiaoping Wang wrote:
> Dear Matt and Peter:
> 
> I have seen a lot of discussions recently on issues about measurement 
> scale and temporal coverage.  They are very helpful for our better 
> understanding of EML.  The following are my questions and concerns I 
> raised during my work on our EML-based metadata. <#temporalCoverage>
> 
> 1. About the Measurement scale
> 
> The measurementSclae is a little bit confusing.  I spent a lot of time 
> working on the measurementScale for nominal data.  Here I want to give 
> you an example about how I use the measurmentScale to describe nominal 
> data in our dataset, and you can see whether my implementation is based 
> on correct understanding of this element.
> 
> We have a data table with four columns (attributes): recordID, 
> variable_name, variable_unit, and avriable_value.  The values for 
> variable_name column include certain measurements for the chemical and 
> physical properites of sea water such as temperature, salinity, 
> nitrate......  The following is a sample piece of my EML file for this 
> dataset.
> - <#> <attribute>
>      <attributeName>varName</attributeName>
>      <attributeDefinition>Name of chemical or physical property 
> measured</attributeDefinition>
>      <storageType>String</storageType>
> - <#>     <measurementScale>
> - <#>         <nominal>
> -            <#><nonNumericDomain>
> -                <#><enumeratedDomain>
> -                    <#><codeDefinition>
>                      <code>T</code>
>                      <definition>Temperature, unit: C</definition>
>                  </codeDefinition>
> -                <#>    <codeDefinition>
>                         <code>S</code>
>                         <definition>Salinity, unit: PPT</definition>
>                  </codeDefinition>
> -                    <#><codeDefinition>
>                         <code>ST</code>
>                         <definition>Sigma-T, unit: KG/M**3</definition>
>                     </codeDefinition>   <#>
>              </enumeratedDomain>
>          </nonNumericDomain>
>      </nominal>
>  </measurementScale>
> </attribute>
> - <#> <attribute>
>      <attributeName>varUnit</attributeName>
>      <attributeDefinition>Unit of chemical or physical property 
> measured</attributeDefinition>
>      <storageType>String</storageType>
> - <#>     <measurementScale>
> - <#>         <nominal>
> - <#>             <nonNumericDomain>
> - <#>                 <textDomain>
>                      <definition>*</definition>
>              </textDomain>
>          </nonNumericDomain>
>      </nominal>
>  </measurementScale>
> </attribute>
> 
> My questions / concerns are:
> (1) Is it suitable to use enumeratedDomain element to describe varName?
Yes, that is fine, although if you wanted it to be free text that would 
be ok too (just use textDomain instead of enumeratedDomain).  Encoding 
the unit information in the variable name is somewhat repetitive if you 
have the same unit information in the varUnit column.

> 
> (2) For the varUnit, I don't think it is necessary to include 
> measurementScale element.  However, since the measurementScale is an 
> required field, I have to put something there in order to pass the EML 
> validation.  So I put a "*" sign for the definition element.  I have 
> seen some other similar cases in which the EML metadata developers use a 
> "*" for the definition element.  Obviously, the measurementScale content 
> described here tells no useful information about the varUnit.
The use of the '*' is inappropriate.  The field is required because the 
authors of EML thought the information was important.  In this case, I 
think you should put in the definition something that indicates that the 
values are names of units.  One major thing that is missing here is that 
  you don't use the EML Unit Dictionary when choosing your unit 
definitions.  This eliminates the major advantage of EML in being able 
to provide quantitative information about units.  If there is a 1:1 
correspondence between your units and the EML unit dictionary, I think 
it would be good if you defined varUnit as an enumerated domain and for 
each of your units provide the EML standard name for the unit in the 
definition.  This would help in translating, although it is unlikely 
that anyone could use this in automated systems because its such a 
non-standard use of the eml descriptors.

In general, this model of variablename, varunit, value is a non-standard 
use of the relational model as the attributes do not really represent a 
single type.  The relational model is generally intended to have 
attributes that contain a semantically homogenous set of values.  In 
your case this is not true, unless considered from a meta-level.  So, I 
think you are using the relational model as a schema language itself. 
This significantly complicates use of the data in standard analytical 
systems (e.g., SAS< Splus, R, Matlab) -- they basically all require 
different views of the data as described in Peter's note.  Personally I 
think that documenting these more traditional views if you have them 
would be far more useful to scientists who wish to analyze the data. 
That would have the added benefit of being better described by EML 
structures.  Documenting your "meta-level" schema isn't particularly 
informative because the information in one attribute is so heterogeneous.

> 
> 2. About the information of metadata itself
> 
> Based on my understanding of EML schemas, the only inforamtion 
> associated with the metadata itself is the information about metadata 
> provider(s).  However, my supervisors and I  think that  it is important 
> to provide other metadata information, such as when metadata document is 
> created, if further update of metadata is neede, and if the answer is 
> yes, what is the metadata update frequency and the date of last update.  
> Those pieces of  information are particularly important in the case when 
> the endDate value for the dataset from on-going projects is going to 
> change, because first they can remind metadata providers / developer 
> when they should update their metadata, and second they can tell 
> metadata users if the metadata document provides the most current 
> information about the dataset described.
Sure.  In hindsight, I think we should have included these metadata 
information fields, particularly the timestamp fields.  But we do have 
some related fields that describe ongoing data collection.  Take a look 
at /eml/dataset/maintenance/description and 
/eml/dataset/maintenance/maintenanceUpdateFrequency.  The latter is 
probably what you want.  Ay fields that you want but that don't exist in 
the schema can be put in the "/eml/additionalMetadata" field, so you 
always have that as a recourse.  If you have specific recommendations 
for fields that are needed you could send them to 
eml-dev at ecoinformatics.org and we'll try to get them into plans for a 
future release.

> 
> 3. About the temporal coverage <#temporalCoverage>
> 
> We have many metadata records with uncertain endDate because the new 
> data are being continuously loaded into the dataset.  Whenever new data 
> are loaded, we have to change the values for end date, number of 
> records, and /or size of table......  I am wondering when you can 
> provide a solution for this issue.
Personally I think this is a good thing.  At any given point in time 
there is a finite amount of data available, and the metadata should 
describe that.  If you have an automated data collection process, then 
you would simply have to update your metadata as part of that process. 
The number of records, table size, and checksum are useful when people 
get your data to validate that they got the data without error.  The end 
date for temporal coverage provides valuable discovery information, and 
should simply be made to match the data that you release.

> 
> In addition, I found from John's email that you had a KNB data 
> management workshop early this year.  I am very interested in this kind 
> of workshop, particular workshop associated with the use of metacat.  If 
> you have this type of workshop in the future, please let me know.
Yeah, we had one in February.  We announce these opportunities on 
various web sites and mailing lists.  You should subscribe to 
ecoinfo at ecoinformatics.org and watch http://seek.ecoinformatics.org in 
particular for announcements.

Like Peter I also recommend that you get involved in the ongoing 
improvements related to EML.  Your feedback and contributions would be 
extremely vauable.  Good luck.  Let us know if you have more questions.

Matt
> 
> Thank you very much for your support!
> 
> Xiaoping Wang
> 
> PMEL /NOAA
> 
> 
> 
> 
> 
> 

-- 
-------------------------------------------------------------------
Matt Jones                                     jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439    Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
University of California Santa Barbara
Interested in ecological informatics? http://www.ecoinformatics.org
-------------------------------------------------------------------



More information about the Eml-dev mailing list