[seek-dev] Re: notes on eml (Your comments and suggestions will be greatly appreciated!)

Mon Feb 10 13:15:57 PST 2003

Thanks, Jenny.  Tony also sent this to me last week, but I had not had a 
chance to reply until I got back from my travels.  Thank you for 
examining EML so closely -- few people seem to take the time to do so. 
My replies to your comments are interspersed within your comments below.

 > Jenny Wang (jwang at sdsc.edu) wrote:
> 1. Other data formats than tabular
> 
> EML is to document metadata for binary or ASCII files (just tabular
> data), how about metadata of XML files, html files, data published
> through web services? As more applications export or exchange
> information in XML format and even some original data are just
> recorded in XML files, their DTD or XML Schema are very important
> information that need to be documented and provided to users.

Actually, EML supports a variety of entity types, including tabular 
data, spatial vector and raster data, and others.  This is modularly 
implemented, and so is extensible.  For details, examine all of the EML 
schema ComplexTypes that reference EntityGroup.  In addition, for any 
given type of entity, for example tabular data, there is is separation 
between the logical model (e.g., what attributes are present), and the 
physical model (e.g., how the table is serialized to disk).  So, for 
example, one can have a table in serialized in csv format described in 
the physical module using "textFormat", and the same table serialized in 
XML format and described in the physical module using 
"externallyDefinedFormat".  One could argue that adding a container to 
place the actual schema/dtd in externallyDefinedFormat would be a good 
idea.  However, even with such a schema, the exact mapping between the 
physical format and the logical model is still ambiguous, and that's a 
hard one to solve.  Our "textFormat" description defines the mapping to 
the logical structure, but I'm not sure how this would work for XML. 
Needless to say, I think it would be important to be able to have full 
XML serialization support for entities in EML.

> 2. Data source query ability
> 
> EML describes some metadata of relational databases and datasets
> behind applications like JSPs. But it only describes the information
> about the data, no query abilities on a data source provided. For
> example, some site may have data in Oracle tables, but only queries
> over some views are allowable. Then the users or applications would
> not know how to properly query the data source. Another example,
> NTL (North Template Lake) site export their data through JSPs. In
> the sample EML document for a climate dataset collected at Noble
> F. Lee Municipal Airport, descriptions about the attributes are
> provided, and the URL where the JSPs located is given too. But, no
> any information about the flow of the JSPs is given. Without reading
> and understanding the interface by a human user, an application can
> not know how to transfer a query into this interface specific query
> (filling in the forms correctly) over this interface, since there
> is no any description in the EML file about accessing the data
> through this interface.

Querying on custom interfaces is a difficult thing to describe 
generally, so we decided to make a simple distinction between simple 
services where the entire query can be expressed as a URL, and more 
complex systems where more detailed knowledge of the application is 
required.  In EML, these more detailed applications are described in the 
physical distribution under "online/connection". As the semantics of the 
application are difficult to generalize, we decided that we would 
basically allow one to name an application and provide the connection 
parameters needed to query using that interface.  The application 
protocol is described in natural language, but really a machine must 
recognize it by name in order to really make use of it.  This is 
severely lacking for many cases.  We had extensive discussions about the 
use of languages like WSDL in place of out current connection 
descriptors, but the general feeling among the EML developers was that 
WSDL was too immature, and its future uncertain.  So, we decided to wait 
on a full implementaiton of this functionality until we were able to 
experiment with it further.  Your input into what is needed would be 
welcome.

> 3. Automated data access support for applications
> 
> EML seems not providing enough information that helps applications
> directly and automatically access the data.
> 
> In physical module, distribution element provides URL as only
> information for a DBMS or other system that holds the data. Without
> further information about what DBMS it is, what version, etc.,
> how an application can set up right connection (e.g., what driver
> to use) to the system?

Use "online/connection" instead of "online/url".  See my notes on your 
point (2) as well, as they apply.

> Again taking the sample EML document for Noble F. Lee Municipal
> airport from NTL site as example, in Physical module, the
> distribution part gives only the URL where the query interface
> through JSPs is running, lots of html forms (for example, which
> fields to be retrieved, time boundaries, output format) need to be
> filled before retrieving the data.

That the NTL site decided to provide only a URL that doesn't directly 
access the data is their decision.  They had and have the opportunity to 
provide more detailed connection information through 
"online/connection", and they opted not to.

> If EML was only for helping users to interpret the data properly,
> then it should not bother to put in XML format, just regular text
> files would suffice as those sites or ClimDB provide in introduction
> or description part on their web sites.  If it is going to support
> automated data integration, automated data access is necessary
> at first. E.g., providing connection and query ability for DBMSs,
> attaching calling messages and parameters for JSPs and web services
> seem needed.

Agreed.  I argued pretty hard for including WSDL descriptions in EML, 
but didn't get much support back then.  We switched to our current 
"online/connection" as a compromise.  I think this area still needs lots 
of work.  However, as we are trying to get people to deploy EML 
throughout the community, we have promised not to make any changes that 
are backwards incompatible.  So we will need to think carefully about 
how to do this.  In the interim, you could use the flexible 
"additionalMetadata" section to point a WSDL description at an 
"online/connection" element, and see how far that gets you.

> 4. Data integration based on EML
> 
> It provides information for each attribute existing in a dataset,
> which contains attribute name, label, definition, storage type,
> measurement scale (unit: standard unit or custom unit, precision,
> domain: number type, min value, max value), accuracy, etc.
> 
> The problem is, besides so many standard units (the transformation
> among them is doable though), there are custom units. How can an
> integration system understand all custom units and know how to
> transform among them?

Custom units are not free-form.  They must be defined in the document 
(probably in "additionalMetadata") using an STMML definition.  These 
STMML definitions do two main things: 1) they define a "unitType" which 
is a class of units that share the same dimensionality, and 2) they 
define a specific "unit" with its conversion information to get back to 
the canonical SI unit for that unitType.  So, although the standard 
dictionary only supplies a few units for the unitType "energy", one can 
create new "energy" units (e.g., BTU) that are self-describing with 
respect to their base SI unit (e.g., joule).  So, to map custom units to 
standard units, one will need to parse and traverse the STMML logic.  In 
addition, one can determine the relationship between unitTypes by 
examing the dimensionality of each unit.  We aren't convinced that 
everything we need is there, but its a pretty good start.  I'd like to 
hear your opinions once you evaluate the STMML part of the system.

> Another problem, how can an integration system correctly understand
> the attributes? meaning and their relationships? For example,
> AVG_AIR_TEMP?s definition in the airport sample is ?air temperature
> at 1.5m height in Gill shelter?, possibly somebody else would
> give the definition of the same attribute as ?air temperature in
> Gill shelter 150 cm high?, or ?air temperature at 1.5m height in
> shelter each hour? in another EML file for a dataset at the same
> site or a different site.  Notice there is no description about the
> AVG aggregation range in the definition for AVG_AIR_TEMP in the
> airport sample.  It is hard for an application to tell they mean
> the same thing or not. If  monthly temperature is misunderstood as
> for daily and integrated with daily temperature together,  analysis
> result on the data would not make sense.

Yep.  Attribute definitions contain a lot of the semantic information 
needed to do integration.  You are dealing with very simple climate 
data; the problem gets far, far worse with real ecological data, 
especially once taxonomy and classification problems get dragged in.  We 
are working on a simple system to add "semantic labels" to the 
attributes, where the labels are drawn from an identified ontology. 
However, this work was not mature enough to include in EML, so we are 
experimenting with it in "additionalMetadata" using the reference 
pointers, and will hope to add it in a future EML version once the SEEK 
project has worked out some of the thorny issues and found something 
that works well.

> Adding semantic mapping from an attribute to a data standard may
> be a solution. For example, if an attribute uses a custom unit,
> then how to translate it into a standard unit should be described
> in a way that an integration system can understand and provided
> within a semantic mapping element under this attribute.

Much of this is already provided by our use of STMML in EML.  There are 
far harder integration problems than simple unit conversion.  But even 
unit conversion can be hard, because just because you know how to 
convert between two units mathematically does not mean that it will 
produce a meaningful result.  Basically, determining the legitimacy of 
the integration of two attributes depends only shallowly on the data 
type and units, and much more fundamentally on the requirements of the 
analysis to which the integrated data will be put.  We are explicitly 
dealing with this issue in SEEK.

> 5. Analysis support
> 
> EML has a module, eml_software, for providing general information
> that describes the software needed to view a dataset, or process
> it. This part is based on (OSD) Open Software Description. But it
> does not include interface specification that is necessary for an
> application to use the software.
> 
> Possibly, we can extend this part to include input and output
> specification, and necessary information for accessing it. Also, we
> can map each attribute of input and output to standard data format,
> and then it would be possible to use this software or an analysis
> routine to work with other datasets including integrated datasets
> since automated translation of a dataset into required format could
> be done with explicit mapping information.

Agreed.  The software module was a placeholder.  We have developed a 
pipeline language for formally defining the inputs and outputs of 
analyses and models as part of our Monarch project.  This language is 
not yet mature enough for inclusion in an EML release to the ecological 
community, but the SEEK Analysis and Modeling System component will 
definitely have such an extension as one of its products.  I'd be happy 
to review our current pipeline work with you, show you our pipeline 
schemas, and review the Monarch system if it would be useful to you.  We 
are currently in the process of comparing our language to others like 
MOML to see if there is a need for an independent language, or whether 
we can adopt all or part of a language like MOML.  So, basically, what 
you describe is exactly what we were funded to do in the SEEK proposal.

> 6. EML related tools
> 
> The ongoing projects on developing EML related tools are mainly
> focused on providing metadata editor, generating, storing,
> querying, displaying and reformatting EML documents, or manipulating
> local data or data on KNB, including Metacat, Morpho, Xanthoria,
> Xylographa. However, there is no much effort on making usage of EML
> documents, or making EML useful for ecological applications. The only
> proposed application of EML so far is for data quality control. E.g.,
> when a user enters tested data into computer, if it was out of range,
> the system would inform by comparing the min and max described
> in a corresponding EML document. This is supported in any DBMS
> through constraints.

Actually, not really true.  Our Monarch system is a general-purpose 
analysis and modeling system that is metadata-driven.  It supports 
arbitrary statistical approaches to data analysis, and is extensible to 
include arbitrary backend systems for executing analyses and models.  We 
currently are supporting SAS and Matlab, and plan to have support for R 
and some simulaiton models in the future. We are also considering 
wrapping ESRI tools.  In addition, Peter McCartney's group is developing 
software that analyzes data based on EML input as well, although he'll 
need to provide the details of that (I think its called Xylopia). 
Finally Wade Sheldon at GCE has been developing analytical tools that 
are metadata-driven as part of his matlab toolbox, and I belive he plans 
on converting it to use EML in addition to the current FLED metadata 
spec that he supports.

> 7. Further work
> 
> 1)  Extend it to describe more formats of data, like data in XML,
> data wrapped by web services.

Already possible.  May need to provide a container for the schema, but 
this still leaves the physical to logical mapping ambiguous.  Web 
services can be named in "online/connection".  See above.

> 2)  Extend it to describe data source query abilities for data
> source maintained in DBMS or hided behind query interfaces, and
> provide information for automated data access.

Agreed.  We'd like to experiment with the inclusion of WSDL and other 
interface definitions directly in EML.

> 3) Extend it to describe interface of common analysis routines in
> the community.

Agreed.  Monarch does this by allowing people to wrap analytical steps 
and models in our pipeline language, and then describe an analytical 
workflow as a directed graph of these analytical steps.  SEEK will 
continue this work.

> 4)  Add semantic mapping for attributes that are not in standard
> format, which describes how to transform the attribute into a
> standard format in order to really integrate datasets from different
> sides. This standard format may be represented as a XML schema, in
> which only standard terms, standard units and standard definitions
> are used. The mapping may contain two parts, one is the corresponding
> path in the data standard schema, and another is the corresponding
> conversion function. If the semantic mapping does not appear, by
> default, it is assumed that there exists direct correspondence to
> a path with its end called the same in the data standard schema.

Again, already possible.  STMML provides this function for unit 
conversion.  See my notes below about issues with a global view.  Also, 
it is not clear to me that converting everything to XML is a good idea, 
given the processing overhead it can involve.

> This approach requires heavy work on the data standard schema,
> but it would support automatic integration using the data standard
> schema as the global view. Customized global view could be easily
> converted from this standard global view. The sites have full
> autonomy on the representation of the real site data and metadata,
> any definition and unit what they like can be used.

Unit conversion is a small part of what is needed for automated 
integration.  More significant are statistical assumptions (e.g., 
constant variance) and analytical restrictions (e.g., sampling scale 
invariance).  The SEEK project has proposed to develop an ecological 
constraint language (ECOCL2) for expressing the pre- and post- 
constraints of analytical steps in order to solve this problem.  We 
would welcome your input and participation in this effort.  In addition, 
the idea of a global view is not particularly realistic in ecology. 
Although incredibly simple physical data like climate data might be 
amenable to such an approach, the heterogenity and subtle issues 
associated with integrating more traditional ecological data would be a 
significant barrier to developing a truly global view.  SEEK instead 
intends to allow people to dynamically create integrated views of 
subsets of the data based on the pre- and post- constraints of 
analytical steps.  I think this is a more realistic approach.

> 5) Make EML compatible with other related metadata standards,
> like FGDC, ESML in similar domains. This will save efforts on
> restructuring existing metadata in other standards and make possible
> to use the development achievements for those standards. E.g.,
> some ESML APIs on image manipulation.

Whole parts of EML are adopted from FGDC, ISO, and Dublin Core 
standards, so it is quite compatible with them.  We have a developer 
working on developing transformation tools that map EML documents to the 
NBII Biological Data Profile, which is a superset of the FGDC CSDGM. 
McCartney, I think, has been working on a FGDC to EML converter.  ESML 
is a an easier language to deal with as it is such a small subset of the 
metadata in EML, but a converter there would probably also be useful. 
Using the ESML API tools would be a definite win.  Would you like to do 
this mapping, in one or both directions?

> 6) Collect and build translation functions among commonly used data
> formats, e.g., different units, different image representations
> (BIL, BIP, BSP, etc.).
> 

Sure.  Seems to me like existing software already does this.  Can't we 
just package this up as a step in a system like Monarch by wrapping some 
of the ESRI tools?  I think McCartney's Xylopia system does some of this 
type of thing already too, but I really don't know much about it.

> 7) Develop an integration tool based on the data standard as a
> global view, the translation functions, and extended EML. It provides
> integrated query over the distributed diverse data sources, necessary
> data transformation abilities, and ability to call local or remote
> analysis services to accomplish a task including heterogeneous data
> gathering, preprocessing and analysis.

Sounds good.  That's the goal of SEEK in a nutshell.  The issues I 
raised above on barriers to data integration will be significant here.

> 8) Make these functionalities? APIs available to community, including
> extracting specific information from an EML document, translating
> data between different formats, common analysis routines, for easy
> use and reuse by other applications.

Sounds good.

> 9) Demonstrate the usage of the integration tool and APIs

Sounds good.

Thanks for your in-depth analysis.  I really was only able to take a 
quick pass at your comments, so feel free to contact me or log onto IRC 
(irc.ecoinformatics.org) on channel "#seek" if you want to discuss this 
further.  Once you look over some of the EML features that you 
overlooked (like STMML and "online/connection"), I'd like to hear what 
you think of them.

Thanks!
Matt

Jenny Wang wrote:
> Hi, Matt and Peter,
> 
> How are you? Hope everything is going well with you!
> 
> The attached is my notes on eml with data integration. Prof.Goguen,
> Bertram, Tony and me are trying to make use of EML for ecological
> applications involving data and model integration in SEEK project
> context. I did not mention any good appects of EML in
> the notes, actually
> only those apsects needed to be extended and further effort needed to make
> EML
> useful for
> applications from my point of view. Peter
> already got it from Tony at
> LTER meeting last week. Some may be wrong, or you don't agree with some of
> it. Your
> insights on
> making EML good for data integration, and your comments and suggestions on
> my notes are
> very valuable and will be greatly appreciated!
> 
> We are thinking that we may take a
> small parts of EML, like attribute part, distribution part, extending them
> and demonstrating their
> use with a case application (eg. integrating land cover indices with
> climate variables, using the same data as those used in the
> "Integrated Features" part of Spatial Data
> Workbench, but taking a different approach, you can reach SDW at
> http://sdw.sdsc.edu). But your input will be
> extremely important to this, and any EML related effort in
> the future.
> 
> Many many thanks to your time and help! Looking forward to hearing from
> you two,
> 
> Jenny
> 
> 
> 

-- 
*******************************************************************
Matt Jones                                    jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439   Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)

Interested in ecological informatics? http://www.ecoinformatics.org
*******************************************************************