[seek-dev] Re: notes on eml (Your comments and suggestions will be greatly appreciated!)
Matt Jones
jones at nceas.ucsb.edu
Mon Feb 10 13:15:57 PST 2003
Thanks, Jenny. Tony also sent this to me last week, but I had not had a
chance to reply until I got back from my travels. Thank you for
examining EML so closely -- few people seem to take the time to do so.
My replies to your comments are interspersed within your comments below.
> Jenny Wang (jwang at sdsc.edu) wrote:
> 1. Other data formats than tabular
>
> EML is to document metadata for binary or ASCII files (just tabular
> data), how about metadata of XML files, html files, data published
> through web services? As more applications export or exchange
> information in XML format and even some original data are just
> recorded in XML files, their DTD or XML Schema are very important
> information that need to be documented and provided to users.
Actually, EML supports a variety of entity types, including tabular
data, spatial vector and raster data, and others. This is modularly
implemented, and so is extensible. For details, examine all of the EML
schema ComplexTypes that reference EntityGroup. In addition, for any
given type of entity, for example tabular data, there is is separation
between the logical model (e.g., what attributes are present), and the
physical model (e.g., how the table is serialized to disk). So, for
example, one can have a table in serialized in csv format described in
the physical module using "textFormat", and the same table serialized in
XML format and described in the physical module using
"externallyDefinedFormat". One could argue that adding a container to
place the actual schema/dtd in externallyDefinedFormat would be a good
idea. However, even with such a schema, the exact mapping between the
physical format and the logical model is still ambiguous, and that's a
hard one to solve. Our "textFormat" description defines the mapping to
the logical structure, but I'm not sure how this would work for XML.
Needless to say, I think it would be important to be able to have full
XML serialization support for entities in EML.
> 2. Data source query ability
>
> EML describes some metadata of relational databases and datasets
> behind applications like JSPs. But it only describes the information
> about the data, no query abilities on a data source provided. For
> example, some site may have data in Oracle tables, but only queries
> over some views are allowable. Then the users or applications would
> not know how to properly query the data source. Another example,
> NTL (North Template Lake) site export their data through JSPs. In
> the sample EML document for a climate dataset collected at Noble
> F. Lee Municipal Airport, descriptions about the attributes are
> provided, and the URL where the JSPs located is given too. But, no
> any information about the flow of the JSPs is given. Without reading
> and understanding the interface by a human user, an application can
> not know how to transfer a query into this interface specific query
> (filling in the forms correctly) over this interface, since there
> is no any description in the EML file about accessing the data
> through this interface.
Querying on custom interfaces is a difficult thing to describe
generally, so we decided to make a simple distinction between simple
services where the entire query can be expressed as a URL, and more
complex systems where more detailed knowledge of the application is
required. In EML, these more detailed applications are described in the
physical distribution under "online/connection". As the semantics of the
application are difficult to generalize, we decided that we would
basically allow one to name an application and provide the connection
parameters needed to query using that interface. The application
protocol is described in natural language, but really a machine must
recognize it by name in order to really make use of it. This is
severely lacking for many cases. We had extensive discussions about the
use of languages like WSDL in place of out current connection
descriptors, but the general feeling among the EML developers was that
WSDL was too immature, and its future uncertain. So, we decided to wait
on a full implementaiton of this functionality until we were able to
experiment with it further. Your input into what is needed would be
welcome.
> 3. Automated data access support for applications
>
> EML seems not providing enough information that helps applications
> directly and automatically access the data.
>
> In physical module, distribution element provides URL as only
> information for a DBMS or other system that holds the data. Without
> further information about what DBMS it is, what version, etc.,
> how an application can set up right connection (e.g., what driver
> to use) to the system?
Use "online/connection" instead of "online/url". See my notes on your
point (2) as well, as they apply.
> Again taking the sample EML document for Noble F. Lee Municipal
> airport from NTL site as example, in Physical module, the
> distribution part gives only the URL where the query interface
> through JSPs is running, lots of html forms (for example, which
> fields to be retrieved, time boundaries, output format) need to be
> filled before retrieving the data.
That the NTL site decided to provide only a URL that doesn't directly
access the data is their decision. They had and have the opportunity to
provide more detailed connection information through
"online/connection", and they opted not to.
> If EML was only for helping users to interpret the data properly,
> then it should not bother to put in XML format, just regular text
> files would suffice as those sites or ClimDB provide in introduction
> or description part on their web sites. If it is going to support
> automated data integration, automated data access is necessary
> at first. E.g., providing connection and query ability for DBMSs,
> attaching calling messages and parameters for JSPs and web services
> seem needed.
Agreed. I argued pretty hard for including WSDL descriptions in EML,
but didn't get much support back then. We switched to our current
"online/connection" as a compromise. I think this area still needs lots
of work. However, as we are trying to get people to deploy EML
throughout the community, we have promised not to make any changes that
are backwards incompatible. So we will need to think carefully about
how to do this. In the interim, you could use the flexible
"additionalMetadata" section to point a WSDL description at an
"online/connection" element, and see how far that gets you.
> 4. Data integration based on EML
>
> It provides information for each attribute existing in a dataset,
> which contains attribute name, label, definition, storage type,
> measurement scale (unit: standard unit or custom unit, precision,
> domain: number type, min value, max value), accuracy, etc.
>
> The problem is, besides so many standard units (the transformation
> among them is doable though), there are custom units. How can an
> integration system understand all custom units and know how to
> transform among them?
Custom units are not free-form. They must be defined in the document
(probably in "additionalMetadata") using an STMML definition. These
STMML definitions do two main things: 1) they define a "unitType" which
is a class of units that share the same dimensionality, and 2) they
define a specific "unit" with its conversion information to get back to
the canonical SI unit for that unitType. So, although the standard
dictionary only supplies a few units for the unitType "energy", one can
create new "energy" units (e.g., BTU) that are self-describing with
respect to their base SI unit (e.g., joule). So, to map custom units to
standard units, one will need to parse and traverse the STMML logic. In
addition, one can determine the relationship between unitTypes by
examing the dimensionality of each unit. We aren't convinced that
everything we need is there, but its a pretty good start. I'd like to
hear your opinions once you evaluate the STMML part of the system.
> Another problem, how can an integration system correctly understand
> the attributes? meaning and their relationships? For example,
> AVG_AIR_TEMP?s definition in the airport sample is ?air temperature
> at 1.5m height in Gill shelter?, possibly somebody else would
> give the definition of the same attribute as ?air temperature in
> Gill shelter 150 cm high?, or ?air temperature at 1.5m height in
> shelter each hour? in another EML file for a dataset at the same
> site or a different site. Notice there is no description about the
> AVG aggregation range in the definition for AVG_AIR_TEMP in the
> airport sample. It is hard for an application to tell they mean
> the same thing or not. If monthly temperature is misunderstood as
> for daily and integrated with daily temperature together, analysis
> result on the data would not make sense.
Yep. Attribute definitions contain a lot of the semantic information
needed to do integration. You are dealing with very simple climate
data; the problem gets far, far worse with real ecological data,
especially once taxonomy and classification problems get dragged in. We
are working on a simple system to add "semantic labels" to the
attributes, where the labels are drawn from an identified ontology.
However, this work was not mature enough to include in EML, so we are
experimenting with it in "additionalMetadata" using the reference
pointers, and will hope to add it in a future EML version once the SEEK
project has worked out some of the thorny issues and found something
that works well.
> Adding semantic mapping from an attribute to a data standard may
> be a solution. For example, if an attribute uses a custom unit,
> then how to translate it into a standard unit should be described
> in a way that an integration system can understand and provided
> within a semantic mapping element under this attribute.
Much of this is already provided by our use of STMML in EML. There are
far harder integration problems than simple unit conversion. But even
unit conversion can be hard, because just because you know how to
convert between two units mathematically does not mean that it will
produce a meaningful result. Basically, determining the legitimacy of
the integration of two attributes depends only shallowly on the data
type and units, and much more fundamentally on the requirements of the
analysis to which the integrated data will be put. We are explicitly
dealing with this issue in SEEK.
> 5. Analysis support
>
> EML has a module, eml_software, for providing general information
> that describes the software needed to view a dataset, or process
> it. This part is based on (OSD) Open Software Description. But it
> does not include interface specification that is necessary for an
> application to use the software.
>
> Possibly, we can extend this part to include input and output
> specification, and necessary information for accessing it. Also, we
> can map each attribute of input and output to standard data format,
> and then it would be possible to use this software or an analysis
> routine to work with other datasets including integrated datasets
> since automated translation of a dataset into required format could
> be done with explicit mapping information.
Agreed. The software module was a placeholder. We have developed a
pipeline language for formally defining the inputs and outputs of
analyses and models as part of our Monarch project. This language is
not yet mature enough for inclusion in an EML release to the ecological
community, but the SEEK Analysis and Modeling System component will
definitely have such an extension as one of its products. I'd be happy
to review our current pipeline work with you, show you our pipeline
schemas, and review the Monarch system if it would be useful to you. We
are currently in the process of comparing our language to others like
MOML to see if there is a need for an independent language, or whether
we can adopt all or part of a language like MOML. So, basically, what
you describe is exactly what we were funded to do in the SEEK proposal.
> 6. EML related tools
>
> The ongoing projects on developing EML related tools are mainly
> focused on providing metadata editor, generating, storing,
> querying, displaying and reformatting EML documents, or manipulating
> local data or data on KNB, including Metacat, Morpho, Xanthoria,
> Xylographa. However, there is no much effort on making usage of EML
> documents, or making EML useful for ecological applications. The only
> proposed application of EML so far is for data quality control. E.g.,
> when a user enters tested data into computer, if it was out of range,
> the system would inform by comparing the min and max described
> in a corresponding EML document. This is supported in any DBMS
> through constraints.
Actually, not really true. Our Monarch system is a general-purpose
analysis and modeling system that is metadata-driven. It supports
arbitrary statistical approaches to data analysis, and is extensible to
include arbitrary backend systems for executing analyses and models. We
currently are supporting SAS and Matlab, and plan to have support for R
and some simulaiton models in the future. We are also considering
wrapping ESRI tools. In addition, Peter McCartney's group is developing
software that analyzes data based on EML input as well, although he'll
need to provide the details of that (I think its called Xylopia).
Finally Wade Sheldon at GCE has been developing analytical tools that
are metadata-driven as part of his matlab toolbox, and I belive he plans
on converting it to use EML in addition to the current FLED metadata
spec that he supports.
> 7. Further work
>
> 1) Extend it to describe more formats of data, like data in XML,
> data wrapped by web services.
Already possible. May need to provide a container for the schema, but
this still leaves the physical to logical mapping ambiguous. Web
services can be named in "online/connection". See above.
> 2) Extend it to describe data source query abilities for data
> source maintained in DBMS or hided behind query interfaces, and
> provide information for automated data access.
Agreed. We'd like to experiment with the inclusion of WSDL and other
interface definitions directly in EML.
> 3) Extend it to describe interface of common analysis routines in
> the community.
Agreed. Monarch does this by allowing people to wrap analytical steps
and models in our pipeline language, and then describe an analytical
workflow as a directed graph of these analytical steps. SEEK will
continue this work.
> 4) Add semantic mapping for attributes that are not in standard
> format, which describes how to transform the attribute into a
> standard format in order to really integrate datasets from different
> sides. This standard format may be represented as a XML schema, in
> which only standard terms, standard units and standard definitions
> are used. The mapping may contain two parts, one is the corresponding
> path in the data standard schema, and another is the corresponding
> conversion function. If the semantic mapping does not appear, by
> default, it is assumed that there exists direct correspondence to
> a path with its end called the same in the data standard schema.
Again, already possible. STMML provides this function for unit
conversion. See my notes below about issues with a global view. Also,
it is not clear to me that converting everything to XML is a good idea,
given the processing overhead it can involve.
> This approach requires heavy work on the data standard schema,
> but it would support automatic integration using the data standard
> schema as the global view. Customized global view could be easily
> converted from this standard global view. The sites have full
> autonomy on the representation of the real site data and metadata,
> any definition and unit what they like can be used.
Unit conversion is a small part of what is needed for automated
integration. More significant are statistical assumptions (e.g.,
constant variance) and analytical restrictions (e.g., sampling scale
invariance). The SEEK project has proposed to develop an ecological
constraint language (ECOCL2) for expressing the pre- and post-
constraints of analytical steps in order to solve this problem. We
would welcome your input and participation in this effort. In addition,
the idea of a global view is not particularly realistic in ecology.
Although incredibly simple physical data like climate data might be
amenable to such an approach, the heterogenity and subtle issues
associated with integrating more traditional ecological data would be a
significant barrier to developing a truly global view. SEEK instead
intends to allow people to dynamically create integrated views of
subsets of the data based on the pre- and post- constraints of
analytical steps. I think this is a more realistic approach.
> 5) Make EML compatible with other related metadata standards,
> like FGDC, ESML in similar domains. This will save efforts on
> restructuring existing metadata in other standards and make possible
> to use the development achievements for those standards. E.g.,
> some ESML APIs on image manipulation.
Whole parts of EML are adopted from FGDC, ISO, and Dublin Core
standards, so it is quite compatible with them. We have a developer
working on developing transformation tools that map EML documents to the
NBII Biological Data Profile, which is a superset of the FGDC CSDGM.
McCartney, I think, has been working on a FGDC to EML converter. ESML
is a an easier language to deal with as it is such a small subset of the
metadata in EML, but a converter there would probably also be useful.
Using the ESML API tools would be a definite win. Would you like to do
this mapping, in one or both directions?
> 6) Collect and build translation functions among commonly used data
> formats, e.g., different units, different image representations
> (BIL, BIP, BSP, etc.).
>
Sure. Seems to me like existing software already does this. Can't we
just package this up as a step in a system like Monarch by wrapping some
of the ESRI tools? I think McCartney's Xylopia system does some of this
type of thing already too, but I really don't know much about it.
> 7) Develop an integration tool based on the data standard as a
> global view, the translation functions, and extended EML. It provides
> integrated query over the distributed diverse data sources, necessary
> data transformation abilities, and ability to call local or remote
> analysis services to accomplish a task including heterogeneous data
> gathering, preprocessing and analysis.
Sounds good. That's the goal of SEEK in a nutshell. The issues I
raised above on barriers to data integration will be significant here.
> 8) Make these functionalities? APIs available to community, including
> extracting specific information from an EML document, translating
> data between different formats, common analysis routines, for easy
> use and reuse by other applications.
Sounds good.
> 9) Demonstrate the usage of the integration tool and APIs
Sounds good.
Thanks for your in-depth analysis. I really was only able to take a
quick pass at your comments, so feel free to contact me or log onto IRC
(irc.ecoinformatics.org) on channel "#seek" if you want to discuss this
further. Once you look over some of the EML features that you
overlooked (like STMML and "online/connection"), I'd like to hear what
you think of them.
Thanks!
Matt
Jenny Wang wrote:
> Hi, Matt and Peter,
>
> How are you? Hope everything is going well with you!
>
> The attached is my notes on eml with data integration. Prof.Goguen,
> Bertram, Tony and me are trying to make use of EML for ecological
> applications involving data and model integration in SEEK project
> context. I did not mention any good appects of EML in
> the notes, actually
> only those apsects needed to be extended and further effort needed to make
> EML
> useful for
> applications from my point of view. Peter
> already got it from Tony at
> LTER meeting last week. Some may be wrong, or you don't agree with some of
> it. Your
> insights on
> making EML good for data integration, and your comments and suggestions on
> my notes are
> very valuable and will be greatly appreciated!
>
> We are thinking that we may take a
> small parts of EML, like attribute part, distribution part, extending them
> and demonstrating their
> use with a case application (eg. integrating land cover indices with
> climate variables, using the same data as those used in the
> "Integrated Features" part of Spatial Data
> Workbench, but taking a different approach, you can reach SDW at
> http://sdw.sdsc.edu). But your input will be
> extremely important to this, and any EML related effort in
> the future.
>
> Many many thanks to your time and help! Looking forward to hearing from
> you two,
>
> Jenny
>
>
>
--
*******************************************************************
Matt Jones jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/ Fax: 425-920-2439 Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
Interested in ecological informatics? http://www.ecoinformatics.org
*******************************************************************
More information about the Seek-dev
mailing list