[seek-kr-sms] more notes from wednesday

Fri Sep 3 06:37:29 PDT 2004

Seems like VLDB starts to be fun after all ;-)

Thanks for sharing your notes!

(and making them googable too ;-)

Bertram

Shawn Bowers writes:
 > 
 > Just a note ... bertram you might get a kick out of the last paper
 > 
 > shawn
 > 
 > ---------------
 > 
 > Supporting semantics-based semantic matching in RDBMS
 > -----------------------------------------------------
 > NOTE: This was part of the industrial session.
 > 
 >   Example: Restaurant guid application
 > 
 >   "ONT_EXPAND" operator, returns a set of rows, where each row is
 >   filled in as part of the ontology
 > 
 >   Retrieves subset of onto indirectly
 > 
 >   ONT_RELATED, ONT_PATH, ONT_DISTANCE, ONT_INDEXTYPE
 > 
 >   ONT_RELATED operator
 > 
 >     takes term1, reltype, term, ontoname, returns integer
 > 
 >     if term1 and term2 are related via reltype, returns 1, otherwise 0
 > 
 >     reltype can be isa, or boolean exprs. like isa or eqv
 > 
 >   ONT_DISTANCE operator: takes label, returns length of shortest path
 >   length
 > 
 >     select * from served_food where ONT_RELATED( 'cuisine', 'isa',
 >       western'; ...)
 > 
 >     query ontology directly to find relationships
 > 
 >   ont_expand: takes term1, reltype, term2, ontoname, returns a table
 >   of <term1, property, term2, distance, path> tuples
 > 
 >      nulls represent wildcards (ugh)
 > 
 >   select * from table ( ont_expand( null, 'isa', 'latin american',
 >     cuisine_ontology)
 > 
 >   implementation
 > 
 >     simple rels handled by table lookup
 > 
 >     transfitive rels handled by using oracl'es connect by, which
 >     allows traversing hierarchical data stored in tables
 > 
 >     symmetric relationships handled as part of the above steps
 > 
 >     more complex owl (DL and lite) inferencing is handled by an
 >     initial phase of preprocessing dones as part of loading the owl
 >     ontology
 > 
 >       NOTE: I was just reading in the owl api paper why this has
 >       generally been considered a bad idea.  mainly due to editing the
 >       ontology / updating the onto (which is of course inevitable)
 > 
 >     speeding up these queries
 > 
 >       an "extensible" indexing type.  precompute transitive closures
 > 
 >       looks like a regular create index statement, but add parameters
 >       that you want to index (ontology=cuisine_ontology), e.g.
 > 
 >       index table: (cuisine, row_id)
 > 
 >   performance results
 > 
 >     oracle 9i, lots of terms (he went too fast through that slide...)
 > 
 >     performance is linear for ont_expand
 > 
 >   applications for semantic matching
 > 
 >     homeland security application
 > 
 >       a person x has rented a truck, and a person y has bought
 >       fertilizers and x and y stay at the same address
 > 
 >       select * from Activy_x, Activy_y
 >       where x.action = 'rent' and y.action = 'buy' and ...
 >       ont_related (x.object isa ...)
 > 
 >     a life science example
 > 
 >       ... see paper
 > 
 > 
 >   conclusions
 > 
 >     ontologies must be stored in databases for efficient maintenance,
 >     qsharing, and scalability
 > 
 >     equally important is the capability of ontology-based semantic
 >     matching
 > 
 >     future work: address ontology merging and supporting defined rules
 > 
 >   NOTE: this is exactly the kind of integration GEON is doing ...
 > 
 > Similarity Search for Web Services
 > ----------------------------------
 > Xin Luna Dong (UW)
 > 
 >   This is "Woogle"
 > 
 >   Web service search
 > 
 >     first-generation web-service search engines do keyword search on
 >     web-service descriptions (WSDL?)
 > 
 >       binding point, web central, salcentral, web service of the day,
 >       remote methods, et.
 > 
 >       they say this isn't enough
 > 
 >     doesn't capture enough underlying semantics
 > 
 >     does not accurately specify user's information needs
 > 
 >     also, don't specify the services, only the description of all of
 >     them for a description
 > 
 >       need to "drill" down
 > 
 >   how to improve this?
 > 
 >     offer more flexibility by offering more similarity
 > 
 >     via semantics (not just keywords)
 > 
 >   example
 > 
 >     four services, all take zip code (but different outputs)
 > 
 >     some outputs are inputs to others
 > 
 >   searching with woogle
 > 
 >     returns list of operations with similar functionalities, a list
 >     with similar inputs, a list with similar outputs, a list of those
 >     that can be composed
 > 
 >   elementary problems
 > 
 >     operation matching: given a web-service operation, return a list
 >     of similar operations
 > 
 >     input/output matching: given the i/o of a web-service op, return a
 >     list of we service ops with similar i/os
 > 
 >   can we apply previous work?
 > 
 >     software comp. matching: require the knowledge of implementation;
 >     we only know the interface
 > 
 >     schema matching: similarity on diff. granularity; web-services are
 >     more loosely related (NOTE: what? isn't wsdl just xml? why doesn't
 >     this work? this doesn't make sense)
 > 
 >     text-document matching: nope this won't work either they say. only
 >     a coupld of sentences, etc. web services are structured (NOTE: so
 >     why not the previous...)
 > 
 >     NOTE: this is a little hard to follow; their reasoning here
 > 
 >   Approach:
 > 
 >     Part 1: Exploit structure (wsdl, op name + description, i/o
 >     param namess, etc.)
 > 
 >       cluster i/o param names ...
 > 
 >   Custering parameter names
 > 
 >     heuristic: parameter terms tend to express the same concept if
 >     they occur together often
 > 
 >     strategy: cluster parameter terms into concepts based on their
 >     co-occurence.
 > 
 >       given terms p and q, similarity from p to q:
 > 
 >         sim(p->q) = P(q|p)
 > 
 >       ideal clustering: high cohesion and low correlation; cohesion
 >       measures intra-cluster term simil. correlation measures the
 >       inter-cluster term sim; cohesion/corr score = (avg(coh) /
 >       avg(corr))
 > 
 >     algorithm: a series of refinements of the classic agglomerative
 >     clustering; basic agglomerative clustering: merge clusters ...
 > 
 >     cohesion condition: each term in the result cluster is close to
 >     most (half) of the other terms in the cluster. Refined algo: merge
 >     clusters
 > 
 >     a bunch of examples ... sound pretty good, but not apparent how
 >     what they are doing actually does this, haven't shown how what
 >     they do can solve these examples (probably have to read the paper)
 >     (WARNING: I am not a big fan of these kinds of "soft" approaches
 >     for schema matching; so my comments may be a bit biased, but I
 >     have found these approaches have big "wow" factors, so I am in the
 >     minority a bit)
 > 
 >   measure top-k precision
 > 
 >     benchmark: 25 web-service ops; from several domains; with
 >     diff. input/outpu, etc.
 > 
 >     top-k precision for op matching is 85% (quoted as "pretty high").
 > 
 >     ignoring structure or just text matching, results "not that good"
 > 
 >     (WARNING: those statistically inclined should look away now)
 > 
 >   measure precision/recall
 > 
 >     8 web-services and 15 in/out
 > 
 >     6 domains, diff. popularity, etc.
 > 
 >     woogle (compared to only description or structure) is better
 > 
 > 
 >   conclusions
 > 
 >     defined primitives for web-service search
 > 
 >     algorithms for sim. search on web-service ops
 > 
 >     etc.
 > 
 >   on-going work
 > 
 >     template search
 > 
 >     specify input and output and brief description of operations
 > 
 >     (I want the input to be city state, output to be weather) -- isn't
 >     this another "AI complete" problem? They claim it is pretty close
 >     ...
 > 
 >     they claim what they have is better than google, since you can ask
 >     questions like "what is the temperature in seattle" -- since
 >     presumably they can compose and run the services
 > 
 > 
 > Data sharing through query translation in autonomous sources
 > ------------------------------------------------------------
 > A. Kementsietsidis
 > 
 >   A network of autonomous sources
 > 
 >   each can have: its own data types (genes, proteins, articles); it
 >   own different schema; its own vocabulary (gene names, prot. names,
 >   etc); its own set of data sharing sources (a subset of sources)
 > 
 >   treat each source as an access
 > 
 >   requirements: share data without a common schema and where schemas
 >   haven't been rreconciled; translate and propagate query between
 >   sources
 > 
 >   use mapping tables (sigmod last year)
 > 
 >     associates values between sources; can be multiple tables;
 >     associate values within or across domains; recorded assocaitions
 >     are m:n in general; they support variables to express
 > 
 >     want to use mapping tables in structured query translation
 > 
 >   query lang: SPJ, no negation, selection is disjunction or
 >   conjunctions of atoms
 > 
 >   characterize translations: given q and q' and a mapping table m; we
 >   define what it means for sound and complete translations
 > 
 >   compute translations: propose efficient algorithms to compute
 >   translations
 > 
 >   with multiple mapping tables, decide which mapping tables to use
 >   during translation. solution: let the query guide the selection
 >   process
 > 
 >   our algorithms manipulate queries and mapping tables. it proves
 >   beneficial to have a uniform representation for both
 >   constructs. solution: represent queries as T-queries (tableau-like
 >   rep. of queries that uses the same formalism with mapping tables)
 > 
 >   reduce run-time execution costs
 > 
 >   solution: reuse previous translations
 > 
 >   seems pretty simple...
 > 
 > 
 > 
 > Linear Road: A stream data management benchmark
 > -----------------------------------------------
 > 
 >   contribs
 > 
 >     novel sdms benchmark, some implementations
 > 
 >   inspiration
 > 
 >     variable tolling
 > 
 >       cars have gps, emit "position reports"
 > 
 >       tolls - f(speed, conjestion, accidents, ...)
 > 
 >   linear road
 > 
 >     every car emits a "position report" every 30 sec
 > 
 >     set of 5 continuous, historical queries
 > 
 >     each w/ max response time constraint
 > 
 >     l-rating: number of express ways (1000 PRs/sec)
 > 
 >   challenges/solution
 > 
 >     semantically valid input (streams should be meaningful
 >     time-series)
 > 
 >       input generated by traffic simulator (MITSim)
 > 
 >     proper performance metrics (not "completion time" but "response
 >     time")
 > 
 >       L-rating: supported load while meeting requirements
 > 
 >     many correct results possible (answer can depend on evolving
 >     historical state)
 > 
 >       validation tools
 > 
 >     no query language standard (expressed in a high-level lang)
 > 
 >   caveat
 > 
 >     a stress test, not a real traffic model
 > 
 >     simplistic assumptions: global clock, no position interpolation,
 >     etc.
 > 
 >   backdrop
 > 
 >     linear city: 10 horizontal express ways, with 100 mile long
 >     segments
 > 
 >     3 lanes, 1 entrance and exit ramp
 > 
 >   input
 > 
 >     simulator produces 3 hrs of traffic data
 > 
 >       each car: one position report every 30 seconds
 > 
 >     data consists of: position reports (car, time, position data)
 >     piggyback ...
 > 
 > 
 >   toll alert query
 > 
 >     car crosses into new segment
 > 
 >     charge toll for seg #i. calculate toll for seg#i+1 based on
 >     previous minute stats: avg speed, # of cars, accident near; 5
 >     sec. response time
 > 
 >   accident alert query
 > 
 >     car emits 4 reports from same position -> "stopped"
 > 
 >     two cars stopped at same position -> "accident"
 > 
 >     alert all vehicles entering a segment ...
 > 
 >   historical query
 > 
 >     account balance (5 sec reponse)
 > 
 >       what do i owe? more than 1 correct result (depends on when
 >       answered)
 > 
 >     daily expenditure (10 sec response)
 > 
 >       what did i spend on xway x on day d? day d any day in last 10
 >       weeks
 > 
 >     travel time/toll estimate (30 sec. response)
 > 
 >       predict travel from p1 to p2 on xway x, day d, time t. use last
 >       10 weeks for prediction
 > 
 >   implementations
 > 
 >     Aurora
 > 
 >   comments
 > 
 >     some stuff doesn't make sense: like two cars stationary for a
 >     wreck, it would be more of a stress-test
 > 
 > 
 > Query languages and data models for database sequences and data
 > streams
 > ---------------------------------------------------------------
 > Carlo Zaniolo
 > 
 >   most dsms use sql; queries span both streams and dbs will be easier
 > 
 >   but sql suffers from major new problems; designed for persistent
 >   data, not transitive data
 > 
 >   for persistent data, far from ideal. areas not supported: data
 >   mining, sequence queries
 > 
 >   this talk covers shortcomings of sql on data streams, and how to
 >   overcome them
 > 
 >   blocking operators
 > 
 >     current dbms's make heavy use of blocking operators
 > 
 >     characterize blocking and non-blocking operators
 > 
 >   partial ordering
 > 
 >     let s be a sequence [t1, ..., tn]
 > 
 >     then, [t1, ..., tn] is the presequence of s, of length k, denoted
 >     Sk
 > 
 >     if for some k, L-Sk, we say that L is a presequence of S
 > 
 >     presequence  defines a partial order
 > 
 >     generalizes to subset when order and duplicates are immaterial
 > 
 >     the empty sequence is a subsequence of every other sequence
 > 
 >   theorem: queries can be expressed via nonblocking computations iff
 >   they are monotonic
 > 
 >   a query lang. L can express a given set of functions on its input;
 >   to avaoid blocking queries, only the monotonic functions expressible
 >   by L should be allowed on data streams
 > 
 >   But ALL of them are expressible using the nb-operators of L?
 > 
 >   L contains nb-operators and blocking operators: only the former can
 >   be used on data streams -- so, can those express all the monotonic
 >   queries expressible in L
 > 
 >   L nb-complete ...
 > 
 >   example
 > 
 >     bidstream(item, bidval, time)
 > 
 >     select item
 >     from bidstream
 >     group by item
 >     having sum(bidval) > 100000;
 > 
 >     this is monotonic. it can be expressed in a lnaguage containing
 >     suitable query ops, but not in sql-2; which is not nb-complete;
 >     thus it is ill-suited for continuous queries on data streams
 > 
 >       i think he is saying sum is implemented as a blocking operator,
 >       but it doesn't have to be implemented this way
 > 
 >   relational algebra
 > 
 >     nonmonotonic ra operators: set difference and division
 > 
 >     we are left with: select, project, join, and union. can these
 >     express all FO monotonic queries?
 > 
 >     some interesting temporal queries: coalesce and until
 > 
 >       they are expressible in ra (by double negation)
 > 
 >       they are monotonic
 > 
 >       they cannot be expressed in nb-relational algrebra
 > 
 >   "review"
 > 
 >     sql's lack of expressivity is a major problem for database-centric
 >     apps
 > 
 >     these problems significantly more serious for data streams since:
 >     only monotonic; etc.
 > 
 >   embedding sql in a prog. lang.
 > 
 >     ops not expressible in sql can be expressed in pl
 > 
 >     but cursors are 'pull-based' and not usable on streams
 > 
 >       cannot hold the current tuple and all that follow waiting for
 >       the pl to request it by a getNext statement
 > 
 >   punctuation: source of the data inserts punctuations to denote the
 >   end of certain kind of data, when the auction closes on item95, e.g.
 > 
 >     queries find the total volume of bids for item95 can then be
 >     answered, but little benefit for our example query (because you
 >     have to wait) ...
 > 
 >     windows as synopses.  approximations, e.g., compute the continuous
 >     sum.
 > 
 >   user defined aggregates as a solution
 > 
 >     supported by Obj-Rel systems, were they are defined in an external
 >     PL
 > 
 >     Aurora suppors specific sublanguage
 > 
 >     can be defined in SQL itself; native extensibility (but ... )
 > 
 >   example
 > 
 >     aggregrate myavg(next int) : Real
 >     ( table state(tsum int, cnt int);
 > 
 >       initialize: ( ... insert into state values (next, 1) )
 > 
 >       iterate: ( update state set tsum-tsum+next, cnt=cnt_1 ...)
 > 
 >       terminate: insert into return select tsum/cnt from state
 >     )
 > 
 >   theorem
 > 
 >     sql with uda's defined in sql is turing complete --- i.e., it can
 >     express every possible db query
 > 
 >     this result holds if we allow both blocking and nb uda's
 > 
 >     but what about data streams where we can only us nb operators?
 > 
 >       there are query languages that are turing complete on stored
 >       data but not on data streams ...
 > 
 >     this is called a "tumbling window" -- various kinds of windows
 >     used with data streams
 > 
 >   nb-udas
 > 
 >     those with an empty, or missing, terminate clause
 > 
 >     thereom: sql with nb-udas is nb-complete: it can express all
 >     monotonic queries
 > 
 >     nb-udas natively in sql: cornerstone of their expressive streams
 >     language ESL, that supports well times series queries, data
 >     mining, etc., etc.
 > 
 >   conclusion
 > 
 >     db approach shows promises, but the challenges should not be
 >     underestimated
 > 
 >     we provided a formal analysis of the language problems of
 >     sql-based cqls
 > 
 >     proposed an effective solution: udas that make sql turing
 >     complete, but nb-complete (on data streams)
 > 
 >   ESL is up and running on the stream mill system
 >   (http://wis.cs.ucla.edu)
 > 
 >   here come the questions ...
 > 
 >       dave maier and jennifer widom go to the mike...
 > 
 >       as they walk there, dave says: "you go ahead; do you want me to
 >       lift you up to the mike?" (the fun begins ... :-)
 > 
 >       widom: agrees with approach; disagrees that sql is a stream
 >       query language: quote "I haven't seen anyone propose this, have
 >       you?" ... "no one has proposed sql semantics, maybe syntax, but not
 >       semantics"
 > 
 >       carlo: "then why use sql syntax?"
 > 
 >       widom: "interesting results, but wrong wrapping"
 > 
 >       carlo: "your comment shows some form of lack of thought"
 > 
 >       carlo: "if i was an nsf reviewer i would scratch it"
 > 
 >       widom: "oh, that was you!"
 > 
 >       widom: "does esl extend atlas?"
 > 
 >       ... didn't get answer ....
 > 
 >       widom: "ok, here comes mister punctuation"
 > 
 >       dave: "assuming sql would be implemented in a certain way
 >       underneath; would it be possible to translate it into operators
 >       implemented in a different way?", i.e., as non-blocking...
 > 
 >       carlo: "that is what we propose; the can do it"
 > 
 >       dave: "then you have the problem of translating into nb
 >       operators; then I have to do it"
 > 
 > 
 > 
 > _______________________________________________
 > seek-kr-sms mailing list
 > seek-kr-sms at ecoinformatics.org
 > http://www.ecoinformatics.org/mailman/listinfo/seek-kr-sms