Skip to content

Sifter: an open economy aggregator for Econix

soycamo edited this page Sep 13, 2010 · 1 revision

Making sense of new economies

Draft 0.1

This document summarizes high-level data structures and relationships. It is anticipated that this will form the basis for a web-based client application as an initial implementation of the Econix approach to open standards for economic information.

Data flow

  1. Data pools
    • Data resides in multiple external data pools for each data type
    • Eg. Craigslist, eBay, monster.com, etc. for wanted listings
  2. Data feeds
    • We assume that data is made available through RSS/Atom feeds
    • These feeds will (in general) be constructed using searches or other user-specific criteria
    • Data pools may create restrictions on the feeds in order to minimize resource use or for other purposes
    • The content of the feed is XHTML semantically marked up using microformats
  3. Aggregator client
    • The client regularly polls the feeds and updates its internal database (note: recommend separate database for each user)
    • New data is parsed into internal data structures, and relevant relationships are created
    • When the user accesses the “dashboard”, the internal database is searched and data is filtered, prioritized, and presented using the user’s preferences
    • These preferences will likely
      • start as one of a number of preset defaults
      • be refined by the user over time
      • be able to be saved in arbitrary number of reference values
      • be able to be “tuned” dynamically using easy user interface
      • eventually be able to be “learned” using neural-net-style learning and fuzzy logic
      • eventually be able to be partially or entirely copied from friends

Data parsing (for initial version)

Incoming feeds are parsed as follows (including only most important data elements; see microformat descriptions for full details. to be expanded):

  • hListing
    • listing action (offer, wanted, trade, meet, etc.)
    • lister, in hCard format
    • item info (location, picture, etc.)
    • description
    • tags
  • hReview
    • type (product | business | event | person | place | website | url)
    • item info (name, photo, url, can be hCard for people/businesses, hCalendar for events)
    • reviewer. (hCard)
    • rating. optional. fixed point integer [1.0-5.0], with optional alternate worst (default:1.0) and/or best (default:5.0), also fixed point integers, and explicit value.
    • description. optional. text with optional valid HTML markup.
    • tags. optional. keywords or phrases, using rel-tag, each with optional rating.
  • hCard
    • fn (ie. full name)
    • url, email, tel (maybe best practice is for url to be an openid?)
    • adr, geo (for location)
    • title, role, org
    • category (ie. tag)
  • hCalendar
    • start, [end | duration].
    • summary
    • location, geo
    • url
    • category (ie. tag)

Other data formats may be added or included. For example, the hReview format can be used to encode trust information (ie., a review of a person, with a rating, potentially broken down into distinct tagged ratings). However, trust information might be obtained in other ways as well: one could use XFN or OpenSocial to obtain information about “friends”, which can be assumed to have a certain kind of trust/rating (eg. a maximal rating on a user-specified tag, such as “friend”).

Data structures and relationships

This first section is a bit verbose… If you wish skip down to “Data structures” proper.

Once incoming data is parsed, it will be related and added to the existing internal database. The key challenges in this operation are:

  • To identify references to the same people / organizations. For example, the hListing:lister of a service and the hReview:item that rates that service provider.
    • The most general way to do this is to establish probable links between entries in the DB that remain separate. For example (based on preferences) having the same email might give an identity-probability of 0.99; the same url 0.90; and the same full name 0.80. Then, entries that are linked by probabilities in excess of some value will be displayed to the user appropriately (eg. if greater then 0.90, then behaves as if the entries describe the same person; or there could be gradation in which color-coding indicates relative confidence of the underlying identity). This might be complemented by an (optional) feature to allow the user to manually dis/approve likely matches.
    • Alternatively, but much less robustly, incoming data could be merged with existing data when it reaches a threshold of probable identity.
    • Confidence in identity is likely to be a significant concern, especially as the system grows in use. There’s a variety of ways this can be addressed: the user can select how trusted various data pools are, and/or microformatted data can be digitally signed. Though they would require an additional handshaking process, OpenID and OAuth also offer potential solutions to the question of how to be confident of identity.
  • To identify references to the same things / services / events. The “fuzzy logic” approach described above is perhaps even more appropriate here.
    • When trying to match the object of a listing with that of a review, tags may need to be considered as well as name, url, and description fields. A match is frequently likely to be uncertain (but still useful).

In addition, some amount of information will likely be entered directly into the client, in a way that makes more sense as part of the database than as a preference. Most importantly: trust of reviewers, such as of data pools as a whole (discussed above) or of particular people and/or streams. In other words, users will need to be able to directly set up trust in third-party raters. Moreover, there may be need for an interface for users to attach tags to a particular entry (perhaps as part of a neural-net learning process whereby tags are recognized as relative synonyms).

Data structures:

  • People / organizations
  • Listings
  • Reviews
  • Events
  • Items (note: this is where the “object” of listings and reviews are entered; the hProduct microformat might help structure this, but it should be at least as general as the hListing / hReview formats are)

Relationships:

  • Probabilistic identities between entries within a single table: eg., People entries 3 and 7 are 90% likely the same; 3 and 34 are 75%; 7 and 34 are 85%.
  • 1-1 subject or object relations between tables: eg. between each Listing (or Review) and a lister (or reviewer) in People, and an item (ie. service, product, event, etc.) in Items or Events
  • Maybe further derived relationships for performance reasons? Eg. add a table of tags with links to each entry thus tagged.

Preferences (simple version)

  • Tags are weighted with importance (ie. each tag entry has a weight)
  • Can enter “search terms” or “standing desires” or the like: these will do fulltext searches (maybe on parsing, all words are added to a standing “keywords” table, with relationships)

Data analysis (priorities / filters)

  • All reviews about a particular person/item/etc. are gathered.
    • Probable matches are considered together, in a configurable way. Eg., if X and Y are 85% likely the same, then reviews about X could be treated as if they’re reviews of Y (it’s a above a binary threshold) or only with 85% confidence (ie. only affects overall priority 85% as much).
  • Reviews about reviewers (and reviews about reviewers of reviewers, etc.) are gathered. How these are used is highly configurable:
    • One likely approach is to use rating-quality transitively (ie. if I directly rate Alice 80% as a reviewer, and she rates Bob 50% as a reviewer, then I give Bob’s reviews 40% weight); this weights reviewers better the closer they are to me.
    • Another approach would be to account for “rater uncertainty” in some way: ie. I trust Bob 50% +- 10%.
  • These reviews are then averaged, accounting for the weighting of their raters.
  • Then these ratings are modified by the weights of tags and/or search keywords, and a priority is determined. This is the basis for presenting prioritized data to the user.
  • For performance reasons at least, a running copy of prioritized high-value data is likely to be maintained and edited, rather than re-creating from scratch on each refresh.

Scraps

These are unused but potentially useful.

  • Identities, linking multiple entries within and among the entity tables. These take various forms:

because what the “same” means is quite fuzzy. For example, it’s the “same” if it has a similar name: “Red & Black”, “Red and Black”, “The Red & Black”.

  • For some purposes, the “same” service is one that has similar tags — eg. “restaurant”, “cafe”, “eatery”, etc., or more specifically: “vegan”, “pizza”, etc.