Date: Mon, 08 May 2000 15:32:52 -0700
From: | Eric Armstrong |
eric.armstrong@eng.sun.com Reply-To: unrev-II@egroups.com |
To: | unrev2 unrev-II@egroups.com |
Subject: | Collaborative Documents Requirements, v0.6 |
Requirements for a collaborative-document system. (Aprox. equiv to Open HyperDocument System.)
OverviewThis is a lengthy document aimed at adducing the requirements for a subset of an eventual Dynamic Knowledge Repository (DKR). The subset described is for a collaborative document system, which Doug describes as an "Open HyperDocument System" (OHS). The goal of this document is to show how such a system fits into a DKR framework, detail its requirements, and point to a couple of extensions that move it in the direction of a full DKR.
This document has the following sections:
A fully functional DKR will need to manage many different kinds of things:
Since the general outline of a DKR seems to depend on the problem domain it is targeted for, it seems reasonable to focus attention on the elements they have in common.
This set of requirements will focus on what is perhaps the major common feature: Documents -- in particular, Collaborative Documents, and the need to interact via email to construct them.
Other important areas that will need attention include the integration of multimedia objects (including animations, simulations, audio, video, and the like) as well as the critical functions of abstract knowledge representation, inference engines, model-building functions, and the integration of other executable programs. But here, we'll focus on Collaborative Documents.
MotivationA wide variety of email and forum-based discussions occur on a host of topics every day. In each of these discussions, important information frequently surfaces, but that information is hard to capture where you need it.
Document production systems, on the other hand, simplify the task of creating complex documents but make it hard to gather and integrate feedback.
For example the DKR discussions have identified several possible starting points for such a system. That kind of feedback occurs naturally in an email system, as opposed to a document production system, but each of the pointers was buried in a separate email. It required lengthy search to gather them together (below), and the list may not even be complete!
To act as a foundation for a DKR, a Collaborative Document System (CDS?) needs to combine the best features of:
In the DKR discussion, we've seen pointers to several possible starting points for such a system. Those are contained in the References post, in the Bootstrap section. (They many possible starting points listed in the post desperately need short synopses and evaluations.)
General CharacteristicsThe lengthy list of starting points, the difficulty of creating it, and the rapidity with which it goes out of date, combine to suggest several obvious requirements for the system: It needs to be composed of information nodes that are hierarchical, mailable, linkable, and evaluable (more on those subjects in a moment).
Each of those requirements leads in turn to other requirements. The major requirements are listed here and explained below:
General Functional Requirements
These are the general requirements for how the system must operate, to be effective.
This document, like the list of starting points mentioned earlier, is heavily hierarchical in nature -- as are most technical documents. These facts further underscore the need for a hierarchical system.
For example, this email message should exist in outline form. It should be easy to add and remove entries to various sections: for example, the list of starting points given above.
However, the hierarchy should function using XML-sytle "entity references" that copy the target contents into the displayed document, "inline". That permits multiple references to the same node. The result is effectively a lattice of information nodes, where any one view of it is hierarchical.
To be strictly correct, the underlying data structure will be a directed graph. In reality, it will be bidirectional, and it will typically turn out to have cyclic loops. Although it would be nice to avoid that, it is probably unavoidable.
The "network" nature of the graph results from the property that allows a document-segment (node or tree) to be used in multiple places. In each "document" that makes such an access, however, the view is hierarchical. The hierarchy is a view of the graph, and a "document" is really a structured collection of nodes from the data base.
Unlike HTML, where references to other documents occurs only with links, references to other nodes and trees in this system will typically occur as "includes". The effect of the inclusions will be to make the material will appear inline, as though it were part of the original document.
Although "hard" links to objects will be needed at times, in most cases the link to the "Requirements Document" should be a "soft" link -- that is, an indirect link that points to the latest version. That means never having to worry about looking at an old version of the spec.
Each node in the hierarchy needs to be versioned, so that previous information is available. In addition, the task of displaying differences becomes essentially trivial.
It must be possible to "publish" the whole document or sections of it by "posting" it. It must also be possible to create replies for individual sections, and then "post" them all at one time.
At a minimum, every node in the system has two hierarchies descending from it. One is a list of content nodes that comprise the hierarchical document. The other is a list of reviewer comments. (Some comments will be specific to the information in that node, others will be intended as general comments for that section of the document.)
Other sub-element lists may found to be desirable in the future, so the system should be "open-ended" in allowing other sublists to be added, identified, and accessed.
Rather than using a central "repository", the system should employ the major strengths of email systems, namely: fast access on local systems and the robust nature of the system as a result of having redundant copies on many different systems. The system will be more space intensive than email systems, but storage costs are dropping precipitously, and future technologies paint an even brighter picture.
To mitigate the short-term need for storage space, it should be possible to set individual storage policies. For example, a user will most likely not want to keep previous versions of any documents they are not personally involved in authoring.
It must also be possible to add names to the authoring list. Name removal should probably be limited to the original author. For those cases when the original author is no longer part of the system, it should be possible to make a copy of the document and name a new primary author.
When a new version of a document arrives, differences are highlighted. Old-version information becomes accessible through links (if saved). Differences are always against the last version that was visited. If a section of the document was never visited, the most recent version of that section is displayed on the first visit. If several iterations have taken place since the last visit, the cumulative differences are shown. (Again, node-versioning makes this user-friendly feature fairly trivial.)
XMLTreeDiff at IBM Alphaworks (Lars Martin)
Clearly support for web links is desirable, as shown by the links to the various possible starting points in the References post. [Note: Each of those should be evaluated against this requirements list, and used to modify these requirements.]
Indirect links are needed, both to link to a list of related nodes, and to link to the latest version of a node.
It must be possible to categorize nodes (and possibly links). For IBIS-style discussions, for example, node types include (at a minimum) question, alternative, pro, con, endorsement, and decision.
For material that is included "in line" in the original document, typing implies the ability to choose which kinds of linked-information to include. For example, in addition to the current version, one might choose to display previous versions and/or all commentary.
For material that is displayed in separate windows, typing allows the secondary windows to automatically display material of a given type. (For example, in Rod Welch's "contract alignment" example, the secondary window might automatically display the meeting minutes that are linked to particular phrases in a contract. Lines might be automatically drawn from sections of the minutes to sections of the contract. Other links in the documents, however, would be ignored.
It should be possible to construct an initial design document using queries of the form "give me all design notes corresponding to the features we decided to implement in the current version of the functional specification.
The many possible starting points in the References list highlights the need for evaluablility. It should be possible, not only to reply with a comment on any item in those lists, but also to add an evaluation, much as Amazon.com keeps evaluations for books. That feature is arguably their greatest contribution to ecommerce, and the DKR should make use of it. It should also be possible to order list items using relative evaluations. That lets the most promising starting point float to the top of the list.
Not all lists should be ordered by evaluation, however. For example, the sequence of requirements has been chosen to provide the most natural "bridge" from one to the next. So evaluation-ordering must be an option.
Ideally, it should also be possible to "weight" an evaluation, perhaps by adding a "yay" or "nay" to an existing evaluation.
When displaying an evaluation, where evaluators can choose a value from 1..5, it might make sense to display the average, the number of evaluations, and the distribution. A distribution like
...for example, would show a highly polarized response, even though the "average" was 3.
The system must increase the ability of multiple people, working collaboratively, to generate up to date and accurate revisions.
For any given document, there are several classes of interaction:
The 3rd group consists of people who suggest an alternative wording or organization. Those "suggestions" take the form of a modified copy of the original. One of the document authors may then agree to use that formulation in place of the original, or may simply keep it as commentary.
The 4th group consists of the fully-collaborative authoring group. The original author must be able to add other individuals to the document, or to subsections of it. (An author registered for a given node has authoring privileges throughout the hierarchy anchored at that node.)
Every information node that is created should be automatically attributed to it's author. When a new version of a node is created, all of the people who sent comments should be contained in a "reviewer" list. When a suggestion is accepted, the author of the suggested node should go into a "contributor" list in the parent node and be added to the "author" list for the current node. It should be possible to identify all of the reviewers, contributors, and authors for the whole document and for each section of it.
When new versions of a document are created, material would be included by pointing to it, keeping attributions intact. The system must accelerate that process. It should be possible to start a new document in one of two ways:
These are requirements for the system as a whole.
The system must be "open" in the sense that a user is not constrained to using a particular editor, email system, or central server. The specifications for interaction with the system should be freely available, along with a reference implementation to use as a basis. As much as possible, conformance with existing standards (XML, XHTML, HTTP, email) is desirable. (The tricky decisions, of course, will be between required features and standard protocols that don't support them.)
The server and client systems that implement the DKR must also be fully *extensible*. In other words, the same characteristics of hierarchy, versioning, and revisability (use of most recent version) that apply to the documents must apply to the system itself.
That extensibility can be accomplished with a "dispatch table" that names the class to use for each kind of object that needs to be created. In conjunction with open sourcing, that architecture allows a user to extend (subclass) an existing class and then use the extended version in place of the original. In addition, upgrades can occur dynamically, while the system is in operation, while allowing for modular downgrades when extensions don't work out.
Security in such a system becomes an issue, unfortunately. The system should employ whatever mechanisms exist or can be constructed to help prevent trojan horse attacks, back door attacks, and other security breaches in an open source system.
For example, Christine Peterson described Apache's process as having something like 45 reviewers, 3 of whom recommend the inclusion and none of whom object, before new code is added to the system.
Email is fundamentally the right interface for such a system, because information comes to you, the information is organized into threads, and you can edit/reply from within the same application you use to view the information.
(Email's major weaknesses stem from the fact that even though the interface is appropriate, the underlying data structures are not. But the hierarchy inherent in the specified system will rectify those flaws, eliminating the redundancy inherent in email responses and allowing for thread-summaries.)
However, the factor that makes email central to one's daily activities is the wide variety of inputs you receive. Email is inherently "project neutral". You get email on every topic under the sun, including personal and professional interests. It represents "one stop shopping" for your information needs. (The Web, on the other hand, provides nicer storefronts, but you have to go visit the store to find what you want.)
In a sense, the "firewall" requirement is in itself a partition. In an organization like the Standford Research Center (SRI), for example, there is a need to create a project-specific partition, so that only only other members of the project team ever see that information. On the other hand, there is a wide area of shared expertise (computer expertise, management expertise, administrative expertise) that can be shared among all members of the organization.
In a similar vein, the "email interface model" implies the need for multiple partitions -- one for each project or interest area, for example. The degree to which you "cross-fertilize" between the partitions should then be up to you.
Looking Ahead: Some DKR RequirementsThese additional requirements begin to move the system towards a DKR.
With respect to security, there is also the issue of "firewall" capability. The DKR must allow professionals in many different organizations to contribute and share knowledge. That knowledge may largely be in the form of published papers and the means to locate and access them, but it represents a high-degree of inter-organizational co-operation, at the level of the individual professional.
The DKR will also be handy for individual projects, though. The mechanisms will support collaborative designs and "on demand" education as to corporate procedures, for example. But that information must remain *inside* the firewall, inaccessible to competitors.
In the ideal scenario, it will also be possible to "publish" information stored in the inner repository at strategic times, rather like publishing a technical paper that gives the design of the system. But until then, the firewall must remain intact.
Eventually, the system must become a *teaching* tool. It must follow the concept of "Education on Demand", intelligently supplying the user with the information needed, and educating that user, whatever their initial background. (Within reasonable limits.)
Outline of Operational RequirementsThis is an outline of functional operations for the system:
A better alternative, if feasible, would be attributions attached to every phrase in the node. That requirement creates a third category of containment for the node, consisting of the text that makes it up. When originally created, there would only be one long phrase, and it's author. When others make changes, the text would be broken up into segments. That's the same architecture most editors use internally, anyway, but it would require storing a lot more information, putting it together to display the node, and taking it into account when copying and pasting.
Each node in the system should be able to track the following information:
After the initial version of the data/object structures has been nailed down, they need to be run through a series of use case scenarios, with the data manipulations defined for each. The goal of the process will be to refine the data structures, looking for weaknesses or necessary reorganizations. [Note: Some scenarios may need to be tabled as unsuitable for the initial system.]
--Requirements rows, Technology columns --Must-Have, Nice-To-Have, Optional categories --Y/N cells &/or evaluation cells --Adding a new technology suggests additional "must have" feature
A hierarchical system is created from only two relationships:
One wonders what such a system will look like after it begins to be extended with thousands of additional relationships.
It boggles the mind.
Sincerely,
Eric Armstrong
eric.armstrong@eng.sun.com