move matrix-generic documentation from synapse/docs into new matrix-doc project

2014-10-09 20:30:40 +02:00 · 2014-10-09 20:30:40 +02:00 · 556e3f8a71
commit 556e3f8a71
parent 06723d7465
8 changed files with 3721 additions and 0 deletions
--- a/drafts/definitions.rst
+++ b/drafts/definitions.rst
@ -0,0 +1,53 @@
+Definitions
+===========
+
+# *Event* -- A JSON object that represents a piece of information to be
+distributed to the the room. The object includes a payload and metadata,
+including a `type` used to indicate what the payload is for and how to process
+them. It also includes one or more references to previous events.
+
+# *Event graph* -- Events and their references to previous events form a
+directed acyclic graph. All events must be a descendant of the first event in a
+room, except for a few special circumstances.
+
+# *State event* -- A state event is an event that has a non-null string valued
+`state_key` field. It may also include a `prev_state` key referencing exactly
+one state event with the same type and state key, in the same event graph.
+
+# *State tree* -- A state tree is a tree formed by a collection of state events
+that have the same type and state key (all in the same event graph.
+
+# *State resolution algorithm* -- An algorithm that takes a state tree as input
+and selects a single leaf node.
+
+# *Current state event* -- The leaf node of a given state tree that has been
+selected by the state resolution algorithm.
+
+# *Room state* / *state dictionary* / *current state* -- A mapping of the pair
+(event type, state key) to the current state event for that pair.
+
+# *Room* -- An event graph and its associated state dictionary. An event is in
+the room if it is part of the event graph.
+
+# *Topological ordering* -- The partial ordering that can be extracted from the
+event graph due to it being a DAG.
+
+(The state definitions are purposely slightly ill-defined, since if we allow
+deleting events we might end up with multiple state trees for a given event
+type and state key pair.)
+
+Federation specific
+-------------------
+# *(Persistent data unit) PDU* -- An encoding of an event for distribution of
+the server to server protocol.
+
+# *(Ephemeral data unit) EDU* -- A piece of information that is sent between
+servers and doesn't encode an event.
+
+Client specific
+---------------
+# *Child events* -- Events that reference a single event in the same room
+independently of the event graph.
+
+# *Collapsed events* -- Events that have all child events that reference it
+included in the JSON object.
--- a/drafts/human-id-rules.rst
+++ b/drafts/human-id-rules.rst
@ -0,0 +1,79 @@
+This document outlines the format for human-readable IDs within matrix.
+
+Overview
+--------
+UTF-8 is quickly becoming the standard character encoding set on the web. As
+such, Matrix requires that all strings MUST be encoded as UTF-8. However,
+using Unicode as the character set for human-readable IDs is troublesome. There
+are many different characters which appear identical to each other, but would
+identify different users. In addition, there are non-printable characters which
+cannot be rendered by the end-user. This opens up a security vulnerability with
+phishing/spoofing of IDs, commonly known as a homograph attack.
+
+Web browers encountered this problem when International Domain Names were
+introduced. A variety of checks were put in place in order to protect users. If
+an address failed the check, the raw punycode would be displayed to disambiguate
+the address. Similar checks are performed by home servers in Matrix. However, 
+Matrix does not use punycode representations, and so does not show raw punycode 
+on a failed check. Instead, home servers must outright reject these misleading 
+IDs.
+
+Types of human-readable IDs
+---------------------------
+There are two main human-readable IDs in question:
+
+- Room aliases
+- User IDs
+ 
+Room aliases look like ``#localpart:domain``. These aliases point to opaque
+non human-readable room IDs. These pointers can change, so there is already an
+issue present with the same ID pointing to a different destination at a later
+date.
+
+User IDs look like ``@localpart:domain``. These represent actual end-users, and
+unlike room aliases, there is no layer of indirection. This presents a much
+greater concern with homograph attacks. 
+
+Checks
+------
+- Similar to web browsers.
+- blacklisted chars (e.g. non-printable characters)
+- mix of language sets from 'preferred' language not allowed. 
+- Language sets from CLDR dataset.
+- Treated in segments (localpart, domain)
+- Additional restrictions for ease of processing IDs.
+   - Room alias localparts MUST NOT have ``#`` or ``:``.
+   - User ID localparts MUST NOT have ``@`` or ``:``.
+
+Rejecting
+---------
+- Home servers MUST reject room aliases which do not pass the check, both on 
+  GETs and PUTs.
+- Home servers MUST reject user ID localparts which do not pass the check, both
+  on creation and on events.
+- Any home server whose domain does not pass this check, MUST use their punycode
+  domain name instead of the IDN, to prevent other home servers rejecting you.
+- Error code is ``M_FAILED_HUMAN_ID_CHECK``. (generic enough for both failing 
+  due to homograph attacks, and failing due to including ``:`` s, etc)
+- Error message MAY go into further information about which characters were
+  rejected and why.
+- Error message SHOULD contain a ``failed_keys`` key which contains an array
+  of strings which represent the keys which failed the check e.g::
+  
+    failed_keys: [ user_id, room_alias ]
+  
+Other considerations
+--------------------
+- Basic security: Informational key on the event attached by HS to say "unsafe 
+  ID". Problem: clients can just ignore it, and since it will appear only very
+  rarely, easy to forget when implementing clients.
+- Moderate security: Requires client handshake. Forces clients to implement
+  a check, else they cannot communicate with the misleading ID. However, this is
+  extra overhead in both client implementations and round-trips.
+- High security: Outright rejection of the ID at the point of creation / 
+  receiving event. Point of creation rejection is preferable to avoid the ID
+  entering the system in the first place. However, malicious HSes can just allow
+  the ID. Hence, other home servers must reject them if they see them in events.
+  Client never sees the problem ID, provided the HS is correctly implemented.
+- High security decided; client doesn't need to worry about it, no additional
+  protocol complexity aside from rejection of an event.
--- a/drafts/state_resolution.rst
+++ b/drafts/state_resolution.rst
@ -0,0 +1,51 @@
+State Resolution
+================
+This section describes why we need state resolution and how it works.
+
+
+Motivation
+-----------
+We want to be able to associate some shared state with rooms, e.g. a room name
+or members list. This is done by having a current state dictionary that maps
+from the pair event type and state key to an event.
+
+However, since the servers involved in the room are distributed we need to be
+able to handle the case when two (or more) servers try and update the state at
+the same time. This is done via the state resolution algorithm.
+
+
+State Tree
+------------
+State events contain a reference to the state it is trying to replace. These
+relations form a tree where the current state is one of the leaf nodes.
+
+Note that state events are events, and so are part of the PDU graph. Thus we
+can be sure that (modulo the internet being particularly broken) we will see
+all state events eventually.
+
+
+Algorithm requirements
+----------------------
+We want the algorithm to have the following properties:
+- Since we aren't guaranteed what order we receive state events in, except that
+  we see parents before children, the state resolution algorithm must not depend
+  on the order and must always come to the same result. 
+- If we receive a state event whose parent is the current state, then the
+  algorithm will select it.
+- The algorithm does not depend on internal state, ensuring all servers should
+  come to the same decision.
+
+These three properties mean it is enough to keep track of the current state and
+compare it with any new proposed state, rather than having to keep track of all
+the leafs of the tree and recomputing across the entire state tree.
+
+
+Current Implementation
+----------------------
+The current implementation works as follows: Upon receipt of a newly proposed
+state change we first find the common ancestor. Then we take the maximum
+across each branch of the users' power levels, if one is higher then it is
+selected as the current state. Otherwise, we check if one chain is longer than
+the other, if so we choose that one. If that also fails, then we concatenate
+all the pdu ids and take a SHA1 hash and compare them to select a common
+ancestor.