Vervis aims to be a decentralized project hosting platform, with support for server federation and distributed repository storage. But it's still very young, and at the time of writing only minimal features exist for hosting Git repositories.

Vervis is a RESTful web application written in Haskell, using the Yesod web framework.

The code is at: http://hub.darcs.net/fr33domlover/vervis

The material below is old, don't take it too seriously. For relevant progress tracking, bug reports, etc. see the tickets.

I'm going to launch an experimental preview instance soon. I'll publish its URL here when I do.

Old Stuff

The idea comes from the idea collection page:

Cool way to learn CGI and make something useful: gitweb / gitolite / full dev platform, with a generic vcs api and backends for git and darcs at least. Then you can see all repos - git and darcs - in the same list, and manage access policies etc. in a single config file. Idea for beginning: understand CGI and make a "vcsweb" script that combines gitweb and darcsweb. Preferrably write in Haskell (gitweb is in Perl and darcsweb is in Python)

Some links to start with:

Initial Plans

Eventually Vervis should use Afifon for federation, but for now I assume it's the common kind of a development platform, i.e. a standalone server.

The web UI isn't the core of the system, and is instead one of many possible specific frontends. The web UI is just an "HTML frontend" to Vervis. Here is an initial architecture (made with Asciio):

   .-------------.                   .------------.  .-----------------.
   | Web Browser |                   | GUI client |  | Terminal client |
   '-------------'                   '------------'  '-----------------'
          |                                 |               |
          |                                 |               |
          |                                 |               |
          |                                 |               |
          v                                 v               v
.---------------------.                  .----------------------.
| Web Server Node     |                  | Instance Server Node |
|   .-----------.     |                  |  .-----------------. |
|   |  Web UI   |     |                  |  |     Web API     | |
|   | HTML, CSS |     |       .------------>| HTTP, XML, JSON | |
|   '-----------'     |       |          |  '-----------------' |
| .-----------------. |       |          |  .----------------.  |
| |  Web Server     | |       |          |  |  Service API   |  |
| | Lighttpd, Nginx |---------'          |  '----------------'  |
| '-----------------' |                  |  .----------------.  |
'---------------------'                  |  |      API       |  |
                                         |  '----------------'  |
                                         |  .----------------.  |
                                         |  |    Instance    |  |
                                         |  '----------------'  |
                                         '----------------------'

More concepts:

  • A general data model
  • A storage scheme for the database (SQL, RDF, Smaoin) based on the general one
  • A programming language model based on the general one
  • API for querying repos and database, and for making modifications

There are generally 2 kinds of API actions:

  • Actions which operate on a repo
  • Actions which operate on the database

For the first kind, there needs to be a uniform interface for handling the various VCSs. Possible styles of the interface:

  • Unified: Support just the things shared by all backends. Pros: Very simple and straight-forward to use. Cons: Ignores the special capabilities of specific VCSs, even if shared by many of them.
  • Separate: Support each VCS using a separate interface, taking advantage of the full feature set it provides. Pros: Provides access to all the features. Cons: No abstraction, cumbersome, hard to add operations.
  • Combined: Support a core of things shared by all backends, but also allow for features to be optional (e.g. Darcs doesn't support in-repo branches right now) and provide room for VCS-specific additional features. Pros: Provides abstraction without ignoring the more advanced specific capabilities. Cons: Need to think how to make operations easy to write and integrate with a good design of the "optional" and "specific" kinds of features in the interface.

Vervis will be targeting the combined approach.

Questions to answer:

  1. Which VCSs should Vervis support?
  2. What are the common concepts of VCSs? See Wikipedia
  3. What are the common concepts of the chosen VCSs? Make a data model

Before I go to Wikipedia, here's a short list of common VCSs:

  • Git
  • Darcs
  • Mercurial
  • Monotone
  • Fossil
  • SVN (old?)
  • CVS (old?)
  • Bazaar (obsolete?)
  • GNU Arch (?)

And some concepts I see in various VCSs: record, branch, repo, record title, record, description, record author, patch, author, change, change type, file/folder add/remove, file edit, file/folder move/rename, author name, author email, tag, committer, record sequence graph, record type, tarball archive

Now let's see Wikipedia.

First, an initial list of VCSs to read about and pick the ones to be supported. Remember support means more software in the system, and systems with integrated UIs probably won't benefit much from having Vervis.

  • ArX
  • Bazaar
  • Codeville
  • Darcs
  • Fossil
  • Git
  • GNU Arch
  • Mercurial
  • Monotone

Impressions:

  • ArX: Latest release in 2007, I'm not sure anyone even uses it
  • Bazaar: Wikipedia cites Bazaar not being very active, but projects use it
  • Codeville: Abandoned
  • Darcs: OK
  • Fossil: SQLite seems to be the major user. Has web UI, but worth checking
  • Git: OK
  • GNU Arch: Deprecated
  • Mercurial: OK, seems to be used by many projects, including major ones
  • Monotone: OK, but I2P seems to be the only major project using it

Decision: VCSs will be added in 2 steps. First, high-priority ones. Then, later, at some point, perhaps gradually, the lower-priority ones will be added too.

High priority:

  • Darcs
  • Git
  • Mercurial

Low priority:

  • Bazaar
  • Monotone
  • Maybe Fossil?

General VCS concepts from Wikipedia: Revision control#Common vocabulary.

Interesting Haskell packages, trying to collect the better ones:

A data model: model.dia.

More info and plans still exist in my notebook.

UPDATE: I was thinking recently about the relation between decentralized and distributed systems, and about the good and bad sites of these concepts. See in decent.

Generation 1: Simple Git Repo View

It's been a while since I wrote the initial plans above. A decentralized dev platform is now one of the project ideas I'm most interested to try. I started reading the Yesod book today, and I'd like to try creating a foundation first that is independent of a web UI.

The general idea is: Create a Yesod subsite that is essentially a simple partial gitweb clone. Or even a standalone web app. Yesod is involved here simply because I want to learn it, but it can be anything else.

For the beginning:

  1. No web UI, just functions that generates the data for the UI
  2. A single such function which generates the main view, i.e. list of repos
  3. Doesn't matter much which Git library I use, as long as it works well

Step 1: Generate repo list

I'm going to write a library with a function which does the following:

  1. Take a project root dir from the user, i.e. a directory path under which zero or more Git repos are located
  2. Read the repos and generate a simple record for each
  3. Return a list of these records

In this step, nothing Git related really needs to be done. Just iterate over the subdirectories at the given path and return a list of their names.

Done.

Step 2: Generate basic repo details

Assume each subdir found is a git repo. For each sub repo, return the following:

  1. Name
  2. Time of last change
  3. How much time passed since then

There are several Git related packages, and they do different things. For example, libgit runs high-level operations, such as making commits. Other packages explore the data stored in .git/ and don't touch the working directory at all. I could also use some wrapper which supports multiple VCSs, but for now let's pick something Git-specific which can read data from .git/:

  • gitlib is fast, has multiple backends and seems to be able to do a variety of things but is also somewhat low level, not documented well and its API isn't very coherent.
  • hlibgit2 is probably - no Haddock so hard to say - a thin wrapper around the libgit2 C library.
  • hit - Implements the Git repo format directly in Haskell. Operates only on repo data, not working dir or index
  • libgit - wraps the git binary with a simple Haskell API for common user operations.
  • ght - functions for examining a git repo, nearly no documentation and no updates since 2011
  • vcswrapper - license issue since it's a GPL library, i.e. my CC0 codecan't use it. It wraps Git, Mercurial an SVN using their binaries. It doesn't actually unify them under a single API, but it's a simple API. It provides roughly what libgit does, i.e. read and write user operations. Note about GPL: It seems my code can be CC0, but people releasing it with the library will have to release it as GPL. Something like that.
  • filestore - API for versioned file stores, supports Git, Darcs and Mercurial using their binaries. It allows to create files, modify files, get file contents, examine history, search and more.

It's hard to say which VCS wrapper tool would be needed for my case, since it's much too early to tell what kind of VCS operations I will need to perform from Haskell code. And since the operation I currently need, getting the datetime of the last commit, is commoly available, I have a wide choice among the listed libraries.

Perhaps on the long term it can be useful to write a library which provides a VCS API suited for project hosting platforms.

Decision: I picked hit for the first try.

Done.

Step 3: Basic records

Let's expand the data model a bit. It could be nice to make it possible to load data in a modular manner from a DB or from git config or from other files like Gitolite does. So let's just have data as Haskell types and leave serialization for a bit later.

Our server has a name, for example hub.vervis.org. There is a list of users of the server. Each user has a case-insensitive username, for example fr33domlover. There are also groups of users. These are simply named lists of users. The server has a list of namespaces, each of which is either a user or a group. Under each namespace, there can be zero or more repositories. Each such repo is simply a name in the records.

On the disk, there is a directory containing one subdirectory for each user and group. Under these, there are git repo directories. For now at least, assume they are bare, i.e. have no working tree.

Each repo also has a mailing list address and an IRC channel, and both are optional, and users have a name field for possibly real name, and an email address field.

Given a list of namespaces and the repos under them, return a list of pairs, in which each pair consists of a "namespace/repo" string and how much time passed since the last change as a friendly string.

Done.

Step 4: Monad

Viewing repos is nice, but we're going to need a complete API for managing projects, including repos and tickets and merge requests and so on. Let's avoid passing context information everywhere by defining the Vervis monad. It can be simple RWST on top of IO, hidden in a newtype. The environment will contain things that don't change, maybe for example the repo root dir, and the state will contain what can change, such as user details.

I read a bit about the option of using ReaderT with IORef instead of StateT, and I decided I'll use StateT and RWST as long as they work well.

In this step, just define the monad.

Done.

Step 5: Persistence

For now, no RDF and no SQL. Not even acid-state for now. Just use JSON files. The json-state package I made can help with that a lot.

Done.

(Cancelled) Step 6v1: Whole project operations

In this step I'll add basic project operations, such as repo creation and deletion. These will allow me to create repos freely and using them to test all the other operations I'll add in the next steps.

Right now the data model works like this:

  • A namespace, which is a group or a user, has a list of projects
  • A project has a list of repos

Since most projects need a single repo, later I'll make the API and the UI make that use case easy, but for now it's not important. The operations to implement in this step are:

  • Create a new user
  • Delete a user
  • Create a new project
  • Delete a project
  • Create a new repo
  • Delete a repo

Working...

Problem: It's boring. I code a lot during night shifts, and boring coding makes me sleepy. The reason for boredom is that the operations here contain mainly 2 things:

  1. State changes, mostly HashMap operations right now
  2. Git repo operations, in this step only repo creation (initRepo) and repo deletion (which is simple directory removal, not git-specific)

Since these are all trivial, and will proabably be done using persist anyway, they become boring.

So... what's next, then? I did implement a single operation, user creation. Maybe move to persist at this point? I see two directions:

  1. Read a bit about persist, see if I can/should use it without Yesod
  2. Implement all basic git related operations gitweb supports

Since testing the git operations conveniently will require the DB layer, which is why I wanted to handle it first, I'll read about persist now.

Step 6v2: Database

Use persistent. See if I can start using it without Yesod, and whether it's easy to use Yesod later or it's better to use Yesod from now on.

I checked, and it's possible. persistent works well without Yesod. Therefore, in this step, do the following:

  1. Split the code into submodules
  2. Support persistence for the data model using SQLite

Done.

Step 7: Main view query

Currently in a single simple page, there should be a query action which returns a table of repos:

  • Column 1: Namespace names
  • Column 2: Project names
  • Column 3: Repo names
  • Column 4: Repo time-agos, i.e. time duration since last change

The steps for that are hopefully:

  1. Get (namepace, project, repo) tuples using a single SQL query
  2. Construct repo path for each and determine last change time
  3. Return a structure of the results

In addition to that, add uniqueness constraints to the data model and anything else persistent offers which I should be using. In particular, see if there is an elegant way to make ident unique across users and groups, i.e. have unique namespaces which forbid the existence of a user and a group with the same ident.

Hmmm the uniqueness thing needs some work. Here's basically what we have right now:

User
    ident Text

    UniqueUserIdent ident

Group
    ident Text

    UniqueGroupIdent ident

Project
    ident Text
    --unique mapping to the namespace the project belongs too??

Repo
    ident   Text
    project ProjectId

    UniqueRepo ident project

The problem is that users and groups share the same namespace (first component of repo URL path), and this setup:

  1. Doesn't enforce it
  2. Doesn't (currently) have a way for a project to tell to which user/group it belongs. I can do that using a separate table but will be a slightly dirty hack and won't solve issue 1

Below I'll try to build a solution by having a separate table for the ident namespace.

Ident
    ident Text

    UniqueIdent ident

User
    ident IdentId

    UniqueUserIdent ident

Group
    ident IdentId

    UniqueGroupIdent ident

Project
    ident     Text
    namespace IdentId

    UniqueProject ident namespace

Repo
    ident   Text
    project ProjectId

    UniqueRepo ident project

The above seems to solve both problems, but creates a new one. The solutions:

  1. A single table for ident names with the name field being unique
  2. Each project refers to the namespace. To determine which user or group that namespace really is, find a record in the user or group table which has the same namespace

The problem is that it's possible to have an ident name which is neither a user nor a group, or that is both a user and a group. Therefore finding the user/group for a namespace requires checking two tables, getting two Maybe results and checking at runtime that exactly one of them is Just and the other is Nothing. This isn't guaranteed statically. Another way is to simply put a <|> between the Maybes. This quietly handles the "both" case but still needs to report failure in the "neither" case.

Hmmm maybe using SQL's "on delete" I can avoid the "neither" case? In other words, make an ident name be deleted when not in use?

After looking at some docs, it seems I can use DeleteCascade. It means that I can delete an ident name, and cause all references to it be deleted recursively. Here's another idea: For each ident name, store whether it's a user or a group. A simple boolean. When a user/group is added, set that field in the newly added ident name, and on query use it to determine which table to use for the user/group lookup.

Hmmm feels a bit ugly. Another idea: See if I can figure out the data model Gogs uses.

Ah, Gogs is different. Organization pages are under the org namespace. Maybe a quick hack or something.

Actually, this raises a bigger issue: How should namspaces really be organized? Should I really have a shared space for users and groups? Is there a good reason to do that? So far the reason is simply the URL path structure, is that a good reason?

Let's examine GNU social. Specifically, the instance at gnusocial.no. Since this is a social tool for microblogging, it's expected to be user oriented. The path for users is <domain>/<user> while the path for groups is <domain>/group/<group>. What about GitLab? I can't find a group table, but there's a user table and a namespace table and a project table.

Another idea: Define an enum for namespace types and use that in the namespace table.

Another idea: Have different namespaces, it's not such a big deal. After all the only reason for sharing namespaces seems to be short and simple URLs. We can still have things like /o/orgname and /u/username and be fine.

DECISION: Separate namespaces for now.

Action items:

  • Use separate namespaces
  • Describe folder structure
  • Generate the simple view data

HOLD ON! I missed something. If I use separate namespaces, then each ident name belongs to exactly one user or group, but I have a new problem: How does a project say to which namespace it belongs? I can use a custom field but it won't be a foreign key which SQL or Persistent will understand.

Here's a new idea. Right now users and groups share only one thing: They all have ident names. But theoretically, we could have a "namespace" concept which keeps everthing that's relevant to both users and groups, e.g. full name and email and website, and separate user and group concepts for the specific fields. This is somewhat like a class hierarchy in OOP.

As to the ident uniqueness issue, I can prepare error pages which say there's an issue in the database. Actually in that case it's best to warn the maintainer/developer somehow. The point is, in valid code these cases shouldn't happen anyway.

NEW DECISION: Try a single namespace for all project owners.

Initial suggested folder structure:

.
├── people
│   └── john
│       └── myproj
│           └── myrepo
└── groups
    └── ourgroup
        └── ourproj
            └── ourrepo

There is however an issue here: When finding the repo location for a repo in the DB, it's not enough to know the sharer: You must determine whether the sharer is a person or a group. This leads to complications. I'll probably need to make 2 queries. And if I want the main view to list all sharers sorted alphabeticaly, I can't do efficient pagination like that.

New structure:

.
├── john
│   └── myproj
│       └── myrepo
└── ourgroup
    └── ourproj
        └── ourrepo

Done.

Step 8: Minimal Yesod

In this step, use the simple Yesod examples from the book to turn my code into a minimal Yesod app which returns the main view data in some format, even plain text is fine. JSON is better, and an HTML table will be great.

Done. Returns just an HTML table, no JSON yet.

Step 9: Use stack

Manage dep versions using stack. Drop the cabal sandbox. Learn how to use stack, get used to the new workflow and commands.

Done.

Step 10: Database

Read in the Yesod book and the Persistent wiki about setting up a PostgreSQL database for a Yesod app. Take a look at Snowdrift's sdm code. Write some basic instructions for myself. For now it's enough to have a development database managed by my regular user. The production database can come later.

Actually, I already did this for my MediaGoblin instance. Check their instructions too.

I'm starting with hints based on the MediaGoblin installation documentation. By the way, those docs are released via CC0 so I can freely copy them as-is without worrying.

Note: Throughout the documentation I use the sudo command to indicate that an instruction requires elevated user privileges to run. You can issue these commands as the root user if you prefer. If you need help configuring sudo, see sudo.

Install the PostgreSQL server and command-line client:

$ sudo apt-get install postgresql postgresql-client

The installation process will create a new system user named postgres, which will have privileges sufficient to manage the database. We will create a new database user with restricted privileges and a new database owned by our restricted database user for our Vervis instance.

In this example, the database user will be vervis and the database name will be vervis too.

We'll add these entities by first switching to the postgres account:

$ sudo su - postgres

Enter the following createuser and createdb commands at that prompt. They will run as the postgres user. We'll create the vervis database user first:

With password:

$ createuser --no-createdb --no-createrole --no-superuser --encrypted --pwprompt vervis

No password (if you run the app as a user by the same name as the DB user):

$ createuser --no-createdb --no-createrole --no-superuser vervis

Then we'll create the database where all of our Vervis data will be stored:

$ createdb --encoding=UTF8 --owner=vervis vervis

where the first vervis is the database owner and the second vervis is the database name.

Type exit to exit from the postgres user account:

$ exit

caution:: Where is the password?

These steps enable you to authenticate to the database in a password-less
manner via local UNIX authentication provided you run the MediaGoblin
application as a user with the same name as the user you created in
PostgreSQL.

That was the part based on MediaGoblin. I looked at sdm too. What it does is generate passwords for the database users and insert them into the config files. So my suggestion is as follows: Either create a database role named fr33domlover (I mean, your operating system username) and then you don't need a password, or use vervis and just keep the password somewhere safe (just in case the DB becomes available over the network for any reason).

I'm creating a database user vervis_dev and a database by the same name.

Done.

Step 11: Yesod Scaffoldings

Continue reading in the Yesod book. Understand how the scaffoldings work and what they look like. Create them using a PostgreSQL template and stack new.

Done.

Step 12: Integrate the repo table

I have a scaffolded site now, but it isn't using my git related code. I'd like to do the following:

  1. Continue reading in the Yesod book, understand auth and templates and handlers better
  2. Update models and settings to reflect my previous work
  3. Clear the database and let Persistent create tables
  4. Add a route and a handler for the repo table

Done.

Step 13: Simple user login

The authentication system is implemented. There is no UI for adding a new user yet, but login and logout pages exist. However:

  • There are no users yet
  • When login fails, a strange technical error message is displayed

Therefore:

  1. Read in the Yesod book about sessions
  2. Add a user using GHCi
  3. Fix the error
  4. Verify auth works

It seems there is a bug with CSRF protection, a feature included by default in the scaffolding. I'm disabling it for now. With the feature disabled, authentication seems to work perfectly. Login and logout.

What's Next

There are many many things I can do now, since the basics are in place. Instead of writing tasks like a diary, I'm now going to start managing the work using tickets, here in the wiki. Hopefuly, in time, I'll migrate them to a Vervis instance.

See tickets.