bims: 2nd Knight grant

This is the second grant solicitation to the Knight Prototype fund, of 2016‒11‒16. Thomas Krichel and Ross Mounce wrote it. Here we show the questions and the answers. The sum was $35000. We did not have to write a budget at that stage. The application was declined.

1. Describe what you will make (150 words)

We want to build an open-access project to improve current awareness of the biomedical literature. This includes the contents available from the Nationl Library of Medicine's PubMed database. This database grows by about 20000 items a week. The type of documents described is quite broad. It includes material that is only of interest to specialists. But there are also papers that are of interest to the general public.

Biomed news will be a system that contains reports. Each report will announce new additions to PubMed on a one-issue-per-week basis. Each report will be specific to a subject and an audience. Each report will be headed by a selector. Selectors will make decisions on what papers are announced in the report. Report issues will be circulated to subscribers via email, published on the biomed news web site, put into RSS feeds and announced via Twitter.

2. What problem are you trying to_solve (200 words)

The project may appear unrealistic, because we are asking selectors to look through thousands of documents a week. But it's not a lot of work for them if we are using machine learning to help them. We know that from experience. Since 1998, Thomas Krichel has been running the "NEP: New Economics Papers" project at http://nep.repec.org. It's basically the same system as we propose here. It uses data collected by the RePEc digital library for economics. It takes about 10 minutes for an editor to work through an issue. NEP produces close to 100 reports. It has over 75,000 subscriptions.

There are technical challenges. We need to take account of the fact that PubMed is 25 times the size of RePEc. Nobody would possibly look at all the 20000 papers we would get in a week. That means we need to make a preselection. Before we can train for a report, we need selectors to give us sample data. Caputuring that is a new system we have to build. Having noted that we will definitely be able to launch because we can use an adapted version of the selector interface of NEP.

3. Who do you intend to impact with the project and how do you understand their needs? (200 words)

We have two types of users.

The first type of users are the selectors. We expect selectors to be academics, patients of a certain disease, or journalists. All selectors have to stay on top of the literature in their field. PubMed is the primier source for many of them. The traditional way to access PubMed are keyword searches. They are tedious to use. They do not produce good results unless the topic has a very specific jargon. We are convinced that the machine learning will bring more relevant papers out than, say, PubMed alerts. The first benefit to selectors is for them to get better information about the topic. But there is a second benefit for selectors. Selectors will be publicly acknowledged. Thus selectors get name recognition as thought leaders in a group of people who have a similar interest.

The second type of users are the readers of report issues. These users can comprise anybody with an interest in a biomedical issue. Anybody can sign up at no cost. We make all the reports available as open data for reuse, in simple XML files.

4. Please list team members and their qualifications (400 words)

There are two team members. We both feel important to understand where we are coming from.

Thomas Krichel was born in Germany in 1965. He studied economics in France. When worked as a research assistant in the UK, he started to become interested in the infrastructure of publishing in economics. In 1993, he published the first electronic research paper in economics on a gopher server he had access to. In 1997, he founded a system called RePEc for the publication of economics research papers. It's a non-proprietary publishing system that over 1700 publishers of economics papers contribute to. In 1999, he build the RePEc Author Service. This was the first ever service to allow authors to register their papers. In 2007 he released a co-authorship visualization site called CollEc. Among his work in RePEc, the most salient to this application is NEP: New Economics Papers, which he created in 1998.

Thomas Krichel has a history of creating free but self-sustaining datasets and services. RePEc, NEP and other systems he founded are non-proprietary and independent of external funding. The sustain themselves by leveraging the contributions of self-interested contributors. Thomas Krichel thus has a history of using initial grant support to build freely available yet self-sustaining systems.

Dr Ross Mounce has a PhD in Evolutionary Biology and is a scientific advisor with the ContentMine project (http://contentmine.org/), mining and republishing facts from academic literature. His research focus is on automating robust and reproducible identification and re-use of facts from research articles to ameliorate impediments to effective knowledge synthesis. In 2012, Ross was one of the first awardees of a Panton Fellowship, for the promotion of open data in science, by the Open Knowledge Foundation.

5. What progress, if any, have you made on this project (200 words)

Thomas Krichel founded the "NEP: New Economics Paper" system in 1998. The first NEP issue had 24 papers. Manually selecting relevant papers was adequate. In 2003, Thomas designed a purpose-built report composition tool. The system is called ernad. It introduced machine learning into the selection process. Thomas has been maintaining ernad ever since.

Starting in 2014, Thomas has refactored ernad to make it a more general system. It can now run several services. Each service is customizable for look and feel. In 2015 Thomas has concentrated on improving the way the machine learning works. This is critical to scaling it up to the level that PubMed would require. The Open Libary Society is a PubMed vendor. We have access to daily updates of PubMed data. We have indexed the data. We find that this year, there are an average 3827 new papers a day. Thus we have measured the extent of the scaling problem.

Some of our friends are ready to be selectors. But we can’t advertize the system until we have built and ironed out most of the bugs.

This was the body of the proposal. Applicants where allowed to attach some media to make their point. So Ross made this demo video.