Imagine you stumble into a medical conference and decide you want to socialize with physicians. Your process may be to approach a random physician, speak to them for a bit, and then have them randomly select one of their colleagues whom you can speak to next. If you continued like this forever, how much time would you spend talking to each person? You might spend the most time talking with the person with the most connections, but depending on how the data is structured, that might not be the case. Let's take a look!

Step 1: Get a hold of the data you want to analyze and format it as an edge list in a csv. An edge list has two columns which both represent nodes in the graph. For example, a row like 103, 105 might represent that physician 103 is connected to physician 105.

Step 2: Install R, as well as the igraph package.

Step 3: Build a graph with the igraph package from your csv edge list. Running the function below prompts the user to select a csv file from the system and builds a graph.

library(igraph)
graphFromEdgeList <- function(){
  dat=read.csv(file.choose(),header=TRUE) # choose an edgelist in .csv file format
  graph.data.frame(dat,directed=TRUE)
}

Step 4: Manipulate the graph using some of the techniques from Google's page rank paper and build a transition matrix.

Before moving to step 5, let's take a quick look at graph theory. An adjacency matrix is a representation of a graph where a spot in the matrix (for example, [103,105]) is 1 if physician 103 is connected to 105, and 0 if they aren't connected. To get the probability of randomly transitioning from any given node to any of its connections, you divide 1 by the number of connections for the physician. If physician 103 has 5 connections, then there would be a 1/5 or .20 chance of physician 103 directing you to 105. A matrix of all these probabilities is called a transition matrix.

But what happens if physician 103 is connected to 104 but 104 is connected to nobody? Or what if 104 is only connected to 105 and 105 is only connected to 104? These two examples, also known as leafs and periodic subgraphs, can distort the results of a random walk because the walker could never get to the rest of the graph. When Google was working on its page rank algorithm they came up with some tricks to ensure that the graph they analyzed would not have either of these properties. First, if a node has no outbound connections, then connect it to everything in the graph at equal probability. Second, the walker is given a 20% chance of jumping to anyone randomly at any time. The code to build such a transition matrix given a graph object is below:

# takes a weighted & directed graph
# returns a modified n x n transition matrix
  # return matrix is modified so that any node that had no outbound edges is
  # connected to every node in the graph at a uniform weight.
  # Every node is connected so the graph is guaranteed not to have hidden periodic graphs
randomChatterMatrix <- function(G){
  A <- get.adjacency(G) # warning this matrix can be quite large
  N <- nrow(A)
  r <- c()

  for (i in 1:N){
    s = sum(A[i,])

    if (s == 0){
      # connect all leafs to every other node in the graph at equal probability
      # manipulating A in a loop is not performant because it will copy A every time
      # keep a vector of rows to bulk update later
      r = c(r, i)
    } else {
      # since s varies row to row perform the operation on the spot
      A[i, ] <- A[i, ] / s
    }
  }

  A[r,] <- 1/N # bulk update all zero rows
  m <- as.matrix(A)
  (.8 * m) + (.2 * 1/N)
}

Step 5: Compute the random walk probabilities for each physician. Now that we have a transition matrix to work with, we can use linear algebra to answer our original question. How much time will you spend talking to each person? It just so happens that when you're dealing with a real square matrix with positive entities, the eigenvector corresponding to its largest eigenvalue will give us exactly that information. Using the functions above the code to compute the dominant eigenvector looks like this:

g <- graphFromEdgeList()
r <- randomChatterMatrix(g)
eigen_vect = eigen(t(r))$vectors[,1]
probs = eigen_vect/sum(eigen_vect)
print(probs) # lets see what we got!

Below is a graph of a small subset of Doximity physicians where the X axis shows the proportion of time spent talking to a physician and the Y axis shows how many physicians fell into that amount of time. The results follow a logarithmic trend where you would randomly chat with most physicians for about the same amount of time. A few physicians however stand out as people you would spend significantly more time chatting with.
histogram of physician chatting time

And just like that, you're able to quantify exactly how much time you would spend with each physician. This new piece of information is interesting on its own, but it can also be the start of many more fun data science exercises with R!

When consumers first get acquainted with an API, they'll often turn to documentation. This is a natural starting point for most consumers, especially since APIs are so prominent in development. Why, then, do we start with writing tests for code? The short answer: we don’t.

At Doximity, we often write the documentation for our APIs before writing the first line of code. Tom Preston-Werner actually wrote about the benefits of this type of documentation in his blog post Readme Driven Development. Over time, we've found a way of streamlining this workflow to offer benefits for both developers and consumers alike.

The Workflow

First, we write a draft of the documentation, including an example request and response. After receiving edits from the consumer, our iOS team, we make the necessary revisions. This feedback loop assures all shareholders, project manager included, agree on a final product. (This is similar to the process that user interface design mocks go through before development.) After this loop is complete, we write the failing tests that support and drive the code as we normally would.

Apiary

To help us with this process we use the Apiary service, which hosts our documentation. But to call them a mere host misses the point. Apiary champions a documentation format, API blueprint, which standardizes the way to write documentation. After comparing a few different products in this area, like Swagger and RAML, we decided on Apiary. The ease of learning API blueprint’s syntax, especially for less technical people, was too appealing to pass up.

Apiary also runs a web server for the defined endpoints themselves. The HTTP server responds with the JSON found in the example of the documentation. This is a powerful feature. It allows the iOS team to start developing against it as soon as we finish the documentation. There are two other tools built around the mock API that are logical derivatives. The first is the proxy API. This tool proxies requests through Apiary to any server, such as a staging server. When debugging a request or its response, it helps to match the expected with the actual. To that end, Apiary provides a diff. Finally, while consumers are reading an API's docs, they can send requests to a production server. By adding a header for authorization, the browser will send the request to the server.

Final Thoughts

There are many other natural benefits to this style of driving design. Our documentation serves as a transparent boundary for our teams as we grow as a company, And our development team can stub the responses from services. Apiary can help with that. The blame for what team owns a bug goes from fuzzy to clear. We can even write tests for the documentation itself. It's well worth the cost of the initial time spent writing documentation if we can continue to have a meeting of the minds later on.

There is a fantastic Thoughtbot article, written by Caleb, about signing commits (among other items, like emails). I presented it to our team as an excellent opportunity to provide some authenticity and ensure provenance.

If you don't have the time to follow along with Caleb, I'm going to attempt to tl;dr his article here. However, I highly recommend referring back to the original article.

Signing a commit proves you yourself made those changes. This is advantageous for a number of reasons that you can learn about from horror stories.

To get setup, run these commands:

brew install gpg2 gpg-agent pinentry-mac
gpg2 --gen-key

Use RSA and 4096. Set key expiry to 1 year if this is your first one. This way lost passphrases, forgotten keys, etc. all get expired. However, if you use PGP regularly, having a key that doesn't expire isn't unreasonable as long as you generate a revocation certificate you store somewhere separately, so pick 1 year.

After you follow the prompts, generate a revocation certificate, especially if not expiring your key.

gpg2 --output revoke.asc --gen-revoke [email protected]

Follow the prompts and tell gnupg you're giving no reason, since you're pre-generating it. Seriously, you need this. If you lose it, you're hosed, so store it safely. Printing as a QR code is highly recommended.

Finally, make this automatic for git by adding it to your gitconfig. This is the best part and was only recently added to git. Run gpg-agent so you only have to enter the secret key's passphrase once.

If you made it this far, consider exchanging and signing each other's keys at your organization for unlocking the full power.

Machine Learning Made Simple with Ruby

How is it possible to make automatic classification work properly without resorting to using external prediction services? Starting with Bayesian classification, you can use the ruby gem classifier-reborn to create a Latent Semantic Indexer. Hands on!

Thinking in React

Pete Hunt walks you through the process of creating a React.js application, explaining the process and how to think the React.js way.

Go and Ruby-FFI

How to write a shared library in Go that can be loaded by Ruby-FFI.

Profiling & Optimizing in Go

Transcript of a talk going through the tools and strategies for profiling and optimizing Go.

Best practices for a new Go developer

Read what Gophers from across the world have to say to the question — “What best practices are most important for a new Go developer to learn and understand?”

Microservices Resource Guide

Martin Fowler’s guide to microservices is a collection of recommended articles, presentations, and materials regarding the subject.

Practical Persistence in Go: Organising Database Access

In this post the author takes a look at four different methods for organizing your code and structuring access to your database connection pool.

Getting Started with Rails 5's ActionCable and Websockets

An introduction to Rails 5’s new feature, ActionCable.

A Neural Network in 11 lines of Python

A bare bones neural network implementation to describe the inner workings of backpropagation.

Stack Overflow: Replacing a 32-bit Loop Count Variable with 64-bit Introduces Crazy Performance Deviations

Very interesting low level discussion about optimization and how empirical optimization can also be misleading.