• Poldracklab, and Informatics

    Why, hello there! I hear you are a potentially interested graduate student, and perhaps you are interested in data structures, and or imaging methods? If so, why you’ve come to the right place! My PI Russ Poldrack recently wrote a nice post to advertise the Poldrack lab for graduate school. Since I’m the first (and so far, only) student out of BMI to make it through (and proudly graduate from the Poldracklab), I’d like to do the same and add on to some of his comments. Please feel free to reach out to me if you want more detail, or have specific questions.

    Is graduate school for you?

    Before we get into the details, you should be sure about this question. Graduate school is a long, challenging process, and if you don’t feel it in your heart that you want to pursue some niche passion or learning for that long, given the opportunity cost of making a lower income bracket “salary” for 5 years, look elsewhere. If you aren’t sure, I recommend taking a year or two to work as an RA (research assistant) in a lab doing something similar to what you want to do. If you aren’t sure what you want to study (my position when I graduated from college in 2009), then I recommend an RAship in a lab that will maximize your exposure to many interesting things. If you have already answered the harder questions about the kind of work that gives you meaning, and can say a resounding “YES!” to graduate school, then please continue.

    What program should I apply to?

    Russ laid out a very nice description of the choices, given that you are someone that is generally interested in psychology and/or informatics. You can study these things via Biomedical Informatics (my program), Neuroscience, or traditional Psychology. If you want to join Poldracklab (of which I highly recommend!) you probably would be best choosing one of these programs. I will try to break it down into categories as Russ did.

    • Research: This question is very easy for me to answer. If you have burning questions about human brain function, cognitive processes, or the like, and are less interested in the data structures or methods to get you answers to those questions, don’t be in Biomedical Informatics. If you are more of an infrastructure or methods person, and your idea of fun is programming and building things, you might on the other hand consider BMI. That said, there is huge overlap between the three domains. You could likely pursue the exact same research in all three, and what it really comes down to is what you want to do after, and what kind of courses you want to take.
    • Coursework: Psychology and neuroscience have a solid set of core requirements that will give you background and expertise in understanding neurons, (what we know) about how brains work, and (some) flavor of psychology or neuroscience. The hardest course (I think) in neuroscience is NBIO 206, a medical school course (I took as a domain knowledge course) that will have you studying spinal pathways, neurons, and all the different brain systems in detail. It was pretty neat, but I’m not sure it was so useful for my particular research. Psychology likely will have you take basic courses in Psychology (think Cognitive, Developmental, Social, etc.) and then move up to smaller seminar courses and research. BMI, on the other hand, has firm but less structured requirements. For example, you will be required to take core Stats and Computer Science courses, and core Informatics courses, along with some “domain of knowledge.” The domain of knowledge, however, can be everything from genomics to brains to clinical science. The general idea is that we learn to be experts in understanding systems and methods (namely machine learning) and then apply that expertise to solve “some” problem in biology or medicine. Hence the name “Bio-medical” Informatics.
    • Challenge: As someone who took Psychology courses in College and then jumped into Computer Science / Stats in graduate school, I can assuredly say that the latter I found much more challenging. The Psychology and Neuroscience courses I’ve taken (a few at Stanford) tend to be project and writing intensive with tests that mainly require lots of memorization. In other words, you have a lot of control over your performance in the class, because working hard consistently will correlate with doing well. On the other hand, the CS and Stats courses tend to be problem set and exam intensive. This means that you can study hard and still take a soul crushing exam, work night and day on a problem set, get a 63% (and question your value as a human being), and then go sit on the roof for a while. TLDR: graduate courses, especially at Stanford, are challenging, and you should be prepared for it. You will learn to unattach your self-worth from some mark on a paper, and understand that you are building up an invaluable toolbelt to start to build the foundation of your future career. You will realize that classes are challenging for everyone, and if you work hard (go to problem sessions, do your best on exams, ask for help when you need it) people tend to notice these things, and you’re going to make it through. Matter of fact, once you make it through it really is sunshine and rainbows! You get to focus on your research and build things, which basically means having fun all the time :)
    • Career: It’s hard to notice that most that graduate from BMI, if they don’t continue as a postdoc or professor in academia, get some pretty awesome jobs in industry or what I call “academic industry.” The reason is because the training is perfect for the trendy job of “data scientist,” and so coming out of Stanford with a PhD in this area, especially with some expertise in genomics, machine learning, or medicine, is a highly sought after skill set, and a sound choice given indifference. You probably would only do better with Statistics or Computer Science, or Engineering. If you are definitely wanting to stay in academia and/or Psychology, you would be fine in any three of the programs. However, if you are unsure about your future wants and desires (academia or industry?) you would have a slightly higher leg up with BMI, at least on paper.
    • Uncertainty: We all change our minds sometime. If you are decided that you love solving problems using algorithms but unsure about imaging or brain science, then I again recommend BMI, because you have the opportunity to rotate in multiple labs, and choose the path that is most interesting to you. There is (supposed to be) no hard feelings, and no strings attached. You show up, bond (or not) with the lab, do some cool work (finish or not, publish or not) and then move on.
    • Admission: Ah, admissions, what everyone really wants to know about! I think most admissions are a crapshoot - you have a lot of highly and equally qualified individuals, and the admissions committees apply some basic thresholding of the applications, and then go with gut feelings, offer interviews to 20-25 students (about 1/5 or 1/6 of the total maybe?) and then choose the most promising or interesting bunch. From a statistics point of view, BMI is a lot harder to be admitted to (I think). I don’t have complete numbers for Psychology or Neuroscience, but the programs tend to be bigger, and they admit about 2-3X the number of students. My year in BMI, the admissions rate was about 4-5% (along the lines of 6 accepted for about 140-150 applications) and the recently published statistics cite 6 accepted for 135 applications. This is probably around a 5% admissions rate, which is pretty low. So perhaps you might just apply to both, to maximize your chances for working with Poldracklab!
    • Support: Support comes down to the timing of having people looking out for you during your first (and second) year experiences, and this is where BMI is very different from the other programs. You enter BMI and go through what are called “rotations” (three is about average) before officially joining a lab (usually by the end of year two), and this happens during the first two years. This period also happens to be the highest stress time of the graduate curriculum, and if a student is to feel in lack of support, overworked, or sad, it is most likely to happen during this time. I imagine this would be different in Psychology, because you are part of a lab from Day 1. In this case, the amount of support that you get is highly dependent on your lab. Another important component of making this decision is asking yourself if you are the kind of person that likes having a big group of people to be sharing the same space with, always available for feedback, or if you are more of a loner. I was an interesting case because I am strongly a loner, and so while the first part of graduate school felt a little bit like I was floating around in the clouds, it was really great to be grounded for the second part. That said, I didn’t fully take advantage of the strong support structure that Poldracklab had to offer. I am very elusive, and continued to float when it came to pursuing an optimal working environment (which for me wasn’t sitting at a desk in Jordan Hall). You would only find me in the lab for group meetings, and because of that I probably didn’t bond or collaborate with my lab to the maximum that I could. However, it’s notable to point out that despite my different working style, I was still made to feel valued and involved in many projects, another strong indication of a flexible and supportive lab.


    How is Poldracklab different from other labs?

    Given some combination of interest in brain imaging and methods, Poldracklab is definitely your top choice at Stanford, in my opinion. I had experience with several imaging labs during my time, and Poldracklab was by far the most supportive, resource providing, and rich in knowledge and usage of modern technology. Most other labs I visited were heavily weighed to one side - either far too focused on some aspect of informatics at a detriment to what was being studied, or too far into answering a specific question and relying heavily on plugging data into opaque, sometimes poorly understood processing pipelines. In Poldracklab, we wrote our own code, we constantly questioned if we could do it better, and the goal was always important. Russ was never controlling or limiting in his advising. He provided feedback whenever I asked for it, brought together the right people for a discussion when needed, and let me build things freely and happily. We were diabolical!

    What does an advisor expect of me?

    I think it’s easy to have an expectation that, akin to secondary school, Medical School, or Law School, you sign up for something, go through a set of requirements, pop out of the end of the conveyor belt of expectation, and then get a gold star. Your best strategy will be to throw away all expectation, and follow your interests and learning like a mysterious light through a hidden forest. If you get caught up in trying to please some individual, or some set of requirements, you are both selling yourself and your program short. The best learning and discoveries, in my opinion, come from the mind that is a bit of a drifter and explorer.

    What kind of an advisor is Russ?

    Russ was a great advisor. He is direct, he is resourceful, and he knows his stuff. He didn’t impose any kind of strict control over the things that I worked on, the papers that I wanted to publish, or even how frequently we met. It was very natural to meet when we needed to, and I always felt that I could speak clearly about anything and everything on my mind. When I first joined it didn’t seem to be a standard to do most of our talking on the white board (and I was still learning to do this myself to move away from the “talking head” style meeting), but I just went for it, and it made the meetings fun and interactive. He is the kind of advisor that is also spending his weekends playing with code, talking to the lab on Slack, and let’s be real, that’s just awesome. I continued to be amazed and wonder how in the world he did it all, still catching the Caltrain to make the ride all the way back to the city every single day! Lab meetings (unless it was a talk that I wasn’t super interested in) were looked forward to because people were generally happy. The worst feeling is having an advisor that doesn’t remember what you talked about from week to week, can’t keep up with you, or doesn’t know his or her stuff. It’s unfortunately more common than you think, because being a PI at Stanford, and keeping your head above the water with procuring grants, publishing, and maintaining your lab, is stressful and hard. Regardless, Russ is so far from the bad advisor phenotype. I’d say in a heartbeat he is the best advisor I’ve had at Stanford, on equal par with my academic advisor (who is also named Russ!), who is equally supportive and carries a magical, fun quality. I really was quite lucky when it came to advising! One might say, Russ to the power of two lucky!

    Do I really need to go to Stanford?

    All this said, if you know what you love to do, and you pursue it relentlessly, you are going to find happiness and fulfillment, and there is no single school that is required for that (remember this?). I felt unbelievably blessed when I was admitted, but there are so many instances when opportunities are presented by sheer chance, or because you decide that you want something and then take proactive steps to pursue it. Just do that, and you’ll be ok :)

    In a nutshell

    If you pursue what you love, maximize fun and learning, take some risk, and never give up, graduate school is just awesome. Poldracklab, for the win. You know what to do.

    ·

  • Thesis Dump

    I recently submit my completed thesis (cue albatross soaring, egg hatching, sunset roaring), but along the way I wanted a simple way to turn it into a website. I did this for fun, but it proved to be useful because my advisor wanted some of the text and didn’t want to deal with LaTeX. I used Overleaf because it had a nice Stanford template, and while it still pales in comparison to the commenting functionality that Google Docs offers, it seems to be the best currently available collaborative, template-based, online LaTeX editor. If you are familiar with it, you also know that you have a few options for exporting your documents. You can of course export code (meaning .tex files for text, and .bib for something like a bibliography, and .sty for styles (and these files are zipped up), or you can have Overleaf compile it for you and download as PDF.

    Generating a site

    The task at hand is to convert something in LaTeX to HTML. If you’ve used LaTeX before, you know that there are already tools for that (hdlatex and docs). The hard part in this process was really just installing dependencies, a task that Docker is well suited for. Thus, I generated a Docker image that extracts files from the Overleaf zip, runs an hdlatex command to generate a static version of the document with appropriate links, and then you can push the static files to your Github pages, and you’re done! I have complete instructions in the README, and you can see the final generated site. It’s nothing special, basically just white bread that could use some improving upon, but it meets it’s need for now. The one tiny feature is that you can specify the Google Font that you want by changing one line in generate.sh (default is Source Serif Pro):

        
    
    docker exec $CONTAINER_ID python /code/generate.py "Space Mono"
    
    

    Note that “Source Mono” is provided as a command line argument, and nothing is specified in the current file to default to Source Serif Pro. Here is a look at the final output with Source Serif Pro:


    Advice for Future Students

    The entire thesis process wasn’t really as bad as many people make it out to be. Here are some pointers, for those in the process of or not yet started writing their theses.

    • Choose a simple, well-scoped project. Sure, you could start your dream work now, but it will be a lot easier to complete a well defined project, nail your PhD, and then save the world after. I didn’t even start the work that became my thesis until about a year and a half before the end of graduate school, so don’t panic if you feel like you are running out of time.
    • Early in graduate school, focus on papers. The reason is that you literally can have a paper be an entire chapter, and boum there alone you’ve banged out 20-30 pages! Likely you will want to rewrite some of the content to have a different organization and style, but the meat is high quality. Having published work in a thesis is a +1 for the committee because it makes it easy for them to consider the work valid.
    • Start with an outline, and write a story around it. The biggest “new writing” I had to do for mine was an introduction with sufficient background and meat to tie all the work that I had done together. Be prepared to change this story, depending on feedback from your committee. I had started with a theme of “reproducible science,” but ultimately finished with a more niche, focused project.
    • For the love of all that is good, don’t put your thesis into LaTeX until AFTER it’s been edited, reviewed, and you’ve defended, made changes, and then have had your reading committee edit it again. I made the mistake of having everything ready to go for my defense, and going through another round of edits was a nightmare afterward. Whatever you do, there is going to be a big chunk of time that must be devoted for converting a Google Doc into LaTeX. I chose to do it earlier, but the cost of that is something that is harder to change later. If I did this again, I would have just done this final step when it was intended for, at the end!
    • Most importantly, graduate school isn’t about a thesis. Have fun, take risks, and spend much more time doing those other things. The thesis I finished, to be completely honest, is pretty niche, dry, and might only be of interest to a few people in the world. The real richness in graduate school, for me, was everything but the thesis! I wrote a poem about this a few months ago for a talk, and it seems appropriate to share it here:
    
    I don't mean to be facetious,
    but graduate school is not about a thesis.
    To be tenacious, tangible, and thoughtful,
    for inspired idea you must be watchful.
    The most interesting things you might miss
    because they can come with a scent of risk.
     
    In this talk I will tell a story,
    of my thinking throughout this journey.
    I will try to convince you, but perhaps not
    that much more can be learned and sought
    if in your work you are not complacent,
    if you do not rely on others for incent.
    When you steer away from expectation,
    your little car might turn into innovation.
     
    Graduate school between the lines,
    has hidden neither equation nor citation.
    It may come with a surprise -
    it's not about the dissertation.
    
    


    Uploading Warnings

    A quick warning - the downloaded PDF wasn’t considered by the Stanford online Axess portal to be a “valid PDF”:


    and before you lose your earlobes, know that if you open the PDF in any official Adobe Reader (I used an old version of Adobe Reader on a Windows machine) and save it again, the upload will work seamlessly! Also don’t panic when you first try to do this, period, and see this message:


    As the message says, if you come back in 5-10 minutes the page will exist!

    ·

  • Pokemon Ascii Avatar Generator

    An avatar is a picture or icon that represents you. In massive online multiplayer role playing games (MMORPGs) your “avatar” refers directly to your character, and the computer gaming company Origin Systems took this symbol literally in its Ultima series of games by naming the lead character “The Avatar.”

    Internet Avatars

    If you are a user of this place called the Internet, you will notice in many places that an icon or picture “avatar” is assigned to your user. Most of this is thanks to a service called Gravatar that makes it easy to generate a profile that is shared across sites. For example, in developing Singularity Hub I found that there are many Django plugins that make adding a user avatar to a page as easy as adding an image with a source (src) like https://secure.gravatar.com/avatar/hello.

    The final avatar might look something like this:

    This is the “retro” design, and in fact we can choose from one of many:


    Command Line Avatars?

    I recently started making a command line application that would require user authentication. To make it more interesting, I thought it would be fun to give the user an identity, or minimally, something nice to look at at starting up the application. My mind immediately drifted to avatars, because an access token required for the application could equivalently be used as a kind of unique identifier, and a hash generated to produce an avatar. But how can we show any kind of graphic in a terminal window?


    Ascii to the rescue!

    Remember chain emails from the mid 1990s? There was usually some message compelling you to send the email to ten of your closest friends or face immediate consequences (cue diabolical flames and screams of terror). And on top of being littered with exploding balloons and kittens, ascii art was a common thing.

    
    
     __     __        _           _                     _ 
     \ \   / /       | |         | |                   | |
      \ \_/ /__  __ _| |__     __| | __ ___      ____ _| |
       \   / _ \/ _` | '_ \   / _` |/ _` \ \ /\ / / _` | |
        | |  __/ (_| | | | | | (_| | (_| |\ V  V / (_| |_|
        |_|\___|\__,_|_| |_|  \__,_|\__,_| \_/\_/ \__, (_)
                                                   __/ |  
                                                  |___/   
    
    


    Pokemon Ascii Avatars!

    I had a simple goal - to create a command line based avatar generator that I could use in my application. Could there be any cute, sometimes scheming characters that be helpful toward this goal? Pokemon!! Of course :) Thus, the idea for the pokemon ascii avatar generator was born. If you want to skip the fluff and description, here is pokemon-ascii.

    Generate a pokemon database

    Using the Pokemon Database I wrote a script that produces a data structure that is stored with the module, and makes it painless to retrieve meta data and the ascii for each pokemon. The user can optionally run the script again to re-generate/update the database. It’s quite fun to watch!


    The Pokemon Database has a unique ID for each pokemon, and so those IDs are the keys for the dictionary (the json linked above). I also store the raw images, in case they are needed and not available, or (in the future) if we want to generate the ascii’s programatically (for example, to change the size or characters) we need these images. I chose this “pre-generate” strategy over creating the ascii from the images on the fly because it’s slightly faster, but there are definitely good arguments for doing the latter.


    Method to convert image to ascii

    I first started with my own intuition, and decided to read in an image using the Image class from PIL, converting the RGB values to integers, and then mapping the integers onto the space of ascii characters, so each integer is assigned an ascii. I had an idea to look at the number of pixels that were represented in each character (to get a metric of how dark/gray/intense) each one was, that way the integer with value 0 (no color) could be mapped to an empty space. I would be interested if anyone has insight for how to derive this information. The closest thing I came to was determining the number of bits that are needed for different data types:

    
    # String
    "s".__sizeof__()
    38
    
    # Integer
    x=1
    x.__sizeof__()
    24
    
    # Unicode
    unicode("s").__sizeof__()
    56
    
    # Boolean
    True.__sizeof__()
    24
    
    # Float
    float(x).__sizeof__()
    24
    
    

    Interesting, a float is equivalent to an integer. What about if there are decimal places?

    
    float(1.2222222222).__sizeof__()
    24
    
    

    Nuts! I should probably not get distracted here. I ultimately decided it would be most reasonable to just make this decision visually. For example, the @ character is a lot thicker than a ., so it would be farther to the right in the list. My first efforts rendering a pokemon looked something like this:


    I then was browsing around, and found a beautifully done implementation. The error in my approach was not normalizing the image first, and so I was getting a poor mapping between image values and characters. With the normalization, my second attempt looked much better:


    I ultimately modified this code sightly to account for the fact that characters tend to be thinner than they are tall. This meant that, even though the proportion / size of the image was “correct” when rescaling it, the images always looked too tall. To adjust for this, I modified the functions to adjust the new height by a factor of 2:

    
    def scale_image(image, new_width):
        """Resizes an image preserving the aspect ratio.
        """
        (original_width, original_height) = image.size
        aspect_ratio = original_height/float(original_width)
        new_height = int(aspect_ratio * new_width)
    
        # This scales it wider than tall, since characters are biased
        new_image = image.resize((new_width*2, new_height))
        return new_image
    
    

    Huge thanks, and complete credit, goes to the author of the original code, and a huge thanks for sharing it! This is a great example of why people should share their code - new and awesome things can be built, and the world generally benefits!

    Associate a pokemon with a unique ID

    Now that we have ascii images, each associated with a number from 1 to 721, we would want to be able to take some unique identifier (like an email or name) and consistently return the same image. I thought about this, and likely the basis for all of these avatar generators is to use the ID to generate a HASH, and then have a function or algorithm that takes the hash and maps it onto an image (or cooler) selects from some range of features (e.g., nose mouth eyes) to generate a truly unique avatar. I came up with a simple algorithm to do something like this. I take the hash of a string, and then use modulus to get the remainder of that number divided by the number of pokemon in the database. This means that, given that the database doesn’t change, and given that the pokemon have unique IDs in the range of 1 to 721, you should always get the same remainder, and this number will correspond (consistently!) with a pokemon ascii. The function is pretty simple, it looks like this:

    
    def get_avatar(string,pokemons=None,print_screen=True,include_name=True):
        '''get_avatar will return a unique pokemon for a specific avatar based on the hash
        :param string: the string to look up
        :param pokemons: an optional database of pokemon to use
        :param print_screen: if True, will print ascii to the screen (default True) and not return
        :param include_name: if True, will add name (minus end of address after @) to avatar
        '''
        if pokemons == None:
            pokemons = catch_em_all()
    
        # The IDs are numbers between 1 and the max
        number_pokemons = len(pokemons)
        pid = numpy.mod(hash(string),number_pokemons)
        pokemon = get_pokemon(pid=pid,pokemons=pokemons)
        avatar = pokemon[pid]["ascii"]
        if include_name == True:
            avatar = "%s\n\n%s" %(avatar,string.split("@")[0])
        if print_screen == True:
            print avatar    
        else:
            return avatar
    
    

    …and the function get_pokemon takes care of retrieving the pokemon based on the id, pid.

    Why?

    On the surface, this seems very silly, however there are many good reasons that I would make something like this. First, beautiful, or fun details in applications make them likable. I would want to use something that, when I fire it up, subtly reminds me that in my free time I am a Pokemon master. Second, a method like this could be useful for security checks. A user could learn some image associated with his or her access token, and if this ever changed, he/she would see a different image. Finally, a detail like this can be associated with different application states. For example, whenever there is a “missing” or “not found” error returned for some function, I could show Psyduck, and the user would learn quickly that seeing Psyduck means “uhoh.”

    There are many more nice uses for simple things like this, what do you think?

    Usage

    The usage is quite simple, and this is taken straight from the README:

    
          usage: pokemon [-h] [--avatar AVATAR] [--pokemon POKEMON] [--message MESSAGE] [--catch]
    
          generate pokemon ascii art and avatars
    
          optional arguments:
            -h, --help         show this help message and exit
            --avatar AVATAR    generate a pokemon avatar for some unique id.
            --pokemon POKEMON  generate ascii for a particular pokemon (by name)
            --message MESSAGE  add a custom message to your ascii!
            --catch            catch a random pokemon!
    
          usage: pokemon [-h] [--avatar AVATAR] [--pokemon POKEMON] [--message MESSAGE] [--catch]
    
    


    Installation

    You can install directly from pip:

    
          pip install pokemon
    
    

    or for the development version, clone the repo and install manually:

    
    
          git clone https://github.com/vsoch/pokemon-ascii
          cd pokemon-ascii
          sudo python setup.py sdist
          sudo python setup.py install
    
    


    Produce an avatar

    Just use the --avatar tag followed by your unique identifier:

    
          pokemon --avatar vsoch
    
          @@@@@@@@@@@@@*,@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
          @@@@@@@@@,:::::::::,@@@:,S+@@@@@@@#:::::::,:@@@@@@@@@@@@@@@@
          @@@@@@@@:::::::::****@@**..@@@@@:::::::::::::S@@@@@@@@@@@@@@
          @@@@@@@::::***********#..%.S@@#::::::::::*****S@@@@@@@@@:?@@
          @@@@@@@*****?%%%%?***....SSSS#*****************#@@@@@@@@,::@
          @@@@@@@%***S%%%%%.#*S.SSSSSS?.****?%%%%%#*******@@@@@#:::::.
          @@@@@@@@****%%%%#...#SSSS#%%%?***?%%%%%%%******%@@@@%::::***
          @@@@@@@@@+****SSSSSS?SSS.%...%#SS#%....%.******@@@@@?*******
          @@@@@@@@@@@@#SSSSSSS#S#..%...%%.....%.%.******@@@@@@@#**.**@
          @@@@@@@@@@@..SSSSS..?#.%..%.......%.%.******#@@@@@@@@@@S%,@@
          @@@@@@@@@@#%........................%****%,@@@@@@@@@@@?%?@@@
          @@@@@@@@@@.#*@@@%.................%%%......SSS.SSS%@#%%?@@@@
          @@@@@@@@@%+*@@?,.%....%%.,@,@@,*%.%%%..%%%%.%....%...?#@@@@@
          @@@@@@@@@:*.@#+?,%%%%%.%,@??#@@@**......%........%...%%*@@@@
          @@@@@@@@@@.*.@##@...%%.+@##@?,@@****.....%...%....?...%%@@@@
          @@@@@@@@@@@.**+**#++SS***,*#@@@*****%%.%.......%%........@@@
          @@@@@@@@@@@@************************..........%%.%%...%%*@@@
          @@@@@@@@@@@@@@,?**?+***************.%........#....%%%%%%@@@@
          @@@@@@@@@@@@@@@@@%#*+....*******..%%..%%.....%%%%%%%%%%@@@@@
          @@@@@@@@@@@@@@+%%%%%%??#%?%%%???.%%%%%...%%%.**.**#%%%@@@@@@
          @@S#****.?......%%%%%%?%@@@@@:#........%?#*****.#***#@@@@@@@
          @***%.*S****S**.%%%%%%@@@@@@@S....%%..%@@+*+..%**+%#.@@@@@@@
          @%%..*..*%%+++++%%%%@@@@@@@...%.%%.%.%@@@,+#%%%%++++@@@@@@@@
          @:+++#%?++++++++%%@@.**%**.****#..%%%,@@@@@S+.?++++@@@@@@@@@
          @@@++?%%%?+++++#@@@**.********S**%%%@@@@@@@@@@@@@@@@@@@@@@@@
          @@@@@@%++++?@@@@@@S%%*#%.**%%**..+%@@@@@@@@@@@@@@@@@@@@@@@@@
          @@@@@@@@@@@@@@@@@@@++++++++.S++++#@@@@@@@@@@@@@@@@@@@@@@@@@@
          @@@@@@@@@@@@@@@@@@@@++%%%%?+++++@@@@@@@@@@@@@@@@@@@@@@@@@@@@
          @@@@@@@@@@@@@@@@@@@@@*+#%%%+++%@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    
          vsoch
    
    

    You can also use the functions on command line (from within Python):

    
          from pokemon.skills import get_avatar
     
          # Just get the string!
          avatar = get_avatar("vsoch",print_screen=False)
          print avatar
    
          # Remove the name at the bottom, print to screen (default)
          avatar = get_avatar("vsoch",include_name=False)
    
    


    Randomly select a Pokemon

    You might want to just randomly get a pokemon! Do this with the --catch command line argument!

    
          pokemon --catch
    
          @%,@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
          .????.@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
          .???????S@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
          :?????????#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
          *?????????????*@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
          @???????#?????###@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@,*.??#
          @?????,##,S???#####@@@@@@@@@@@@@@@@@@@@@@@@@@S##????????????
          @?????*,,,,,,########@@@@@@@@@@@@@@@@@:###????????????????#@
          @##????,,,,,,,,,#####@@@@@@@@@@@@@.######?????#?:#????????@@
          @####?#,,,,,,,,,,,##@@@@@@@@@@@@@@#######*,,,,,*##+?????+@@@
          @######,,,,,,,,,,,S@@@@@@@@@@@@@@#.,,,,,,,,,,,,,,:?####@@@@@
          @######,,,,,,,,,,%@@,S.S.,@@@@@@@,,,,,,,,,,,,,,,######@@@@@@
          @@#####,,,,,,,,.,,,,,,,,,,,,,,,*#,,,,,,,,,,,,,.#####:@@@@@@@
          @@@@@@@@@@.#,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,######@@@@@@@@@
          @@@@@@@@@,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,+######@@@@@@@@@@
          @@@@@@@@%,,,,,++:,,,,,,,,,,,,,,,,,,,,,@@:.######:@@@@@@@@@@@
          @@@@@@@:,,,:##@@@#,,,,,,,,,,,,?@S#,,,,,,@@@@@@@@@@@@@@@@@@@@
          @@@@@@@?,,,#######,,,,,,,,,,,#.@:##,,,:?@@@@@@@@@@@@@@@@@@@@
          @@@@@@@.,,S,??%?*,,,,,,,,,,,,####?%+,::%@@@@@@@@@@@@@@@@@@@@
          @@@@@@@@?..*+,,,,,,*,,,,,,,,,,,+#S,::::*@@@@@@@@@@@@@@@@@@@@
          @@@@@@@@@%..*,,,,,,,,,,,,,,,,,,,:.*...%@@@@@@@@@@@@@@@@@@@@@
          @@@@@@@@@@.**::*::::::,,:::::::+.....@@@@@@@@@@@@@@@@@@@@@@@
          @@@@@@@@.@@@@?:**:::*::::::::::*...@@@@@@@@@@@@@@@@@@@@@@@@@
          @@@@@?,,,,,,,,,:,##S::::**:::S#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
          @@@@@@@.,,,,,,:S#?##?########:#****#,@@@@@@@@@@@@@@@@@@@@@@@
          @@@@@@@@@@,%:*%,??#,,,,:*S##**:..****:,.*@@@@@@@@@@@@@@@@@@@
          @@@@@@@@@@@@@+,,,,,,,,,,,,,,,,,,*...*:,.,@@@@@@@@@@@@@@@@@@@
          @@@@@@@@@@@@@+,,,,,,,,,,,,,,,,,,?@@@@@*#?@@@@@@@@@@@@@@@@@@@
          @@@@@@@@@@@@@*,,,,,,,,,,,,,,,,,,.@#########?@@@@@@@@@@@@@@@@
          @@@@@@@@@@@@@.*:,,,,,,,,,,,,,,:.##%,?#####????:@@@@@@@@@@@@@
          @@@@@@@@@@@@@@?.....*******....S@@@@@@:##?????@@@@@@@@@@@@@@
          @@@@@@@@@@@@@@S.+..********...#%@@@@@@@@@##,@@@@@@@@@@@@@@@@
          @@@@@@@@@@@#*,,,,*.#@@@@@@@..*:,,*S@@@@@@@@@@@@@@@@@@@@@@@@@
          @@@@@@@@@@+@,%,,,#@@@@@@@@@@,S,,,%,,:@@@@@@@@@@@@@@@@@@@@@@@
    
          Pichu
    
    
    

    You can equivalently use the --message argument to add a custom message to your catch!

    
          pokemon --catch --message "You got me!"
    
          @@@@@@@@@*.@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
          @@@@@@@@...+@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
          @@@@@@@@++++@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
          :..+,@@+.+++%@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
          @..++++S++++++.?...@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
          @@@:S.S+SSS.S%++.+++@@@@@@@@@@+.%.@@@@@@@@@@@@@@@@@@@@@@@@@@
          @@@@:SSSSSSSSSS,@@@@@@@,:,:.SS+.....+.@@@@@@@@@@@@@@@@@@@@@@
          @@@@,:%SS++SS.,.%,:,S,,,,+..%.........S.@@@@@@@@@@@@@@@@@@@@
          @@@@@,:*...:,,+,.,,,,,,,*%%%++++..+++SSS+@@@@@@@@@@@@@@@@@@@
          @@@@@@,,.....%:,,,:.:.,:.%%.SSSS++SS+%+S%,+@@@@@@@@@@@@@@@@@
          @@@@@@@*.....S...***+,,,%..%++,?SSS.%.%%%:,.,@@@@@@@@@@@@@@@
          @@@@@@@@,+**........,,,,....++S@,+%..#..%,,S..@@@@@@@@@@@@@@
          @@@@@@@@@@@@@@@@*..:,,,,,%..%++S%%.%%.S%%,,*+.+@@@@@@@@@@@@@
          @@@@@@@@@@@@@@@@S,,,,,,,,,%%%..SS..%?%%%,,,S+...@@@@@@@@@@@@
          @@@@@@@@@@@@@@@@S.:::::::::%.%%S...%%%%:::*.....**@@@@@@@@@@
          @@@@@@@@@@@@@@@@.%%..:::::::S%%.?%%%%%:::....**,S,,:@@@@@@@@
          @@@@@@@@@@@@@@:::*%%%%?..*:::,.%%%%.,:*.%@@.*:,,,:,,S@....@@
          @@@@@@@@@@@@@:,:,::*.?%%%%%%?+*%%?.?%%%%%+@@,,,,,,,.++%++@@@
          @@@@@@@@@@@@@@*,,,,,**...*%%%%%%%%%%?++++++.@,,,,,SS+SS++@@@
          @@@@@@@@@@@@@,,.,S,,,,:....***%%?%++++++++++.@.,,+SSSSS.S+@@
          @@@@@@@@@@@@,,SSSS..:.%,:*..?%%??%%++++++.+S+@@@.S..%S.%.S++
          @@@@@@@@@@@,,S.....S::*.@@@%%%%@?%%#+++++%%%?S@@@@@.%.,@@...
          @@@@@@@@@@@:,,?.%%%::::@@@...%.@?.%.++++.+%%%%.@@@@..++@@@@@
          @@@@@@@@@@S,.%%.:,,,,,S@@@@@.?@@+SS,S..........@@@@@,@@@@@@@
          @@@@@@@@@@@+S...++.,,:@@@@@@@@@@@@@@@%....SSS+SS@@@@@@@@@@@@
    
          You got me!
    
    

    You can also catch pokemon in your python applications. If you are going to be generating many, it is recommended to load the database once and provide it to the function, otherwise it will be loaded each time.

    
          from pokemon.master import catch_em_all, get_pokemon
    
          pokemons = catch_em_all()
          catch = get_pokemon(pokemons=pokemons)
    
    
    

    I hope that you enjoy pokemon-ascii as much as I did making it!

    ·

  • How similar are my operating systems?

    How similar are my operating systems?

    A question that has spun out of one of my projects that I suspect would be useful in many applications but hasn’t been fully explored is comparison of operating systems. If you think about it, for the last few decades we’ve generated many methods for comparing differences between files. We have md5 sums to make sure our downloads didn’t poop out, and command line tools to quickly look for differences. We now have to take this up a level, because our new level of operation isn’t on a single “file”, it’s on an entire operating system. It’s not just your Mom’s computer, it’s a container-based thing (e.g., Docker or Singularity for non sudo environments) that contains a base OS plus additional libraries and packages. And then there is the special sauce, the application or analysis that the container was birthed into existence to carry out. It’s not good enough to have “storagey places” to dump these containers, we need simple and consistent methods to computationally compare them, organize them, and let us explore them.

    Similarity of images means comparing software

    An entire understanding of an “image” (or more generally, a computer or operating system) comes down to the programs installed, and files included. Yes, there might be various environmental variables, but I would hypothesize that the environmental variables found in an image have a rather strong correlation with the software installed, and we would do pretty well to understand the guts of an image from the body without the electricity flowing through it. This would need to be tested, but not quite yet.

    Thus, since we are working in Linux land, our problem is simplified to comparing file and folder paths. Using some software that I’ve been developing I am able to derive quickly lists of both of those things (for example, see here), and matter of fact, it’s not very hard to do the same thing with Docker (and I plan to do this en-masse soon).

    Two levels of comparisons: within and between images

    To start my thinking, I simplified this idea into two different comparisons. We can think of each file path like a list of sorted words. Comparing two images comes down to comparing these lists. The two comparisons we are interested in are:

    • Comparing a single file path to a second path, within the same image, or from another image.
    • Comparing an entire set of file paths (one image) to a (?different) set (a second image).

    I see each of these mapping nicely to a different goal and level of detail. Comparing a single path is a finer operation that is going to be useful to have detailed understanding about differences between two images, and within one image it is going to let me optimize the comparison algorithm by first removing redundant paths. For example, take a look at the paths below:

    
          ./usr/include/moar/6model',
          ./usr/include/moar/6model/reprs'
    
    

    We don’t really need the first one because it’s represented in the second one. However, if some Image 1 has the first but not the second (and we are doing a direct comparison of things) we would miss this overlap. Thus, since I’m early in developing these ideas, I’m just going to choose the slower, less efficient method of not filtering anything yet. So how are we comparing images anyway?

    Three frameworks to start our thinking

    Given that we are comparing lists of files and/or folders, we can approach this problem in three interesting ways:

    1. Each path is a feature thing. I’m just comparing sets of feature things.
    2. Each path is list of parent –> child relationships, and thus each set of paths is a graph. We are comparing graphs.
    3. Each path is a document, and the components of the path (folders to files) are words. The set of paths is a corpus, and I’m comparing different corpus.


    Comparison of two images

    I would argue that this is the first level of comparison, meaning the rougher, higher level comparison that asks “how similar are these two things, broadly?” In this framework, I want to think about the image paths like features, and so a similarity calculation can come down to comparing two sets of things, and I’ve made a function to do this. It comes down to a ratio between the things they have in common (intersect) over the entire set of things:

    
          score = 2.0*len(`intersect`) / (len(`pkg1`)+len(`pkg2`))
    
    

    I wasn’t sure if “the entire set of things” should include just folder paths, just files paths, or both, and so I decided to try all three approaches. As I mentioned previously, it also would need to be determined if we can further streamline this approach by filtering down the paths first. I started running this on my local machine, but then realized how slow and lame that was. I then put together some cluster scripts in a giffy, and the entire thing finished before I had finished the script to parse the result. Diabolical!


    I haven’t had a chance to explore these comparisons in detail yet, but I’m really excited, because there is nice structure in the data. For example, here is the metric comparing images using both files and folders:

    A shout out to plotly for the amazingly easy to use python API! Today was the first time I tried it, and I was very impressed how it just worked! I’m used to coding my own interactive visualizations from scratch, and this was really nice. :) I’m worried there is a hard limit on the number of free graphs I’m allowed to have, or maybe the number of views, and I feel a little squirmy about having it hosted on their server… :O

    Why do we want to compare images?

    Most “container storage” places don’t do a very good job of understanding the guts inside. If I think about Docker Hub, or Github, there are a ton of objects (scripts, containers, etc.) but the organization is largely manual with some search feature that is (programatically) limited to the queries you can do. What we need is a totally automated, unsupervised way of categorizing and organizing these containers. I want to know if the image I just created is really similar to others, or if I could have chosen a better base image. This is why we need a graph, or a mapping of the landscape of images - first to understand what is out there, and then to help people find what they are looking for, and map what they are working on into the space. I just started this pretty recently, but here is the direction I’m going to stumble in.


    Generating a rough graph of images

    The first goal is to get an even bigger crapton of images, and try to do an estimate of the space. Graphs are easy to work with and visualize, so instead of sets (as we used above) let’s now talk about this in a graph framework. I’m going to try the following:

    1. Start with a big list of (likely) base containers (e.g., Docker library images)
    2. Derive similarity scores based on the rough approach above. We can determine likely parents / children based on one image containing all the paths of another plus more (a child), or a subset of the paths of the other (a parent). This will give us a bunch of tiny graphs, and pairwise similarity scores for all images.
    3. Within the tiny graphs, define potential parent nodes (images) as those that have not been found to be children of any other images.
    4. For all neighbors / children within a tiny graph, do the equivalent comparison, but now on the level of files to get a finer detail score.
    5. Find a strategy to connect the tiny graphs. The similarity scores can do well to generate a graph of all nodes, but we would want a directional graph with nice detail about software installed, etc.



    The last few points are kind of rough, because I’m not ready yet to think about how to fine tune the graph given that I need to build it first. I know a lot of researchers think everything through really carefully before writing any code or trying things, but I don’t have patience for planing and not doing, and like jumping in, starting building, and adjusting as I go. On second thought, I might even want to err away from Singularity to give this a first try. If I use Docker files that have a clear statement about the “parent” image, that means that I have a gold standard, and I can see how well the approach does to find those relationships based on the paths alone.

    Classifying a new image into this space

    Generating a rough heatmap of image similarity (and you could make a graph from this) isn’t too wild an idea, as we’ve seen above. The more challenging, and the reason that this functionality is useful, is quickly classifying a new image into this space. Why? I’d want to, on the command line, get either a list or open a web interface to immediately see the differences between two images. I’d want to know if the image that I made is similar to something already out there, or if there is a base image that removes some of the redundancy for the image that I made. What I’m leading into is the idea that I want visualizations, and I want tools. Our current understanding of an operating system looks like this:


    Yep, that’s my command line. Everything that I do, notably in Linux, I ssh, open a terminal, and I’ll probably type “ls.” If I have two Linuxy things like containers, do we even have developed methods for comparing them? Do they have the same version of Python? Is one created from the other? I want tools and visualization to help me understand these things.

    We don’t need pairwise comparisons - we need bases

    It would be terrible if, to classify a new image into this space, we had to compare it to every image in our database. We don’t need to, because we can compare it to some set of base images (the highest level of parent nodes that don’t have parents), and then classify it into the graph by walking down the tree, following the most similar path(s). These “base” images we might determine easily based on something like Dockerfiles, but I’d bet we can find them with an algorithm. To be clear, a base image is a kind of special case, for example, those “official” Docker library images like Ubuntu, or Nginx, or postgres that many others are likely to build off of. They are likely to have few to no parent images themselves. It is likely the case that people will add on to base images, and it is less likely they will subtract from them (when is the last time you deleted stuff from your VM when you were extending another image?). Thus, a base image can likely be found by doing the following:

    • Parse a crapton of Docker files, and find the images that are most frequently used
    • Logically, an image that extends some other image is a child of that image. We can build a graph/tree based on this
    • We can cut the tree at some low branch to define a core set of bases.

    Questions and work in progress!

    I was working on something entirely different when I stumbled on this interesting problem. Specifically, I want a programmatic way to automatically label the software in an image. In order to do this, I need to derive interesting “tags.” An interesting tag is basically some software that is installed on top of the base OS. You see how this developed - I needed to derive a set of base OS, and I needed a way to compare things to them. I’ll get back to that, along with the other fun project that I’ve started to go with this - developing visualizations for comparing operating systems! This is for another day! If you are interested in the original work, I am developing a workflow interface using Singularity containers called Singularity Hub Hubba, hubba!.

    ·

  • Service Worker Resource Saver

    If you are like me, you probably peruse a million websites in a day. Yes, you’re an internet cat! If you are a “tabby” then you might find links of interest and leave open a million tabs, most definitely to investigate later (really, I know you will)! If you are an “Octocat” then “View Source” is probably your right click of choice, and you are probably leaving open a bunch of raw “.js” or “.css” files to look at something later. If you are an American cat, you probably have a hodge-podge of random links and images. If you are a perfectionist cat (siamese?), you might spend an entire afternoon searching for the perfect image of a donut (or other thing), and have some sub-optimal method for saving them. Not that I’ve ever done that…

    TLDR: I made a temporary stuff saver using service workers. Read on to learn more.


    How do we save things?

    There are an ungodly number of ways to keep lists of things, specifically Google Docs and Google Drive are my go-to places, and many times I like to just open up a new email and send myself a message with said lists. For more permanent things I’m a big fan of Google Keep and Google Save, but this morning I found a use case that wouldn’t quite be satisfied by any of these things. I had a need to keep things temporarily somewhere. I wanted to copy paste links to images and be able to see them all quickly (and save my favorites), but not clutter my well organized and longer term Google Save or Keep with these temporary lists of resources.

    Service Workers, to the rescue!

    This is a static URL that uses a service worker with the postMessage interface to send messages back and forth between a service worker and a static website. This means that you can save and retrieve links, images, and script URLS across windows and sessions! This is pretty awesome, because perhaps when you have a user save stuff you rely on the cache, but what happens if they clear it? You could use some kind of server, but what happens when you have to host things statically (Github pages, I’m looking at you!). There are so many simple problems where you have some kind of data in a web interface that you want to save, update, and work with across pages, and service workers are perfect for that. Since this was my first go, I decided to do something simple and make a resource saver. This demo is intended for Chrome, and I haven’t tested in other browsers. To modify, stop, or remove workers, visit chrome://serviceworker-internals.

    How does it work?

    I wanted a simple interface where I could copy paste a link, and save it to the cache, and then come back later and click on a resource type to filter my resources:

    I chose material design (lite) because I’ve been a big fan of it’s flat simplicity, and clean elements. I didn’t spend too much time on this interface design. It’s pretty much some buttons and an input box!

    The gist of how it works is this: you check if the browser can support service workers:

    
    if ('serviceWorker' in navigator) {
      Stuff.setStatus('Ruh roh!');
    } else {
      Stuff.setStatus('This browser does not support service workers.');
    }
    
    

    Note that the “Stuff” object is simply a controller for adding / updating content on the page. Given that we have browser support, we then register a particular javascript file, our service controller commands, to the worker:

    
      navigator.serviceWorker.register('service-worker.js')
        // Wait until the service worker is active.
        .then(function() {
          return navigator.serviceWorker.ready;
        })
        // ...and then show the interface for the commands once it's ready.
        .then(showCommands)
        .catch(function(error) {
          // Something went wrong during registration. The service-worker.js file
          // might be unavailable or contain a syntax error.
          Stuff.setStatus(error);
        });
    
    

    The magic of what the worker does, then, is encompassed in the “service-worker.js” file, which I borrowed from Google’s example application. This is important to take a look over and understand, because it defines different event listeners (for example, “activate” and “message”) that describe how our service worker will handle different events. If you look through this file, you are going to see a lot of the function “postMessage”, and actually, this is the service worker API way of getting some kind of event from the browser to the worker. It makes sense, then, if you look in our javascript file that has different functions fire off when the user interacts with buttons on the page, you are going to see a ton of a function saveMessage that opens up a Message Channel and sends our data to the worker. It’s like browser ping pong, with data instead of ping pong balls. You can view in the console of the demo and type in any of “MessageChannel”, “sendMessage” or “postMessage” to see the functions in the browser:


    If we look closer at the sendMessage function, it starts to make sense what is going on. What is being passed and forth are Promises, which help developers (a bit) with the callback hell that is definitive of Javascript. I haven’t had huge experience with using Promises (or service workers), but I can tell you this is something to start learning and trying out if you plan to do any kind of web development:

    
    function sendMessage(message) {
      // This wraps the message posting/response in a promise, which will resolve if the response doesn't
      // contain an error, and reject with the error if it does. If you'd prefer, it's possible to call
      // controller.postMessage() and set up the onmessage handler independently of a promise, but this is
      // a convenient wrapper.
      return new Promise(function(resolve, reject) {
        var messageChannel = new MessageChannel();
        messageChannel.port1.onmessage = function(event) {
          if (event.data.error) {
            reject(event.data.error);
          } else {
            resolve(event.data);
          }
        };
    
        // This sends the message data as well as transferring messageChannel.port2 to the service worker.
        // The service worker can then use the transferred port to reply via postMessage(), which
        // will in turn trigger the onmessage handler on messageChannel.port1.
        // See https://html.spec.whatwg.org/multipage/workers.html#dom-worker-postmessage
        navigator.serviceWorker.controller.postMessage(message,
          [messageChannel.port2]);
      });
    }
    
    

    The documentation is provided from the original example, and it’s beautiful! The simple functionality I added is to parse the saved content into different types (images, script/style and other content)


    …as well as download a static list of all of your resources (for quick saving).


    More content-specific link rendering

    I’m wrapping up for playing around today, but wanted to leave a final note. As usual, after an initial bout of learning I’m unhappy with what I’ve come up with, and want to minimally comment on the ways it should be improved. I’m just thinking of this now, but it would be much better to have one of the parsers detect video links (from youtube or what not) and then them rendered in a nice player. It would also make sense to have a share button for one or more links, and parsing into a data structure to be immediately shared, or sent to something like a Github gist. I’m definitely excited about the potential for this technology in web applications that I’ve been developing. For example, in some kind of workflow manager, a user would be able to add functions (or containers, in this case) to a kind of “workflow cart” and then when he/she is satisfied, click an equivalent “check out” button that renders the view to dynamically link them together. I also imagine this could be used in some way for collaboration on documents or web content, although I need to think more about this one.

    Demo the Stuff Saver

    ·

  • Neo4J and Django Integration

    What happens when a graph database car crashes into a relational database car? You get neo4-django, of course! TLDR: you can export cool things like this from a Django application:

    Neo4j-Django Gist

    I’ve been working on the start of version 2.0 of the Cognitive Atlas, and the process has been a little bit like stripping a car, and installing a completely new engine while maintaining the brand and look of the site. I started with pages that looked like this:


    meaning that fixing up this site comes down to inferring the back end functionality from this mix of Javascript / HTML and styling, and turning them into Django templates working with views that have equivalent functionality.

    Neo For What?

    Neo4J is a trendy graph database that emerged in 2007, but I didn’t really hear about it until 2014 or 2015 when I played around with it to visualize the nidm data model, a view of the Cognitive Atlas and of the NIF ontology (which seems like it was too big to render in a gist). It’s implemented in Java, and what makes it a “graph” database is the fact that it stores nodes and relationships. This is a much different model than traditional relational databases, which work with things in tables. There are pros and cons of each, however for a relatively small graph that warrants basic similarity metrics, graph visualizations, and need for an API, I thought Neo4j was a good use case. Now let’s talk about how I got it to work with Django.

    Django Relational Models

    Django is based on the idea of models. A model is a class of objects directly linked to objects in the relational database, so if I want to keep track of my pet marshmallows, I might make a django application called “marshdb” and I can do something like the following:

    
    from django.db import models
    
    class Marshmallow(models.Model):
        is_toasted = models.BooleanField(default=True)
        name = models.CharField(max_length=30)
    
    

    and then I can search, query, and interact with my marshmallow database with very intuitive functionality:

    
    from marshdb.models import Marshmallow
    
    # All the marshmallows!
    nomnom = Marshmallow.objects.all()
    
    # Find me the toasted ones
    toasted_mallows = Marshmallow.objects.filter(is_toasted=True)
    
    # How many pet marshmallows do I have?
    marshmallow_count = Marshmallow.objects.count()
    
    # Find Larry
    larry = Marshmallow.objects.get(name="Larry")
    
    
    

    Neo4Django?

    Django is fantastic - it makes it possible to create an entire site and database backend, along with whatever plugins you might want, in even the span of a weekend! My first task was how to integrate a completely different kind of database into a relational infrastructure. Django provides ample detail on how to instantiate your own models, but it’s not a trivial thing to integrate a completely different kind of database. I found neo4django, but it wasn’t updated for newer versions of Django, and it didn’t seem to be taking a clean and simple approach to integrating Neo4j. Instead, I decided to come up with my own solution.

    Step 1: Dockerize!

    Deployment and development is much easier with Docker, period. Need neo4j run via Docker? Kenny Bastani (holy cow he’s just in San Mateo! I could throw a rock at him!) has a solution for that! Basically, I bring in the neo4j container:

    
    graphdb:
      image: kbastani/docker-neo4j:latest
      ports:
       - "7474:7474"
       - "1337:1337"
      links:
       - mazerunner
       - hdfs
    
    

    and then link it to a docker image that is running the Django application:

    
    uwsgi:
        image: vanessa/cogat-docker
        command: /code/uwsgi.sh
        restart: always
        volumes:
            - .:/code
            - /var/www/static
        links:
            - postgres
            - graphdb
    
    

    You can look at the complete docker-compose file, and Kenny’s post on the mazerunner integration for integrating graph analytics with Apache Spark.

    This isn’t actually the interesting part, however. The fun and interesting bit is getting something that looks like a Django model for the user to interact with that entirely isn’t :).

    Step 2: The Query Module

    As I said previously, I wanted this to be really simple. I created a Node class that includes the same basic functions as a traditional Django model (get, all, filter, etc.), and added a few new ones:

    
        def link(self,uid,endnode_id,relation_type,endnode_type=None,properties=None):
            '''link will create a new link (relation) from a uid to a relation, first confirming
            that the relation is valid for the node
            :param uid: the unique identifier for the source node
            :param endnode_id: the unique identifier for the end node
            :param relation_type: the relation type
            :param endnode_type: the type of the second node. If not specified, assumed to be same as startnode
            :param properties: properties to add to the relation
            '''
    
    

    … blarg blarg blarg

    
    
     def cypher(self,uid,lookup=None,return_lookup=False):
            '''cypher returns a data structure with nodes and relations for an object to generate a gist with cypher
            :param uid: the node unique id to look up
            :param lookup: an optional lookup dictionary to append to
            :param return_lookup: if true, returns a lookup with nodes and relations that are added to the graph
            '''
            base = self.get(uid)[0]
    
    
    

    and then I might instantiate it like this for the “Concept” node:

    
    class Concept(Node):
    
        def __init__(self):
            self.name = "concept"
            self.fields = ["id","name","definition"]
            self.relations = ["PARTOF","KINDOF","MEASUREDBY"]
            self.color = "#3C7263" # sea green
    
    

    and you can see that generally, I just need to define the fields, relations, and name of the node in the graph database to get it working. Advanced functionality that might be needed for specific node types can be implemented for those particular classes.

    Functionality for any node in the graph can be added to the “Node” class. The function “link” for example, will generate a relationship between an object and some other node, and “cypher” will produce node and link objects that can be rendered immediately into a neo4j gist. This is where I see the intersection of Django and Neo4j - adding graph functions to their standard model. Now how to visualize the graph? I like developing my own visualizations, and made a general, searchable graph run by the application across all node types:


    However I realized that a user is going to want more power than that to query, make custom views, and further, share them. The makers of Neo4j were smart, and realized that people might want to share snippets of code as github gists to make what they call a graph gist. I figured why not generate a URL to render this cypher code that can then immediately be rendered into a preview, and then optionally exported and saved by the user? The awesome part of this is that it sends the computing of the graph part off of the Cognitive Atlas server, and you can save views of the graph. For example, here is a gist that shows a view of the working memory fMRI task paradigm. If you’re a visual learner, you can learn from looking at the graph itself:


    You can see example cypher queries, with results rendered into clean tables:


    and hey, you can write your own queries against the graph!


    This is a work in progress and it’s not perfect, but I’m optimistic about the direction it’s going in. If more ontologies / graph representations of knowledge were readily explorable, and sharable in this way, the semantic web would be a lot easiest to understand and navigate.

    Relational Database Help

    Why then should we bother to use a relational database via Django? I chose this strategy because it keeps the model of the Cognitive Atlas separate from any applications deploying or using it. It provides a solid infrastructure for serving a RESTful API:


    and basic functionalities like storing information about users, and any (potential) future links to automated methods to populate it, etc.

    General Thinking about Visualization and Services

    This example gets at a general strategy that is useful to consider when building applications, and that is the idea of “outsourcing” some of your analysis or visualization to third parties. In the case of things that just need a web server, you might store code (or text) in a third party service like Github or Dropbox, and use something like Github Pages or another third party to render a site. In the case of things that require computation, you can take advantage of Continuous Integration to do much more than run tests. In this example, we outsourced a bit of computation and visualization. In the case of developing things that are useful for people, I sometimes think it is more useful to build a generic “thing” that can turn some standard data object (eg, some analysis result, data, or text file) and render it into some more programmatic data structure that can plug into (some other tool) that makes it relatable to other individual’s general “things.” I will spend some time in another post to more properly articulate this idea, but the general take away is that as a user you should be clever when you are looking for a certain functionality, and as a developer you should aim to provide general functions that have wide applicability.

    Cognitive Atlas 2.0

    The new version of the Cognitive Atlas has so far been a fun project I’ve worked on in free time, and I would say you can expect to see cool things develop in the next year or two, even if I’m not the one to push the final changes. In the meantime, I encourage all researchers working with behavioral or cognitive paradigms, perhaps using the Experiment Factory or making an assertion about a brain map capturing a cognitive paradigm in the NeuroVault database, to do this properly by defining paradigms, cognitive concepts in the current version of the Cognitive Atlas. If you have feedback or want to contribute to developing this working example of integrating Neo4j and Django, please jump in. Even a cool idea would be a fantastic contribution. Time to get back to work! Well, let’s just call this “work,” I can’t say I’m doing much more than walking around and smiling like an idiot in this final lap of graduate school. :)


    ·

  • The Elusive Donut

    Elusive Donut

    A swirl of frosting and pink
    really does make you think
    Take my hunger away, won’t ‘ut?
    Unless you’ve browsed an elusive donut!

    elusive donut

    ·

  • Interactive Components for Visualizations

    If you look at most interactive visualizations that involve something like D3, you tend to see lots of circles, bars, and lines. There are several reasons for this. First, presenting information simply and cleanly is optimal to communicate an idea. If you show me a box plot that uses different tellitubbies as misshapen bars, I am likely going to be confused, a little angry, and miss the data lost in all the tubby. Second, basic shapes are the default available kind of icon built into these technologies, and any variability from that takes extra work.

    Could there be value in custom components?

    This begs the question - if we were able to, would the world of data science specific to generating interactive visuaizations be improved with custom components? I imagine the answer to this question is (as usual), “It depends.” The other kind of common feature you see in something like D3 is a map. The simple existence of a map, and an ability to plot information on it, adds substantially to our ability to communicate something meaningful about a pattern across geographical regions. The human mind is more quick to understand something with geographic salience overlayed on a map than the same information provided in some kind of chart with labels corresponding to geographic regions. Thus, I see no reason that we cannot have other simple components for visualizations that take advantage of our familiarity that brings some intuitive understanding of a domain or space.

    A Bodymap

    My first idea (still under development) was to develop a simple map for the human body. I can think of so many domains crossing medicine, social media, and general science that have a bodygraphic salience. I was in a meeting with radiologists many weeks ago, and I found it surprising that there weren’t standard templates for an entire body (we have them for brain imaging). A standard template for a body in the context of radiology is a much different goal than one for a visualization, but the same reality rings true. I decided that a simple approach would be to take a simple representation, transform it into a bunch of tiny points, and then annotate different sets of points with different labels (classes). The labels can then be selected dynamically with any kind of web technology (d3, javascript, jquery, etc.) to support an interactive visualization. For example, we could parse a set of documents, extract mentions of body parts, and then use the Bodymap like a heatmap to show the prevalance of the different terms.

    Generating an svg pointilism from any png image

    My first task was to be able to take any png image and turn it into a set of points. I first stupidly opened up Inkscape and figured out how to use the clone tool to generate a bunch of points. Thankfully I realized quickly that before I made my BodyMap tool, I needed general functions for working with images and svg. I am in the process of creating svgtools for this exact goal! For example, with this package you can transform a png image into an svg (pointilism thing) with one function:

    
    from svgtools.generate import create_pointilism_svg
    
    # Select your png image!
    png_image = "data/body.png"
    
    # uid_base will be the id of the svg
    # sample rate determines the space between points (larger --> more space)
    
    create_pointilism_svg(png_image,uid_base="bodymap",
                                    sample_rate=8,width=330,height=800,
                                    output_file="data/bodymap.svg")
    
    

    script

    I expect to be adding a lot of different (both manual and automated) methods here for making components, so keep watch of the package if interested.

    This allowed me to transform this png image:

    into a “pointilism svg” (this is for a sampling rate of 8, meaning that I add more space between the points)

    actual svg can be seen here

    Great! Now I need some way to label the paths with body regions, so I can build stuff. How to do that?

    Terms and relationships embedded in components

    We want to be able to (manually and automatically) annotate svg components with terms. This is related to a general idea that I like to think about - how can we embed data structures in visualizations themselves? An svg object (a support vector graphic) is in fact just an XML document, which is also a data structure for holding (you guessed it, data!). Thus, if we take a set of standard terms and relationships between them (i.e., an ontology), we can represent the terms as labels in an image, and the relationships by the relationship between the objects (eg, “eye” is “part of” the “head” is represented by way of the eye literally being a part of the head!). My first task, then, was to take terms from the Foundation Model of Anatomy (FMA) and use them as tags for my BodyMap.

    A little note about ontologies - they are usually intended for a very specific purpose. For example, the FMA needs to be detailed enough for use in science and medicine. However, if I’m extracting “body terms” from places like Twitter or general prose, I can tell you with almost certainty that you might find a term like “calf” but probably not “gastrocnemius.” My first task was to come up with a (very simple) list of terms from the FMA that I thought would be likely to be seen in general conversation or places like the Twitterverse. It’s not an all-encompassing set, but it’s a reasonable start.

    Annotation of the BodyMap

    I then had my svg for annotation, and I had my terms, how to do the annotation? I built myself a small interface for this goal exactly. You load your svg images and labels, and then draw circles around points you want to select, for example here I have selected the head:

    and then you can select terms from your vocabulary:

    and click annotate! The selection changes to indicate that the annotation has been done.

    Selecting a term and clicking “view” will highlight the annotation, in case you want to see it again. When you are finished, you can save the svg, and see that the annotation is present for the selected paths via an added class attribute:

    This is the simple functionality that I desired for this first round, and I imagine I’ll add other things as I need them. And again, ideally we will have automated methods to achieve these things in the long run, and we would also want to be able to take common data structures and images, convert them seamlessly into interactive components, and maybe even have a database for users to choose from. Imagine if we had a database of standard components for use, we could use them as features to describe visualizations, and get a sense of what the visualization is representing by looking at it statically. We could use methods from image processing and computer vision to generate annotated components automatically, and blur the line between what is data and what is visual. Since this is under development and my first go, I’ll just start by doing this annotation myself. I just created the svgtools package and this interface today, so stay tuned for more updates!

    annotation interface demo

    ·

  • Visualizations, Contain yourselves!

    Visualizing things is really challenging. The reason is because it’s relatively easy to make a visualization that is too complex for what it’s trying to show, and it’s much harder to make a visualization catered for a specific analysis problem. Simplicity is usually the best strategy, but while standard plots (e.g., scatter, box and whisker, histogram) are probably ideal for publications, they aren’t particularly fun to think about. You also have the limitations of your medium and where you paint the picture. For example, a standard web browser will get slow when you try to render ~3000 points with D3. In these cases you are either trying to render too many, or you need a different strategy (e.g., render points on canvas in favor of number over interactivity).

    I recently embarked on a challenge to visualize a model defined at every voxel in the brain (a voxel is a little 3D cube of brain landscape associated with an X,Y,Z coordinate). Why would I want to do this? I won’t go into details here, but with such models you could predict what a statistical brain map might look like based on cognitive concepts, or predict a set of cognitive concepts from a brain map. This work is still being prepared for publication, but we needed a visualization because the diabolical Poldrack is giving a talk soon, and it would be nice to have some way to show output of the models we had been working on. TLDR: I made a few Flask applications and shoved them into Docker continers with all necessary data, and this post will review my thinking and design process. The visualizations are in no way “done” (whatever that means) because there are details and fixes remaining.

    Step 1: How to cut the data

    We have over 28K models, each built from a set of ~100 statistical brain maps (yes, tiny data) with 132 cognitive concepts from the Cognitive Atlas. When you think of the internet, it’s not such big data, but it’s still enough to make putting it in a single figure challenging. Master Poldrack had sent me a paper from the Gallant Lab, and directed me to Figure 2:

    Gallant lab figure 2

    I had remembered this work from the HIVE at Stanford, and what I took away from it was the idea for the strategy. If we wanted to look at the entire model for a concept, that’s easy, look at the brain maps. If we want to understand all of those brain maps at one voxel, then the visualization needs to be voxel-specific. This is what I decided to do.

    Step 2: Web framework

    Python is awesome, and the trend for neuroimaging analysis tools is moving toward Python dominance. Thus, I decided to use a small web framework called Flask that makes data –> server –> web almost seamless. It takes a template approach, meaning that you write views for a python-based server to render, and they render using jinja2 templates. You can literally make a website in under 5 minutes.

    Step 3: Data preparation

    This turned out to be easy. I could generate either tab delimited or python pickled (think a compressed data object) files, and store them with the visualizations in their respective Github repos.

    Regions from the AAL Atlas

    At first, I generated views to render a specific voxel location, some number from 1..28K that corresponded with an X,Y,Z coordinate. The usability of this is terrible. Is someone really going to remember that voxel N corresponds to “somewhere in the right Amygdala?” Probably not. What I needed was a region lookup table. I wasn’t decided yet about how it would work, but I knew I needed to make it. First, let’s import some bread and butter functions!

    
    import pandas
    import nibabel
    import requests
    import xmltodict
    from nilearn.image import resample_img
    from nilearn.plotting import find_xyz_cut_coords
    
    

    The requests library is important for getting anything from a URL into a python program. nilearn is a nice machine learning library for python (that I usually don’t use for machine learning at all, but rather the helper functions), and xmltodict will do exactly that, convert an xml file into a superior data format :). First, we are going to use the Neurovault RESTApi to both obtain a nice brain map, and the labels from it. In the script to run this particular python script, we have already downloaded the brain map itself, and now we are going to load it, resample to a 4mm voxel (to match the data in our model), and then associate a label with each voxel:

    
    # OBTAIN AAL2 ATLAS FROM NEUROVAULT
    
    data = nibabel.load("AAL2_2.nii.gz")
    img4mm = nibabel.load("MNI152_T1_4mm_brain_mask.nii.gz")
    
    # Use nilearn to resample - nearest neighbor interpolation to maintain atlas
    aal4mm = resample_img(data,interpolation="nearest",target_affine=img4mm.get_affine())
    
    # Get labels
    labels = numpy.unique(aal4mm.get_data()).tolist()
    
    # We don't want to keep 0 as a label
    labels.sort()
    labels.pop(0)
    
    # OBTAIN LABEL DESCRIPTIONS WITH NEUROVAULT API
    url = "http://neurovault.org/api/atlases/14255/?format=json"
    response = requests.get(url).json()
    
    

    We now have a json object with a nice path to the labels xml! Let’s get that file, convert it to a dictionary, and then parse away, Merrill.

    
    # This is an xml file with label descriptions
    xml = requests.get(response["label_description_file"])
    doc = xmltodict.parse(xml.text)["atlas"]["data"]["label"]  # convert to a superior data structure :)
    
    

    Pandas is a module that makes nice data frames. You can think of it like a numpy matrix, but with nice row and column labels, and functions to sort and find things.

    
    # We will store region voxel value, name, and a center coordinate
    regions = pandas.DataFrame(columns=["value","name","x","y","z"])
    
    # Count is the row index, fill in data frame with region names and indices
    count = 0
    for region in doc:
        regions.loc[count,"value"] = int(region["index"]) 
        regions.loc[count,"name"] = region["name"] 
        count+=1
    
    

    I didn’t actually use this in the visualization, but I thought it might be useful to store a “representative” coordinate for each region:

    
    # USE NILEARN TO FIND REGION COORDINATES (the center of the largest activation connected component)
    for region in regions.iterrows():
        label = region[1]["value"]
        roi = numpy.zeros(aal4mm.shape)
        roi[aal4mm.get_data()==label] = 1
        nii = nibabel.Nifti1Image(roi,affine=aal4mm.get_affine())
        x,y,z = [int(x) for x in find_xyz_cut_coords(nii)]
        regions.loc[region[0],["x","y","z"]] = [x,y,z]
    
    

    and then save the data to file, both the “representative” coords, and the entire aal atlas as a squashed vector, so we can easily associate the 28K voxel locations with regions.

    
    # Save data to file for application
    regions.to_csv("../data/aal_4mm_region_coords.tsv",sep="\t")
    
    # We will also flatten the brain-masked imaging data into a vector,
    # so we can select a region x,y,z based on the name
    region_lookup = pandas.DataFrame(columns=["aal"])
    region_lookup["aal"] = aal4mm.get_data()[img4mm.get_data()!=0]
    
    region_lookup.to_pickle("../data/aal_4mm_region_lookup.pkl")
    
    

    script

    For this first visualization, that was all that was needed in the way of data prep. The rest of the files I already had on hand, nicely formatted, from the analysis code itself.

    Step 4: First Attempt: Clustering

    My first idea was to do a sort of “double clustering.” I scribbled the following into an email late one night:

    …there are two things we want to show. 1) is relationships between concepts, specifically for that voxel. 2) is the relationship between different contrasts, and then how those contrasts are represented by the concepts. The first data that we have that is meaningful for the viewer are the tagged contrasts. For each contrast, we have two things: an actual voxel value from the map, and a similarity metric to all other contrasts (spatial and/or semantic). A simple visualization would produce some clustering to show to the viewer how the concepts are similar / different based on distance. The next data that we have “within” a voxel is information about concepts at that voxel (and this is where the model is integrated). Specifically - a vector of regression parameters for that single voxel. These regression parameter values are produced via the actual voxel values at the map (so we probably would not use both). What I think we want to do is have two clusterings - first cluster the concepts, and then within each concept bubble, show a smaller clustering of the images, clustered via similarity, and colored based on the actual value in the image (probably some shade of red or blue).

    Yeah, please don’t read that. The summary is that I would show clusters of concepts, and within each concept cluster would be a cluster of images. Distance on the page, from left to right, would represent the contribution of the concept cluster to the model at the voxel. This turned out pretty cool:

    You can mouse over a node, which is a contrast image (a brain map) associated with a particular cognitive concept, and see details (done by way of tipsy). Only concepts that have a weight (weight –> importance in the model) that is not zero are displayed (and this reduces the complexity of the visualization quite a bit), and the nodes are colored and sized based on their value in the original brain map (red/big –> positive, and blue/small –> negative):

    You can use the controls in the top right to expand the image, save as SVG, link to the code, or read about the application:

    You can also select a region of choice from the dropdown menu, which uses select2 to complete your choice. At first I showed the user the voxel location I selected as “representative” for the region, but I soon realized that there were quite a few large regions in the AAL atlas, and that it would be incorrect and misleading to select a representative voxel. To embrace the variance within a region but still provide meaningful labels, I implemented it so that a user can select a region, and a random voxel from the region is selected:

    
        ...
        # Look up the value of the region
        value = app.regions.value[app.regions.name==name].tolist()[0]
        
        # Select a voxel coordinate at random
        voxel_idx = numpy.random.choice(app.region_lookup.index[app.region_lookup.aal == value],1)[0]
    
        return voxel(voxel_idx,name=name)
    
    

    Typically, Flask view functions return… views :). In this case, the view returned is the original one that I wrote (the function is called voxel) to render a view based on a voxel id (from 1..28K). The user just sees a dropdown to select a region:

    Finally, since there are multiple images tagged with the same concept in an image, you can mouse over a concept label to highlight those nodes in the image. You can also mouse over a concept label to highlight all the concepts associated with the image. We also obtain a sliced view of the image from NeuroVault to show to the user.

    Check out the full demo

    Step 5: Problems with First Attempt

    I first thought it was a pretty OK job, until my extremely high-standard brain started to tell me how crappy it was. The first problem is that the same image is shown for every concept it’s relevant for, and that’s both redundant and confusing. It also makes no sense at all to be showing an entire brain map when the view is defined for just one voxel. What was I thinking?

    The second problem is that the visualization isn’t intuitive. It’s a bunch of circles floating in space, and you have to read the “about” very careful to say “I think I sort of get it.” I tried to use meaningful things for color, size, and opacity, but it doesn’t give you really a sense of anything other than, maybe, magnetic balls floating in gray space.

    I thought about this again. What a person really wants to know, quickly, are

    1) which cognitive concepts are associated with the voxel?
    2) How much?
    3) How do the concepts relate in the ontology?

    I knew very quickly that the biggest missing component was some representation of the ontology. How was “recognition” related to “memory” ? Who knows! Let’s go back to the drawing table, but first, we need to prepare some new data.

    Step 6: Generating a Cognitive Atlas Tree

    A while back I added some functions to pybraincompare to generate d3 trees from ontologies, or anything you could represent with triples. Let’s do that with the concepts in our visualization to make a simple json structure that has nodes with children.

    
    from pybraincompare.ontology.tree import named_ontology_tree_from_tsv
    from cognitiveatlas.datastructure import concept_node_triples
    import pickle
    import pandas
    import re
    
    

    First we will read in our images, and we only need to do this to get the image contrast labels (a contrast is a particular combination / subtraction of conditions in a task, like “looking at pictures of cats minus baseline”).

    
    # Read in images metadata
    images = pandas.read_csv("../data/contrast_defined_images_filtered.tsv",sep="\t",index_col="image_id")
    
    

    The first thing we are going to do is generate a “triples data structure,” a simple format I came up with that would be simple for pybraincompare to understand that would allow it to render any kind of graph into the tree. It looks like this:

    
    ## STEP 1: GENERATE TRIPLES DATA STRUCTURE
    
    '''
      id    parent  name
      1 none BASE                   # there is always a base node
      2 1   MEMORY                  # high level concept groups
      3 1   PERCEPTION              
      4 2   WORKING MEMORY          # concepts
      5 2   LONG TERM MEMORY
      6 4   image1.nii.gz           # associated images (discovered by way of contrasts)
      7 4   image2.nii.gz
    '''
    
    

    Each node has an id, a parent, and a name. For the next step, I found the unique contrasts represented in the data (we have more than one image for contrasts), and then made a lookup to find sets of images based on the contrast.

    
    # We need a dictionary to look up image lists by contrast ids
    unique_contrasts = images.cognitive_contrast_cogatlas_id.unique().tolist()
    
    # Images that do not match the correct identifier will not be used (eg, "Other")
    expression = re.compile("cnt_*")
    unique_contrasts = [u for u in unique_contrasts if expression.match(u)]
    
    image_lookup = dict()
    for u in unique_contrasts:
       image_lookup[u] = images.index[images.cognitive_contrast_cogatlas_id==u].tolist()
    
    

    To make the table I showed above, I had added a function to the Cognitive Atlas API python wrapper called concept_node_triples.

    
    output_triples_file = "../data/concepts.tsv"
    
    # Create a data structure of tasks and contrasts for our analysis
    relationship_table = concept_node_triples(image_dict=image_lookup,output_file=output_triples_file)
    
    

    The function includes the contrast images themselves as nodes, so let’s remove them from the data frame before we generate and save the JSON object that will render into a tree:

    
    # We don't want to keep the images on the tree
    keep_nodes = [x for x in relationship_table.id.tolist() if not re.search("node_",x)]
    relationship_table = relationship_table[relationship_table.id.isin(keep_nodes)]
    
    tree = named_ontology_tree_from_tsv(relationship_table,output_json=None)
    pickle.dump(tree,open("../data/concepts.pkl","w"))
    json.dump(tree,open("../static/concepts.json",'w'))
    
    

    script

    Boum! Ok, now back to the visualization!

    Step 7: Second Attempt: Tree

    For this attempt, I wanted to render a concept tree in the browser, with each node in the tree corresponding to a cognitive concept, and colored by the “importance” (weight) in the model. As before, red would indicate positive weight, and blue negative (this is a standard in brain imaging, by the way). To highlight the concepts that are relevant for the particular voxel model, I decided to make the weaker nodes more transparent, and nodes with no contribution (weight = 0) completely invisible. However, I would maintain the tree structure to give the viewer a sense of distance in the ontology (distance –> similarity). This tree would also solve the problem of understanding relationships between concepts. They are connected!

    As before, mousing over a node provides more information:

    and the controls are updated slightly to include a “find in page” button:

    Which, when you click on it, brings up an overlay where you can select any cogntiive concepts of your choice with clicks, and they will light up on the tree!

    If you want to know the inspiration for this view, it’s a beautiful installation at the Stanford Business School that I’m very fond of:



    The labels were troublesome, because if I rendered too many it was cluttered and unreadable, and if I rendered too few it wasn’t easy to see what you were looking at without mousing over things. I found a rough function that helped a bit, but my quick fix was to simply limit the labels shown based on the number of images (count) and the regression parameter weight:

    
    
        // Add concept labels
        var labels = node.append("text")
            .attr("dx", function (d) { return d.children ? -2 : 2; })
            .attr("dy", 0)
            .classed("concept-label",true)
            .style("font","14px sans-serif")
            .style("text-anchor", function (d) { return d.children ? "end" : "start"; })
            .html(function(d) { 
                // Only show label for larger nodes with regression parameter >= +/- 0.5
                if ((counts[d.nid]>=15) && (Math.abs(regparams[d.nid])>=0.5)) {
                    return d.name
                }
            });
    
    

    Check out the full demo

    Step 8: Make it reproducible

    You can clone the repo on your local machine and run the visualization with native Flask:

    
        git clone https://github.com/vsoch/cogatvoxel
        cd cogatvoxel
        python index.py
    
    

    Notice anything missing? Yeah, how about installing dependencies, and what if the version of python you are running isn’t the one I developed it in? Eww. The easy answer is to Dockerize! It was relatively easy to do, I would use docker-compose to grab an nginx (web server) image, and my image vanessa/cogatvoxeltree built on Docker Hub. The Docker Hub image is built from the Dockerfile in the repo, which installs dependencies, maps the code to a folder in the container called /code and then exposes port 8000 for Flask:

    
    FROM python:2.7
    ENV PYTHONUNBUFFERED 1
    RUN apt-get update && apt-get install -y \
        libopenblas-dev \
        gfortran \
        libhdf5-dev \
        libgeos-dev
    
    MAINTAINER Vanessa Sochat
    
    RUN pip install --upgrade pip
    RUN pip install flask
    RUN pip install numpy
    RUN pip install gunicorn
    RUN pip install pandas
    
    ADD . /code
    WORKDIR /code
    
    EXPOSE 8000
    
    

    Then the docker-compose file uses this image, along with the nginx web server (this is pronounced “engine-x” and I’ll admit it took me probably 5 years to figure that out).

    
    web:
      image: vanessa/cogatvoxeltree
      restart: always
      expose:
        - "8000"
      volumes:
        - /code/static
      command: /usr/local/bin/gunicorn -w 2 -b :8000 index:app
    
    nginx:
      image: nginx
      restart: always
      ports:
        - "80:80"
      volumes:
        - /www/static
      volumes_from:
        - web
      links:
        - web:web
    
    

    It’s probably redundant to again expose port 8000 in my application (the top one called “web”), and add /www/static to the web server static. To make things easy, I decided to use gunicorn to manage serving the application. There are many ways to skin a cat, there are ways to run a web server… I hope you choose web servers over skinning cats.

    That’s about it. It’s a set of simple Flask applications to render data into a visualization, and it’s containerized. To be honest, I think the first is a lot cooler, but the second is on its way to a better visualization for the problem at hand. There is still a list of things that need fixing and tweaking (for example, not giving the user control over the threshold for showing the node and links is not ok), but I’m much happier with this second go. On that note, I’ll send a cry for reproducibility out to all possible renderings of data in a browser…

    Visualizations, contain yourselves!

    ·

  • Wordfish: tool for standard corpus and terminology extraction

    If pulling a thread of meaning from woven text
    is that which your heart does wish.
    Not so absurd or seemingly complex,
    if you befriend a tiny word fish.

    wordfish

    I developed a simple tool for standard extraction of terminology and corpus, Wordfish, that is easily deployed to a cluster environment. I’m graduating (hopefully, tentatively, who knows) soon, and because publication is unlikely, I will write about the tool here, in the case it is useful to anyone. I did this project for fun, mostly because I found DeepDive to be overly complicated for my personal goal of extracting a simple set of terms from a corpus in the case that I couldn’t define relationships apriori (I wanted to learn them from the data). Thus I used neural networks (word2vec) to learn term relationships based on their context. I was able to predict reddit boards for different mental illness terms with high accuracy, and it sort of ended there because I couldn’t come up with a good application in Cognitive Neuroscience, and no “real” paper is going to write about predicting reddit boards. I was sad to not publish something, but then I realized I am empowered to write things on the internet. :) Not only that, I can make up my own rules. I don’t have to write robust methods with words, I will just show and link you to code. I might even just use bulletpoints instead of paragraphs. For results, I’ll link to ipython notebooks. I’m going to skip over the long prose and trust that if you want to look something up, you know how to use Google and Wikipedia. I will discuss the tool generally, and show an example of how it works. First, an aside about publication in general - feel free to skip this if you aren’t interested in discussing the squeaky academic machine.

    Why sharing incomplete methods can be better than publication

    It’s annoying that there is not a good avenue, or even more so, that it’s not desired or acceptable, to share a simple (maybe even incomplete) method or tool that could be useful to others in a different context. Publication requires the meaningful application. It’s annoying that, as researchers, we salivate for these “publication” things when the harsh reality is that this slow, inefficient process results in yet another PDF/printed thing with words on a page, offering some rosy description of an analysis and result (for which typically minimal code or data is shared) that makes claims that are over-extravagant in order to be sexy enough for publication in the first place (I’ve done quite a bit of this myself). A publication is a static thing that, at best, gets cited as evidence by another paper (and likely the person making the citation did not read the paper to full justice). Maybe it gets parsed from pubmed in someone’s meta analysis to try and “uncover” underlying signal across many publications that could have been transparently revealed in some data structure in the first place. Is this workflow really empowering others to collaboratively develop better methods and tools? I think not. Given the lack of transparency, I’m coming to realize that it’s much faster to just share things early. I don’t have a meaningful biological application. I don’t make claims that this is better than anything else. This is not peer reviewed by some three random people that gives it a blessing like from a rabbi. I understand the reasons for these things, but the process of conducting research, namely hiding code and results toward that golden nugget publication PDF, seems so against a vision of open science. Under this context, I present Wordfish.

    Wordfish: tool for standard corpus and terminology extraction

    DOIDOI

    Abstract

    The extraction of entities and relationships between them from text is becoming common practice. The availability of numerous application program interfaces (API) to extract text from social networks, blogging platforms and feeds, standard sources of knowledge is continually expanding, offering an extensive and sometimes overwhelming source of data for the research scientist. While large corporations might have exclusive access to data and robust pipelines for easily obtaining the data, the individual researcher is typically granted limited access, and commonly must devote substantial amounts of time to writing extraction pipelines. Unfortunately, these pipelines are usually not extendable beyond the dissemination of any result, and the process is inefficiently repeated. Here I present Wordfish, a tiny but powerful tool for the extraction of corpus and terms from publicly available sources. Wordfish brings standardization to the definition and extraction of terminology sources, providing an open source repository for developers to write plugins to extend their specific terminologies and corpus to the framework, and research scientists an easy way to select from these corpus and terminologies to perform extractions and drive custom analysis pipelines. To demonstrate the utility of this tool, I use Wordfish in a common research framework: classification. I first build deep learning models to predict Reddit boards from post content with high accuracy. I hope that a tool like Wordfish can be extended to include substantial plugins, and can allow easier access to ample sources of textual content for the researcher, and a standard workflow for developers to add a new terminology or corpus source.

    Introduction

    While there is much talk of “big data,” when you peek over your shoulder and look at your colleague’s dataset, there is a pretty good chance that it is small or medium sized. When I wanted to extract terms and relationships from text, I went to DeepDive, the ultimate powerhouse to do this. However, I found that setting up a simple pipeline required database and programming expertise. I have this expertise, but it was tenuous. I thought that it should be easy to do some kind of NLP analysis, and combine across different corpus sources. When I started to think about it, we tend to reuse the same terminologies (eg, an ontology) and corpus (pubmed, reddit, wikipedia, etc), so why not implement an extraction once, and then provide that code for others? This general idea would make a strong distinction between a developer, meaning an individual best suited to write the extraction pipeline, and the researcher, an individual best suited to use it for analysis. This sets up the goals of Wordfish: to extract terms from a corpus, and then do some higher level analysis, and make it standardized and easy.

    Wordfish includes data structures that can capture an input corpus or terminology, and provides methods for retrieval and extraction. Then, it allows researchers to create applications that interactively select from the available corpus and terminologies, deploy the applications in a cluster environment, and run an analysis. This basic workflow is possible and executable without needing to set up an entire infrastructure and re-writing the same extraction scripts that have been written a million times before.

    Methods

    The overall idea behind the infrastructure of wordfish is to provide terminologies, corpus, and an application for working with them in a modular fashion. This means that Wordfish includes two things, wordfish-plugins and wordfish-python. Wordfish plugins are modular folders, each of which provides a standard data structure to define extraction of a corpus, terminology or both. Wordfish python is a simple python module for generating an application, and then deploying the application on a server to run analyses.

    Wordfish Plugins

    A wordfish plugin is simply a folder with typically two things: a functions.py file to define functions for extraction, and a config.json that is read by wordfish-python to deploy the application. We can look at the structure of a typical plugin:

      plugin
            functions.py
            __init__.py
            config.json
    

    Specifically, the functions.py has the following functions:

    1) extract_terms: function to call to return a data structure of terms
    2) extract_text: function to call to return a data structure of text (corpus)
    3) extract_relations: function to call to return a data structure of relations
    4) functions.py: is the file in the plugin folder to store these functions

    The requirement of every functions.py is an import of general functions from wordfish-python that will save a data structure for a corpus, terminology, or relationships:

    
    	# IMPORTS FOR ALL PLUGINS
    	from wordfish.corpus import save_sentences
    	from wordfish.terms import save_terms
    	from wordfish.terms import save_relations
    	from wordfish.plugin import generate_job
    
    

    The second requirement is a function, go_fish, which is the main function to be called by wordfish-python under the hood. In this function, the user writing the plugin can make as many calls to generate_job as necessary. A call to generate job means that a slurm job file will be written to run a particular function (func) with a specified category or extraction type (e.g., terms, corpus, or relations). This second argument helps the application determine how to save the data. A go_fish function might look like this:

    
    	# REQUIRED WORDFISH FUNCTION
    	def go_fish():    
    	    generate_job(func="extract_terms",category="terms")
    	    generate_job(func="extract_relations",category="relations")
    
    

    The above will generate slurm job files to be run to extract terms and relations. Given input arguments are required for the function, the specification can look as follows:

    
    generate_job(func="extract_relations",inputs={"terms":terms,"maps_dir":maps_dir},category="relations",batch_num=100)
    
    

    where inputs is a dictionary of keys being variable names, values being the variable value. The addition of the batch_num variable also tells the application to split the extraction into a certain number of batches, corresponding to SLURM jobs. This is needed in the case that running a node on a cluster is limited to some amount of time, and the user wants to further parallelize the extraction.

    Extract terms

    Now we can look at more detail at the extract_terms function. For example, here is this function for the cognitive atlas. The extract_terms will return a json structure of terms

    
    	def extract_terms(output_dir):
    
    	    terms = get_terms()
    	    save_terms(terms,output_dir=output_dir)
    
    

    You will notice that the extract_terms function uses another function that is defined in functions.py, get_terms. The user is free to include in the wordfish-plugin folder any number of additional files or functions that assist toward the extraction. Here is what get_terms looks like:

    
    	def get_terms():
    	    terms = dict()
    	    concepts = get_concepts()
    
    	    for c in range(len(concepts)):
    		concept_id = concepts[c]["id"]
    		meta = {"name":concepts[c]["name"],
    		        "definition":concepts[c]["definition_text"]}
    		terms[concept_id] = meta
    	    return terms
    
    

    This example is again from the Cognitive Atlas, and we are parsing cognitive ceoncepts into a dictionary of terms. For each cognitive concept, we are preparing a dictionary (JSON data structure) with fields name, and definition. We then put that into another dictionary terms with the key as the unique id. This unique id is important in that it will be used to link between term and relations definitions. You can assume that the other functions (e.g., get_concepts are defined in the functions.py file.

    Extract relations

    For extract_relations we return a tuple of the format (term1_id,term2_id,relationship):

        
    	def extract_relations(output_dir):
    
    	    links = []
    	    terms = get_terms()
    	    concepts = get_concepts()
    
    	    for concept in concepts:
    		if "relationships" in concept:
    		    for relation in concept["relationships"]:   
    		        relationship = "%s,%s" %(relation["direction"],relation["relationship"])
    		        tup = (concept["id"],relation["id"],relationship) 
    		        links.append(tup)
    
    	    save_relations(terms,output_dir=output_dir,relationships=links)
    
    

    Extract text

    Finally, extract_text returns a data structure with some unique id and a blob of text. Wordfish will parse and clean up the text. The data structure for a single article is again, just JSON:

    
                corpus[unique_id] = {"text":text,"labels":labels}
    
    

    Fields include the actual text, and any associated labels that are important for classification later. The corpus (a dictionary of these data structures) gets passed to save_sentences

    
                save_sentences(corpus_input,output_dir=output_dir)
    
    

    More detail is provided in the wordfish-plugin README

    The plugin controller: config.json

    The plugin is understood by the application by way of a folder’s config.json, which might look like the following:

    
          [
                {
                  "name": "NeuroSynth Database",
                  "tag": "neurosynth",
                  "corpus": "True",
                  "terms": "True",
                  "labels": "True",
                  "relationships":"True",
                  "dependencies": {
                                    "python": [ 
                                                "neurosynth",
                                                "pandas"
                                              
                                              ],
                                     "plugins": ["pubmed"]
                                  },
                  "arguments": {
                                   "corpus":"email"
                               },
                  "contributors": ["Vanessa Sochat"], 
                  "doi": "10.1038/nmeth.1635",
        
                }
          ]
    
    

    1) name: a human readable description of the plugin

    2) tag: a term (no spaces or special characters) that corresponds with the folder name in the plugins directory. This is a unique id for the plugin.

    3) corpus/terms/relationships: boolean, each “True” or “False” should indicate if the plugin can return a corpus (text to be parsed by wordfish) or terms (a vocabulary to be used to find mentions of things), or relations (relationships between terms). This is used to parse current plugins available for each purpose, to show to the user.

    4) dependencies: should include “python” and “plugins.” Python corresponds to python packages that are dependencies, and these plugins are installed by the overall application. Plugins refers to other plugins that are required, such as pubmed. This is an example of a plugin that does not offer to extract a specific corpus, terminology, or relations, but can be included in an application for other plugins to use. In the example above, the neurosynth plugin requires retrieving articles from pubmed, so the plugin develop specifies needing pubmed as a plugin dependency.

    5) arguments: a dictionary with (optionally) corpus and/or terms. The user will be asked for these arguments to run the extract_text and extract_terms functions.

    6) contributors: a name/orcid ID or email of researchers responsible for creation of the plugins. This is for future help and debugging.

    7) doi: a reference or publication associated with the resource. Needed if it’s important for the creator of the plugin to ascribe credit to some published work.

    Best practices for writing plugins

    Given that a developer is writing a plugin, it is generally good practice to print to the screen what is going on, and how long it might take, as a courtesy to the user, if something needs review or debugging.

    “Extracting relationships, will take approximately 3 minutes”

    The developer should also use clear variable names, well documented and spaced functions (one liners are great in python, but it’s more understandable by the reader if to write out a loop sometimes), and attribute function to code that is not his. Generally, the developer should just follow good practice as a coder and human being.

    Functions provided by Wordfish

    While most users and clusters have internet connectivity, it cannot be assumed, and an error in attempting to access an online resource could trigger an error. If a plugin has functions that require connectivity, Wordfish provides a function to check:

    
          from wordfish.utils import has_internet_connectivity
          if has_internet_connectivity():
              # Do analysis
    
    

    If the developer needs a github repo, Wordfish has a function for that:

    
          from wordfish.vm import download_repo
          repo_directory = download_repo(repo_url="https://github.com/neurosynth/neurosynth-data")
    
    

    If the developer needs a general temporary place to put things, tempfile is recommended:

    
          import tempfile
          tmpdir = tempfile.mkdtemp()
    
    

    Wordfish has other useful functions for downloading data, or obtaining a url. For example:

    
          from wordfish.utils import get_url, get_json
          from wordfish.standards.xml.functions import get_xml_url
          myjson = get_json(url)
          webpage = get_url(url)
          xml = get_xml_url(url)
    
    


    Custom Applications with Wordfish Python

    The controller, wordfish-python is a flask application that provides the user (who is just wanting to generate an application) with an interactive web interface for doing so. It is summarized nicely in the README:

    Choose your input corpus, terminologies, and deployment environment, and an application will be generated to use deep learning to extract features for text, and then entities can be mapped onto those features to discover relationships and classify new texty things. Custom plugins will allow for dynamic generation of corpus and terminologies from data structures and standards of choice from wordfish-plugins You can have experience with coding (and use the functions in the module as you wish), or no experience at all, and let the interactive web interface walk you through generation of your application.

    Installation can be done via github or pip:

    
          pip install git+git://github.com/word-fish/wordfish-python.git
          pip install wordfish
    
    

    And then the tool is called to open up a web interface to generate the application:

    
        wordfish
    
    

    The user then selects terminologies and corpus.

    And a custom application is generated, downloaded as a zip file in the browser. A “custom application” means a folder that can be dropped into a cluster environment, and run to generate the analysis,



    Installing in your cluster environment

    The user can drop the folder into a home directory of the cluster environment, and run the install script to install the package itself, and generate the output folder structure. The only argument that is needed is to supply is the base of the output directory:

    
          WORK=/scratch/users/vsochat/wordfish
          bash install.sh $WORK
    
    

    All scripts for the user to run are in the scripts folder here:

    
          cd $WORK/scripts
    
    

    Each of these files corresponds to a step in the pipeline, and is simply a list of commands to be run in parallel. The user can use launch, or submit each command to a SLURM cluster. A basic script is provided to help submit jobs to a SLURM cluster, and this could be tweaked to work with other clusters (e.g., SGE).

    Running the Pipeline

    After the installation of the custom application is complete, this install script simply runs run.py, which generates all output folders and running scripts. the user has a few options for running:

    1) submit the commands in serial, locally. The user can run a job file with bash, bash run_extraction_relationships.job
    2) submit the commands to a launch cluster, something like launch -s run_extraction_relationships.job
    3) submit the commands individually to a slurm cluster. This will mean reading in the file, and submitting each script with a line like sbatch -p normal -j myjob.job [command line here]

    Output structure

    The jobs are going to generate output to fill in the following file structure in the project base folder, which again is defined as an environmental variable when the application is installed (files that will eventually be produced are shown):

    
          WORK
                  corpus
                      corpus1
                          12345_sentences.txt
                          12346_sentences.txt
                      corpus2
                          12345_sentences.txt
                          12346_sentences.txt
                  terms
                      terms1_terms.txt
                      terms2_relationships.txt
    
                  scripts
                      run_extraction_corpus.job
                      run_extraction_relationships.job
                      run_extraction_terms.job
    
    

    The folders are generated dynamically by the run.py script for each corpus and terms plugin based on the tag variable in the plugin’s config.json. Relationships, by way of being associated with terms, are stored in the equivalent folder, and the process is only separate because it is not the case that all plugins for terms can have relationships defined. The corpus are kept separate at this step as the output has not been parsed into any standard unique id space. Wordfish currently does not do this, but if more sophisticated applications are desired (for example with a relational database), this would be a good strategy to take.

    Analysis

    Once the user has files for corpus and terms, he could arguably do whatever he pleases with them. However, I found the word2vec neural network to be incredibly easy and cool, and have provided a simple analysis pipeline to use it. This example will merge all terms and corpus into a common framework, and then show examples of how to do basic comparisons, and vector extraction (custom analyses scripts can be based off of this). We will do the following:

    1) Merge terms and relationships into a common corpus
    2) For all text extract features with deep learning (word2vec)
    3) Build classifiers to predict labels from vectors

    Word2Vec Algorithm

    First, what is a word2vec model? Generally, Word2Vec is a neural network implementation that will allow us to learn word embeddings from text, specifically a matrix of words by some N features that best predict the neighboring words for each term. This is an interesting way to model a text corpus because it’s not about occurrence, but rather context, of words, and we can do something like compare a term “anxiety” in different contexts. If you want equations, see this paper.

    The problem Wordfish solves

    Wordfish currently implements Word2Vec. Word2Vec is an unsupervised model. Applications like DeepDive take the approach that a researcher knows what he or she is looking for, requiring definition of entities as first step before their extraction from a corpus. This is not ideal given that a researcher has no idea about these relationships, or lacks either positive or negative training examples. In terms of computational requirements, Deepdive also has some that are unrealistic. For example, using the Stanford Parser is required to determine parts of speech and perform named entity recognition. While this approach is suitable for a large scale operation to mine very specific relationships between well-defined entities in a corpus, for the single researcher that wants to do simpler natural language processing, and perhaps doesn’t know what kind of relationships or entities to look for, it is too much. This researcher may want to search for some terms of interest across a few social media sources, and build models to predict one type of text content from another. The researcher may want to extract relationships between terms without having a good sense of what they are to begin with, and definition of entities, relationships, and then writing scripts to extract both should not be a requirement. While it is reasonable to ask modern day data scientists to partake in small amounts of programming, substantial setting up of databases and writing extraction pipelines should not be a requirement. A different approach that is taken by Wordfish is to provide plugins for the user to interactively select corpus and terminology, deploy their custom application in their computational environment of choice, and perform extraction using the tools that are part of their normal workflows, which might be a local command line or computing cluster.

    When the DeepDive approach makes sense, the reality is that setting up the infrastructure to deploy DeepDive is really hard. When we think about it, the two applications are solving entirely different problems. All we really want to do is discover how terms are related in text. We can probably do ok to give DeepDive a list of terms, but then to have to “know” something about the features we want to extract, and have positive and negative cases for training is really annoying. If it’s annoying for a very simple toy analysis (finding relationships between cognitive concepts) I can’t even imagine how that annoyingness will scale when there are multiple terminologies to consider, different relationships between the terms, and a complete lack of positive and negative examples to validate. This is why I created Wordfish, because I wanted an unsupervised approach that required minimal set up to get to the result. Let’s talk a little more about the history of Word2Vec from this paper.

    The N-Gram Model

    The N-gram model (I think) is a form of hidden Markov Model where we model the P(word) given the words that came before it. The authors note that N-gram models work very well for large data, but in the case of smaller datasets, more complex methods can make up for it. However, it follows logically that a more complex model on a large dataset gives us the best of all possible worlds. Thus, people started using neural networks for these models instead.

    simple models trained on huge amounts of data outperform complex systems trained on less data.

    The high level idea is that we are going to use neural networks to represent words as vectors, word “embeddings.” Training is done with stochastic gradient descent and backpropagation.

    How do we assess the quality of word vectors?

    Similar words tend to be close together, and given a high dimensional vector space, multiple representations/relationships can be learned between words (see top of page 4). We can also perform algebraic computations on vectors and discover cool relationships, for example, the vector for V(King) - V(Man) + V(Woman) is close to V(Queen). The most common metric to compare vectors seems to be cosine distance. The interesting thing about this approach reported here is that by combining individual word vectors we can easily represent phrases, and learn new interesting relationships.

    Two different algorithm options

    You can implement a continuous bag of words (CBOW) or skip-gram model: 1) CBOW: predicts the word given the context (other words)
    2) skip-gram: predicts other words (context) given a word (this seems more useful for what we want to do)

    They are kind of like inverses of one another, and the best way to show this is with a picture:

    algorithms

    Discarding Frequent Words

    The paper notes that having frequent words in text is not useful, and that during training, frequent words are discarded with a particular probability based on the frequency. They use this probability in a sampling procedure when choosing words to train on so the more frequent words are less likely to be chosen. For more details, see here, and search Google.

    Building Word2Vec Models

    First, we will train a simple word2vec model with different corpus. And to do this we can import functions from Wordfish, which is installed by the application we generated above.

    
    	from wordfish.analysis import build_models, save_models, export_models_tsv, load_models, extract_similarity_matrix, export_vectors, featurize_to_corpus
    	from wordfish.models import build_svm
    	from wordfish.corpus import get_corpus, get_meta, subset_corpus
    	from wordfish.terms import get_terms
    	from wordfish.utils import mkdir
    	import os
    
    

    Installation of the application also write the environmental variable WORDFISH_HOME to your bash profile, so we can reference it easily:

    
    	base_dir = os.environ["WORDFISH_HOME"]
    
    

    It is generally good practice to keep all components of an analysis well organized in the output directory. It makes sense to store analyses, models, and vectors:

    
    	# Setup analysis output directories
    	analysis_dir = mkdir("%s/analysis" %(base_dir))
    	model_dir = mkdir("%s/models" %(analysis_dir))
    	vector_dir = mkdir("%s/vectors" %(analysis_dir))
    
    

    Wordfish then has nice functions for generating a corpus, meaning removing stop words, excess punctuation, and the typical steps in NLP analyses. The function get_corpus returns a dictionary, with the key being the unique id of the corpus (the folder name, tag of the original plugin). We can then use the subset_corpus plugin if we want to split the corpus into the different groups (defined by the labels we specified in the initial data structure):

    
    	# Generate more specific corpus by way of file naming scheme
    	corpus = get_corpus(base_dir)
    	reddit = corpus["reddit"]
    	disorders = subset_corpus(reddit)
    	corpus.update(disorders)
    
    

    We can then train corpus-specific models, meaning word2vec models.

    
    	# Train corpus specific models
    	models = build_models(corpus)
    
    

    Finally, we can export models to tsv, export vectors, and save the model so we can easily load again.

    
    	# Export models to tsv, export vectors, and save
    	save_models(models,base_dir)
    	export_models_tsv(models,base_dir)
    	export_vectors(models,output_dir=vector_dir)
    
    

    I want to note that I used gensim for learning and some methods. The work and examples from Dato are great!

    Working with models

    Wordfish provides functions for easily loading a model that is generated from a corpus:

    
    model = load_models(base_dir)["neurosynth"]
    
    

    You can then do simple things, like find the most similar words for a query word:

    
    	model.most_similar("anxiety")
    	# [('aggression', 0.77308839559555054), 
    	#   ('stress', 0.74644440412521362), 
    	#   ('personality', 0.73549789190292358), 
    	#   ('excessive', 0.73344630002975464), 
    	#   ('anhedonia', 0.73305755853652954), 
    	#   ('rumination', 0.71992391347885132), 
    	#   ('distress', 0.7141801118850708), 
    	#   ('aggressive', 0.7049030065536499), 
    	#   ('craving', 0.70202392339706421), 
    	#   ('trait', 0.69775849580764771)]
    
    

    It’s easy to see that corpus context is important - here is finding similar terms for the “reddit” corpus:

    
    	model = load_models(base_dir)["reddit"]
    	model.most_similar("anxiety")
    	# [('crippling', 0.64760375022888184), 
    	# ('agoraphobia', 0.63730186223983765), 
    	# ('generalized', 0.61023455858230591), 
    	# ('gad', 0.59278655052185059), 
    	# ('hypervigilance', 0.57659250497817993), 
    	# ('bouts', 0.56644737720489502), 
    	# ('depression', 0.55617612600326538), 
    	# ('ibs', 0.54766887426376343), 
    	# ('irritability', 0.53977066278457642), 
    	# ('ocd', 0.51580017805099487)]
    
    

    Here are examples of performing addition and subtraction with vectors:

    
    	model.most_similar(positive=['anxiety',"food"])
    	# [('ibs', 0.50205761194229126), 
    	# ('undereating', 0.50146859884262085), 
    	# ('boredom', 0.49470821022987366), 
    	# ('overeating', 0.48451068997383118), 
    	# ('foods', 0.47561675310134888), 
    	# ('cravings', 0.47019645571708679), 
    	# ('appetite', 0.46869537234306335), 
    	# ('bingeing', 0.45969703793525696), 
    	# ('binges', 0.44506731629371643), 
    	# ('crippling', 0.4397256076335907)]
    
    	model.most_similar(positive=['bipolar'], negative=['manic'])
    	# [('nos', 0.36669495701789856), 
    	# ('adhd', 0.36485755443572998), 
    	# ('autism', 0.36115738749504089), 
    	# ('spd', 0.34954413771629333), 
    	# ('cptsd', 0.34814098477363586), 
    	# ('asperger', 0.34269329905509949), ('schizotypal', 0.34181860089302063), ('pi', 0.33561226725578308), ('qualified', 0.33355745673179626), ('diagnoses', 0.32854354381561279)]
    
    	model.similarity("anxiety","depression")
    	#0.67751728687122414
    
    	model.doesnt_match(["child","autism","depression","rett","cdd","young"])
    	#'depression'
    
    

    And to get the raw vector for a word:

    
    	model["depression"]
    
    

    Extracting term similarities

    To extract a pairwise similarity matrix, you can use the function extract_similarity_matrix. These are the data driven relationships between terms that the Wordfish infrastructure provides:

    
    	# Extract a pairwise similarity matrix
    	wordfish_sims = extract_similarity_matrix(models["neurosynth"])
    
    

    Classification

    Finally, here is an example of predicting neurosynth abtract labels using the pubmed neurosynth corpus. We first want to load the model and meta data for neurosynth, meaning labels for each text:

    
    	model = load_models(base_dir,"neurosynth")["neurosynth"]
    	meta = get_meta(base_dir)["neurosynth"]
    
    

    We can then use the featurize_to_corpus method to get labels and vectors from the model, and the build_svm function to build a simple, cross validated classified to predict the labels from the vectors:

    
    	vectors,labels = featurize_to_corpus(model,meta)
    	classifiers = build_svm(vectors=vectors,labels=labels,kernel="linear")
    
    

    The way this works is to take a new post from reddit with an unknown label, use the Word2vec word embeddings vector as a lookup, and generating a vector for the new post based on taking the mean vector of word embeddings. It’s a simple approach, could be improved upon, but it seemed to work reasonably well.

    Classification of Disorder Using Reddit

    A surprisingly untapped resource are Reddit boards, a forum with different “boards” indicating a place to write about topics of interest. It has largely gone unnoticed that individuals use Reddit to seek social support for advice, for example, the Depression board is predominantly filled with posts from individuals with Depression writing about their experiences, and the Computer Science board might be predominantly questions or interesting facts about computers or being a computer scientist. From the mindset of a research scientist who might be interested in Reddit as a source of language, a Reddit board can be thought of as a context. Individuals who post to the board, whether having an “official” status related to the board, are expressing language in context of the topic. Thus, it makes sense that we can “learn” a particular language context that is relevant to the board, and possibly use the understanding of this context to identify it in other text. Thus, I built 36 word embedding models across 36 Reddit boards, each representing the language context of the board, or specifically, the relationships between the words. I used these models to look at context of words across different boards. I also build one master “reddit” model, and used this model in the classification framework discussed previously.

    For the classification framework, it was done for two applications - predicting reddit boards from reddit posts, and doing the same, but using the neurosynth corpus as the Word2Vec model (the idea being that papers about cognitive neuroscience and mental illness might produce word vectors that are more relevant for reddit boards about mental illness groups). For both of these, the high level idea is that we want to predict a board (grouping) based on a model built from all of reddit (or some other corpus). The corpus used to derive the word vectors gives us the context - meaning the relationships between terms (and this is done across all boards with no knowledge of classes or board types), and then we can take each entry and calculate an average vector for it based on averaging the vectors of word embeddings that are present in the sentence. Specifically we:

    
    1) generate word embeddings model (M) for entire reddit corpus (resulting vocabulary is size N=8842) 
    2) For each reddit post (having a board label like "anxiety":
    - generate a vector that is an average of word embeddings in M

    Then for each pairwise board (for example, “anxiety” and “atheist”

    1) subset the data to all posts for “anxiety” and “atheist”
    2) randomly hold out 20% for testing, rest 80% for training
    3) build an SVM to distinguish the two classes, for each of rbf, linear, and poly kernel
    4) save accuracy metrics

    Results

    How did we do?

    Can we classify reddit posts?

    The full result has accuracies that are mixed. What we see is that some boards can be well distinguished, and some not. When we extend to use the neurosytnh database to build the model, we don’t do as well, likely because the corpus is much smaller, and we remember from the paper that larger corpus tends to do better.

    Can we classify neurosynth labels?

    A neurosynth abstract comes with a set of labels for terms that are (big hand waving motions) “enriched.” Thus,given labels for a paragraph of text (corresponding to the neurosynth term) I should be able to build a classifier that can predict the term. The procedure is the same as above: an abstract is represented as its mean of all the word embeddings represented. The results are also pretty good for a first go, but I bet we could do batter with a multi-class model or approach.

    Do different corpus provide different context (and thus term relationships?)

    This portion of the analysis used the Word2Vec models generated for specific reddit boards. Before I delved into classification, I had just wanted to generate matrices that show relationships between words, based on a corpus of interest. I did this for NeuroSynth, as well as for a large sample (N=121862) reddit posts across 34 boards, including disorders and random words like “politics” and “science.” While there was interesting signal in the different relationship matrices, really the most interesting thing we might look at is how a term’s relationships varies on the context. Matter of fact, I would say context is extremely important to think about. For example, someone talking about “relationships” in the context of “anxiety” is different than someone talking about “relationships” in the context of “sex” (or not). I didn’t upload these all to github (I have over 3000, one for each neurosynth term), but it’s interesting to see how a particular term changes across contexts.

    Each matrix (pdf file) in the folder above is one term from neurosynth. What the matrix for a single term shows is different contexts (rows) and the relationship to all other neurosynth terms (columns). Each cell value shows the word embedding of the global term (in the context specified) against the column term. The cool thing for most of these is that we see general global patterns, meaning that the context is similar, but then there are slight differences. I think this is hugely cool and interesting and could be used to derive behavioral phenotypes. If you would like to collaborate on something to do this, please feel free to send me an email, tweet, Harry Potter owl, etc.

    Conclusions

    Wordfish provides standard data structures and an easy way to extract terms, corpus, and perform a classification analysis, or extract similarity matrices for terms. It’s missing a strong application. We don’t have anything suitable in cognitive neuroscience, at least off the bat, and if you might have ideas, I’d love to chat. It’s very easy to write a plugin to extend the infrastructure to another terminology or corpus. We can write one of those silly paper things. Or just have fun, which is much better. The application to deploy Wordfish plugins, and the plugins themselves are open source, meaning that they can be collaboratively developed by users. That means you! Please contribute!.

    Limitations

    The limitations have to do with the fact that this is not a finished application. Much fine tuning could be done toward a specific goal, or to answer a specific question. I usually develop things with my own needs in mind, and add functionality as I go and it makes sense.

    Database

    For my application, it wasn’t a crazy idea to store each corpus entry as a text file, and I had only a few thousand terms. Thus, I was content using flat text files to store data structures. I had plans for integration of “real” databases, but the need never arose. This would not be ideal for much larger corpus, for which using a database would be optimal. Given the need for a larger corpus, I would add this functionality to the application, if needed.

    Deployment Options

    Right now the only option is to generate a folder and install on a cluster, and this is not ideal. Better would be options to deploy to a local or cloud-hosted virtual machine, or even a Docker image. This is another future option.

    Data

    It would eventually be desired to relate analyses to external data, such as brain imaging data. For example, NeuroVault is a database of whole-brain statistical maps with annotations for terms from the cognitive atlas, and we may want to bring in maps from NeuroVault at some point. Toward this aim a separate wordfish-data repo has been added. Nothing has been developed here yet, but it’s in the queue.

    And this concludes my first un-paper paper. It took an afternoon to write, and it feels fantastic. Go fish!

    ·

  • So Badly

    I want it so badly, I can’t breathe. The last time I had this feeling was right after interviewing at Stanford. My heart ached for the realization that the opportunity to learn informatics was everything I had ever wanted, and the chances were so slim of getting it. Before I had any prowess in programming and didn’t know what a normal distribution was, I had the insight that immersion in the right environment would push and challenge me exactly in the way I was hungry for. It is only when we surround ourselves by individuals who are more skilled, and in an environment with opportunity to try new things and take risks, that we grow in skill and in ourselves. When I compare the kind of “things” that I tried to build back in my first quarter to what I build now, I have confidence that I was right about this particular instinct. Now, in my last year or so of graduate school, I am again faced with uncertainty about the future. I am almost thirty - I feel so old, and in this context a part of me is tired of constantly needing to prove that I am good enough. I am defined by what I like to call the goldfish property. I am so devoted to the things that I love, mainly learning about new infrastructures and creating beautiful applications, and my stubborness is so great that I will tend to grow toward the size of my tank. I have the confidence that, despite at any moment not having prowess in some domain, if presented with challenge, and especially in an environment where I can observe the bigger fish, I will grow. What scares me is the fact that in order to gain entry to the larger ocean, we are judged as little fish. I am also terrified by a general realization about the importance of the choice of an environment. Each step slightly away from being around the kind of people that are amazing at the things I want to be amazing at is a small dimming of the light insight my heart, and too many steps away means a finality of regret and a present life of doing some sets of tasks that, maybe one might be good at, but they do not satisfy the heart.

    This is why this crucial next step is so terrifying. I have confidence in the things that I want to be amazing at. I’m also a weird little fish, I don’t care about pursuring weekend fun, going on trips, learning about different cultures, or even starting a family. I just want to immerse my brain in a computer and build things. The challenges that I want are building infrastructure, meaning databases, cluster and cloud environments, virtual machines… essentially applications. I go nuts over using version control (Github), APIs, Continuous Integration, and working on software projects. I like to think about data structures, and standards, and although my training in machine learning supplements that, and I enjoy learning new algorithms that I can use in my toolbox, I don’t have drive to find ways to better optimize those algorithms. I want to be the best version of myself that I can possibly be - to return home at the end of each day and feel that I’ve milked out every last ounce of energy and effort is the ultimate satisfaction. I want to be able to build anything that I can dream of. To have and perform some contained skillset is not good enough. I don’t think I’ll ever feel good enough, and although this can sometimes sound disheartening, is the ultimate driver towards taking risks to try new things that, when figured out, lead to empowerment and joy. The terrifying thing about remaining in a small research environment is that I won’t be pushed, or minimally have opportunity, to become a badass at these things. I might use them, but in context of neuroscientists or biologists, to already feel like one of the bigger fish in using these technologies fills me with immense sadness. I’m perhaps proficient in neuroscience, but I’m not great because I’m more interested in trying to build tools that neuroscientists can use. Any individual can notice this in him or herself. We tend to devote time and energy to the things that we obsessively love, and while the other things might be necessary to learn or know, it’s easy to distinguish these two buckets because one grows exponentially, effortlessly, and the other one changes only when necessary. Growing in this bucket is an essential need, critical for happiness and fulfillment, and to slow down this growth and find oneself in a job performing the same skill set with no opportunity for growth leads to screaming inside ones head, and feeling trapped.

    So, in light of this uncertainty, and upcoming change, I feel squirmy. I want to be an academic software engineer, but if I am rooted in academia I am worried about being a big fish, and I would not grow. I am an engineer at heart, and this does not line up with the things that an individual in academia is hungry to learn, which are more rooted in biological problems than the building of things. However, if I were to throw myself into some kind of industry, I feel like I am breaking a bond and established relationship and trust with the people and problems that I am passionate about solving. I love academia because it gives me freedom to work on different and interesting problems every day, freedom and control over my time and work space, and freedom to work independently. But on the other hand, it is much more challenging to find people that are great at the kinds of modern technologies that I want to be great at. It is much less likely to be exposed to the modern, bleeding edge technology that I am so hungry to learn, because academia is always slightly behind, and the best I can do is read the internet, and watch conference videos for hours each evening, looking for tidbits of cool things that I can find an excuse to try in some research project. Academia is also painfully slow. There is a pretty low bar for being a graduate student. We just have to do one thesis project, and it seems to me that most graduate students drag their feet to watch cells grow in petri dishes, complain about being graduate students, and take their sweet time supplemented with weekend trips to vineyards and concerts in San Francisco. I’ve said this before, but if a graduate student only does a single thesis project, unless it is a cure for cancer, I think they have not worked very hard. I also think it’s unfortunate that someone who might have been more immersed not get chosen in favor of someone else that was a better interviewer. But I digress. The most interesting things in graduate school are the projects that one does for fun, or the projects that one helps his or her lab with, as little extra bits. But these things still feel slow at times, and the publication process is the ultimate manifestation of this turtleness. When I think about this infrastructure, and that the ultimate products that I am to produce are papers, this feels discouraging. Is that the best way to have an impact, for example, in reproducible science? It seems that the sheer existence of Github has done more for reproducibility across all domains than any published, academic efforts. In that light, perhaps the place to solve problems in academia is not in academia, but at a place like Github. I’m not sure.

    This dual need for a different environment and want to solve problems in a particular domain makes it seem like I don’t have any options. The world is generally broken into distinct, “acceptable” and well paved paths for individuals. When one graduates, he or she applies for jobs. The job is either in academia or industry. It is well scoped and defined. Many others have probably done it before, and the steps of progression after entry are also logical. In academia you either move through postdoc-ship into professor-dome, or you become one of those things called a “Research Associate” which doesn’t seem to have an acceptable path, but is just to say “I want to stay in academia but there is no real proper position for what I want to do, so this is my only option.” What is one to do, other than to create a list of exactly the things to be desired in a position, and then figure out how to make it? The current standard options feel lacking, and choosing either established path would not be quite right. If I were to follow my hunger to learn things more rooted in a “building things” direction, this would be devastating in breaking trust and loyalty with people that I care a lot about. It also makes me nervous to enter a culture that I am not familiar with, namely the highly competitive world of interviewing for some kind of engineering position. The thought of being a little fish sits on my awareness and an overwhelming voice says “Vanessa, you aren’t good enough.” And then a tiny voice comes in and questions that thought, and says, “Nobody is good enough, and it doesn’t matter, because that is the ocean where you will thrive. You would build amazing things.” But none of that matters if you are not granted entry into the ocean, and this granting is no easy thing. It is an overwhelming, contradictory feeling. I want to sit in a role that doesn’t exist - this academic software developer, there is no avenue to have the opportunities and environment of someone that works at a place like Google but still work on problems like reproducible science. I don’t know how to deal with it at this point, and it is sitting on my heart heavily. I perhaps need to follow my own instinct and try to craft a position that does not exist, one that I know is right. I must think about these things.

    ·

  • The Academic Software Developer

    To say that I have very strong feelings about standards and technology used in academic research would be a gross understatement. Our current research practices and standards for publication, sharing of data and methods, and reproducible science are embarrassingly bad, and it’s our responsibility to do better. As a gradate student, it seemed that the “right place” to express these sentiments would be my thesis, and so I poured my heart out into some of the introduction and later chapters. It occurred to me that writing a thesis, like many standard practices in academia, is dated to be slow and ineffective - a document that serves only to stand as a marker in time to signify completion of some epic project, to only have eyes laid upon it by possibly 4-6 people, and probably not even that many, as I’ve heard stories of graduate students getting away with copy pasting large amounts of nonsense between an introduction and conclusion and getting away with it. So should I wait many months for this official pile of paper to be published to some Stanford server to be forgotten about before it even exists? No thanks. Let’s talk about this here and now.

    This reproducibility crisis comes down to interpretation - the glass can be half full or half empty, but it doesn’t really matter because at the end of the day we just need to pour more water in the stupid glass, or ask why we are wasting out time evaluating and complaining about the level of water when we could be digging wells. The metric itself doesn’t even matter, because it casts a shadow of doubt not only on our discoveries, but on our integrity and capabilities of scientists. Here we are tooting on about “big data” and publishing incremental changes to methods when what we desperately is need paradigm shifts in the most basic, standard practices for conducting sound research. Some people might throw their hands up and say “It’s too big of a problem for me to contribute.” or “The process is too political and it’s unlikely that we can make any significant change.” I would suggest that change will come slowly by way of setting the standard through example. I would also say that our saving grace will come by way of leadership and new methods and infrastructure to synthesize data. Yes, our savior comes by way of example from software development and informatics.

    Incentives for Publication

    It also does not come as a surprise that the incentive structure for conducting science and publishing is a little broken. The standard practice is to aggressively pursue significant findings to publish, and if it’s not significant, then it’s not sexy, and you can file it away in the “forgotten drawer of shame.” In my short time as a graduate student, I have seen other graduate students, and even faculty anguish over the process of publication. I’ve seen graduate students want to get out as quickly as possible, willing to do just about anything “for that one paper.” The incentive structure renders otherwise rational people into publication-hungry wolves that might even want to turn garbage into published work by way of the science of bullshit. As a young graduate student it is stressful to encounter these situations and know that it goes against what you consider to be a sound practice of science. It is always best to listen to your gut about these things, and to pursue working with individuals that have the highest of standards. This is only one of the reasons that Poldrack Lab is so excellent. But I digress. Given that our incentives are in check, what about the publications themselves?

    Even when a result makes it as far as a published paper, the representation of results as static page does not stand up to our current technological capabilities. Why is it that entire careers can be made out of parsing Pubmed to do different flavors of meta-analysis, and a large majority of results seem to be completely overlooked or eventually forgotten? Why is a result a static thing that does not get updated as our understanding of the world, and availability of data, changes? We pour our hearts out into these manuscripts, sometimes making claims that are larger than the result itself, in order to make the paper loftier than it actually is. While a manuscript should be presented with an interesting story to capture the attention of others who may not have interest in a topic, it still bothers me that many results can be over-sensationalized, and other important results, perhaps null or non significant findings, are not shared. Once the ink has dried on the page, the scientist is incentivized to focus on pursuit on the next impressive p-value. In this landscape, we don’t spend enough time thinking about reproducible science. What does it mean, computationally, to reproduce a result? Where do I go to get an overview of our current understanding for some question in a field without needing to read all published research since the dawn of time? It seems painfully obvious to me that continued confidence in our practice of research requires more standardization and best practices for methods and infrastructure that lead to such results. We need informed ways to compare a new claim to everything that came before it.

    Lessons from Software Development and Informatics

    Should this responsibility for a complete restructuring of practices, the albatross for the modern scientist, be his burden? Probably this is not fair. Informatics, a subset of science that focuses on the infrastructure and methodology of a scientific discipline, might come to his aid. I came into this field because I’m not driven by answering biological questions, but by building tools. I’ve had several high status individuals tell me at different times that someone like myself does not belong in a PhD program, and I will continue to highly disagree. There is a missing level across all universities, across all of academia, and it is called the Academic Software Developer. No one with such a skillset in their right mind would stay in academia when they could be paid two to three fold in industry. Luckily, some of us either don’t have a right mind, or are just incredibly stubborn about this calling that a monetary incentive structure is less important than the mission itself. We need tools to both empower researchers to assess the reproducibility of their work, and to derive new reproducible products. While I will not delve into some of the work I’ve done in my graduate career that is in line with this vision (let’s save that for thesis drivelings), I wlll discuss some important observations about the academic ecosystem, and make suggestions for current scientists to do better.

    Reproducibility and Graduate Students

    Reproducibility goes far beyond the creation of a single database to deposit results. Factors such as careful documentation of variables and methods, how the data were derived, and dissemination of results unify to embody a pattern of sound research practices that have previously not been emphasized. Any single step in an analysis pipeline that is not properly documented, or does not allow for a continued life cycle of a method or data, breaks reproducibility. If you are a graduate student, is this your problem? Yes it is your problem. Each researcher must think about the habits and standards that he or she partakes in from the initial generation of an idea through the publishing of a completed manuscript. On the one hand, I think that there is already a great burden on researchers to design sound experiments, conduct proper statistical tests, and derive reasonable inferences from those tests. Much of the disorganization and oversight to sound practices could be resolved with the advent of better tools such as resources for performing analysis, visualizing and capturing workflows, and assessing the reproducibility of a result. On the other hand, who is going to create these tools? The unspoken expectation is that “This is someone else’s problem.” Many seem to experience tunnel vision during graduate school. There is no reality other than the individual’s thesis, and as graduate students we are protected from the larger problems of the community. I would argue that the thesis is rather trivial, and if you spend most of your graduate career working on just one project, you did not give the experience justice. I don’t mean to say that the thesis is not important, because graduation does not happen without its successful completion. But rather, graduate school is the perfect time to throw yourself into learning, collaborating on projects, and taking risks. If you have time on the weekends to regularly socialize, go to vineyards, trips, and consistently do things that are not obsessively working on the topic(s) that you claimed to be passionate about when applying, this is unfortunate. If you aim to get a PhD toward the goal of settling into a comfy, high income job that may not even be related to your research, unless you accomplished amazing things during your time as a young researcher, this is also unfortunate. The opportunity cost of these things is that there is probably someone else in the world that would have better taken advantage of the amazing experience that is being a graduate student. The reason I bring this up is because we should be working harder to solve these problems. With this in mind, let’s talk about tiny things that we can do to improve how we conduct research.

    The components of a reproducible analysis

    A reproducible analysis, in its truest definition, must be easy to do again. This means several key components for the creation and life cycle of the data and methods:

    • complete documentation of data derivation, analysis, and structure
    • machine accessible methods and data resources
    • automatic integration of data, methods, and standards

    A truly reproducible analysis requires the collection, processing, documentation, standardization, and sound evaluation of a well-scoped hypothesis using large data and openly available methods. From an infrastructural standpoint this extends far beyond requiring expertise in a domain science and writing skills, calling for prowess in high performance computing, programming, database and data structure generation and management, and web development. Given initiatives like the Stanford Center for Reproducibile Neuroscience, we may not be too far off from “reproducibility as a service.” This does not change the fact that reproducibility starts on the level of the individual researcher.

    Documentation

    While an infrastructure that manages data organization and analysis will immediately provide documentation for workflow, this same standard must trickle into the routine of the average scientist before and during the collection of the input data. The research process is not an algorithm, but rather a set of cultural and personal customs that starts from the generation of new ideas, and encompasses preferences and style in reading papers and taking notes, and even personal reflection. Young scientists learn through personal experience and immersion in highly productive labs with more experienced scientists to advise their learning. A lab at a prestigious University is like a business that exists only by way of having some success with producing research products, and so the underlying assumption is that the scientists in training should follow suit. The unfortunate reality is that the highly competitive nature of obtaining positions in research means that the composition of a lab tends to weigh heavily in individuals early in their research careers, with a prime focus on procuring funding for grants to publish significant results to find emotional closure in establishing security of their entire life path thus far. In this depiction of a lab, we quickly realize that the true expertise comes by way of the Principle Investigator, and the expectation of a single human being to train his or her entire army while simultaneously driving innovative discovery in his or her field is outrageous. Thus, it tends to be the case that young scientists know that it’s important to read papers, take notes, and immerse themselves in their passion, but their method of doing this comes by way of personal stumbling to a local optimum, or embodying the stumbling of a slightly larger fish.

    Levels of Writing

    A distinction must be made between a scientist pondering a new idea, to testing code for a new method, to archiving a procedure for future lab-mates to learn from. We can define different levels of writing based on the intended audience (personal versus shared), and level of privacy (private versus public). From an efficiency standpoint, the scientist has much to gain by instilling organization and recording procedure in personal learning and data exploration, whether it be public or private. A simple research journal means a reliable means to quickly turn around and turn a discovery into a published piece of work. This is an example of personal work, and private may mean that it is stored on an individual’s private online Dropbox, Box, or Google Drive, and public may mean that it is written about on a personal blog or forum. Keeping this kind of documentation, whether it is private or public, can help an individual to keep better track of ideas and learning, and be a more efficient researcher. Many trainees quickly realize the need to record ideas, and stumble on a solution without consciously thinking ahead to what kind of platform would best integrate with a workflow, and allow for future synthesis and use of the knowledge that is recorded.

    In the case of shared resources, for computational labs that work primarily with data, an online platform with appropriate privacy and backup is an ideal solution over more fragile solutions such as paper or documents on a local machine. The previously named online platforms for storing documents (Box, Dropbox, and Google Drive), while not appropriate for PI or proprietary documents, are another reasonable solution toward the goal of shared research writing. These platforms are optimized for sharing amongst a select group, and again without conscious decision making, are commonly the resources that lab’s used in an unstructured fashion.

    Documentation of Code

    In computational fields, it is typically the case that the most direct link to reproducing an analysis is not perusing through research prose, but by way of obtaining the code. Writing is just idealistic idea and hope until someone has programmed something. Thus, a researcher in a computational field will find it very hard to be successful if he or she is not comfortable with version control. Version control keeps a record of all changes through the life cycle of a project. It allows for the tagging of points in time to different versions of a piece of software, and going back in time. These elements are essential for reproducible science practices that are based on sharing of methods and robust documentation of a research process. It takes very little effort for a researcher to create an account with a version control service (for example, http://www.github.com), and typically the biggest barriers to this practice are cultural. A researcher striving to publish novel ideas and methods is naturally going to be concerned over sharing ideas and methods until they have been given credit for them. It also seems that researchers are terrified of others finding mistakes. I would argue if the process is open and transparent, coding is collaborative, and peer review includes review of code, finding a bug (oh, you are a human and make mistakes every now and then?) is rather trivial and not to be feared. This calls for a change not only in infrastructure, but research culture, and there is likely no way to do that other than by slow change of incentives and example over time. It should be natural for a researcher, when starting a new project, to immediately create a repository to organize its life-cycle. While we cannot be certain that services like Github, Bitbucket, and Sourceforge are completely reliable and will exist into infinitum, this basic step can minimally ensure that work is not lost to a suddenly dead hard-drive, and methods reported in the text of a manuscript can be immediately found in the language that produced the result. Researchers have much to gain in being able to collaboratively develop methods and thinking by way of slowly gaining expertise in using these services. If a computational graduate student is not using and established in using Github by the end of his or her career, this is a failure in his or her training as a reproducible scientist.

    On the level of documentation in the code itself, this is often a personal, stylistic process that varies by field. An individual in the field of computer science is more likely to have training in algorithms and proper use of data structures and advanced programming ideas, and is more likely to produce computationally efficient applications based on bringing together a cohesive set of functions and objects. We might say this kind of research scientist, by way of choosing to study computer science to begin with, might be more driven to develop tools and applications, and unfortunately for academia will ultimately be most rewarded for pursuing a job in industry. This lack of “academic software developers,” as noted previously, is arguably the prime reason that better, domain-specific, tools do not exist for academic researchers. A scientist that is more driven to answer biological questions sees coding as a means to procure those answers, and is more likely to produce batch scripts that use software or functions provided by others in the field. In both cases, we gripe over “poorly documented” code, which on the most superficial level suggests that the creator did not add a proper comment to each line explaining what it means. An epiphany that sometimes takes years to realize is the idea that documentation of applications lives in the code itself. The design, choice of variable names and data structures, spacing of the lines and functions, and implementation decisions can render a script easy to understand, or a mess of characters that can only be understood by walking through each line in an interactive console. Scientists in training, whether aiming to build elegant tools or simple batch scripts, should be aware of these subtle choices in the structure of their applications. Cryptic syntax and non-intuitive processes can be made up for with a (sometimes seemingly) excessive amount of commenting. The ultimate goal is to make sure that a researcher’s flow of thinking and process is sufficiently represented in his programming outputs.

    Documentation Resources for Scientists

    A salient observation is that these are all service oriented, web-based tools. The preference for Desktop software such as Microsoft Word or Excel is founded on the fact that Desktop software tends to provide better user experience (UI) and functionality. However, the current trend is that the line is blurring between Desktop and browser, and with the growing trend of browser-based offline tools that work with or without an internet connection, it is only a matter of time until there will be no benefit to using a Desktop application over a web-based one. Research institutions have taken notice of the benefit of using these services for scientists, and are working with some of these platforms to provide “branded” versions for their scientists. Stanford University provides easy access to wikis, branded “Box” accounts for labs to share data, along with interactive deployment of Wordpress blogs for individuals and research groups to deploy blogs and websites for the public. Non-standard resources might include an online platform for writing and sharing LaTex documents http://www.overleaf.com, for collecting and sharing citations (http://www.paperpile.com, http://www.mendeley.com), and for communicating about projects and daily activity (http://www.slack.com) or keeping track of projects and tasks (http://www.asana.com).

    This link between local and web-based resource continues to be a challenge that is helped with better tools. For example, automated documentation tools (e.g., Sphinx for Python) can immediately transform comments hidden away in a Github repository into a clean, user friendly website for reading about the functions. Dissemination of a result, to both other scientists and the public, is just as important (if not more important) than generation of the result, period. An overlooked component toward understanding of a result is providing the learner with more than a statistical metric reported in a manuscript, but a cohesive story to put the result into terms that he or she can relate to. The culture of publication is to write in what sounds like “research speak,” despite the fact that humans learn best by way of metaphor and story. What this means is that it might be common practice to, along with a publication, write a blog post and link to it. This is not to say that results should be presented as larger than they really are, but put into terms that are clear and undertandable for someone outside of the field. Communication about results to other researchers and the public is an entire thesis in itself, but minimally scientists must have power to relate their findings to the world via an internet browser. Right now, that means a simple text report and prose to accompany a thought, or publication. Our standards for dissemination of results should reflect modern technology. We should have interactive posters for conferences, theses and papers immediately parsed for sophisticated natural language processing applications, and integration of social media discussion and author prose to accompany manuscripts. A scientist should be immediately empowered to publish a domain-specific web report that includes meaningful visualization and prose for an analysis. It might be interactive, including the most modern methods for data visualization and sharing. Importantly, it must integrate seamlessly into the methodology that it aims to explain, and associated resources that were used to derive it. It’s up to us to build these tools. We will try many times, and fail many times. But each effort is meaningful. It might be a great idea, or inspire someone. We have to try harder, and we can use best practices from software development to guide us.

    The Academic Software Developer

    The need for reproducible science has brought with it the emerging niche of the “academic software developer,” an individual that is a combined research scientist and full stack software developer, and is well suited to develop applications for specific domains of research. This is a space that exists between research and software development, and the niche that I choose to operate in. The Academic Software Developer, in the year 2016, is thinking very hard about how to integrate large data, analysis pipelines, and structured terminology and standard into web-friendly things. He or she is using modern web technology including streaming data, server side JavaScript, Python frameworks and cloud resources, and Virtual Machines. He or she is building and using Application Programming Interfaces, Continuous Integration, and version control. Speaking personally, I am continually experimenting with my own research, continually trying new things and trying to do a little bit better each time. I want to be a programming badass, and have ridiculous skill using Docker, Neo4j, Web Components, and building Desktop applications to go along with the webby ones. It is not satisfying to stop learning, or to see such amazing technology being developed down the street and have painful awareness that I might never be able to absorb it all. My work can always be better, and perhaps the biggest strength and burden of having such stubbornness is this feeling that efforts are never good enough. I think this can be an OK way to be only given the understanding that it is ok for things to not work the first time. I find this process of learning, trying and failing, and trying again, to be exciting and essential for life fulfillment. There is no satisfaction at the end of the day if there are not many interesting, challenging problems to work on.

    I don’t have an “ending” for this story, but I can tell you briefly what I am thinking about. Every paper should be associated with some kind of “reproducible repo.” This could mean one (or more) of several things, depending on the abilities of the researcher and importance of the result. It may mean that I can deploy an entire analysis with the click of a button, akin to the recently published MyConnectome Project. It may mean that a paper comes with a small web interface linking to a database and API to access methods and data, as I attempted even for my first tiny publication. It could be a simple interactive web interface hosted with analysis code on a Github repo to explore a result. We could use continuous integration outside of its scope to run an analysis, or programatically generate a visualization using completely open source data and methods (APIs). A published result is almost useless if care is not taken to make it an actionable, implementable thing. I’m tired of static text being the output of years of work. As a researcher I want some kind of “reactive analysis” that is an assertion a researcher makes about a data input answering some hypothesis, and receiving notification about a change in results when the state of the world (data) changes. I want current “research culture” to be more open to business and industry practice of using data from unexpected places beyond Pubmed and limited self-report metrics that are somehow “more official” than someone writing about their life experience informally online. I am not convinced that the limited number of datasets that we pass around and protect, not sharing until we’ve squeezed out every last inference, are somehow better than the crapton of data that is sitting right in front of us in unexpected internet places. Outside of a shift in research culture, generation of tools toward this vision is by no means an easy thing to do. Such desires require intelligent methods and infrastructure that must be thought about carefully, and built. But we don’t currently have these things, and we are already way fallen behind the standard in industry that probably comes by way of having more financial resources. What do we have? We have ourselves. We have our motivation, and skillset, and we can make a difference. My hope is that other graduate students have equivalent awareness to take responsibility for making things better. Work harder. Take risks, and do not be complacent. Take initiative to set the standard, even if you feel like you are just a little fish.

    ·

  • Got Santa on the Brain?

    Got Santa on the Brain? We do at Poldracklab! Ladies and gentlemen, we start with a nice square Santa:

    We convert his colors to integer values…

    And after we do some substantial research on the christmas spirit network, we use our brain science skills and…

    We’ve found the Christmas Spirit Network!

    What useless nonsense is this?

    I recently had fun generating photos from tiny pictures of brains, and today we are going to flip that on its head (artbrain!). My colleague had an idea to do something spirited for our NeuroVault database, and specifically, why not draw pictures onto brainmaps? I thought this idea was excellent. How about a tool that can do it? Let’s go!

    Reading Images in Python

    In my previous brainart I used the standard PIL library to read in a jpeg image, and didn’t take any attention to the reality that many images come with an additional fourth dimension, an “alpha” layer that determines image transparency. This is why we can have transparency in a png, and not in a jpg, at least per my limited undertanding. With this in mind, I wanted to test in different ways for reading png, and minimally choosing to ignore the transparency. We can use the PyPNG to read in the png:

    
        import numpy
        import png
        import os
        import itertools
    
        pngReader=png.Reader(filename=png_image)
        row_count, column_count, pngdata, meta = pngReader.asDirect()
    
    

    In the “meta” variable, this is a dictionary that holds different meta data about the image:

    
        meta
         {'alpha': True,
          'bitdepth': 8,
          'greyscale': False,
          'interlace': 0,
          'planes': 4,
          'size': (512, 512)}
    
        bitdepth=meta['bitdepth']
        plane_count=meta['planes']
    
    

    Right off the bat this gives us a lot of power to filter or understand the image that the user chose to read in. I’m going to not be restrictive and let everything come in, because I’m more interested in the errors that might be triggered. It’s standard practice to freak out when we see an error, but debugging is one of my favorite things to do, because we can generally learn a lot from errors. We then want to use numpy to reshape the image into something that is intuitive to index, with (X,Y,RGBA)

    
        image_2d = numpy.vstack(itertools.imap(numpy.uint16, pngdata))
        # If "image_plane" == 4, this means an alpha layer, take 3 for RGB
        for row_index, one_boxed_row_flat_pixels in enumerate(pngdata):
            image_2d[row_index,:]=one_boxed_row_flat_pixels
        image_3d = numpy.reshape(image_2d,(row_count,column_count,plane_count))
    
    

    The pngdata variable is an iterator, which is why we can enumerate over it. If you want to look at one piece in isolation when you are testing this, after generating the variable you can just do:

    
    pngdata.next()
    
    

    To spit it out to the console. Then when we want to reference each of the Red, Green, and Blue layers, we can do it like this:

    
        R = image_3d[:,:,0]
        G = image_3d[:,:,1]
        B = image_3d[:,:,2]
    
    

    And the alpha layer (transparency) is here:

    
        A = image_3d[:,:,4]
    
    

    Finally, since we want to map this onto a brain image (that doesn’t support different color channels) I used a simple equation that I found on StackOverflow to convert to integer value:

    
        # Convert to integer value
        rgb = R;
        rgb = (rgb << 8) + G
        rgb = (rgb << 8) + B
    
    

    For the final brain map, I normalized to a Z score to give positive and negative values, because then using the Papaya Viewer the default will detect these positive and negative, and give you two choices of color map to play with. Check out an example here, or just follow instructions to make one yourself! I must admit I threw this together rather quickly, and only tested on two square png images, so be on the lookout for them bugs. Merry Christmas and Happy Holidays from the Poldracklab! We found Santa on the brain, methinks.

    ·

  • Behind Her Nose

    It has been said that to fall in love is to see for the first time. The upturn of an eye is the deepest of blues emit from a single stringed instrument. The definition of mystery is a brief brush by the softest of skin on the inner side of the wrist. The promise of a new day is the hope that you sense in the cautious grasp of an outreached hand. These were the subtitles that called to her in metaphor and painted her life a romantic composition that brought forth beauty in the details. When others saw a point of color, she saw a tiny universe. Behind every subtle detail was a rich emotion, the full story of an intensive day that ended with the draw of a breath. If at one point it had overwhelmed her, she had learned to step into the stream of information and not fight being carried off into the current of story that surrounds the people that she cared about.

    Today it was a cold, rushing flow. The gingerbread air freshener fell from the mailbox into the bushes, and she didn’t reach to pick him up. His squishy rubber body was plastered with a plastic white smile that suggested warm evenings by a fire, and family. How cold and inhumane. She sometimes wanted to be mean. She imagined it would be so satisfying to throw a nasty glance to someone that had hurt her, or turn a stone shoulder to one of the many ornery assholes of the world. She tried this every so often, and the outcome was emotional pain on her part. How squishy could she be to feel shame for being so awful to her gingerbread air freshener? Probably just as squishy as he was.

    Her arms felt insanely long and her hands infinitely far away, and she seemed to remember some questions online that asked about strange body perceptions like that. The problem with making such boldly stated assertions about normality and then ascribing people with labels given some ill-defined questionnaire is that normality is an illusion, and such assertions only serve to encourage people to package themselves up and not reveal any experience or cognition that might regard them as broken, or not quite right. The irony is that the set of fundamental human qualities includes a desire to be understood, and this rests on having some ability to connect with other people. Such a simple thing can bring a soul back from sadness into finding acceptance, yet in our desire to “help” such people we discourage revelations and intimacy, and drive them deeper into emotional isolation. Nuts. The girl realized that she was standing in her doorway, again finding herself lost in her own thoughts. She was sure they would have a question about that too.

    The kitchen was dark upon entering, only with a small spot of light shining on the pile of blankets where she slept on the blue tile. Everything she looked at, spoke to her in metaphor. It is usually the case that for one’s life, the color that the metaphor paints its experience, is determined by the most pressing issue in the person’s conscious or unconscious mind. The girl was instensely distracted with her an upcoming event: to make a declaration of her ultimate purpose before it was to be discovered by way of trial of error, by learning and growth. It was a backwards way to go about things, to need to sign off on a future vision and not simply declare that she wanted to continue her growth happening in the present. Thus, this was the metaphor that painted her empty kitchen that day. She did not care about a future title or role, but rather wanted to craft the perfect environment for such learning and growth, and one that would maximize those moments of brilliance when the multiple flows of information that were her constant state of thoughts came together in beautiful symphony, producing a swift and elegant efficiency that gave the feeling of the ultimate fulfillment. She was tired of being probed and tested for the things she knew, and what she did not. There were always infinite things that she would and could possibly not know. She was brilliant sometimes, and completely broken and useless in other things, and she could only promise to do her best. She wanted her best to be enough. And so she looked for this feeling in her kitchen. To be like the spoon that falls between the oven and the sink: not quite fit right to be used for cooking, but falling short of where all the other utensils were spending their time.

    She wondered what it might be like to have a firm belonging in the sink, or even granted the celebrity status to get a pass into the dishwasher. In times like these, her strategy for evaluating the degree of belonging was to pretend to be someone else. The answers came easily, flooding her mind with a particularly blue and colorful river of emotional thought. “I can’t stand being with her,” thought the girl, and the statement was so assertive that she rushed out of the room to give action to its salience. It was clear that it would be easy to be overwhelmed by her intensity, a constant beat of urgency like a gust of wind just strong enough to topple one’s balance and force taking a necessary step. Her relaxed state was a breath held, over a decade ago, and forgotten. If it were to be found and rush out, with it might come particles of experience that had been swirling inside her like a dust storm for years. It’s no wonder she didn’t belong in the sink. She was less of a utensil, and more of the water rushing from the faucet.

    She exited stage left from this thought, and escaped from it by closing the kitchen door behind her, collapsing onto a pile of blankets on the floor. She for some reason was adversive to wanting to purchase a bed, because it seemed like just another expensive item that would be disposed of too quickly to justify the cost, and resorted to sleeping on piles of soft things thrown on the floor in different rooms, wherever she happened to sit with her computer. She rolled over 180 degrees to face the ceiling, and threw back her arms over her head. She closed her eyes. The tiny fan that powered the entire air flow in her apartment provided a soothing noise to fight against the flashes that were the constant behind those closed eyes. It was easy to see why keeping company with others was so challenging for this explosion of a human being. Rationalization, intelligent thought, and most standard cognitive and behavioral approaches could not reason with her natural rhythm, and she found this funny. It might have seemed overwhelming to others, but was not really a big deal for her after so much life experience with it. Most insight comes from this strange kind of intense thinking, the kind that combines unexpected things in sometimes hilarious ways. She searched for some well chosen thought that would serve as a mental blankie to bring the rhythm of her mind to a different key. Sometimes she chose motivational phrases, or even Latin. Today it was two words. Just. Breathe.

    It was now getting dark, and she imagined herself reduced to a dark silhouette against the setting sun from behind the blinds. It is assumed that which is right in front of us can be seen, and she found this logical and true for most tangible things. But for all that defined her in this current setting, the longest of fingers, scraggly hair and mantis arms, she came to realize that for the non-tangible, she could not be seen. And in the times like these, when her future was based on some brief evaluation by a stranger, when she wanted more than anything to be seen, it was even more unlikely. It would take an abyss of time and careful words articulated with story to properly describe whatever encompassed her spirit, and this was not feasible. It was even worse when she thought about love. It had been a painful experience, time and time again, to come to terms that despite her assumptions, the people that she loved could not distinguish her from a crowd. It is interesting how, when the seemingly different domains of life are broken into their component Legos, the fundamental human desires are equivalent. She wanted to be understood, and she wanted to be valued, and this was true across things. For this next step in her life, perhaps a different strategy was needed. To make an assertion that an external thing should see and value her in some way was not the right way to go about it.

    It has been said that to fall in love is to see for the first time. It can be added to that wisdom that to be able to glow brilliantly and distinguish oneself from a crowd, one might choose to be blind. It is only when we close our eyes to the expectations of the rest of the world that we can grow beautifully, and immensely. She realized that, for this next step in her life, she did not want to try and be seen, but to just continue growing, without expectation. She wrinkled her nose at the discovery of such a lovely place in the now darkness.

    ·

  • Reproducible Analyses: the MyConnectome Project

    Reproducibility of an analysis means having total transparency about the methods, and sharing of data so that it can be replicated by other scientists. While it used to be acceptable to detail a method in a manuscript, modern technology demands that we do better than that. Today coincides with the publication of the MyConnectome Project, a multi-year endeavour by Russ Poldrack to collect longitudinal behavioral, brain imaging, and genomics data to do the first comprehensive phenotyping of a single human including imaging data. Russ came to Stanford in the Fall of 2014 when I started working with (and eventually joined) Poldracklab, at which point the data acquisition had finished, analyses were being finished up, and we were presented with the larger problem of packaging this entire thing up for someone else to run. We came up with the MyConnectome Results web interface, and completely reproducible pipeline that will be the topic for this post. Russ has written about this process, and I’d like to supplement those ideas with more detail about the development of the virtual machine itself.

    What is a reproducible analysis?

    From a development standpoint, we want intelligent tools that make it easy to package an entire workflow into something usable by other scientists. The easy answer to this is using a virtual machine, where one has control over the operation system and software, this is the strategy that we took, using a package called vagrant that serves as a wrapper around virtualbox. This means that it can be deployed locally on a single user’s machine, or on some cloud to be widely available. During our process of packaging the entire thing up, we had several key areas to think about:

    Infrastructure

    While it may seem like one cohesive thing, we are dealing with three things: the analysis code that does all the data processing, the virtual machine to deploy this code in an environment with the proper software and dependencies, and the web interface to watch over things running, and keep the user informed.

    Server and Framework

    A web framework is what we might call the combination of some “back-end” of things running on the server communicating with the “front-end,” or what is seen in the browser. Given substantial analyses, we can’t rely just on front-end technologies like JavaScript and HTML. The choice of a back-end was easy in this case, as most neuroimaging analysis tends to be in python, we went with a python-based framework called “Flask.” Flask is something I fell in love with, not only because it was in python, but because it gave complete freedom to make any kind of web application you could think of in a relatively short amount of time. Unlike its more controlled sibling framework, Django that has some basic standards about defining models and working with a database, Flask lets you roll your own whatever. I like to think of Django as a Mom minivan, and Flask as the Batmobile. The hardest part of deployment with Flask was realizing how hairy setting up web servers from scratch was, and worrying about security and server usage. Everything in this deployment was installed from scratch and custom set up, however I think if I did it again I would move to a container based architecture (e.g., Docker) for which there are many pre-build system components that can be put together with a tool called docker-compose like Legos to build a castle. We also figured out how to manage potentially large traffic with Elastic Load Balancing. ELB can take multiple instances of a site (in different server time zones, ideally) and can do as promised, and balance the load to the individual servers. If you’ve never done any kind of work on Amazon Web Services (AWS) before, I highly recommend it. The tools for logging, setting up alerts, permissions, and all of the things necessary to get a server up and running are very good. I’ve used Google Cloud as well, and although the console is different, it is equally powerful. There are pros and cons to each, discussion of which is outside the scope of this post.

    Data

    Data are the flour and eggs of the analysis cake, and if too big, are not going to fit in the pantry. When the data are too big to be packaged with the virtual machine, as was our case, option two is to provide it as an external download. However, there is still the issue that many analyses (especially involving genomic data) are optimal for cluster environments, meaning lots of computers, and lots of memory. As awesome as my Lenovo Thinkpad is, many times when I run analyses in a cluster environment I calculate out how long the same thing would take to run in serial on a single computer, and it’s something ridiculous like 8 months. Science may feel slow sometimes, but I don’t think even the turtle-iest of researchers want to step out for that long to make a sandwich. Thus, for the purposes of reproducing the analyses, in these cases it makes sense to provide some intermediate level of data. This is again the strategy that we took, and I found it amazing how many bugs could be triggered by something as simple as losing an internet connection, or a download server that is somewhat unreliable. While there is technology expanding to connect applications to clustery places, there are still too many security risks to open up an entire supercomputer to the public at large. Thus, for the time being, the best solution seems to be putting your data somewhere with backup, and reliable network connectivity.

    Versions

    As a developer, my greatest fear is that dependencies change, and down the line something breaks. This unfortunately happened to us (as Russ mentions in his post) when we downloaded the latest python mini computational environment (miniconda) and the update renamed the folder to “miniconda2” instead of miniconda. The entire virtual machine broke. We learned our lesson, but it begs to take notice that any reproducible workflow must take into account software and package versions, and be able to obtain them reliably.

    Expect Errors

    With so many different software packages, inputs, and analyses coming together, and the potential variability of the users internet connection, there is never complete certainty of a virtual machine running cleanly from start to finish. A pretty common error is that the user’s internet connection blips, and for some reason a file is not downloaded properly, or completely missing. A reproducible repo must be able to be like a ship, and take on water in some places without sinking. A good reproducible workflow must be able to break a leg, and finish the marathon.

    Timing

    The interface is nice in keeping the user updated about an estimated time remaining. We accomplished this by running the analyses through completiion to come up with a set of initial times associated with the generation of each output file. Since these files are generated reliably in the same order, we could generate a function to return the time associated with the index of the output file farthest along in the list. This means that if there is an error and a file is not produced, the original estimated time may be off by some, but the total time remaining will be calculated based on files that do not exist after the index of the most recently generated file. This means that it can adjust properly in the case of files missing due to error. It’s a rather simple system that can be greatly improved upon, but it seemed to work.

    Communication

    As the user watches a percentile completed bar increase with an estimated time remaining, different links to analyses items change from gray to green to indicate completion. The user can also look at the log tab to see outputs to the console. We took care to arrange the different kinds of analyses in the order they are presented in the paper, but the user has no insight beyond that. An ideal reproducible workflow would give the user insight to what is actually happening, not just in an output log, but in a clean iterface with descriptions and explanations of inputs and outputs. It might even include comments from the creator about parameters and analysis choices. How would this be possible? The software could read in comments from code, and the generator of the repo would be told to leave notes about what is going on in the comments. the software would need to then be able to track what lines are currently being executed in a script, and report comments appropriately. A good reprodudible workflow comes with ample, even excessive, documentation, and there is no doubt about why something is happening at any given point.

    Template Interfaces

    The front page is a navigation screen to link to all analyses, and it updates in real time to keep the user informed about what is completed. An interactive D3 banner across the top of the screen was the very first component I generated specifically for this interface, inspired by a static image on Russ’ original site. While custom, hard coded elements are sometimes appropriate, I much prefer to produce elements that can be customized for many different use cases. Although these elements serve no purpose other than to add a hint of creativity and fun, I think taking the time and attention for these kinds of details makes applications a little bit special, more enjoyable for the user, and thus more likely to be used.

    The output and error log page is a set of tabs that read in dynamically from an output text file. The funny thing about these logs is that what gets classified as “error” versus “output” is largely determined by the applications outputting the messages, and I’m not sure that I completely agree with all of these messages. I found myself needing to check both logs when searching for errors, and realizing that the developer can’t rely on the application classification to return reliable messages to the user. Some higher level entity would need to more properly catch errors and warnings, and present them in a more organized fashion than a simple dump of a text file on a screen. It’s not terrible because it worked well to debug the virtual machine during development, but it’s a bit messy.

    The interactive data tables page uses the Jquery Datatables library to make nicely paginated, sortable tables of results. I fell in love with these simple tables when I first laid eyes on them during my early days of developing for NeuroVault. When you don’t know any better, the idea of having a searchable, sortable, and dynamic table seems like magic. It still does. The nice thing about science is that regardless of the high-tech visualizations and output formats, many results are arguably still presented best in a tabular format. Sometimes all we really want to do is sort based on something like a p-value. However, not everything is fit for a table. I don’t think the researcher deploying the workflow should need to have to match his or her results to the right visualization type - the larger idea here is that outputs and vIsualization of some kind of result must be sensitive to output data type. Our implementation was largely hard coded for each individual output, whether that be an ipython notebook or R Markdown rendered to HTML, a graphic, PDF, or a table. Instead of this strategy, I can envision a tool that sees an “ipynb” and knows to install a server (or point a file to render at one) and if it sees a csv or tsv file, it knows to plug it into a web template with an interactive table. In this light, we can rank the “goodness” of a data structure based on how easy it is to convert from its raw output to something interpretable in a web browser. Something like a PDF, tsv, graphic, or JSON data structure get an A+. A brain image that needs a custom viewer or a data structure that must be queried (e.g., RDF or OWL) does not fare as well, but arguably the tool deploying the analysis can be sensitive to even complex data formats. Finally, all directories should be browsable, as we accomplished with Flask-Autoindex.

    Download

    On the simplest level, outputs should be easy to find, and viewable for inerpretation in some fashion. It also might make sense to provide zipped versions of data outputs for the user to quickly download from the virtual machine, in the case of wanting to do something on a local computer or share the data.

    Usage and Reproducibility Metrics

    As all good developers do, we copy pasted some Google Analytics code into our page templates so that we could keep track of visitors and usage. However, releasing a reproducible workflow of this type that is to be run on some system with a web connection offers so much more opportunity for learning about reproducible analyses. In the case that it’s not on a virtual machine (e.g., someone just ran the myconnectome python package on their computer) we could have a final step to upload a report of results to somewhere so that we could compare across platforms. We could track usage over time, and see if there are some variables we didn’t account for to lead to variance in our results. The entire base of “meta data about the analysis” is another thing all together that must be considered.

    Next steps: the Reprodudible Repo

    Throughout this process, the cognition that repeatedly danced across my mind was “How do I turn this into something that can be done with the click of a button?” Could this process somehow be automatized, and can I plug it into Github? I did a fun bit of work to make a small package called visci, and it’s really just a low level python module that provides a standard for plugging some set of outputs into some template that can be rendered via continuous integration (or other platform). This seems like one of the first problems to solve to make a more intelligent tool. We would want to be able to take any workflow, hand it over to some software that can watch it run, and then have that software plug the entire thing into a virtual machine that can immediately be deployed to reproduce the analyses. I have not yet ventured into playing more with this idea, but most definitely will in the future, and look forward to the day when its standard practice to provide workflows in immediately reproducible formats.

    ·

  • BrainArt. No, really!

    There have been several cases in the last year when we’ve wanted to turn images into brains, or use images in some way to generate images. The first was for the Poldrack and Farah Nature piece on what we do and don’t know about the human brain. It came out rather splendid:

    Nature526

    And I had some fun with this in an informal talk, along with generating a badly needed animated version to do justice to the matrix reference. The next example came with the NeuroImage cover to showcase brain image databases:

    NeuroImage

    This is closer to a “true” data image because it was generated from actual brain maps, and along the lines of what I would want to make.

    BrainArt

    You can skip over everything and just look at the gallery, or the code. It’s under development and there are many things I am not happy with (detailed below), but it does pretty well for this early version. For example, here is “The Scream”:

    This isn’t just any static image. Let’s look a little closer…

    Matter of fact, each “pixel” is a tiny brain:

    And when you see them interactively, you can click on any brain to be taken to the data from which it was generated in the NeuroVault database. BrainArt!

    Limitations

    The first version of this generated the image lookup tables (the “database”) from a standard set of matplotlib color maps. This means we had a lot of red, blue, green, purple, and not a lot of natural colors, or dark “boring” colors that are pretty important for images. For example, here was an original rendering of a face that clearly shows the missing colors:

    UPDATE 12/6/2015: The color tables were extended to include brainmaps of single colors, and the algorithm modified to better match to colors in the database:

    The generation could still be optimized. It’s really slow. Embarrassingly, I have for loops. The original implementation did not generate x and y to match the specific sampling rate specified by the user, and this has also been fixed.

    I spent an entire weekend doing this, and although I have regrets about not finishing “real” work, this is pretty awesome. I should have more common sense and not spend so much time on something no one will use except for me… oh well! It would be fantastic to have different color lookup tables, or even sagittal and/or coronal images. Feel free to contribute if you are looking for some fun! :)

    How does it work?

    The package works by way of generating a bunch of axial brain slices using the NeuroVault API (script). This was done by me to generate a database and lookup tables of black and white background images, and these images (served from github) are used in the function. You first install it:

    
    pip install brainart
    
    

    This will place an executable, ‘brainart’ in your system folder. Use it!

    
    brainart --input /home/vanessa/Desktop/flower.jpg
    
    # With an output folder specified
    brainart --input /home/vanessa/Desktop/flower.jpg --output-folder /home/vanessa/Desktop
    
    

    It will open in your browser, and tell you the location of the output file (in tmp), if you did not specify. Type the name of the executable without any args to see your options.

    Color Lookup Tables

    The default package comes with two lookup tables, which are generated from a combination of matplotlib color maps (for the brains with multiple colors) and single hex values (the single colored brains for colors not well represented in matplotlib). Currently, choice of a color lookup table just means choosing a black or white background, and in the future could be extended to color schemes or different brain orientations. The way to specify this:

    
    brainart --input /home/vanessa/Desktop/flower.jpg --color-lookup black
    
    

    Selection Value N

    By default, the algorithm randomly selects from the top N sorted images with color value similar to the pixel in your image. For those interested, it just takes the minimum of the sorted sums of absolute value of the differences (I believe this is a Manhattan Distance). There is a tradeoff in this “N” value - larger values of N mean more variation in both color and brain images, which makes the image more interesting, but may not match the color as well. You can adjust this value:

    
    brainart --input /home/vanessa/Desktop/flower.jpg --N 100
    
    

    Adding more brain renderings per color would allow for specifying a larger N and giving variation in brains without deviating from the correct color, but then the database would be generally larger, and increase the computation time. The obvious fix is to streamline the computation and add more images, but I’m pretty happy with it for now and don’t see this as an urgent need.

    Sampling Rate

    You can also modify the sampling rate to produce smaller images. The default is every 15 pixels, which seems to generally produce a good result. Take a look in the gallery at “girl with pearl” huge vs. the other versions to get a feel for what I mean. To change this:

    
    brainart --input /home/vanessa/Desktop/flower.jpg --sample 100
    
    

    Contribute!

    The gallery is the index file hosted on the github pages for this repo. See instructions for submitting something to the gallery. While I don’t have a server to host generation of these images dynamically in the browser, something like this could easily be integrated into NeuroVault for users to create art from their brainmaps, but methinks nobody would want this except for me :)

    ·

  • Brain Matrix

    I wanted to make a visualization to liken the cognitive concepts in the cognitive atlas to The Matrix, and so I made the brain matrix without branding. Canvas was the proper choice for this fun project in that I needed to render the entire visualization quickly and dynamically, and while a complete review of the code is not needed, I want to discuss two particular challenges:

    Rendering an svg element into d3

    The traditional strategy of adding a shape to a visualization, meaning appending a data object to it, looks something like this:

    
    svg.selectAll("circle")
        .data(data)
        .enter()
        .append("svg:circle")
        .attr("cy",function(d,i){ return 30*d.Y })
        .attr("cx",function(d,i){ return 30*d.X })
        .attr("r", 10)
        .attr("fill","yellow")
        .attr("stroke-width",10)            
    
    

    You could then get fancy, and instead append an image:

    
    svg.selectAll("svg:image")
        .data(data)
        .enter()
        .append("svg:image")
        .attr("y",function(d,i){ return 30*d.Y })
        .attr("x",function(d,i){ return 30*d.X })
        .attr('width', 20)
        .attr('height', 24)
        .attr("xlink:href","path/to/image.png")
    
    

    The issue, of course, with the above is that you can’t do anything dynamic with an image, beyond maybe adding click or mouse-over functions, or changing basic styling. I wanted to append lots of tiny pictures of brains, and dynamically change the fill, and svg was needed for that. What to do?

    1. Create your svg

    I created my tiny brain in Inkscape, and made sure that the entire thing was represented by one path. I also simplified the path as much as possible, since I would be adding it just under 900 times to the page, and didn’t want to explode the browser. I then added it directly into my HTML. How? An SVG image is just a text file, so open it up in text editor, and copy-paste away, Merrill! Note that I didn’t bother to hide it, however you could easily do that by giving it class of “hidden” or setting the visibility of the div to “none.”

    2. Give the path an id

    We want to be able to “grab” the path, and so it needs an id. Here is the id, I called it “brainpath”. Yes, my creativity in the wee hours of the morning when making this seems like a great idea is, lacking. :)

    3. Insert the paths

    Instead of appending a “circle” or an “svg:image,” we want a “path”. Also note that the link for the image (“svg:a”) is appended first, so it will be parent to the image (and thus work).

    
    svg.selectAll("path")
        .data(data)
        .enter()
        .append("svg:a")
            .attr("xlink:href", function(d){return "http://www.cognitiveatlas.org/term/id/" + d.id;})
        ...
    
    

    I then chose to add a group (“svg:g”), and this is likely unnecessary, but I wanted to attribute the mouse over functions (what are called the “tips”) to the group.

    
        ...
        .append("svg:g")
        .on('mouseout.tip', tip.hide)
        .on('mouseover.tip', tip.show)
        ...
    
    

    Now, we append the path! Since we need to get the X and Y coordinate from the input data, this is going to be a function. Here is what we do. We first need to “grab” the path that we embedded in the svg, and note that I am using JQuery to do this:

    
    var pathy = $("#brainpath").attr("d")
    
    

    What we are actually doing is grabbing just the data element, which is called d. It’s a string of numbers separated by spaces.

    
    m 50,60 c -1.146148,-0.32219 -2.480447,-0.78184 -2.982912,-1.96751 ...
    
    

    When I first did this, I just returned the data element, and all 810 of my objects rendered in the same spot. I then looked for some X and Y coordinate in the path element, but didn’t find one! And then I realized, the coordinate is part of the data:

    
    m 50,60...
    
    

    Those first two numbers after the m! That is the coordinate! So we need to change it. I did this by splitting the data string by an empty space

    
    var pathy = $("#brainpath").attr("d").split(" ")
    
    

    getting rid of the old coordinate, and replacing it with the X and Y from my data:

    
    pathy[1] = 50*d.Y + "," + 60*d.X;
    
    

    and then returning it, making sure to again join the list (Array) into a single string. The entire last section looks like this:

    
        ...
        .append("svg:path")
        .attr("d",function(d){
           var pathy = $("#brainpath").attr("d").split(" ")
           pathy[1] = 50*d.Y + "," + 60*d.X;
           return pathy.join(" ")
         })
        .attr("width",15)
        .attr("height",15)
    
    

    and the entire thing comes together to be this!

    
    svg.selectAll("path")
        .data(data)
        .enter()
        .append("svg:a")
            .attr("xlink:href", function(d){return "http://www.cognitiveatlas.org/term/id/" + d.id;})
        .append("svg:g")
        .on('mouseout.tip', tip.hide)
        .on('mouseover.tip', tip.show)
        .append("svg:path")
        .attr("d",function(d){
           var pathy = $("#brainpath").attr("d").split(" ")
           pathy[1] = 50*d.Y + "," + 60*d.X;
           return pathy.join(" ")
         })
        .attr("width",15)
        .attr("height",15)
    
    

    Embedding an image into the canvas

    Finally, for the cognitive atlas version I wanted to embed the logo, somewhere. When I added it to the page as an image, and adjusted the div to have a higher z-index, an absolute position, and then the left and top coordinates set to where I wanted the graphic to display, it showed up outside of the canvas. I then realized that I needed to embed the graphic directly into the canvas, and have it drawn each time as well. To do this, first I made the graphic an image object:

    
    var background = new Image();
    background.src = "data/ca.png";
    
    

    Then in my draw function, I added a line to draw the image, ctx.drawImage where I wanted it. The first argument is the image variable (background), the second and third are the page coordinates, and the last two are the width and height:

    
    var draw = function () {
      ctx.fillStyle='rgba(0,0,0,.05)';
      ctx.fillRect(0,0,width,height);
      var color = cacolors[Math.floor(Math.random() * cacolors.length)];         
      ctx.fillStyle=color;
      ctx.font = '10pt Georgia';
      ctx.drawImage(background,1200,150,200,70);   
      var randomConcept = concepts[Math.floor(Math.random() * concepts.length)];
      yPositions.map(functio
    
    

    Pretty neat! The rest is pretty straight forward, and you can look at the code to see. I think that d3 is great, and that it could be a lot more powerful manipulating custom svg graphics over standard circles and squares. However, it still has challenges when you want to render more than a couple thousand points in the browser. Anyway, this is largely useless, but I think it’s beautiful. Check it out, in the cognitive atlas and blue brain versions.

    ·

  • Nifti Drop

    The biggest challenge with developing tools for neuroimaging researchers is the sheer size of the data. A brain is like a mountain. We’ve figured out how to capture it with neuroimaging, and then store the image that we capture in a nifti little file format (that happens to be called nifti). But what happens when we want to share or manipulate our mountains in that trendy little place called the internet? That’s really hard. We rely on big servers to put the files first, and then do stuff. This is an OK strategy when you have brain images that are big and important (e.g., to go with a manuscript), but it’s still not standard practice to use a web interface for smaller things.

    Drag and Drop Nifti

    I want to be able to do the smaller stuff in my web-browser. It’s amazing how many posts I see on different neuroimaging lists about doing basic operations with headers or the data, and visualization is so low priority that nobody gives it the time of day. One of my favorite things to do is develop web-based visualizations. Most of my work relies on some python backend to do stuff first, and then render with web-stuffs. I also tried making a small app to show local images in the browser, or as a default directory view. But this isn’t good enough. I realized some time ago that these tools are only going to be useful if they are drag and drop. I want to be able to load, view header details, visualize in a million cool ways, and export different manipulations of my data without needing to clone a github repo, log in anywhere, or do anything beyond dragging a file over a box. So this weekend, I started some basic learning to figure out how to do that. This is the start of Nifti-drop:

    #DEMO

    It took me an entire day to learn about the FileReader, and implement it to read a nifti file. Then I realized that most of those files are compressed, and it took me another day to get that working. My learning resources are an nifti-js (npm), the nifti standard, FileReader, and papaya. It is 100% static, as it has to be, being hosted on github pages!

    What I want to add

    Right now it just displays header data alongside an embarrassingly hideous histogram. I am planning to integrate the following:

    Papaya Viewer: is pretty standard these days, and my favorite javascript viewer, although the code-base is very complicated, and I can barely get my head around it.

    NIDM Export: meaning that you can upload some contrast image, select your contrast from the Cognitive Atlas, and export a data object that captures those things.

    Visualization: Beyond a basic viewer, I am excited about being able to integrate different visuaization functions into an interface like this, so you can essentially drop a map, click a button, get a plot, and download it.

    NeuroVault: is near and dear to my heart, and this is the proper place to store statistical brain maps for longevity, sharing, publication, etc. An interface with some kind of nifti-drop functionality could clearly integrate with something like NeuroVault or NeuroSynth

    Feedback

    Please post feedback as comments, or even better, as github issues so I will see them the next time I jump into a work session.

    ·

  • For Brother and Jane

    Before the light of this evening
    Into a memory does wane
    I would like to share a sentiment
    for Brother and Jane

    The nostalgia within us all
    on this fifth of September
    is not entirely in this moment
    but in the things we remember

    We step in and out of ourselves
    Like an old pair of shoes
    And the biggest insight to life may be
    that our happiness we do so chose

    When Brother was small
    A curly haired Matthew
    Not much taller than a hip
    and I called him Cashew

    With imagination we built
    tiny ships out of paper
    he had a slight head tilt
    he was rice and cheese caper

    We lived in our heads
    and that made adolesence hard
    Brother receded into himself
    put up an emotional guard

    It was ten years later
    during college at UMASS
    when he grew into himself
    and I realized this time had passed

    And then he went to Spain
    and interpret how you will this sign
    Brother became a debonnaire
    and developed a taste for red wine

    And during his time in med school
    a memory not mine but true
    Brother laid eyes on the lovely Jane
    and thought, “I choose you.”

    And such happiness in my heart
    to see this love unfold
    such joy in my family
    for this story to be told

    And it is not something to fear
    to grow up and old
    because destiny is not decided
    But rather foretold

    I love you so much,
    From fire start to fading embre
    Remember, remember, the fifth of September

    ·

  • Reproducibility: Science and Art

    I was thinking about the recent “Reproducibility Project” and how it humbly revealed that science is hard, and we really aren’t certain about much at all, let alone having confidence in a single result. Then I started to wonder, are there other domains that, one might say, “got this reproducibility thing down, pat?” The answer, of course, is yes. Now, I’m extremely skeptical of this domain, but the first one that comes to mind is art. Can we learn any lessons for science from art? The metaphor is clear: It was technology that gave power to reproducible art. We can imagine a single statistical result as an effort to capture a scene. A master painter, if for a second you might superficially say his effort was to reproduce such a scene, went out of work with the advent of the photograph. A photographer that develops film was replaced by the printer, and then digital photography exploded that the quantity to insurmountable digits. This is not to say that we should change our expectations for the quality of the photograph. Any kind of result is still valued based on the statistical decisions and experimental design, just as a single picture can be valued based on the quality of the production medium and decisions for the layout. However, the key difference is figuring out how to produce such a result with the quality that we desire en-masse. And just like the photograph, the answer is in the technology. If we want reproducible science, we need better infrastructure, and tools, whether that comes down to the processing of the data or dissemination of the results. If we want reproducibility the responsibility cannot be in the hands of a single artist, or a single scientist. We must do better than that, or things will largely stay the same: slow progress and limited learning. I cannot say that it is not valuable to ask good biological questions, but this kind of thing is what makes me much more passionate about building tools than pondering such questions.

    ·