Beginning Python for Bioinformatics

Rebuilding

Paulo Nuin — Tue, 28 Dec 2010 16:50:54 +0000

The blog engine (WP) has been hacked last week and I’m still rebuilding some of its features. Sorry for any inconvenience.

Why I left Biostar, but I still like Stackoverflow

Paulo Nuin — Wed, 01 Dec 2010 15:44:08 +0000

About eight months ago I started using Biostar as I saw it as a great opportunity to exchange some ideas, concepts, tips in biology and bioinformatics. I even mentioned the website in this space, trying to bring more people to the mix; at the time the community wasn’t big enough, and some days went by without any question being posted.

But a couple of months ago my interest started to go down the drain. I don’t know if it was the constant next-generation sequencing barrage of questions every day, if it was the infantile blog/twitter posts from members competing for points or maybe the lack of votes for some answers that I posted (that’s selfish on my part, I admit). But at some point it seemed that the website turned into a competition of CVs or knowledge, very different from what I could see in different Stackoverflow spin-offs or in the main site. I guess the turning-point, or the moment I realized that the scientific community (at least in bioinformatics and related fields) will never the be the same as the programming and statistical ones, was the time I gave an answer that had less votes that the one saying “it’s not possible”.

Maybe the problem is myself, I don’t like cliques, don’t mind helping people for nothing, don’t care about reputation. I didn’tt care about how many points I had, and used the down-vote to actually vote down answers that I didn’t see as pertinent (if you never used those sites, every down-vote removes one point from your score). I still think that Biostar is a great idea, and I wish it becomes a great resource for all the bio fields. Maybe if the community gets big enough, maybe if don’t see the same group of people that you see every where else it might become a better place to hang-out online. But right now, I’m over it.

Preview of Django 1.1 Testing and Debugging

Paulo Nuin — Thu, 20 May 2010 21:34:41 +0000

Packt Publishing invited me to review Django 1.1 Testing and Debuggingby Karen M. Tracey. They also kindly provided a free chapter that you can download from the link below. A full review will be posted as soon as I finish the book.

preview chapter – Chapter No.3 “Testing 1, 2, 3: Basic Unit Testing”

Initial impressions about Bioinformatics Programming using Python

Paulo Nuin — Sun, 02 May 2010 01:59:17 +0000

Last week I made a 5 book order at Amazon and one of them was Bioinformatics Programming Using Python: Practical Programming for Biological Data (Animal Guide) by Mitchell L Model.

I started reading the book late Friday night, and I’m on the third chapter, where there’s an introduction to sequences. So far, I found the book very confusing, especially as it claims to be a book for people with no programming background. The examples are OK, but there’s a very messy mixture of Python interpreter and standalone script usage, as the author jumps back and forth them. Another thing is that some examples are explained in detail including the line number, while others you depend on the code’s docstring to understand it.

So far, I’m not impressed. The initial Python sequence example is a set and in this chapter there already are some functional programming concepts, what can be quite challenging to someone that has never programmed in their life. And in the second chapter the reader sees a ternary operator. Another criticism, is that in the preface the author suggests using Python 3, instead of 2, what might add to the frustration of the beginner when a module cannot be installed.

I will continue reading it and post whenever I have a more complete overview of the book.

Python for Bioinformatics by Sebastian Bassi: a (short) review

Paulo Nuin — Mon, 19 Apr 2010 14:01:08 +0000

I promised some time ago to post a complete review of Python for Bioinformatics (Chapman & Hall/CRC Mathematical & Computational Biology) by Sebastian Bassi. It’s long overdue, but the delay allowed me to get more acquainted to the book and its contents.

I can only say that I highly recommend this book, especially for the biologist that is beginning in bioinformatics or python (or both). I cannot compare it to any other Python and Bioinformatics books (I’m planning to buy the another one), but I can say that I could learn a thing or two from Sebastian’s book. Evidently is not a perfect book, as some of the explanations are a little bit rushed and might be difficult for a beginner. At the same time this is a very carefully thought and planned book and has more than enough for one to learn Python and apply the language to solve biological problems. I really liked the BioPython section, and this section made me use BioPython for the first time. Some of BioPython’s examples in the book are light years ahead of the examples in the tool’s website.

Lastly, I would like to congratulate Sebastian for his work and effort in putting together a nice tome for Python and Bioinformatics. It’s a valuable resource for everyone in the field and certainly will help spread Python in the community.

Biostar: bioinformatics community

Paulo Nuin — Sat, 17 Apr 2010 18:26:08 +0000

Biostar is a bioinformatics community on the StackExchange network. It’s still small and not a lot of questions are asked and answered every day, so we need more people participating. If you are new to bioinformatics, or are just curious about the newest trends in the field, help us grow.

The real value of blogging

Paulo Nuin — Sat, 10 Apr 2010 18:45:03 +0000

A couple of days ago I posted on here an entry called ‘The “sickest” Python code I’ve ever created‘. It’s a code that does some file management for proteomics data, with a different set of inputs each time you run it.

The “sickest” part of the title is that it was a small challenge to me. I’ve been away of actual hard-core coding for quite sometime, and you lose some of the gist of the thing with time. Mostly, nowadays, I make simple scripts that don’t require any kind of advanced skills (in any language) and I don’t worry that much of releasing code or about ultra fast performance. I knew from the time I posted that a lot of people would jump and help and teach me, as I was aware it wasn’t the most elegant code out there, not even the most Pythonic one too. What also helped me is that my Python/Bioinformatics blog is indexed on Planet Python, and the audience is far more hard-core Python that I ever dreamed of getting by myself alone.

But the real deal is that I believe it would be much more difficult for me to get some positive feedback or even an answer if I had posted bits of my code on a online forum or community or list. Every time I used one of these methods, I either got no answer, or got schooled for not posting in the right format or somebody replied that no one knew how to do it. There’s the real deal of blogging, and the value is even higher if your audience knows more than you do. I appreciate every comment I got on that post and on others too, I learned things that I wasn’t able to learn from computer books and online tutorials (yes, I searched sometimes before reading the comments, and sometimes after).

The “sickest” Python code I’ve ever created 1

Paulo Nuin — Fri, 09 Apr 2010 01:59:54 +0000

But, I guess, it can be easily refactored/enhanced/despised by the audience that read or have access to this blog via Planet Python. Anyway, for someone like me, whose main task now is not to generate tons of code and lines, I think the code (or part of it) that I will present below is quite good. Feel free to comment, criticize and say bad and good things about it.

We needed a script that would take files coming out from protein search engines that would be able to compare the peptides and protein sequences, their abundance and some other characteristics. We had a combination of protein and peptide files, with a list of proteins (one protein per line in a tab delimited file) that was related to a list of peptides in another file (one peptide per line, with multiple peptides/lines related to one protein in the original list). Also each line in both files had more than 50 columns, and 8 or 16 of them were the values we wanted to extract. I say 8 or 16 of them because we didn’t know how many will output each time, as it would depend on the number of samples per run (4 to 8 samples) .

So, we had a couple of issues: we didn’t know how many proteins would be output (actually found) in each file, we wouldn’t know how many peptides for each protein would be found and we didn’t know before hand how many samples would be run at once. One good thing is that the 8-16 columns of values were fixed, always in the same position and with empty cells if no value was registered there. And we had a fourth problem: usually the samples attributions would be random, meaning a control could come in the first value column or could come in the last. And a fifth as we didn’t know before hand (the tech knew) how many treatments would be run each time. A treatment could be a different experimental condition, a sample grouping or some other extraneous factor. An extra issue is that we would need to compare multiple files, and get protein and peptide abundances in all of them at the same time and finally compare each treatment.

Basically, in order to create an universal script we needed something flexible enough that whatever the experiments threw at use we would be able to handle. First step we decided to use a YAML file that could be filled by the experimental researchers with sample assignments, treatments, etc. The YAML would look like this

B0:
– 114: A
– 115: D
– 116: B
– 117: C

B1:
– 114: C
– 115: A
– 116: D
– 117: B

In this file B0 and B1 would be the result file names, 114 is the column/channel where the sample was run and and A, B, C and D the treatment. With this set, out objective was to get all proteins and their peptides for treatment A in files B0 and B1, do some calculation and them compare to all proteins and peptides from treatment B, C and D extracted also from files B0 and B1.

First step was to get the names of the treatments from the YAML file

]def get_treatments(mapping):
    treats = set([])
    for entry in mapping:
        [treats.add(list(t.values())[0]) for t in mapping[entry]]

    return treats

where

mapping

is the YAML file name. We used a set to store and sets have unique items, and treatment names can vary from file to file. In the code above we basically read the YAML and the value for each entry.

We then needed a class to store protein information, and there was the story got hairy. With all my (lack of ) experience, I decided to use

exec

statements to fix all the uncertainty of the experimental data details. I didn’t have the treatment names before hand (or in a fixed immutable list), and didn’t have the columns (channels) that were being used at the time and I have to correctly assign each protein abundance (area) to its place. In the end our class look like this

class Protein():
    """Class Protein, stores all the information about channels and areas, name and accession"""
    def __init__(self, accession, name, treatments):
        self.accession = accession
        self.name = name

        #ratio channels are called based on their name
        for i in treatments:
            exec('self.%s = []' %i)
            exec('self.area%s = []' %i)

    def add_to_channel(self, channel, peptide):
        exec('self.%s.append(peptide)' % channel)

    def add_to_area(self, channel, area):
        exec('self.area%s.append(area)' % channel)

In order to be faithful to this blog’s name, I will explain how the code above is supposed to work. First,

exec

is a Python statement that support dynamic execution of code. In our case above it was used to name the objects, so we would be able to access them by name in subsequent functions. Let’s take this for example

for i in treatments:
    exec('self.%s = []' %i)
    exec('self.area%s = []' %i)

In this snippet we were trying to create lists called (for the YAML file above) A, B, C and D, and another set of lists called areaA, areaB, areaC and areaD. Let’s say for another experiment we would have treatments “Control”, “Low” and “High” and so on.

The next two functions use the same approach, with exec, this time appending to the freshly created lists. This way it’s easy to control what the user is throwing at us.

I don’f know if this the best approach possible, or if it is or not harmful. Maybe experts reading this might have better ideas, and I appreciate them. We check the rest of the script next time.

Python Testing Beginner’s Guide, review

Paulo Nuin — Thu, 04 Mar 2010 03:46:35 +0000

I posted about a week ago that Packt Publishing had invited me to review Python Testing Beginner’s Guide by Daniel Arbuckle. Having finished reading the book (I must admit that I haven’t tried all the code in it), I can say that I have an excellent initial impression of the book.

PTBG is not a long book and the topic is divided in 10 chapters and one appendix. One of the first things that I liked about the book is that there’s no introduction (or something similar) to Python. It just goes straight to the point assuming that you have some good understanding of the language and everything that surrounds it. In the past I was frustrated with some “Introduction to X with Python” that wasted precious space talking over and over about a topic, learning Python, better covered in many other books. PTBG does not waste time and space introducing its main topic which is testing, and in my opinion that’s the best approach, even though it might look a little bit abrupt by some.

The language and text in the book is clear and very pleasant. PTBG is a very well written book and I really enjoyed its style. The first chapters of the book cover Python testing using doctests. For someone like me that didn’t write so many tests in the normal software development workflow (I know I should write more tests), this section seems like a really nice introduction to the topic, with well thought real-life like examples and a good flow on the explanation of the different features. One small complain that I have is that for a beginner sometimes the code listed in the examples might seem a little bit confusing, and maybe the addition of line numbers might have helped a bit here. But at the same I understand that this is normal style of some Packt books.

After the doctests section, PTBG gets into more advanced techniques, covering a little bit mock objects with Mocker, then moving into unittest and nose. The latter is a Python tool that allows for managing, running and automating tests. Also covered is Twill, another third-party library that allows for testing of web applications.

One full chapter is devoted to test-driven development, with a complete walkthrough of this approach. This gives a wrap-up of most of the techniques and modules covered in the book, but there’s still space for another chapter that shows how beautifully doctests, unittest and nose can be fully integrated and help the development of applications using the test-driven approach.

Overall, I really enjoyed PTBG. As I mentioned, test driven development was never a high priority in the application I usually developed with Python. But certainly this book can be a good starting point for some Python test beginners to incorporate these techniques in their usual development workflow. Scientific software is also a perfect niche for this type of approach and we should do what is possible in order to avoid the nightmares of the past.

Preview of Python Testing Beginner’s Guide

Paulo Nuin — Mon, 22 Feb 2010 16:00:53 +0000

I was invited by Packt Publishing to review Python Testing Beginner’s Guide by Daniel Arbuckle. This is a book on one of the most important aspects of scientific programming (even though the majority of scientific software don’t have any testing routines): code testing, checking if your code actually does what is intended to do. I can say I’m not really an expert on testing so I guess I’m the right audience for it:

You’ll learn about several of Python’s automated testing tools, and you’ll learn about the philosophies and methodologies that they were designed to support, like unit testing and test-driven development. When you’re done, you’ll be able to produce thoroughly tested code faster and more easily than ever before, and you’ll be able to do it in a way that doesn’t distract you from your “real” programming.

Packt also supplied a preview/sample chapter that you can download here.

I hope to get a review ready by the end of the week. before the Ontario Institute of Cancer Research retreat, otherwise I will try to post a full review next week.

Preliminary review of Python for Bioinformatics by Sebastian Bassi

Paulo Nuin — Sun, 03 Jan 2010 19:25:00 +0000

Let me start by saying that Python for Bioinformatics (Chapman & Hall/Crc Mathematical & Computational Biology) is a massive book, massive in a way that it contains a lot of material. I still didn’t have enough time to check everything, but I’m well into the first section of the book that gives an initial view of Python and how to program it.

The initial section of the book is well written (I’m not going criticize the book in terms of good/poor English, as I’m not well qualified to do that), and gives a clear perspective on how to program Python for scientists, who are the main target demographic of the book. Of course, it always help to have some basic knowledge of command line shells, but the book also includes some explanations of IDLE and other Python-capable IDEs. I cannot say that I read this section with the enough care and attention, but what I can say is that you won’t miss a beat with PfB, as it has more material than I expected. I still have to start with the more advanced topics, like BioPython and so forth, what I plan to do in the coming month, and as I don’t have a lot of experience with BioPython, I’m looking forward to it.

On the other hand I have a small-ish complaint, that maybe is more about style than substance. I don’t like the design of the book, the way the code interleaves with the text and the way the code explanations are presented. Most of the code blocks are followed by a careful explanation, but this explanation works as a figure label for the code block. That is quite annoying because there are too many stops in the text fluidity as one tends to lose attention to it (my case, not exactly everyone’s).

Another minor detail is the use of “he” every time scientists are referred (one example is on page 3 on the second phrase of the introduction). The (politically) correct would be to use “he or she” or “she or he” (but that’s OK with me).

I will try to post more complete reviews of the sections that I don’t master. I would also like to thank Sebastian for sending me a copy of the book.

This is (more or less) the end

Paulo Nuin — Sat, 23 May 2009 02:19:01 +0000

So, I’m closing the blog, maybe for good, maybe not. I haven’t been updating it and some other responsibilities are consuming my spare time.

I would like to thank everyone that contributed, commented and read it. You have my deepest appreciation. My work is fulfilled if I helped at least one person along the way.

Cheers
Paulo

PS: there’s the wiki, so register and help me improve it.
PS II: sorry that I couldn’t finish the last project. Maybe some other time.

Wiki

Paulo Nuin — Mon, 11 May 2009 22:43:31 +0000

: Image via Wikipedia

I’m slowly moving the posts from the blog to a wiki. It makes easier to display post series and allows people to modify/enhance/discuss.

The wiki address is http://wiki.genedrift.org.

Managing a simple database with Python, SQLite and wxPython, 8

Paulo Nuin — Wed, 22 Apr 2009 15:04:17 +0000

: Image via Wikipedia

Thanks to the comments and suggestions to the last post, it’s possible to make now a more pythonic and clearly generic database update class. Let’s check how the “generic” update/edit entry function is currently:

def update_data(self, values_list):
    '''edits and updates fields'''

    if sys.platform == 'darwin':
        (cursor, database) = link_db(self.db_path)
    else:
        (cursor, database) = link_db()

    cursor.execute("UPDATE bac SET  projects = ?, comments = ?, temperature = ?, cell = ?, box = ?, tubes = ?, chromosome = ?, sdate = ?, clone = ?, source
	= ?, location1 = ?, startpos = ?, endpos = ?,
	gene = ?, genelink = ?, dnaex = ?, validation = ?, pcr = ?, refs = ?, antibiotic = ? WHERE idbac = ?",
    values_list['projects'], values_list['comments'], values_list['temperature'], values_list['cell'], values_list['box'], values_list['tubes'],
    values_list['chromo'], values_list['date'], values_list['clone'], values_list['source'], values_list['location'], values_list['start']
    values_list['end'], values_list['gene'], values_list['genelink'], values_list['dna'], values_list['validation'], values_list['pcr'],
    values_list['refs'], values_list['antibiotic'], values_list['idbac']))

    database.commit()
    database.close()

which is really ugly and, although it works, is not really useful outside this small project. Based on the comments the best option was to use placeholders and a dictionary, similar to the approach used on the insert data function. Pre-formatting a string to have both the field name to be updated and a placeholder (for instance :idbac) that will receive the values

update = ','.join(['%s=:%s' % (y, y) for y in values_list])

where update is the string we want and values_list is the dictionary with all the key-value pairs. I tried this approach, using this structure in the generic function, but then I decided that the best alternative was to put this join in the derived class function and pre-populate the string with the values and then send this string directly to the update function. In the end I opted to use this

update = ','.join(['%s=\"%s\"' % (y, values_list[y]) for y in values_list])

The latter is slightly different to what was suggested. The original one would create a tuple with the keys from the dictionary, making for instance sdate:sdate. With all these place holders just pass the dictionary and you have all the values inserted. This would be handy if the insert string was being created on the “generic” function. If we move this to the derived class, we can use the the alternative, keeping in mind that the values parsed should be surrounded by quotes, otherwise the SQL UPDATE statement will have problems with spaces and other foreign characters that should not be there. So instead of placeholders we will have gene:"PTEN" and we can attache this joined string to the actual commands. We then can move all the machinery from the “generic” function that can be written as

def update_data(self, update_string):
    '''edits and updates fields'''

    if sys.platform == 'darwin':
        (cursor, database) = link_db(self.db_path)
    else:
        (cursor, database) = link_db()
    cursor.execute(update_string)

    database.commit()
    database.close()

That’s it, very elegant (we will see the derived class in the next post). And finishing our generic class, we would need a delete function, so the user can eliminate entries that he/she doesn’t want anymore. It’s also a very simple function

def delete_data(self, delete_string):
    '''deletes one field'''

    if sys.platform == 'darwin':
        (cursor, database) = link_db(self.db_path)
    else:
        (cursor, database) = link_db()
    cursor.execute(delete_string)

    database.commit()
    database.close()

We will check the delete string next time. Again, I would like to thank for all the comments, it has been really helpful for me.

Previously in the series:
Part 1
Part 2
Part 3
Part 4
Part 5
Part 6
Part 7

Managing a simple database with Python, SQLite and wxPython, 7 (includes a question)

Paulo Nuin — Mon, 20 Apr 2009 17:21:59 +0000

And we’re back. After a couple of weeks of inactivity we will get back to our small soap-opera pf Python, wxPython and SQLite. Continuing in our database management code let’s check two other functions that changed since our first inception of the code. The first one is the insert_data function that looks like this now

def insert_data(self, values_list, insert_string):
    '''inserts data in the database'''

    if sys.platform == 'darwin':
        (cursor, database) = link_db(self.db_path)
    else:
        (cursor, database) = link_db()

    cursor.execute(insert_string % self.table_name, values_list)

    database.commit()
    database.close()

Basically no changes, apart from the obvious check for the current running operating system, which was explained in the last post. The other function to check is the update_data. This function is new and it wasn’t in the first version, but as it can be seen it has a problem being a “generic” function, because it contains information pertained to the table and database being used in the interface. This function basically received information that needs to be updated in the table’s fields and by using the SQL UPDATE ... SET edits and updates data in the changed fields. I have tried several different syntaxes to make the execute generic, mainly trying to pre-format the string without success. IF anyone reading this can help, I’d appreciate.

def update_data(self, values_list):
    '''edits and updates fields'''

    if sys.platform == 'darwin':
        (cursor, database) = link_db(self.db_path)
    else:
        (cursor, database) = link_db()

    cursor.execute("UPDATE bac SET  projects = ?, comments = ?, temperature = ?, cell = ?, box = ?, tubes = ?, chromosome = ?, sdate = ?, clone = ?, source = ?, location1 = ?, startpos = ?, endpos = ?,
	gene = ?, genelink = ?, dnaex = ?, validation = ?, pcr = ?, refs = ?, antibiotic = ? WHERE idbac = ?",
    values_list['projects'], values_list['comments'], values_list['temperature'], values_list['cell'], values_list['box'], values_list['tubes'],
    values_list['chromo'], values_list['date'], values_list['clone'], values_list['source'], values_list['location'], values_list['start'],  values_list['end'],
    values_list['gene'], values_list['genelink'], values_list['dna'], values_list['validation'], values_list['pcr'],
    values_list['refs'], values_list['antibiotic'], values_list['idbac']))

    database.commit()
    database.close()

Anyway, I will explain the logic of the command (OK for a stop gap, but not as a definite solution). values_list is a dictionary that is passed to the function and contains the field names as keys and the new/changed information as values. The execute method simply parses the values from each key in the update string which is then sent to the database and table to be changed. Everything is committed and the database is closed.

As this is a “generic” function from a “generic” class the ideal scenario would be to the function to receive a pre-formatted string with all the information, as in the insert data function, and update the information in the database.

I would like to thank in advance anyone that can comment on this. Next time we will continue checking the generic class and finalize this part in order to start with the interface build process.

Previously in the series:
Part 1
Part 2
Part 3
Part 4
Part 5
Part 6

Managing a simple database with Python, SQLite and wxPython, 6

Paulo Nuin — Tue, 31 Mar 2009 17:06:08 +0000

: Image via Wikipedia

Let’s get back to our SQLite and wxPython project. We haven’t seen anything on wxPython yet, and we will check the interface only on the next post. For now, let’s see some extra code added to the SQLite access class. Remember that we have a generic class and one class derived from it that would work on accessing specific tables in our database file.

When we last covered the db access routines, there was no search for an entry (the function returned everything in the table no matter what), there was no update function in case someone would want to modify an entry and there was no delete method if you wanted to delete something. In the meantime, I added all of this functionality (and some other) to the generic class and extended it to the class derived from it. Let’s check how the generic class is now (you will notice that there is an issue in one of the methods, if someone can help me I’d appreciate. More details later.)

class DB_Generic():
    '''generic class to add DB functionality'''
    def __init__(self, table_name, db_path = ''):
        #par= name of the table to be used
        self.table_name = table_name
        if len(db_path) > 0:
            self.db_path = db_path
            print db_path

    def get_data_generic(self, range = 1, bac_to_get = 0):
        '''gets the data from the database'''       

        if sys.platform == 'darwin':
            (cursor, database) = link_db(self.db_path)
        else:
            (cursor, database) = link_db()

        if range == 1:
            cursor.execute("""SELECT * FROM %s""" % self.table_name)
        elif range == 2:
            cursor.execute("""SELECT * FROM %s where idbac = %d""" % (self.table_name, bac_to_get))

        table_data = cursor.fetchall()
        raw_data = []
        for i in table_data:
            raw_data.append(list(i))

        self.table_data = raw_data
        database.close()

    def insert_data(self, values_list, insert_string):
        '''inserts data in the database'''

        if sys.platform == 'darwin':
            (cursor, database) = link_db(self.db_path)
        else:
            (cursor, database) = link_db()

        cursor.execute(insert_string % self.table_name, values_list)

        database.commit()
        database.close()

    def update_data(self, values_list):
        '''edits and updates fields'''

        if sys.platform == 'darwin':
            (cursor, database) = link_db(self.db_path)
        else:
            (cursor, database) = link_db()

        #change this to generic!!!!!!!!!!!!
        cursor.execute("UPDATE bac SET  projects = ?, comments = ?, temperature = ?, cell = ?, box = ?, tubes = ?, chromosome = ?, sdate = ?, clone = ?, source = ?, location1 = ?, startpos = ?, endpos = ?,
		gene = ?, genelink = ?, dnaex = ?, validation = ?, pcr = ?, refs = ?, antibiotic = ? WHERE idbac = ?",
        (values_list['projects'], values_list['comments'], values_list['temperature'], values_list['cell'], values_list['box'], values_list['tubes'],
         values_list['chromo'], values_list['date'], values_list['clone'], values_list['source'], values_list['location'], values_list['start'], values_list['end'],
         values_list['gene'], values_list['genelink'], values_list['dna'], values_list['validation'], values_list['pcr'],
         values_list['refs'], values_list['antibiotic'], values_list['idbac']))

        database.commit()
        database.close()

    def delete_data(self, delete_string):
        '''deletes one field'''

        if sys.platform == 'darwin':
            (cursor, database) = link_db(self.db_path)
        else:
            (cursor, database) = link_db()
        cursor.execute(delete_string)

        database.commit()
        database.close()

In the next couple of posts we’ll dissect each function and see what’s going on. The class definition wasn’t changed, so we start with get_data_generic

def get_data_generic(self, range = 1, bac_to_get = 0):
	'''gets the data from the database'''       

	if sys.platform == 'darwin':
		(cursor, database) = link_db(self.db_path)
	else:
		(cursor, database) = link_db()

	if range == 1:
		cursor.execute("""SELECT * FROM %s""" % self.table_name)
	elif range == 2:
		cursor.execute("""SELECT * FROM %s where idbac = %d""" % (self.table_name, bac_to_get))

	table_data = cursor.fetchall()
	raw_data = []
	for i in table_data:
		raw_data.append(list(i))

	self.table_data = raw_data
	database.close()

The first difference we notice here is the sys.platform usage. This is required if we intend to package our application as an OS X app, using py2app. When a Python/wxPython application is packaged in OS X, the actual application executable is inside the a directory named after the application (or whatever you set up). In our case here we don’t provide a way for the Python script to receive the path and name for the database on a command line, as we expect it to be in the executable’s current directory. Because of that we need to provide a “config” file (in our case here a one-line text file with the database path) inside the application wrapper, something we will see in the end of the series.

Another modification here is the range parameter and the addition of the bac_to_get parameter. Notice that both parameters have a value assigned to it. This means that they are optional, the function’s call can pass them or not. If it doesn’t pass, their value will be the one assigned on the function declaration. So, here if we are interested in getting all bacs, range will have the value of 1 and we don’t need to worry about it. If we want an specific bac we will pass range as 2 and then pass the bac_to_get ID to be returned.

A final change/addition is that we added a new select statement for the cases when range equals 2. This time we are adding the bac ID to be returned.

Previously in the series:
Part 1
Part 2
Part 3
Part 4
Part 5

RoR commits

Paulo Nuin — Sun, 15 Mar 2009 16:59:18 +0000

Just illustrating my point (or lack of), an animation about the commits of RoR to its repository. Notice the jump after it was migrated to Github

Ruby on Rails from Ilya Grigorik on Vimeo.

Sorry for the non-Python post.

BioPython and CVS

Paulo Nuin — Fri, 13 Mar 2009 19:25:38 +0000

: Image via Wikipedia

I start this post with an apology. I usually don’t rant or vent here, which are feelings that I usually reserve to my personal blog.

I don’t use BioPython, never used it. I have it installed in my systems, but I never wrote a piece of code importing BioPython routines. But I subscribe to their mailing lists, both user and developer. I maybe have written once to the list, and I just follow the discussions there.

Since last year one of the main topics has been the possibility of moving BioPython from CVS to another version control system. Yes, you read it right. It’s 2009 and BioPython uses CVS and their version control system. Soon, CVS will be like typewriters and LPs to young developers. Last stable release of CVS was sometime in 2005, what in interwebs time is equivalent to something like 1972. Since 2005, Subversion has taken the world of version control by storm, and Git is getting also very strong, not to mention Bazaar, Darcs, Mercurial and some others that I might not be aware of.

This is a discussion that have been dragging for sometime in the list. And it’s a shame, a clear lack of leadership from whoever is (not) leading the project. BioRuby is Git, BioPerl SVN and BioPython is CVS, because they “need to care for the legacy developers”. It’s like MSFT keeping two copies of the Notepad executable because they needed to cater to legacy applications, but with a different scale of course. With the current Python steam in the non-bioinformatics and bioinformatics community is very sad to see BioPython not evolving (before you ask me, no, I’m not interested in helping, not the way things are now). Perl which is language forever-in-waiting for its holy grail (Perl 6) has a strong community behind it, and more important an excellent leadership, that’s not scare of making decisions.

So, if you’re still using CVS, it’s 2009!

Managing a simple database with Python, SQLite and wxPython, 5

Paulo Nuin — Tue, 03 Mar 2009 00:23:42 +0000

We have seen how to connect, get and insert data (at least theoretically) in the database. Now, a little not about the SQL engine of choice here: SQLite. SQLite databases have the main characteristic that they are self-contained files. Also it does not require an installation, works without a server and works pretty well in most operating systems.

Basically for the type of application we’re developing here, SQLite seems ideal. It eliminates a lot of infrastructure that would be needed if we were working with MySQL or postgresql. We don’t need a server or know how to configure users or manage the databases and tables. All we need is contained in a single file that can be transported from system to system and can be accesed from the computers used in the lab, mainly XP and OS X. Also some web frameworks (Rails and Django, for instance) can use SQLite, so in the end we can have a desktop application and a web application accessing the same file without extra configuration.

Now the database created for this application has 8 tables and almost no relationships among them. SQLite allows the creation of relationships but in our case only a couple of cases were required. For the table we are using at the moment (bac) there is no need for relationships, although there are some fileds that can benefit from a more relational structure. Also SQLite don’t have the same data types that are found on the bigger SQL engines. All values can be stored as text, integer, real (floating point numbers), null and blob (verbose type, what you store is what you get). As actual types, you can set columns as Boolean and Data for instance and SQLite will understand them. If you have no experience in creating databases, let’s check again the table we are using in this small project. First, I would recommend the use of some SQLite database editor. You can find pretty good ones for any computer system and there is even a Firefox extension that allows you to edit some files. Editors make it easier to generate the SQL table creation scripts and make easier to visualize what we are doing.

So, the table bac looks like

CREATE TABLE bac
(idbac INTEGER PRIMARY KEY,
clone Text,
sdate Date,
source Text,
gene Text,
chromosome Text,
startpos Integer,
endpos Integer,
antibiotic Text,
location1 Text,
temperature Integer,
tubes Integer,
box Integer,
cell Integer,
dnaex Boolean,
validation Boolean,
pcr Boolean,
projects Text,
comments Text,
genelink Text,
refs Text);

If you go back to our last post, you will see that in the insert statement there is no mention of the idbac field. We don’t actually insert ay value there, the values that populate this field are created automatically. And idbac is our primary key, meaning it’s the unique identifier of each bac we insert in this table. And in SQLite a integer primary key is automatically incremented whenever values are inserted in the table. So our first insertion will create idbac 1, the second will create idbac 2 and so on.

I’m not going to enter in details about database development and administration, but it’s usual and safe to create tables with an auto-incremental integer primary keys. These fields, apart from make it easier t identify records, make access to such records faster and are great when relationships among tables are set. Let’s say that we had a column user in our bac table. And let’s say we had an user table with two columns: user_id and name, user_id being a auto-increment primary key. The user column in back could be linked with the user_id column in the user table, in what we call a one-to-many relationship (one user can insert as many bacs as he wants). One day we want to know who is actually working in the lab and we want to check how many bacs were catalogued by each user. We can easily search the user table and extract information from bacs at the same time thanks to the relationship between the tables. And the result should be returned quite quickly, as we are only searching integers.

All the other fields/columns in our table are straightforward to understand. They are basically related to the type of data they need to store. validation is a boolean because the bac might have been validated or not, just as danex (DNA extraction). At the same time, the number of tubes stored in the freezer will always be an integer. So, why does temperature is an integer? Because we can only store bacs in two type of freezers: -80 (ultra freezers) or -20 (regular freezer that we can have at home), and we don’t need to worry about fractional numbers.

Well, this is a very short and limited explanation of tables and SQLite. The web is full of resources about it, so next time we will get back to Python.

Previously in the series:
Part 1
Part 2
Part 3
Part 4

Managing a simple database with Python, SQLite and wxPython, 4

Paulo Nuin — Mon, 02 Mar 2009 16:58:09 +0000

: Image via Wikipedia

Let’s continue building our small db app. As mentioned in the previous post we need now to instantiate a specific class from our generic SQLite access class. In order to do this we just have to declare a new class and its type will be DB_Generic.

class Bac(DB_Generic)

This new class is called Bac because it’s linked to the bac table in our database file. A side note, bacs are Bacterial Artificial Chromosomes and are used in different molecular biology techniques. Mainly in our case bacs have incorporated human DNA segments and are used as probes for deletion, duplication, etc studies.

Now, back to our Python code, as soon as we instantiate our generic class, the object (class) we create has access to all methods and functions from the parent class (by using self), but we still need to create functionality and expose other methods that can be accessed from a class object derived from Bac.

Our instantiated class will be

class Bac(DB_Generic):
    def __init_(self):
        self.bac_data = []
        DB_Generic.__init__(self, 'bac')

    def get_data(self):
        return self.get_data_generic()

    def load_data(self):
        pass

    def add_data(self, values_list):
        insert_string = """INSERT INTO %s (projects, comments, temperature, cell, box, tubes, chromosome, sdate, clone, source,
        location1, startpos, endpos, gene, genelink, dnaex, validation, pcr, refs, antibiotic)
        VALUES (:projects, :comments, :temperature, :cell, :box, :tubes, :chromo, :date, :clone, :source, :location, :start,
        :end, :gene, :genelink, :dna, :validation, :pcr, :refs, :antibiotic)"""
        self.insert_data(values_list, insert_string)

Pretty simple so far, as we don’t have a lot of declared methods. Let’s check one by one

def __init_(self):
    DB_Generic.__init__(self, 'bac')

The only line is the initialization required by the parent class, and we’re passing the value that is the table to be accessed.

def get_data(self):
	self.get_data_generic()
	return self.table_data

The get_data function returns the all elements in our table (So far, we still don’t have an elegant range option) and has one too many lines in it. We will get rid of some useless code here in the future, but it’s OK the way it is. Basically this code access the get_data_generic from the parent class and gets all the values stored in the table.

There is a function not yet complete that will load data, and will be used in the future. And the last one is the function that actually adds the data to the table with a SQL insert statement

def add_data(self, values_list):
	insert_string = """INSERT INTO %s (projects, comments, temperature, cell, box, tubes, chromosome, sdate, clone, source,
	location1, startpos, endpos, gene, genelink, dnaex, validation, pcr, refs, antibiotic)
	VALUES (:projects, :comments, :temperature, :cell, :box, :tubes, :chromo, :date, :clone, :source, :location, :start,
	:end, :gene, :genelink, :dna, :validation, :pcr, :refs, :antibiotic)"""
	self.insert_data(values_list, insert_string)

In this function, we have a large string with all the SQL insert options. A SQL insert statement is divided into two parts, one where you point where to insert the values and another where you input the values. Usually simple insert statements will have this structure

INSERT INTO my_table_name (table_column1, table_column2) VALUES (value1, value2);

So, we have the table we want to insert values into, its columns and the values we set for each column. After executed this will put value1 into table_column1 and value2 into table_column2. The actual syntax can vary a bit for different SQL engines but the structure is identical in most cases. Pretty simple.

For our insert string above, there are some aspects to call for attention. Again note the triple quote around the statement. This make sure that it’s not changed and parsed correctly. We also have a %s for the table name, which will be parsed by the parent class function that insert values, then a list of all the tables in the database and then a list of values to insert. And why the values to be inserted have this :value syntax? Because we are previously storing the values in a dictionary, and the “:” indicates that we need to get the dictionary value for the correspondent key.

The insert string, and the list of values (actually a dictionary, not the best variable/object name I must admit) is then sent to the parent class to be inserted. Storing the values to be inserted in a dictionary is OK for a one time insert case, where the values are obtained from a form. If you are parsing a large CSV or TSV file, ideally it’s better to put it in a list, and dump them at the same time.

We’re progressing. Next we will take a look on some simple SQL table structure and then move to create the form to insert the values and check the table.

Previously in the series:
Part 1
Part 2
Part 3

evi