Top 100 tags and contributors on StackOverflow

statistics April 28th, 2012

Big data is all over the place. Another wonderful source of a large dataset is StackOverflow. As of this moment, the site has more than 10 million questions and around 1.3 million users. Stackoverflow provides data about all answers, questions and voting information through their creative commons data dump. This dataset consists of a number of xml files and I consider this a very rich and interesting dataset.

I wanted to build a suggestion engine on top of this dataset. The idea is to suggest other geeks in the StackOverflow community who have technical interests and technical expertise similar to the given user (assuming that the user has a stackoverflow account). My features are entirely based on the tags in which a particular user has participated (questions/comments/answers). While I was on my way building the tool, I thought that it might be fun to just post the top contributors for each tag.

I decided to consider the measure of contribution as the number of instances of participation. Answering, commenting and asking a question are considered as participation. I just heard you, I do have double counting in cases where there are multiple participation from the same person on a given question.

Below is the list of the top 100 tags and contributors:

You can also Download the full set of 30K  tags and contributors to review it offline

TagsDisplay NameContribution CountGravatar
c#Jon Skeet8542
androidCommonsWare5103
jqueryNick Craver4614
javaBalusC4558
pythonAlex Martelli3639
.netJon Skeet3106
phpPekka2935
asp.net-mvcDarin Dimitrov2844
sqlOMG Ponies2263
sql-servergbn2232
jsfBalusC2081
c++Jerry Coffin2067
javascriptNick Craver2043
asp.netDarin Dimitrov1833
jspBalusC1790
djangoDaniel Roseman1770
asp.net-mvc-3Darin Dimitrov1593
cocoaPeter Hosey1565
xsltDimitre Novatchev1564
gitVonC1561
silverlightAnthonyWJones1478
cR..1422
wcfmarc_s1400
linqJon Skeet1391
google-app-engineNick Johnson1347
htmlQuentin1331
mysqlQuassnoi1294
winformsHans Passant1277
maven-2Pascal Thivent1236
cssthirtydot1234
servletsBalusC1233
entity-frameworkLadislav Mrnka1212
swingcamickr1187
eclipseVonC1185
iphoneTechZen1178
jqgridOleg1172
bashDennis Williamson1132
hibernatePascal Thivent1125
scalaDaniel C. Sobral1105
asp.net-mvc-2Darin Dimitrov1104
objective-cDave DeLong1079
wpfH.B.1057
flexwww.Flextras.com1020
xmlDimitre Novatchev1003
windows-phone-7Matt Lacey993
regexTim Pietzcker989
tsqlgbn971
matlabgnovice964
perlSinan Ünür956
oracleGary Myers945
delphiMason Wheeler872
ms-accessRemou859
rDirk Eddelbuettel830
sql-server-2005gbn829
nhibernateDiego Mijelshon811
mod-rewriteGumbo800
springskaffman800
core-dataTechZen793
xpathDimitre Novatchev784
jsf-2.0BalusC761
rubythe Tin Man757
f#Tomas Petricek750
jpaPascal Thivent747
ormPascal Thivent713
sql-server-2008gbn701
securityRook669
xhtmlJitendra Vyas658
drupalgoogletorp648
ruby-on-railsapneadiving646
powershellKeith Hill630
vb.netHans Passant624
mercurialRy4an615
entity-framework-4Ladislav Mrnka611
.htaccessGumbo611
genericsJon Skeet609
windows-mobilectacke608
visual-studioJaredPar603
restDarrel Miller591
haskellDon Stewart589
shellDennis Williamson585
compact-frameworkctacke554
web-servicesJohn Saunders550
ruby-on-rails-3AnApprentice543
multithreadingJon Skeet540
linuxIgnacio Vazquez-Abrams539
emacsTrey Jackson537
magentoclockworkgeek510
jaxbBlaise Doughan499
performanceMike Dunlavey494
winapiHans Passant493
windowsHans Passant486
apacheGumbo479
version-controlVonC464
entity-framework-4.1Ladislav Mrnka457
cakephpdeceze452
databaseHLGEM450
postgresqlFrank Heikens449
wordpresssongdogtech447
ajaxDarin Dimitrov445

Amazon Data Science Competition

Machine Learning April 14th, 2012

MLSP, the machine learning and signal processing conference typically hosts a contest along with paper submissions and porter sessions. This year, it has partnered with Amazon to produce a contest that is tailored to building a classifier. The deadline is May 17th. For more information visit – http://mlsp2012.conwiz.dk/index.php?id=43

Lapack, Blas & Armadillo preview

Machine Learning January 7th, 2010

Machine learning research and just about many fields of engineering utilize matlab for prototyping and development. Matlab costs quite a good amount of money and when your applications needs production quality, you are more likely to move away from matlab.

The core functionality of matlab that makes it a good, easy and quick prototyping platform is its easy to use matrix and linear algebraic operations. Beyond prototyping, speed of execution and production standards require your code to be implemented in C,C++,C#, java or one of the numerous new languages that pop up each morning. In this post we will see the implementation of matrix operations in C++ (the ones you thought was only possible in matlab) with the least programming effort.

BLAS:

BLAS (Basic Linear Algebra Subprograms) is a set of highly optimized library for vector and matrix operations. BLAS forms the backbone for numerous libraries that are built to facilitate ‘Matlab like ‘ vector operations. BLAS is written in fortran.

LAPACK:

While BLAS contains optimized routines for most fundamental operations, more sophisticated functions like matrix decompositions, factorizations etc are grouped into a library called LAPACK (Linear Algebra PACKage). The fundamental operations in LAPACK routines are optimized since they use the BLAS package. LAPACK is fortran aswell.

ATLAS:

ATLAS is a package that helps to optimize build variables so as to optimize LAPACK further. Apart from the mentioned packages, there are numerous variants of such libraries that claim to outperform one another.

The question that naturally arises is the usability of these packages. The good news is that you can integrate these libraries in C++ using a package called Armadillo. Armadillo by itself has most of the basic vector and matrix operations. Armadillo is a C++ linear algebra library which offers most operations on vectors and matrix in a stand alone manner. In order to access the more complicated operations like matrix factorizations that are described well in LAPACK, Armadillo offers nicer interface to call those functions from your C++ code.

To give a quick overview:

Just for the sake of code comparison between matlab and C++ using armadillo (offcourse including LAPACK and BLAS installed on a machine) we will write a small matrix multiplication code here. We will create a random matrix of dimensions 5×5 and multiply with itself and print the results.

MATLAB:

A = rand(5,5);

A = A*A;

A

C++ code:

mat A = rand(5, 5);

A = A*A;

A.print(“A = “);

Now that does not sound all that different , does it? Armadillo thus provides a nice set of classes that can perform most of the function that are a part of the core MATLAB software. I am sure MATLAB is a great tool when it comes to quick prototyping and visualization but I am sure these opensource packages are giving the companies a run for their money. We will continue the art of converting your Matlab code to C++ and the details about using Armadillo, LAPACK and BLAS for that purpose in future on this blog.