sábado, 16 de noviembre de 2013

DatosPublicos.org.es

He creado un sitio web con un visor para analizar los datos de los Presupuestos Generales del Estado.

Llevaba tiempo queriendo encontrar esa información y hacer algo así. La verdad es que es dificilísimo encontrar datos en formatos legible por máquinas. La mayoría de los datos publicados son PDFs o páginas web, imposibles de procesar. Resulta que hay fundaciones como OKFN que velan por que los datos públicos sean de fácil acceso, y resulta ¡oh sorpresa! que España queda bastante mal en el ranking de apertura y accesibilidad a datos públicos. La parte más interesante, el gasto, sigue siendo difícil de localizar. Sería estupendo poder ver a quién se adjudican qué contratos con facilidad, cosa que sí sucede en otros países como Eslovaquia.


DatosPublicos.org.es
Gasto per cápita en cada comunidad autónoma, 2006-2012. ¿Cómo es que Navarra tiene una barra tan grande? ¡¿Nos cuesta el doble un navarro que un murciano?!

Finalmente encontré que la gente de Civio ya habían hecho algo muy parecido procesando los PDFs de los presupuestos en ¿Dónde van mis impuestos?, y publicaron sus datos. Gracias a ellos pude preparar esta web. Me pregunto... ¿qué podríamos ver analizando el BOE y los diferentes boletines autonomicos y provinciales?

Al sitio aún le quedaría mucho por mejorar, pero es un comienzo :D. La herramienta usada para el análisis de datos de DatosPublicos.org.es es un proyecto llamado CubesViewer, escrito por un servidor. Espero que resulte de interés a alguien.

Agradecimientos a Mateo y Pablo, que han apoyado la idea y colaboran proporcionando el hosting y el dominio.

domingo, 10 de febrero de 2013

Hooverphonic (Cover) - Mad about you


Hemos grabado una versión cortita de "Mad about you", de Hooverphonic. Éste es el segundo tema que produzco:




Voz: Xiana Teimoy
Instrumentos y Producción: J Montes

Desde mi perfil en SoundCloud se puede descargar el audio y navegar por los temas.

De paso recuerdo el anterior tema que publiqué: J - Nowadays .

sábado, 19 de enero de 2013

Wikipedia Facts


What can we find if we download and generate some statistics about Wikipedia?

1) Overview: Wikipedia on January 2nd 2013 has 13 057 082 entries (the Encyclopaedia Britannica sums 228 274 entries according to Wikipedia itself). There are almost as many redirections as actual articles:

Wikipedia Articles (blue: articles, green: redirections)
2) Articles: Let's look at the real articles contents only (no redirections). This is more than 7,1 million articles:

The average article belongs to 2.6 categories, links to 2.31 pages outside the site and 33.86 links internal to Wikipedia. Thousands of articles feature thousands links. In total, there are 16 413 888 external links and 240 751 315 internal links. Enough to get lost for a while!

The average size is 4 634 characters (roughly about 110 words per article), but the total size of article text is 32 GB. And this is only raw text, images are not included.


3) Content

This is one of the most striking result of all. I have searched for certain words within the text of the articles, and assigning a score. The follownig diagram shows how many articles are defined by a particular word (the word with most occurrences). This has been shown before, but the results still seem astonishing to me:




Perhaps we should start thinking of how airily we use the term "war".


4) Geography

I can only report for the articles last updated by anonymous users. But for the sake of it, this is how real article updates (by anonymous users) were distributed among the different continents. This population includes 490 080 articles:




5) Updates:  This is a result I am pretty surprised of. The following graph shows the year and quarter was the time that articles were last updated (separating redirections, in yellow, from articles, in blue). Apparently, a huge percentage of articles have been updated during the last quarter of 2012, which could mean that Wikipedia is very lively and is being updated frequently, although this value seems to high to me, and so I wonder if this may be some automatic process updating wikipedia articles.



Unfortunately, I can't get the "creation date" of articles as the normal Wikipedia dump doesn't include that information.

6) Titles

The average title entry is 26.9 characters long. There are entries starting with every character you can think of: , Ɣ, ¢, £, § ... We can also see how entries are distributed along letters. Surprisingly, numbers 1 and 2 have got more entries than letters Q, X, Y or Z. Even also more than U and V.


7) Processing

I have done the analysis and charts with CubesViewer OLAP Data Explorer, an open source data explorer that I published a couple of weeks ago.

Processing the Wikipedia export file (a 39,6 GB XML file) took my computer more than 18 hours, although using 1 core only and I didn't spend any time optimizing this.

source: 8,95GB 18:35:54 [ 140kB/s]
bzcat: 39,6GB 18:35:57 [ 621kB/s] 
articles: 13,1M 18:35:57 [ 195/s]    

domingo, 13 de enero de 2013

New open source application: CubesViewer


I have recently been working on a data exploration and visualization tool, and I am very happy to announce the release of this new project to the public domain.

It is called CubesViewer, and it is an Online Analytical Processing (OLAP) exploration tool. In everyday words, it allows people to design and produce reports and charts about many kinds of data that can be extracted from a database (like contracts, invoices, climate, demography, scientific production, wikipedia articles, public spending, logistics...).



I wanted to use a simple Online Analytical Processing (OLAP) server and I found the Cubes project, a fantastic lightweight OLAP server which includes everything I needed.

It's always nice to publish Open Source software. I hope this is of use to people.

Features:
  • User Interface allowing for multiple views on-screen.
  • Cube explorer providing drilldown and cut operations.
  • Supports dimension hierarchies and date filtering.
  • Different types of charts and diagrams.
  • View management, sharing and saving.
  • Modular and extensible.

    Cubesviewer Project: https://github.com/jjmontesl/cubesviewer