sklearn and large datasets

I’ve used several scikit-learn classifiers with out-of-core capabilities to train linear models: Stochastic Gradient, Perceptron and Passive Agressive and also Multinomial Naive Bayes on a Kaggle dataset of over 30Gb. All these classifiers share the partial_fit method which you mention. Some behave better than others though. You can find the methodology, the case study and … Read more

Working with big data in python and numpy, not enough ram, how to save partial results on disc?

Using numpy.memmap you create arrays directly mapped into a file: import numpy a = numpy.memmap(‘test.mymemmap’, dtype=”float32″, mode=”w+”, shape=(200000,1000)) # here you will see a 762MB file created in your working directory You can treat it as a conventional array: a += 1000. It is possible even to assign more arrays to the same file, controlling … Read more

How to create a large pandas dataframe from an sql query without running out of memory?

As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql to read and process the query chunk by chunk: sql = “SELECT * FROM My_Table” for chunk in pd.read_sql_query(sql , engine, chunksize=5): print(chunk) Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying

Calculating and saving space in PostgreSQL

“Column Tetris” Actually, you can do something, but this needs deeper understanding. The keyword is alignment padding. Every data type has specific alignment requirements. You can minimize space lost to padding between columns by ordering them favorably. The following (extreme) example would waste a lot of physical disk space: CREATE TABLE t ( e int2 … Read more

Best way to delete millions of rows by ID

It all depends … Assuming no concurrent write access to involved tables or you may have to lock tables exclusively or this route may not be for you at all. Delete all indexes (possibly except the ones needed for the delete itself). Recreate them afterwards. That’s typically much faster than incremental updates to indexes. Check … Read more