bigdata – Page 2 – Tarik Billa

sklearn and large datasets

May 23, 2023 by Tarik

I’ve used several scikit-learn classifiers with out-of-core capabilities to train linear models: Stochastic Gradient, Perceptron and Passive Agressive and also Multinomial Naive Bayes on a Kaggle dataset of over 30Gb. All these classifiers share the partial_fit method which you mention. Some behave better than others though. You can find the methodology, the case study and … Read more

scala.reflect.internal.MissingRequirementError: object java.lang.Object in compiler mirror not found

April 28, 2023 by Tarik

I solved this issue ,I found that I used java 9 that isn’t compatible with scala version so I migrated from java 9 into java 8.

Hbase quickly count number of rows

April 10, 2023 by Tarik

Use RowCounter in HBase RowCounter is a mapreduce job to count all the rows of a table. This is a good utility to use as a sanity check to ensure that HBase can read all the blocks of a table if there are any concerns of metadata inconsistency. It will run the mapreduce all in … Read more

“Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used” on an EMR cluster with 75GB of memory

March 7, 2023 by Tarik

Working with big data in python and numpy, not enough ram, how to save partial results on disc?

February 26, 2023 by Tarik

Using numpy.memmap you create arrays directly mapped into a file: import numpy a = numpy.memmap(‘test.mymemmap’, dtype=”float32″, mode=”w+”, shape=(200000,1000)) # here you will see a 762MB file created in your working directory You can treat it as a conventional array: a += 1000. It is possible even to assign more arrays to the same file, controlling … Read more

How to create a large pandas dataframe from an sql query without running out of memory?

February 8, 2023 by Tarik

As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql to read and process the query chunk by chunk: sql = “SELECT * FROM My_Table” for chunk in pd.read_sql_query(sql , engine, chunksize=5): print(chunk) Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying

Calculating and saving space in PostgreSQL

January 26, 2023 by Tarik

“Column Tetris” Actually, you can do something, but this needs deeper understanding. The keyword is alignment padding. Every data type has specific alignment requirements. You can minimize space lost to padding between columns by ordering them favorably. The following (extreme) example would waste a lot of physical disk space: CREATE TABLE t ( e int2 … Read more

Best way to delete millions of rows by ID

January 20, 2023 by Tarik

It all depends … Assuming no concurrent write access to involved tables or you may have to lock tables exclusively or this route may not be for you at all. Delete all indexes (possibly except the ones needed for the delete itself). Recreate them afterwards. That’s typically much faster than incremental updates to indexes. Check … Read more