Using Python Iterparse For Large XML Files

Try Liza Daly’s fast_iter. After processing an element, elem, it calls elem.clear() to remove descendants and also removes preceding siblings. def fast_iter(context, func, *args, **kwargs): “”” http://lxml.de/parsing.html#modifying-the-tree Based on Liza Daly’s fast_iter http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ See also http://effbot.org/zone/element-iterparse.htm “”” for event, elem in context: func(elem, *args, **kwargs) # It’s safe to call clear() here because no descendants … Read more

Java : Read last n lines of a HUGE file

I found it the simplest way to do by using ReversedLinesFileReader from apache commons-io api. This method will give you the line from bottom to top of a file and you can specify n_lines value to specify the number of line. import org.apache.commons.io.input.ReversedLinesFileReader; File file = new File(“D:\\file_name.xml”); int n_lines = 10; int counter = … Read more

Searching for a string in a large text file – profiling various methods in python

Variant 1 is great if you need to launch many sequential searches. Since set is internally a hash table, it’s rather good at search. It takes time to build, though, and only works well if your data fit into RAM. Variant 3 is good for very big files, because you have plenty of address space … Read more

How to efficiently write large files to disk on background thread (Swift)

Performance depends wether or not the data fits in RAM. If it does, then you should use NSData writeToURL with the atomically feature turned on, which is what you’re doing. Apple’s notes about this being dangerous when “writing to a public directory” are completely irrelevant on iOS because there are no public directories. That section … Read more

How do I read a large CSV file with Scala Stream class?

Just use Source.fromFile(…).getLines as you already stated. That returns an Iterator, which is already lazy (You’d use stream as a lazy collection where you wanted previously retrieved values to be memoized, so you can read them again) If you’re getting memory problems, then the problem will lie in what you’re doing after getLines. Any operation … Read more

Git with large files

Update 2017: Microsoft is contributing to Microsoft/GVFS: a Git Virtual File System which allows Git to handle “the largest repo on the planet” (ie: the Windows code base, which is approximately 3.5M files and, when checked in to a Git repo, results in a repo of about 300GB, and produces 1,760 daily “lab builds” across … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)