Node Read Number on Each Line of File
This blog postal service has an interesting inspiration bespeak. Last week, someone in i of my Slack channels, posted a coding challenge he'd received for a programmer position with an insurance applied science visitor.
It piqued my involvement every bit the challenge involved reading through very large files of information from the Federal Elections Committee and displaying back specific information from those files. Since I've non worked much with raw data, and I'm e'er up for a new challenge, I decided to tackle this with Node.js and come across if I could complete the challenge myself, for the fun of information technology.
Hither's the iv questions asked, and a link to the information set that the program was to parse through.
- Write a program that will print out the total number of lines in the file.
- Detect that the 8th column contains a person's proper name. Write a program that loads in this data and creates an assortment with all name strings. Print out the 432nd and 43243rd names.
- Notice that the fifth column contains a form of date. Count how many donations occurred in each calendar month and print out the results.
- Notice that the 8th column contains a person'due south name. Create an array with each first proper noun. Identify the most mutual first name in the data and how many times it occurs.
Link to the information: https://www.fec.gov/files/majority-downloads/2018/indiv18.cypher
When you unzip the folder, you should see one main .txt
file that'south ii.55GB and a folder containing smaller pieces of that main file (which is what I used while testing my solutions earlier moving to the main file).
Not too terrible, correct? Seems achievable. So let's talk about how I approached this.
The Two Original Node.js Solutions I Came Up With
Processing large files is nothing new to JavaScript, in fact, in the core functionality of Node.js, there are a number of standard solutions for reading and writing to and from files.
The virtually straightforward is fs.readFile()
wherein, the whole file is read into retentiveness and so acted upon once Node has read it, and the second option is fs.createReadStream()
, which streams the information in (and out) similar to other languages similar Python and Coffee.
The Solution I Chose to Run With & Why
Since my solution needed to involve such things as counting the full number of lines and parsing through each line to get donation names and dates, I chose to use the second method: fs.createReadStream()
. Then, I could use the rl.on('line',...)
role to get the necessary data from each line of code as I streamed through the document.
It seemed easier to me, than having to split up apart the whole file once it was read in and run through the lines that way.
Node.js CreateReadStream() & ReadFile() Lawmaking Implementation
Below is the lawmaking I came up with using Node.js'due south fs.createReadStream()
function. I'll break it down below.
The very starting time things I had to do to set this up, were import the required functions from Node.js: fs
(file system), readline
, and stream
. These imports allowed me to and so create an instream
and outstream
and then the readLine.createInterface()
, which would permit me read through the stream line by line and print out information from it.
I besides added some variables (and comments) to concord various bits of data: a lineCount
, names
array, donation
array and object, and firstNames
assortment and dupeNames
object. You'll come across where these come into play a picayune later.
Inside of the rl.on('line',...)
role, I was able to do all of my line-by-line data parsing. In here, I incremented the lineCount
variable for each line it streamed through. I used the JavaScript split()
method to parse out each name and added it to my names
array. I farther reduced each name down to just first names, while accounting for middle initials, multiple names, etc. forth with the first proper noun with the help of the JavaScript trim()
, includes()
and split()
methods. And I sliced the year and engagement out of date column, reformatted those to a more than readable YYYY-MM
format, and added them to the dateDonationCount
array.
In the rl.on('close',...)
function, I did all the transformations on the data I'd gathered into arrays and console.log
ged out all my data for the user to run across.
The lineCount
and names
at the 432nd and 43,243rd index, required no further manipulation. Finding the most common name and the number of donations for each calendar month was a little trickier.
For the near common kickoff name, I start had to create an object of cardinal value pairs for each proper name (the central) and the number of times it appeared (the value), then I transformed that into an array of arrays using the ES6 office Object.entries()
. From at that place, it was a simple task to sort the names by their value and print the largest value.
Donations also required me to brand a similar object of key value pairs, create a logDateElements()
office where I could nicely using ES6's string interpolation to display the keys and values for each donation calendar month. And so create a new Map()
transforming the dateDonations
object into an array of arrays, and looping through each array calling the logDateElements()
function on it. Whew! Not quite as simple equally I first thought.
Just it worked. At least with the smaller 400MB file I was using for testing…
Afterwards I'd done that with fs.createReadStream()
, I went dorsum and as well implemented my solutions with fs.readFile()
, to see the differences. Here'due south the code for that, but I won't get through all the details hither — it's pretty similar to the first snippet, just more than synchronous looking (unless you use the fs.readFileSync()
part, though, JavaScript will run this code just equally asynchronously as all its other lawmaking, non to worry.
If you'd like to see my full repo with all my code, you lot can see information technology hither.
Initial Results from Node.js
With my working solution, I added the file path into readFileStream.js
file for the 2.55GB monster file, and watched my Node server crash with a JavaScript heap out of memory
error.
As it turns out, although Node.js is streaming the file input and output, in between it is withal attempting to hold the entire file contents in memory, which information technology tin can't exercise with a file that size. Node can hold up to 1.5GB in memory at one time, but no more than.
And so neither of my current solutions was up for the full challenge.
I needed a new solution. A solution for fifty-fifty larger datasets running through Node.
The New Data Streaming Solution
I found my solution in the course of EventStream
, a popular NPM module with over 2 million weekly downloads and a promise "to make creating and working with streams easy".
With a little assistance from EventStream's documentation, I was able to figure out how to, one time over again, read the code line by line and do what needed to be done, hopefully, in a more than CPU friendly way to Node.
EventStream Code Implementation
Hither's my code new code using the NPM module EventStream.
The biggest modify was the pipe commands at the offset of the file — all of that syntax is the way EventStream's documentation recommends you intermission up the stream into chunks delimited by the \n
character at the end of each line of the .txt
file.
The only other thing I had to change was the names
respond. I had to fudge that a little bit since if I tried to add all 13MM names into an array, I once again, hit the out of memory consequence. I got around it, past only collecting the 432nd and 43,243rd names and adding them to their own array. Not quite what was beingness asked, merely hey, I had to go a little creative.
Results from Node.js & EventStream: Round 2
Ok, with the new solution implemented, I once more, fired up Node.js with my 2.55GB file and my fingers crossed this would piece of work. Check out the results.
Success!
Decision
In the cease, Node.js'due south pure file and big data handling functions fell a little short of what I needed, but with simply one extra NPM package, EventStream, I was able to parse through a massive dataset without crashing the Node server.
Stay tuned for part two of this series where I compare my 3 different ways of reading data in Node.js with functioning testing to see which i is truly superior to the others. The results are pretty middle opening — especially as the information gets larger…
Thanks for reading, I hope this gives you an thought of how to handle large amounts of data with Node.js. Claps and shares are very much appreciated!
If you enjoyed reading this, y'all may also relish some of my other blogs:
- Postman vs. Insomnia: Comparing the API Testing Tools
- How to Use Netflix's Eureka and Jump Cloud for Service Registry
- Jib: Getting Expert Docker Results Without Whatever Knowledge of Docker
baggettmixtiffinuel.blogspot.com
Source: https://itnext.io/using-node-js-to-read-really-really-large-files-pt-1-d2057fe76b33
0 Response to "Node Read Number on Each Line of File"
Post a Comment