Real data, real problems
This week has been quite difficult and I’ve had to deal with real world problems in terms of creating, reading and inserting data.
Let me put you in context, a colleague and I managed to find a mentor who will accompany us on our way to improve our SQL skills and thus be prepared to take the job opportunities that will be presented to us in the near future.
Among the tasks that the mentor left us was to download 4 csv files of more than 1 GB each, create a PostgreSQL table for each file and load the information from the csv in the corresponding table.
On paper it sounds easy since this is something that can be done relatively quickly with the python pandas library, but since I was working in a virtual machine with windows things got complicated in such a way that the only quick way I found to install python went through Anaconda and its jupyteer notebooks.
Next I had to read the csv documents so with the help of python and thanks to the command
with open(‘file path’) as f:
I could see the type of encoding that was required to work, in this case: UTF-16.
the next problem was being able to save all the records of the files in the tables, what happens is that since they’re such large files, PostgreSQL cannot handle something of those sizes in memory, so I had to use some strategies such as the library SQLAlchemy or directly pandas with its chunksize option within its read_csv method, you can segment the file into chunks and pass the information to PostgreSQL bit by bit.
I really liked facing real problems in the world of data, I’m looking forward to seeing what other challenges await me.