Updated on 2022-07-29

Loading data from thousands of JSON files

In this project, I was presented with a dataset of over 250k JSON files. The files come from a NFT (Non-fungible tokens) project, a blockchain technology that has been talking a lot lately. The official CardaNinjas project proposes a collection of more than 1000 unique characters, the ninja Johnny in various situations.

However, I used the complete combination generated by the team, which contains 256,093 different Johnnies, one in each JSON file containing the name and characteristics. It is difficult to analyze the data with this high amount of files. So using Python I loaded all the data into a document-oriented NoSQL database, MongoDB, and I made a simple dashboard in PowerBI to filter Johnnies by its characteristics.

All the files of the project are on Github.

Preparing the development environment

For the execution of this project, Python 3.10, MongoDB 5.0.6 database and Microsoft's self-service BI application, PowerBI, were used.

As I was using Windows 10 I needed to download the ODBC driver and set up a DSN (Data Source Name). After that, it was also necessary to install the BI Connector for PowerBI to properly connect to MongoDB.

Execution

Using the MongoClient class from the pymongo package, we create the database connection:

            client = MongoClient('localhost', 27017)
            db = client['ninja']
            collection = db['johnny']
            

So we will use johnny collection from the ninja database. After that, the code below specify the directory and counts how many .json files there are in there:

            jsonPath = './json256k/'

            countFiles = sum(1 for i in iglob(jsonPath + "*.json"))
            print("There are {} files in {} directory.".format(countFiles, jsonPath))
            

And we have the result:

            There are 256093 files in ./json256k/ directory.

With this code, we read the json files and put it into a pandas dataframe:

              def gen_dataframe(count_files, json_path):
                  df_data = pd.DataFrame()
                  for i in range(count_files):
                      df = pd.DataFrame(read_json(f"{json_path}{str(i+1)}.json"))
                      df['Name'] = f"Johnny #{str(i+1).zfill(6)}"
                      df['Background'] = df['attributes'][0]['value']
                      df['Legs'] = df['attributes'][1]['value']
                      df['Body'] = df['attributes'][2]['value']
                      df['Eyes'] = df['attributes'][3]['value']
                      df['Head'] = df['attributes'][4]['value']
                      df['Arms'] = df['attributes'][5]['value']
                      df['Pets'] = df['attributes'][6]['value']
                      df.drop(['attributes', 'dna', 'edition'], axis=1, inplace=True)
                      df.drop_duplicates(inplace=True)
                      df_data = df_data.append(df)
                  return df_data
            

That way, the top of the dataframe is something like this:



It contains Name, Background, Legs, Body, Eyes, Head, Arms and Pets attributes.


An example below:



And finally, we run the following code to insert the file into MongoDB:

            
              
            

Inside the MongoDB shell, we can see the number of files loaded in the database:

            > db.johnny.find().count()
            256093
          

A simple test to filter the data can be done with the command below in the MongoDB terminal:

            > db.johnny.find({"Body":"Tall", "Pets":"Hedgehog"}).count()
            2513
          

You can filter the Body as "Tall" and the Pets as "Hedgehog" in the dashboard below and you will see that the value will be the same: 2,513.

Dashboard

This is a simple dashboard where you can filter information by characteristics of each character. For feasibility reasons, it was not possible to attach the 256 thousand images to the dashboard.


There's a button to view fullscreen in the lower right corner. Or click here to view on a separate tab.

Project information

  • Category: Data Engineering - ETL
  • Project date: 25 March, 2022
  • Project files: On Github