Kaizen + Joy | 🦄1337code grinder
Kaizen Path | #100DaysofCodeSmith

Follow

Kaizen Path | #100DaysofCodeSmith

Follow

How to Write a JSON Parser in JS

A simple tutorial

Kaizen + Joy | 🦄1337code grinder's photo
Kaizen + Joy | 🦄1337code grinder
·May 1, 2022·

10 min read

How to Write a JSON Parser in JS
Play this article

Table of contents

  • Understanding the Documentation
  • Breaking down the Problem into Parts
  • Writing Helper Functions
  • Functions for JS Primitive Types
  • Functions for Composite Data Types
  • Next steps:

CodeSmith asked students to write a JSON Parser as part of the Precourse assignment for entering the immersive residency program. Here is my approach and solution.

If you would like to skip straight to the code, you can click here. Would be grateful for any feedback to refactor and make it better!


Understanding the Documentation

I have never taken a class on compilers / interpreters before, and did not know what a tokenizer was or AST. Before tackling this problem, I went down a bit of a Wikipedia rabbit-hole trying to understand these concepts, and spending lots of time trying to understand an abstract framework for tackling this problem, before realizing that I was spending too much time this way, and that the best way to start this problem is just to begin coding and testing functions, one at a time, to build up to the larger solution.

While I was researching the problem, I ran into this blog post by Tai Li Hau, which pointed me to the direction of understanding the Syntax Diagram (or Railroad Diagram) in the JSON.org documentation.

I spent a bit more time on Wikipedia learning the grammar/mechanics of a Syntax Diagram, and then decided to work on my solution to the JSON Parser problem following these diagrams, and translating them directly into Javascript functions:

1.png For ex, a white space is defined in the JSON documentation as these 5 options. (Since the white space is optional in JSON, omitting it altogether is the first option.)

3.png A value can be any of these data types, preceded and followed by a white space.

It’s also critical to understand the bigger picture of how these data types are nested within one another. The McKeeman Form diagram represents the same information as the Railroad diagram in text form.

It helpfully classifies escape and hex characters that need to be defined in a String and various components of a Number:

4.png A number could be an integer, fraction, or exponent.

Breaking down the Problem into Parts

In order to break this function down into parts, it is important to create a hierarchy of the simplest to most complex data types. It’s best to start by defining the primitive data types (Number, Boolean, Null) before moving on to defining the complex data types (String, Array, Object). This is because the more complex data types will need to call on methods to parse the simpler data types within.

Simplest to Most Complex –> Boolean, 2. Null, 3. Number, 4. String, 5. Array, 6. Object

Approaching the Boolean or Null function required some kind of text-reading function that triggers a True parser function when the character “T” is read, or a False parser function when the character “F” is read, or a Null parser function when the character “F” is read.

So in order to read the characters, one at a time, the next thing I needed to do was to write some helper functions.

Writing Helper Functions

I decided to put all my helper functions for reading characters from the JSON string into a “Reader” object with the following methods: index – this property keeps track of the index number of the character in the string that is being read.

  • nextChar() – for reading the next character in the string, while incrementing the index by one.
  • peek() – allows me to get the value of the next character in the string, without incrementing the index, or “eating” the next character as Hau called it.
  • eatWS() – “eats” the next character if it’s a white space by incrementing the index by one, and moving on to the next character.
  • nextNonWS() – loops through next characters until a non-white-space character is found
  • position() – a getter method that returns the value of the index

The Reader object returns these properties, which can be called by each of the Data Type functions to move through the string, like this:

reader.nextChar()
reader.peek()
reader.eatWS()

You also need to define a Value function, which detects which sort of data type the Reader is encountering at each step/index, and “eats the white space” before and after each value. I implemented this as one big switch/case statement. This is the primary function I call in my main JSON parser function to move through the string.

8.png

Functions for JS Primitive Types

Booleans & Null

After writing and testing these helper functions for reading the character stream, I was ready to move on to writing functions for the primitive types.

I combined True, False, and Null in one function called:

parseWord(word, reader)

which takes an argument “word” depending on what character is being detected in the reader stream. The function checks the next three letters in the stream to make sure they are as expected; if not, it throws an error.

I then defined three functions parseTrue(), parseFalse(), and parseNull() that call on the parseWord() function, passing in the respective words as arguments.

Numbers

Defining numbers is quite a bit trickier than defining booleans and null. It took a little while to understand what this diagram means, and the sequence in which optional values are being presented for formulating a number:

5.png

Here’s a breakdown of what this diagram is saying:

  • A number can be negative and start with a “-” or a digit between 1-9.

  • If the number is a fraction, it can start with a 0 instead, followed necessarily by a “.” (In JSON, you can’t have decimals that start with a period, such as .9340, which reads without a problem in plain Javscript. Every decimal, called a “fraction” in JSON, must start with a 0.xxx) After the period follows more digits (0-9).

  • A number can be an exponent in scientific notation if it starts with a decimal (number, period, number), followed by “E” or “e”, followed by a “-” or “+”, and ending with digits. For example, a 2-decimal scientific notation displays 12345678901 as 1.23E+10, which is 1.23 times 10 to the 10th power.

Writing this out as a Javascript function:

function isDigit(c) {
 return c != null && c >= '0' && c <= '9';
}
// ===== parseNumber() function =====
function parseNumber(reader) {
 let parsed = [];
 let c = reader.nextChar();
// read negative number
 if (c === '-') {
  parsed.push('-');
  c = reader.nextChar();
 }
// read whole number
 if (c >= '1' && c <= '9') {
  parsed.push(c);
  while (isDigit(c = reader.peek())) {
   parsed.push(c);
   reader.nextChar();
  }
 } else if (c === '0') {
  parsed.push(c);
  c = reader.peek();
  if (c !== '.') {
   throw makeError(c, reader);
  }
 } else {
  throw makeError(c, reader);
 }
// read fraction
 if (c === '.') {
  parsed.push(c);
  reader.nextChar();
  while (isDigit(c = reader.peek())) {
   parsed.push(c);
   reader.nextChar();
  }
 }
// read exponent
 if (c === 'e' || c === 'E') {
  parsed.push(c);
  reader.nextChar();
  c = reader.peek();
  if (c === '-' || c === '+') {
   c = reader.nextChar();
   parsed.push(c);
  }
  while (isDigit(c = reader.nextChar())) {
   parsed.push(c);
  }
 }
 return parseFloat(parsed.join(''));
}
// ===== end of function =====

Functions for Composite Data Types

After defining these primitive data types, I went on to define composite data types: String, Array, and Object – in that order.

Each of these three data types start with a character " or [ or { which continue onwards until the end of this object is denoted with a " or ] or }.

Strings

The tricky part about writing strings is that it can be “escaped” using the “/” character. The JSON documentation gives 9 possible escape characters, which I implemented using a switch / case statement.

The final case, “/u” is a hex character, which is defined by 4 alphanumeric characters made up of either the numbers 0-9, or the letters “a-f” or “A-F”. If the reader encounters an escape character, it needs to add “/” plus whatever character follows it into the string. (This requires adding another “/” to parse the character.) Otherwise, the String function can directly add the character into the string, including if the character is a Number.

6.png

Code snippet for escape character case: unicode / HEX

''' while ((c = reader.nextChar()) !== '"') { if (c === '\') { c = reader.nextChar() switch (c) { ... case 'u': let uni = []; for (let i = 0; i < 4; i++) { c = reader.nextChar(); if (c != null && ((c >= '0' && c <= '9') || (c >= 'a' && c <= 'f') || (c >= 'A' && c <= 'F'))) { uni.push(c); } else { throw makeError(c, reader); } } '''

Arrays

The main syntactical feature to note about arrays is that there is a comma in between values in an array. After each value, you have to reader.peek() to see if there is a comma or a “]” following that value.

NOTE: Using the reader.peek()method to check beforereader.nextChar()allows this function to navigate cases of nested arrays.

7.png

function parseArray(reader) {
 let c = reader.nextChar();
 if (c !== '[') {
  throw makeError(c, reader);
 }
 let parsed = [];
 reader.eatWS();
 while ((c = reader.peek()) !== ']') {
  parsed.push(value(reader));
  c = reader.peek();
  if (c === ',') {
   reader.nextChar();
  } else if (c !== ']') {
   throw makeError(c, reader);
  }
 }
 reader.nextChar();
 return parsed;
}

Objects

Objects have a slightly more complex syntax than arrays, and you must make sure to have your parseString() method working before you write your ``` parseObject()


The first string in an object can be parsed and stored in a variable as the “key”. This is followed by a colon. Then the next string can be parsed and stored in a “value” variable. The returned object will then assign the value its key property. If the value is followed by a comma, another key-value pair will follow, until a “}” closes the the object. [See full code in [Replit](https://replit.com/@Kai-LinLin/JSONparser?utm_campaign=%23100%20Days%20of%20Code&utm_medium=email&utm_source=Revue%20newsletter#index.js).]

![2.png](https://cdn.hashnode.com/res/hashnode/image/upload/v1651795850426/e4trYjNIL.png align="left")


## Nesting

I found that by writing these functions in a clear, modular way, nesting actually takes care of itself, and there was not really any additional functions I needed to write to pass the CS nesting tests. The only thing to be wary of is the times when I’m calling

reader.peek() and forgetting to call reader.nextChar() ``` after to move the index up. This could lead to an error in nesting, such as an issue I struggled with for several hours, due to dropping the last property in an object after a nested object.

However, if written correctly, the functions will call each other and execute each set of instructions in the proper local execution context, without needing to manually keep track of the number of levels of nesting you are doing. For example, for one of the tests: JSONParser(‘{ “a”: { “b”: 1 }, “c”: 2 }’

parseObject()
``` is being called inside of

parseString() , which is being called by parseValue() , after being triggered inside the larger parseObject() , which moves through the object with parseString() ``` This could lead to weird errors, which requires tons of agonizing tracing of variable values up and down stacks to debug. But if all of the functions are written clearly, nesting takes care of itself.

JSON Parser Solution:

Here’s all my code for this problem here. It passes all the CodeSmith tests. But I’ll keep working to refactor this code, and would appreciate any code review / feedback on ways to further optimize performance. Thank you! 🙏

FINAL NOTE

In the process of working through these nested functions, I had to get a lot more comfortable with using the Debugger in VS Code, to step through and into some of the wild loops that are being created here. I was running into so many bugs at first. To use the debugger, I would set breakpoints in the middle of the while loops, to check the values of the variables at each level of the call stack.

This was much clearer to understand than using the Quokka extension which prints all the looped variable values in-line, but would often start with a “…” because of how long the loops would run if you don’t set a breakpoint in the debugger. (Essentially impossible to print the starting values of a long loop in a single line with Quokka, and difficult to understand what’s really happening inside the loop). I would not have been able to untangle this mess without taking the detour (and time) to get more comfortable with the VS Code debugger.

Next steps:

In an upcoming post, after I finish the other Precourse requirements, I would like to write another solution using a tokenizer, with a solid understanding of Abstract Syntax Trees (AST), lexical analysis (lexing), compilers and interpreters.

Here’s the course on Front End Masters that I am watching to better understand this topic:

 
Share this