Can GPT-4 Pass a Software Engineer Coding Assignment?

Key Takeaways:

  • GPT-4 demonstrated a clear understanding of Python and Rust syntax and constructs, utilizing language-specific features to provide concise, efficient, and easy-to-read solutions for easy, medium, and hard-difficulty problems.

  • GPT-4’s solutions effectively addressed the problem statements, showcasing its proficiency in generating solutions in different programming languages. However, its performance could have been flawless, as it sometimes struggled to address edge cases and exhibit optimal efficiency.

  • As a study tool, GPT-4 has the potential to be an invaluable resource for students to learn programming concepts and develop problem-solving skills. However, using these tools responsibly and ethically ensures one’s learning progress.

  • GPT-4’s capabilities raise concerns about potential cheating on coding assessments, necessitating more sophisticated and in-depth evaluation methods in the hiring process, such as live coding interviews or pair programming sessions.

  • To ensure the responsible and ethical use of GPT-4 in software engineering education and hiring, it is crucial to promote a culture of academic integrity and implement proactive measures to mitigate the risks associated with cheating and misuse.

Introduction

Artificial intelligence (AI) has advanced rapidly in recent years, developing increasingly sophisticated models like OpenAI’s GPT-4. As AI models become more powerful, researchers, educators, and employers are keen to understand their potential applications and limitations.

Graph via OpenAI
via OpenAI

GPT-4 can take a lot of subject tests at a high level, just like the APs. But can it complete coding assignments typically given to software engineers?

Purpose of the Research

This research aims to assess GPT-4’s ability to solve programming problems that software engineers commonly encounter during interviews and on the job. By examining GPT-4’s performance on a range of problem difficulties and programming tests work, we aim to identify its strengths, weaknesses, and potential implications for the future of software engineering education and hiring practices.

Additionally, we will touch on the ethical considerations related to using AI models like GPT-4 to test programming skills in educational and professional contexts.

Graph via OpenAI
via OpenAI

OpenAI has claimed that GPT-4 provides factual code quality up to around 70%. Let us see if that’s true.

GPT-4 Overview and Capabilities

GPT-4, or Generative Pre-trained Transformer 4, is a powerful AI model developed by OpenAI. As an autoregressive language model, it is designed to generate human-like text by predicting the next word in a sequence based on the context of previous words. GPT-4’s massive scale and training data enable it to perform various tasks, such as answering questions, writing essays, and generating code snippets.

GPT-4’s predecessors, including GPT-3, have demonstrated impressive capabilities in generating code and solving simple programming problems. However, their performance in more complex coding tests and assignments remains a subject of ongoing research.

In this blog post, we will put GPT-4 to the test by evaluating its performance on various coding problems, ranging from easy to hard, to better understand its potential as a tool for software engineers and educators.

As we explore GPT-4’s performance on software engineering tests and coding assignments, it’s crucial to establish a structured and rigorous approach to our investigation.

Methodology

Selection of Coding Problems

To objectively assess GPT-4’s ability to tackle software engineering coding assignments, we selected a diverse set of problems that cover various aspects of algorithmic challenges typically faced by software developers. We focused on three categories of problems: easy, medium, and hard, to get a comprehensive understanding of the model’s capabilities across different levels of difficulty.

The problems were chosen from reputable sources such as LeetCode, HackerRank, and other competitive programming platforms, ensuring their relevance to real-world software engineering tasks.

Evaluation Criteria

To evaluate GPT-4’s performance on the selected coding problems, we established a set of criteria that would allow us to objectively measure the quality of the generated solutions. These assessment criteria include:

Correctness: The solution should produce the expected output for all valid inputs, adhering to the problem constraints and requirements.

Efficiency: The solution should exhibit optimal or near-optimal time and space complexity, ensuring it can handle large inputs within reasonable time limits.

Readability: The generated code should be easy to read and understand, following established coding conventions and practices.

Robustness: The solution should gracefully handle edge cases and unexpected inputs without causing crashes or producing incorrect results.

Knowledge: The solution should be generated in the two programming languages: Python and Rust. This will test GPT-4’s ability to adapt its solutions across different languages, showcasing its versatility and depth of knowledge in various programming paradigms.

By assessing GPT-4’s solutions using these criteria, we can better understand its strengths and weaknesses in solving software engineering coding assignments.

The selection of Python and Rust as the two programming languages for this study was driven by a combination of factors, including their syntax, popularity, and the preferences of the developer community.

Syntax Criteria: Python and Rust were chosen for their distinct and expressive syntax styles. Python is known for its simplicity and readability, which makes it an excellent choice for demonstrating GPT-4’s ability to generate clean, easily understandable code.

On the other hand, Rust is designed for performance and safety, with a syntax that emphasizes explicitness and strong typing. By evaluating GPT-4’s performance in both languages, we can assess its adaptability to different syntax styles and programming paradigms.

Popularity and Developer Preferences: According to recent developer surveys, Python is one of the most popular programming languages, consistently ranking among the top choices for developers across various fields, including web development, data science, and machine learning. Its versatility and ease of use make it an ideal candidate for this study.

While not as widespread as Python, Rust has gained a reputation as one of the most loved programming languages among developers. Its focus on safety and performance and its innovative features like ownership and borrowing make it an increasingly popular choice for systems programming and other performance-critical applications.

Including Rust in this study allows us to gauge GPT-4’s proficiency in a cutting-edge and beloved language by the developer community.

By selecting Python and Rust as our target languages, we aim to demonstrate GPT-4’s capability to generate solutions that cater to the preferences and requirements of a diverse range of developers while showcasing its ability to adapt to different programming languages and paradigms.

Assessing GPT-4’s solutions using these criteria, we can better understand its strengths and weaknesses in solving software engineering coding assignments.

GPT-4 Solution Generation

We provided GPT-4 with a clear and concise problem statement for each selected coding problem and any necessary input and output constraints. We then prompted the model to generate a solution in a specific programming language (e.g., Python, Java, or C++).

To ensure that GPT-4 had ample opportunity to generate high-quality solutions, we repeated this process multiple times for each problem, varying the prompts slightly to explore different approaches the model might take. We then selected the best solution generated by GPT-4 for each problem based on the evaluation criteria outlined earlier.

In the following sections, we will discuss GPT-4’s performance on the selected coding tests and problems, analyze its solutions, and draw conclusions about the model’s capabilities in the context of software engineering coding test assignments.

Easy Difficulty Problem: Sum of Even Numbers

Problem Description

Given an array of integers, find the sum of all even numbers in the array.

Input: An array of n integers, where 1 <= n <= 10^4 and the integers range from -10^4 to 10^4.

Output: Return the sum of all even numbers in the input array.

Example:
Input: [1, 2, 3, 4, 5, 6]

Output: 12

Explanation: The even numbers in the array are 2, 4, and 6. Their sum is 12.

Problem Difficulty and Target Audience

The “Sum of Even Numbers” problem is an easy-level coding problem suitable for beginners in programming or students who have recently started learning a programming language. It tests basic programming skills, such as looping through an array, conditional statements, and simple arithmetic operations.

This problem is a great starting point for students new to programming, as it helps them understand how to traverse arrays and work with conditional statements. It also serves as a foundation for complex problems requiring similar techniques, such as iterating through arrays and performing computations based on specific conditions.

Students with some programming experience should be able to solve this problem with little difficulty. However, it may still be a good exercise for those looking to practice their coding skills and gain confidence in their abilities. Additionally, it allows students to familiarize themselves with the syntax and structure of different programming languages, as we will be evaluating the GPT-4’s solutions in Python and Rust.

GPT-4’s Solution and Analysis

Python Solution:
def sum_of_even_numbers(arr):
    return sum(x for x in arr if x % 2 == 0)
Rust Solution:
fn sum_of_even_numbers(arr: &[i32]) -> i32 {
    arr.iter().filter(|&&x| x % 2 == 0).sum()
}
Explanation:

GPT-4 clearly understands the respective language syntax and constructs for both Python and Rust solutions, effectively utilizing language-specific features.

GPT-4 employs a list comprehension within the sum function in the Python solution, which iterates through the input array arr. It filters the elements by checking if the remainder, when divided by 2 is 0 (i.e., the number is even) and then sums up the filtered elements to produce the final result.

Similarly, in the Rust solution, GPT-4 uses the iter() method to create an iterator over the input array arr, followed by the filter() method with a closure that checks if the number is even. Finally, the sum() method is used to sum the filtered elements and return the result.

Both solutions are concise, efficient, and easy to read, showcasing GPT-4’s proficiency in generating solutions in different programming languages. Furthermore, the generated solutions effectively address the problem statement, exhibiting optimal time complexity (O(n)) and handling edge cases gracefully.” – ChatGPT via GPT-4

Medium Difficulty Problem: Valid Binary Search Tree

Problem Description

The task is determining whether a given binary tree data scientist is a valid binary search tree (BST). A binary search tree is a binary tree data structure with the following properties:

The value of each node in the tree is greater than the values of all the nodes in its left subtree.

The value of each node in the tree is less than the values of all the nodes in its right subtree.

Both the left and right subtrees must also be binary search trees.

The input will be the root of the binary tree, and the output should be a boolean value indicating whether the tree is a valid binary search tree or not.

Example:

Input:

    5
   / 
  1   7
     / \
    6   8

“Did it just draw a binary tree in markdown?” Yes, yes, it did. Continue

Output: True

This example input represents a valid binary search tree because it satisfies all the above BST properties.

This medium-difficulty problem is typically suited for students who understand the fundamentals of binary trees, recursion, and data structure. The problem requires a deeper understanding of the properties of binary search trees and the ability to navigate through the tree structure to verify these properties.

Students who have completed a few data structure and algorithm courses or have practiced solving tree-based problems on LeetCode should be able to solve this problem. However, it can be challenging for beginners who are just starting to learn about binary trees and binary search trees.

The problem serves as a good exercise for intermediate-level students to refine their problem-solving skills, as it helps reinforce their knowledge of tree traversal techniques and binary search tree properties.

GPT-4’s Solution and Analysis

Python Solution:
def isValidBST(self, root):
    def helper(node, lower=float('-inf'), upper=float('inf')):
        if not node:
            return True
  val = node.val
        if val <= lower or val >= upper:
            return False
        if not helper(node.right, val, upper):
            return False
      if not helper(node.left, lower, val):
            return False

        return True
    return helper(root)
Rust Solution:
use std::cell::RefCell;
use std::rc::Rc;
use std::i64;
fn is_valid_bst(root: Option<Rc<RefCell<TreeNode>>>) -> bool {
    fn helper(node: &Option<Rc<RefCell<TreeNode>>>, min: i64, max: i64) -> bool {
        if let Some(n) = node {
            let n = n.borrow();
            let val = n.val as i64;
            if val <= min || val >= max {
                return false;
            }
            if !helper(&n.left, min, val) || !helper(&n.right, val, max) {
                return false;
            }
        }
        return true
    }
    helper(&root, i64::MIN, i64::MAX)
}

GPT-4 provides valid solutions for both Python and Rust. The solutions take a recursive approach to traverse the binary tree, ensuring that the properties of a binary search tree are maintained. The helper function takes the current node, a lower, and an upper bound as arguments.

The algorithm checks whether the current node’s value lies between the lower and upper bounds at each recursive step. If not, the function returns False, indicating that the given tree is not a valid binary search tree.

The function then makes recursive calls to the left and right subtrees, updating the lower and upper bounds based on the current node’s value.

For the left subtree, the current node’s value is used as the new upper bound, and for the right subtree, the current node’s value is used as the new lower bound. This ensures that each node’s value is greater than all the nodes in its left and less than all in its right subtree.

Python and Rust solutions exhibit optimal time complexity, visiting each node simultaneously. The recursive approach makes the code concise and easy to read while effectively handling edge cases and ensuring correctness.

3.3. Hard Difficulty Problem: Trapping Rain Water

We will examine GPT-4’s performance on a difficult problem as we wrap up. This will provide a comprehensive understanding of its capabilities in tackling complex coding challenges.

Problem Description

The Trapping Rain Water problem involves calculating the amount of water trapped in an array of non-negative integers, where each integer represents the height of a wall. The width of each wall is 1, and the water is trapped between the walls. The goal is to determine the total volume of water that can be trapped between these walls after rain.

This problem is considered a hard difficulty level because it requires a solid understanding of data structures, such as stacks or dynamic programming, and a strong grasp of algorithms to optimize time and space complexity.

GPT-4’s Solution and Analysis

To evaluate GPT-4’s performance on the Trapping Rain Water problem, we will analyze its solutions in both Python and Rust.

Python Solution:

class Solution:

class Solution:
    def trap(self, height: List[int]) -> int:
        if len(height) < 3:
            return 0

        left_max = [0] * len(height)
        right_max = [0] * len(height)

        left_max[0] = height[0]
        for i in range(1, len(height)):
            left_max[i] = max(left_max[i - 1], height[i])


        right_max[-1] = height[-1]
        for i in range(len(height) - 2, -1, -1):
            right_max[i] = max(right_max[i + 1], height[i])


        trapped_water = 0
        for i in range(len(height)):
            trapped_water += min(left_max[i], right_max[i]) - height[i]

        return trapped_water
Rust Solution:
pub fn trap(height: Vec<i32>) -> i32 {
    if height.len() < 3 {
        return 0;
    }

    let mut left_max = vec![0; height.len()];
    let mut right_max = vec![0; height.len()];


    left_max[0] = height[0];
    for i in 1..height.len() {
        left_max[i] = left_max[i - 1].max(height[i]);
    }


    right_max[height.len() - 1] = height[height.len() - 1];
    for i in (0..height.len() - 1).rev() {
        right_max[i] = right_max[i + 1].max(height[i]);
    }


    let mut trapped_water = 0;
    for i in 0..height.len() {
        trapped_water += (left_max[i].min(right_max[i]) - height[i]).abs();
    }

    trapped_water
}

The Python and Rust solutions demonstrate GPT-4’s ability to handle complex problem-solving in multiple languages. The efficient solutions implement dynamic programming to optimize time and space complexity. They also exhibit good readability and adhere to each language’s syntax and coding conventions.

In both solutions, GPT-4 first checks if the input list has fewer than three elements, in which case it returns 0, as no water can be trapped. It then initializes two lists, left_max and right_max, to store the maximum heights to the left and right of each position. The algorithm populates these lists by iterating through the height list in two passes, one from left to right and the other from right to left. Finally, the trapped water volume is calculated by iterating through the height list once more and summing the difference between the minimum left and right maxes at each position and the height at that position.

Overall, GPT-4’s solutions to the Trapping Rain Water problem indicate its strong capabilities in handling difficult coding problems, demonstrating its potential as a study tool and problem-solving assistant for advanced computer science students and software developers.

Discussion

4.1. GPT-4’s Performance Across Different Problem Difficulties

Our evaluation of GPT-4’s ability to solve coding problems of varying difficulty showed that its performance was generally consistent with the expectations for each difficulty level. For the easy difficulty problem, GPT-4 generated correct and efficient solutions in both Python and Rust.

As the difficulty increased to medium and hard problems, GPT-4 continued to produce viable solutions, although in some cases, the generated code required minor adjustments or further optimization.

It is important to note that GPT-4’s performance could have been more flawless, and in some cases, it struggled to address edge cases and exhibit optimal efficiency. However, considering its ability to generate solutions for diverse problems, GPT-4 has demonstrated impressive capabilities as an AI language model in software engineering tasks.

I found that it passed 100% of free LeetCode easy and medium-difficulty programming tests and questions that would be found in a serious programming test or interview. But GPT-4 fail to run on 2 of the hard algorithm questions. There is still some improvement to be done.

GPT-4 failling an edge case in LeetCode

4.2. Limitations and Challenges in GPT-4’s Problem-solving Capabilities

While GPT-4 has shown promise in solving coding problems, some limitations and challenges are associated with its problem-solving capabilities. First, GPT-4’s generated solutions may only sometimes adhere to the best coding practices, potentially leading to less readable or maintainable code.

Additionally, GPT-4 might not consistently generate solutions with optimal efficiency, which could be problematic for performance-critical applications or large-scale systems.

Moreover, GPT-4’s ability to understand complex problem statements and extract relevant information to generate solutions is sometimes flawed. Sometimes, it may require additional guidance or clarification to fully comprehend a given problem statement.

Implications for Software Engineering Education and Hiring

The findings from our research have several implications for software engineering education and hiring. GPT-4’s ability to generate solutions to coding problems suggests that AI-driven tools like GPT-4 can be valuable resources for students to learn programming concepts and develop problem-solving skills.

However, using these tools responsibly and ethically is essential, as relying on AI-generated solutions requires understanding the underlying concepts to ensure learning progress.

In the hiring process, GPT-4’s capabilities raise concerns about the potential for candidates to cheat on coding assessments using AI-generated solutions.

To combat this, employers should consider adopting more sophisticated and in-depth evaluation methods, such as live coding interviews or pair programming sessions, to better assess a top candidate’s coding skills and true understanding and ability to apply programming concepts in real-world scenarios.

Ultimately, GPT-4 serves as a powerful study tool and resource for software engineering students and professionals alike, but it should be utilized responsibly and ethically to ensure its benefits are realized without compromising the integrity of the education and hiring processes.

Ethical Considerations

GPT-4’s Use as a Study Tool

The application of GPT-4 as a study tool in software engineering and programming education has the potential to be an invaluable resource for students. By providing a means for students to access solutions and explanations for coding problems, GPT-4 can aid in reinforcing programming concepts and improving problem-solving and technical communication skills throughout. However, students must use GPT-4 responsibly and as a supplement to their learning rather than relying solely on AI-generated solutions. Understanding each solution’s underlying concepts and thought processes is essential for long-term success and growth as a software engineer.

When using GPT-4 as a study tool, it is vital to keep the following ethical considerations in mind

Always strive to comprehend the logic behind the AI-generated solution rather than blindly copying or memorizing it.

Use GPT-4-generated solutions as a reference, but practice solving problems independently to develop critical thinking and problem-solving skills.

Acknowledge the limitations and potential biases in AI-generated solutions and consult other resources or experts when in doubt.

Cheating and Misuse of GPT-4 in Coding Interviews and Assignments

The availability of powerful AI tools like GPT-4 raises concerns about cheating and misuse in coding interviews and assignments. Candidates might be tempted to use GPT-4-generated solutions to pass coding assessments without truly understanding or possessing the required skills. This dishonest behavior undermines the integrity of the hiring and recruitment process and can lead to unqualified candidates entering the workforce.

To mitigate the risks associated with cheating and the misuse of GPT-4, both educators and employers must take proactive measures, such as:

Implementing more sophisticated evaluation methods, such as live coding interviews, pair programming sessions, technical interviews, or project-based assessments, to better gauge a top candidate’s skills and true understanding and ability to apply programming concepts.

Encouraging a culture of academic integrity and promoting the ethical use of AI-driven tools in educational settings.

Regularly updating coding assessments and problem sets to reduce the likelihood of candidates finding ready-made solutions.

By addressing these ethical considerations, we can ensure that GPT-4 and similar AI tools are utilized responsibly and ethically, ultimately enhancing the software engineering education and hiring landscape while preserving their integrity.

Paragraphs from OpenAI

Is GPT-4 released?

Its long-anticipated GPT4 was released on March 15th, 2023. ChatGPT was redesigned to respond to images and texts with fewer delays and can be more accurate — better informed, and more visible.

Will GPT-4 be free?

If you’re looking for a chatbot that supports GPT-4, this will cost you nothing. I have a free version of this app. The price is $20 a month for ChatGPT Plus, an exclusive version of ChatGPT Bot.

Why can’t I use ChatGPT in a coding interview?

Using ChatGPT in a coding interview is not advisable for several reasons:

  1. Honesty and Integrity: Coding interviews assess your programming skills, problem-solving abilities, and understanding of computer science concepts. Using ChatGPT during an interview would be considered cheating, as it undermines the purpose of the assessment and falsely represents your abilities.

  2. Long-term Consequences: Even if you successfully pass the interview using ChatGPT, you may struggle to perform well without the necessary skills and knowledge. This could lead to poor job performance, difficulty in career growth, and potential termination.

  3. Slow Output and Code Formatted Errors: Using ChatGPT on interview questions could lead to waiting on the tool, which shows that you either don’t know the material or waiting on something or someone. The question could be asked by word of mouth, so the AI can’t catch it all. Thus it creates its libraries or has slight inaccuracies.

  4. Unreliability of AI-generated Solutions: Although ChatGPT is a powerful AI tool, it may not always generate the most efficient or accurate solutions. Relying on it during a coding interview might result in incorrect or suboptimal answers, negatively impacting your interview performance.

Instead of using ChatGPT during a coding interview, it’s better to use it as a study tool to help you learn and reinforce programming concepts, practice problem-solving, and prepare for interviews ethically and responsibly. This will lead to long-term success and growth in your software engineering career.

Conclusion

Our research on GPT-4’s ability to solve software engineering coding assignments across various difficulty levels and programming languages has provided valuable insights into its capabilities and limitations. While GPT-4 can be useful for learning and reinforcing programming concepts, it is not a replacement for genuine understanding and skill development. It is important for students and professionals alike to focus on honing their programming abilities and critical thinking skills rather than relying on AI-generated solutions. By maintaining integrity during interviews and assignments, you can ensure long-term success in your software engineering career and contribute meaningfully to the ever-evolving field of computer science.

Updated on November 20, 2023… OpenAI released the the newest version of GPT-4 and code interpreter. Long story short all the coding solutions are now fully solved by AI…