Eric Schorn
Technical Director
Cryptography Services Practice, NCC Group
April 15, 2023
This executable blog post is the third in a series related to machine learning and explores code generation from a 16 billion parameter large language model (LLM). After a brief look under the hood at the LLM structure and parameter allocation, we generate a variety of Python functions and make observations related to code quality and security. Similar to OpenAI's ChatGPT, Google's Bard, Meta's LLaMa and GitHub's Copilot, we will see some fantastic capabilities and a few intriguing misses. The results demonstrate that human expertise remains crucial, whether it be in engineering the prompt or in evaluating the generated code for suitability.
The model used here was introduced in the paper titled CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis from late 2022, with code on GitHub, and is supported via the Hugging Face framework. As the paper suggests, in some respects this model's performance is comparable to that of the OpenAI Codex model which powered the GitHub Copilot product. Since this technology is still in its infancy, any misadventures should be interpreted as opportunities for further research and the need for human technical expertise, rather than as criticism.
You can run everything for yourself by loading this .ipynb file into any Jupyter-based notebook system. The goal for the code below is to utilize state-of-the-art models while maximizing simplicity, understandability, and accessibility. This is consistent with the two prior posts in the blog series:
The code starts here with importing standard machine learning frameworks, implementing a 'pretty printer' support function to help manage voluminous output, and then reporting key version information.
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
from torchinfo import summary
import warnings; warnings.filterwarnings("ignore") # Pesky Pytorch artifact
def pprint(text, start=0, stop=20): # For pretty-printing with line numbers
lines = text.split("\n")
for i, line in enumerate(lines):
if i < start or i > stop: continue
print(f"{i:>3d}: {line}")
print("PyTorch version {} with Transformers version {}".format(
torch.__version__, transformers.__version__))
PyTorch version 1.13.1.post200 with Transformers version 4.26.1
Now the pretrained model can be loaded along with its matching tokenizer, and the total number of model parameters reported. This model was initially trained on 825GB of English text, then training continued on a subset of the BigQuery dataset containing multiple programming languages, and finally training concluded on a large amount of Python source code. While this Jupyter notebook does not require a GPU, the model does consume roughly 80GB of memory (with smaller but less capable models also available). The total parameter count is reported via our 'pretty printer' pprint()
function which extracts line 419 of the model summary. The full summary is included at the very end of this post. This parameter count includes some ancillary functionality so we will only be working with the most significant digits below.
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-16B-mono", pad_token_id=50256)
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-16B-mono")
model_summary = summary(model, depth=5).__str__()
pprint(model_summary, 419, 419) # Extract one specific line of output
419: Total params: 16,032,155,648
So, how does one 'spend' 16 billion parameters in a LLM?
Essentially all of the most recent and popular LLMs are based on the transformer architecture. This revolutionary architecture was launched with the paper titled Attention Is All You Need in 2017. While transformers have three broad categories (the encoder, the decoder, and the encoder-decoder), the decoder category has arguably seen the most recent publicity and is what we are using here. The basic organization is shown in the classic diagram below sourced from the detailed background available on Wikipedia.
The decoder variant of a transformer is constrained to the components shown on the above right without any of those shown on the above left (which is the encoder). The larger block is repeated many times -- 34 in our case. As suggested by the bottom-most label, during inference the output of the model is repeatedly prepended to the input and the model rerun for each output token.
The tokenizer maps an input prompt string and reference vocabulary to an input list of tokens (which are just integers). Incredibly, the transformer model simply consumes the list of tokens and performs about a (figurative) trillion multiply-accumulate operations as data moves through each layer, which finally results in a prediction for the next most likely set of tokens. The sampled output token is then appended to the input token list and the prediction process rerun again and again. Ultimately, this results the original list of tokens with the output list appended to it, which is run through a reverse tokenizer to recover the result text.
Let's look at the construction of each the following system components while examining how 16 billion parameters are 'spent':
The tokenizer is not strictly a part of the transformer model and does not consume parameters itself. Rather, it is a parsing and mapping function that is given a reference vocabulary set along with an input text string and produce a list of tokens corresponding to the words found in the input. The reference vocabulary can loosely be thought of as a Python set of words that each have a 'reference number' (which will later become an index to an embedding array). For example, the sentence "The cat and the dog went to the beach" may correspond to [1, 76, 2, 1, 67, 42, 5, 1, 99]
if the word the
maps to 1, the word cat
maps to 76, the word beach
maps to 99, and so forth -- note the repeated the
becomes a repeated 1
. With a little thought, it becomes clear that there is considerable additional complexity around capitalization, verb tenses, word stems, punctuation and vocabulary size constraints which are outside the scope of this blog post.
The LLM we are using here utilizes Byte-Pair Encoding tokenization as described by Hugging Face. Below we see the vocabulary size reported, followed by a test prompt. This test prompt is tokenized and the resulting list of tokens shown. Each token is converted back to its vocabulary word and printed within pipes to show segmentation. Note how some tokens, such as the first two 4299 and 825 (which relate to def
), may or may not consume their adjacent space. Rare words and nonsense are split, as are some number sequences. Finally, some arbitrary vocabulary tokens are chosen, decoded, and printed -- here we see some punctuation and word suffixes appear. All of our prompts below will be passed through this tokenizer, and it is interesting to note that this tokenizer is not Python specific.
print(f"Tokenizer vocabulary size is {tokenizer.vocab_size}")
test_prompt = "def def hullo(): test 1 2 3456 (4+5) muchago printf"
encoded = tokenizer.encode(test_prompt)
decoded = "|".join([tokenizer.decode(x) for x in encoded])
print(f"Test prompt: {test_prompt} \nEncoded: {encoded}\nRound trip: |{decoded}|")
arb_vocab = [tokenizer.decode(x) for x in [1,2,5,10,20,50,100,200,500,1000]]
print(f'Arb vocab: |{"|".join(arb_vocab)}|')
Tokenizer vocabulary size is 50257 Test prompt: def def hullo(): test 1 2 3456 (4+5) muchago printf Encoded: [4299, 825, 23644, 78, 33529, 1332, 352, 362, 513, 29228, 357, 19, 10, 20, 8, 881, 3839, 30812] Round trip: |def| def| hull|o|():| test| 1| 2| 3|456| (|4|+|5|)| much|ago| printf| Arb vocab: |"|#|&|+|5|S|�||ine|ale|
Word embeddings are the initial stage of the transformer model and offer the capability of expanding the meaning of each vocabulary token. Effectively, they are a learned mapping from a single token number to a real-valued word vector. Consider an array with each row corresponding to the address of a vocabulary word and the row contents (or columns) corresponding to the learned word vector of 'embedding dimension'. We present each token to the array as a row address, the array does a lookup, and returns a word vector. This is repeated for all input tokens. The transformer has a vocabulary size of 51,200 and the model has an embedding dimension of 6,144. As 'pretty printed' below, this translates into 51,200 * 6,144 or 315M learned parameters (rounded to the nearest million).
pprint(model_summary, 0, 3) # Table headings
pprint(model_summary, 5, 5) # Embedding stats
0: =========================================================================== 1: Layer (type:depth-idx) Param # 2: =========================================================================== 3: CodeGenForCausalLM -- 5: │ └─Embedding: 2-1 314,572,800
The transformer architecture involves a large number of repeated blocks arranged one after another like a pipeline. Each block has two primary elements: a self-attention mechanism, and two fully-connected feed-forward layers.
The self-attention mechanism involves three equal-sized matrices called Q
for query, K
for key and V
for value. The matrices are effectively multiplied together to calculate the attention appropriate to each element of the subsequent layer inputs. With an embedding dimension of 6,144 and three square matrices, this requires 113M learned parameters. The results are then projected through another square matrix of size 6,144 * 6,144, resulting in another 38M learned parameters.
The two fully connected layers consist of two matrices with an input and output size equal to the embedded dimension of 6,144, but with an additional and larger internal dimension. In this case it is 24,576. As a result we have 6,144 * 24,576 or 151M learned parameters. The second matrix reverses the dimensions, which requires another 151M learned parameters.
This whole block is repeated 34 times in this model, resulting in 34 * (113M + 38M + 2 * 151M) or a total of 15.4B learned parameters for the repeated blocks. The few parameters reported below relating to the normalization layer are not dealt with here.
pprint(model_summary, 0, 3) # Table headings
pprint(model_summary, 8, 19) # Block stats
0: =========================================================================== 1: Layer (type:depth-idx) Param # 2: =========================================================================== 3: CodeGenForCausalLM -- 8: │ │ └─CodeGenBlock: 3-1 -- 9: │ │ │ └─LayerNorm: 4-1 12,288 10: │ │ │ └─CodeGenAttention: 4-2 -- 11: │ │ │ │ └─Dropout: 5-1 -- 12: │ │ │ │ └─Dropout: 5-2 -- 13: │ │ │ │ └─Linear: 5-3 113,246,208 14: │ │ │ │ └─Linear: 5-4 37,748,736 15: │ │ │ └─CodeGenMLP: 4-3 -- 16: │ │ │ │ └─Linear: 5-5 151,019,520 17: │ │ │ │ └─Linear: 5-6 151,001,088 18: │ │ │ │ └─NewGELUActivation: 5-7 -- 19: │ │ │ │ └─Dropout: 5-8 --
The output layer is a relatively straightforward fully-connected feed-forward classifier. It is presented with 6,144 inputs from the last stacked block described above and returns the predicted probability of each of the 51,200 output tokens. This requires a matrix of 6,144 * 51,200 resulting in 315M parameters. The output is ultimately sampled and put into a "reverse tokenizer" for final text output. Note that the vocabulary size is not precisely matched between the tokenizer, input, and output -- this is an extraneous artifact for our purposes here.
pprint(model_summary, 0, 3) # Table headings
pprint(model_summary, 417, 417) # Output layer stats
0: =========================================================================== 1: Layer (type:depth-idx) Param # 2: =========================================================================== 3: CodeGenForCausalLM -- 417: ├─Linear: 1-2 314,624,000
This gives us a total of 315M + 15.4B + 315M = 16B parameters!
Now we will survey code samples generated by the 16 billion parameter model described above. The helper function below will be central to this: the given prompt is tokenized, the tokens injected into the model repeatedly, and the final output tokens decoded into a returned result string. Note that beam search means the probabilities are considered over a series of 10 tokens which improves the results significantly.
def generate(prompt):
input_ids = tokenizer(prompt.strip(" \n"), return_tensors="pt")
generated_ids = model.generate(**input_ids, max_new_tokens=256, do_sample=False,
num_beams=10, temperature=0.75)
result = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
return result
All of the code generation samples will begin with a prompt involving the target function signature and a docstring that describes the desired behavior. The generate()
function implemented above will utilize this to produce model output which is presented through the 'pretty print' function pprint()
(where the two numerical arguments indicate what line range to print).
A simple prompt to stimulate the generation of a unit conversion helper function is supplied below. The model generates what it thinks is the most likely completion, and is allowed to continue running for a few extra lines, almost as if daydreaming about related code. It is interesting to see that the functionality is correct, and that the subsequent functions are related, correct, have appropriate names and also have type hints. As the training material has relatively minimal type hints, we drop this aspect moving forward in favor of a more specific docstring.
pprint(generate("""
def min2sec(minutes: float):
\"""Converts the input minutes into seconds.\""" """), 0, 18)
0: def min2sec(minutes: float): 1: """Converts the input minutes into seconds.""" 2: return minutes * 60 3: 4: 5: def sec2min(seconds: float): 6: """Converts the input seconds into minutes.""" 7: return seconds / 60 8: 9: 10: def min2hour(minutes: float): 11: """Converts the input minutes into hours.""" 12: return minutes / 60 13: 14: 15: def hour2min(hours: float): 16: """Converts the input hours into minutes.""" 17: return hours * 60 18:
A prompt to stimulate the generation of another helper function, derived from a challenge in Project Euler, is supplied below. It is interesting that the docstring contains the word "and" while the generated code includes the logical-OR operator on line 4. Whether the code is actually correct requires some context and human thought. If the operator should instead be logical-AND, then the loop range()
and contents could be significantly simplified. A comprehensive (human) code review would investigate the potential differences between intent, docstring and implementation across the overall application.
pprint(generate("""
def sum_of_multiples(a, b, c):
\"""Finds the sum of all the multiples of a and b below c.\""" """), 0, 7)
0: def sum_of_multiples(a, b, c): 1: """Finds the sum of all the multiples of a and b below c.""" 2: total = 0 3: for i in range(1, c): 4: if i % a == 0 or i % b == 0: 5: total += i 6: return total 7:
A prompt to stimulate the generation of a function to validate legal email addresses is supplied below. While the docstring is simple and complete, the task is actually very complex and somewhat open-ended as evidenced by RFC 6530, RFC 6531, RFC 6532 and RFC 6533. Despite an unusual logical inversion involving the if
condition and return values, and even without considering internationalization, Unicode and normalization, the generated code incorrectly validates joe@..nccgroup..helloThereThisIsATest
. Furthermore, CVE-2021-31409 "Unsafe validation RegEx in EmailValidator component" suggests there can be problems lurking below the surface when using regular expressions to process attacker-controlled input.
pprint(generate("""
def validate_email(address):
\"""Validates a legal email address.\""" """), 0, 5)
0: def validate_email(address): 1: """Validates a legal email address.""" 2: if not re.match(r"^[A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*$", address): 3: return False 4: return True 5:
Elliptic curves are an extremely specialized topic for which a very concise prompt can be supplied. A prompt to stimulate the generation of a function to calculate the group law is supplied below. While the resulting code makes some assumptions on the operand types, input validity and a supporting double()
helper function, it works through the necessary corner cases properly and is indeed correct overall! While human expertise is required to determine whether affine, projective, Jacobian or other co-ordinate systems may be the most efficient for a given context, in this basic instance the generated code is a win!
pprint(generate("""
def add(p1, p2):
\"""Adds two elliptic curve points in affine coordinates.\""" """), 0, 16)
0: def add(p1, p2): 1: """Adds two elliptic curve points in affine coordinates.""" 2: if p1 == INFINITY: 3: return p2 4: if p2 == INFINITY: 5: return p1 6: x1, y1 = p1 7: x2, y2 = p2 8: if x1 == x2 and y1 == y2: 9: return double(p1) 10: elif x1 == x2: 11: return INFINITY 12: s = (y2 - y1) / (x2 - x1) 13: x3 = s**2 - x1 - x2 14: y3 = s * (x1 - x3) - y1 15: return (x3, y3) 16:
A prompt to stimulate the generation of a function that tests for prime numbers is supplied below. The initial cases tested on lines 2, 4 and 6 are correct. The loop range()
endpoint on line 9 for testing factors correctly involves the square root of the operand. However, this loop is inefficient in that (for example) an input of 733 will be tested for divisibility by 3, 9, and 27 among other factors, but the latter two tests cannot succeed if the former test fails (so these latter two tests would be redundant). This issue is somewhat common, so many implementations test a longer list of primes on line 6 so that the loop on line 9 can take much larger steps. In the end, for the size of integers normally used in cryptographic applications, this sort of deterministic test that checks all potential factors would 'never complete' and thus a probabilistic primality test would be required. As before, this requires human expertise to evaluate the suitability of generated code.
pprint(generate("""
def is_prime(a):
\"""Determines whether the input a is a prime number.\""" """), 0, 13)
0: def is_prime(a): 1: """Determines whether the input a is a prime number.""" 2: if a < 2: 3: return False 4: elif a == 2: 5: return True 6: elif a % 2 == 0: 7: return False 8: else: 9: for i in range(3, int(a ** 0.5) + 1, 2): 10: if a % i == 0: 11: return False 12: return True 13:
A prompt to stimulate the generation of a function to calculate the square root of an integer modulo N is supplied below. The resulting code continues at length but its display is cut off at line 21 for brevity. It is interesting to see input validation done here, although assert
may be more appropriate for debug rather than production. Based on the model's generated implementation of is_prime()
above and the docstring specifying an odd prime, line 7 needs to be adjusted to prevent a value for N of 2
. The actual logic implemented on line 10+ is simply a lookup table. While each result is indeed correct, this function is not usable in a real-world context and it is difficult to imagine the training that led to this point.
pprint(generate("""
def sqrt(input, N):
\"""
Calculates the square root of the input integer modulo N, where N is an
odd prime.
\""" """), 0, 21)
0: def sqrt(input, N): 1: """ 2: Calculates the square root of the input integer modulo N, where N is an 3: odd prime. 4: """ 5: assert isinstance(input, int) 6: assert isinstance(N, int) 7: assert N > 1 8: assert is_prime(N) 9: 10: if input == 0: 11: return 0 12: elif input == 1: 13: return 1 14: elif input == 4: 15: return 2 16: elif input == 9: 17: return 3 18: elif input == 16: 19: return 4 20: elif input == 25: 21: return 5
A similar prompt to stimulate the generation of a function to calculate the square root modulo N is supplied below, but with an additional hint involving the standard Tonelli-Shanks algorithm. Oddly, our input validation now raises the correct exception variants, although the constant on line 11 should be 2
(as also noted above). However, the look-up style logic that follows is gibberish. The strong hint for the correct algorithm is completely missed.
pprint(generate("""
def sqrt(input, N):
\"""
Calculates the square root of the positive input integer modulo N using
the Tonelli-Shanks algorithm, where N is an odd prime.
\""" """), 0, 22)
0: def sqrt(input, N): 1: """ 2: Calculates the square root of the positive input integer modulo N using 3: the Tonelli-Shanks algorithm, where N is an odd prime. 4: """ 5: if not isinstance(input, int): 6: raise TypeError("Input must be an integer.") 7: if not isinstance(N, int): 8: raise TypeError("N must be an odd prime integer.") 9: if input < 0: 10: raise ValueError("Input must be a positive integer.") 11: if N <= 1: 12: raise ValueError("N must be an odd prime integer.") 13: if not is_prime(N): 14: raise ValueError("N must be an odd prime integer.") 15: if input == 0: 16: return 0 17: if input == 1: 18: return 1 19: if input == 2: 20: return 2 21: if input == 3: 22: return 3
Below we supply a prompt to stimulate the generation of a function for integer deserialization which has traditionally been an error-prone area. There are no special cases specified except for returning the deserialized integer modulo N. The generated code does not check for the input bytearray type, and would return the wrong result when given a list of large integers. Additionally, there are a host of invisible assumptions that make this code problematic, such as the big-endian format, the potential for egregiously oversized inputs leading to a (downstream) denial of service, empty input, and the unsigned (versus signed) result.
pprint(generate("""
def deserialize(input, N):
\"""Deserializes a bytearray input into an integer modulo N.\""" """), 0, 6)
0: def deserialize(input, N): 1: """Deserializes a bytearray input into an integer modulo N.""" 2: x = 0 3: for c in input: 4: x = (x << 8) | c 5: return x % N 6:
Below, the prompt is tightened with additional constraints such as an exact length of the input bytearray type. It is encouraging to see the resulting correct input validation as well as the use of the built in from_bytes()
method with explicit choices for little-endian and unsigned.
However, what we originally wanted when writing this blog post was a function that raised an exception when the deserialized integer was equal to or larger than N. Repeated 'engineering' of the prompt was unsuccessful at generating correct code. This sort of constraint is absolutely necessary -- see Technical Advisory – Arbitrary Signature Forgery in Stark Bank ECDSA Libraries (CVE-2021-43572, CVE-2021-43570, CVE-2021-43569, CVE-2021-43568, CVE-2021-43571) and table 5 in Taming the many EdDSAs relating to non-canonical encodings. So we can approach the functionality we want, but not actually generate it.
# Unable to prompt for code that raises exception for result >= N
pprint(generate("""
def deserialize(bytes, N):
\"""
Deserializes a little-endian array of exactly 16 bytes into an
unsigned integer modulo N. This function raises a TypeError for
any input type other than a bytearray. This function raises a
ValueError for an incorrectly sized input.
\""" """), 0, 12)
0: def deserialize(bytes, N): 1: """ 2: Deserializes a little-endian array of exactly 16 bytes into an 3: unsigned integer modulo N. This function raises a TypeError for 4: any input type other than a bytearray. This function raises a 5: ValueError for an incorrectly sized input. 6: """ 7: if not isinstance(bytes, bytearray): 8: raise TypeError("Input must be a bytearray") 9: if len(bytes)!= 16: 10: raise ValueError("Input must be exactly 16 bytes") 11: return int.from_bytes(bytes, "little", signed=False) % N 12:
Here encryption related functionality is considered, starting with a prompt to stimulate the generation of a key derivation function. Presumably the passphrase would be validated prior to this for minimum length, required characters, etc. However, section 5.1.1.2 of NIST SP 800-63B describes the need to also normalize Unicode strings prior to hashing to avoid interoperability problems. The generated code below specifies 100,000 iterations but this is likely not enough as Django currently specifies 720,000. Further, sha256
is not considered memory hard per NIST guidance. Finally, the docstring specifies an output length of 16 bytes but sha256
returns 32 bytes. As an aside, line 9 should arguably read return salt, key
to return a better separated tuple and remove (visual) ambiguity around addition.
pprint(generate("""
def pbkdf(passphrase):
\"""
This key derivation function converts a passphrase into an ecryption key.
The input passphrase is a unicode string and the output key is a bytearray
of length 16.
\""" """), 0, 10)
0: def pbkdf(passphrase): 1: """ 2: This key derivation function converts a passphrase into an ecryption key. 3: The input passphrase is a unicode string and the output key is a bytearray 4: of length 16. 5: """ 6: salt = bytearray(os.urandom(16)) 7: passphrase = passphrase.encode('utf-8') 8: key = hashlib.pbkdf2_hmac('sha256', passphrase, salt, 100000) 9: return salt + key 10:
Here a prompt is supplied to stimulate the generation of a function that performs encryption. While it is encouraging to see AES utilized, the ECB mode is likely a lurking disaster as "ECB is not recommended for use in cryptographic protocols." As an aside, smaller models return a function where encryption is performed using a single XOR function.
pprint(generate("""
def encrypt(data, key):
\"""
Encrypts an input data string with a key. The key must be a bytearray
of length 16.
\""" """), 0, 7)
0: def encrypt(data, key): 1: """ 2: Encrypts an input data string with a key. The key must be a bytearray 3: of length 16. 4: """ 5: aes = AES.new(key, AES.MODE_ECB) 6: return aes.encrypt(data) 7:
Now a more detailed prompt is supplied for decryption that uses AES in GCM mode. While the most common initialization vector (IV) length is 96 bits or 12 bytes per section 8.2 of NIST SP 800-38d, the generated code below somehow connects it to the block size which is 16 bytes. Similarly, the cipher.verify()
operation on line 9 suggests a fix tag size of 128 bits or 16 bytes, while the specification gives a broader range. Overall, this function looks very good but is still a little less than ideal.
pprint(generate("""
def aes_decrypt(ciphertext, key):
\"""
Decrypts the input ciphertext with a key using AES in GCM mode. The key
must be a bytearray of length 16.
\""" """), 0, 13)
0: def aes_decrypt(ciphertext, key): 1: """ 2: Decrypts the input ciphertext with a key using AES in GCM mode. The key 3: must be a bytearray of length 16. 4: """ 5: iv = ciphertext[:AES.block_size] 6: cipher = AES.new(key, AES.MODE_GCM, iv) 7: plaintext = cipher.decrypt(ciphertext[AES.block_size:]) 8: try: 9: cipher.verify(plaintext[-16:]) 10: return plaintext[:-16] 11: except ValueError: 12: raise ValueError("Invalid key or ciphertext!") 13:
If increasing the productivity of lesser-skilled programmers has become concerning, perhaps automating test case generation would be a better initial step. Here a prompt is supplied to stimulate the generation of a function that calculates the greatest common denominator of its two integer inputs -- a common programming challenge. The additional validation involving 0
is arguably correct, and this is quite nice code.
# def greatest_common_denominator(a, b):
pprint(generate("""
def gcd(a, b):
\"""
Finds the greatest common denominator of two positive non-zero integers
a and b. This function raises a TypeError if both inputs are not integers.
\""" """), 0, 20)
0: def gcd(a, b): 1: """ 2: Finds the greatest common denominator of two positive non-zero integers 3: a and b. This function raises a TypeError if both inputs are not integers. 4: """ 5: if not isinstance(a, int) or not isinstance(b, int): 6: raise TypeError 7: if a < 0 or b < 0: 8: raise ValueError 9: if a == 0 and b == 0: 10: raise ValueError 11: if a == 0: 12: return b 13: if b == 0: 14: return a 15: if a < b: 16: a, b = b, a 17: while b!= 0: 18: a, b = b, a % b 19: return a 20:
The prompt below is freeform departure from the preceding instances, and simply asks for test cases to be generated. This gives another very nice result, albeit with uninspiring choice of data. Clearly, a human would need to adapt the test vector values.
pprint(generate("""
Write a unit test for a function that calculates the greatest common
denominator of two positive integers. """), 0, 23)
0: Write a unit test for a function that calculates the greatest common 1: denominator of two positive integers. 2: """ 3: 4: import unittest 5: from gcd import gcd 6: 7: 8: class TestGcd(unittest.TestCase): 9: def test_gcd_of_1_and_1(self): 10: self.assertEqual(1, gcd(1, 1)) 11: 12: def test_gcd_of_2_and_2(self): 13: self.assertEqual(2, gcd(2, 2)) 14: 15: def test_gcd_of_3_and_3(self): 16: self.assertEqual(3, gcd(3, 3)) 17: 18: def test_gcd_of_4_and_4(self): 19: self.assertEqual(4, gcd(4, 4)) 20: 21: def test_gcd_of_5_and_5(self): 22: self.assertEqual(5, gcd(5, 5)) 23:
We have now looked under the hood of an LLM to understand the basic architecture, where and how 16 billion parameters are spent, and the results of asking the model to generate a variety of code fragments. At first impression, the capabilities are absolutely stunning.
However, as the above examples have shown, human expertise is pivotal in correctly specifying the LLM prompt as well as evaluating the generated code output. The higher level application context typically remains unexpressed and unexamined by the model. The natural tendency for time-pressured developers to judge "that looks about right" along with commercial pressures to "ship it when it seems to work" will best-case result in a house of cards deployed into production. In the worst-case, obvious CVEs are generated.
At this point, code generation by LLMs appears to change the landscape for human expertise and deep thought, but only increases rather than decreases its importance.
The author would like to thank Aleksandar Kircanski and Eli Sohl for their human expertise and detailed review. All issues remain with the author. The full model summary is shown below.
pprint(model_summary, 0, 2000)
0: =========================================================================== 1: Layer (type:depth-idx) Param # 2: =========================================================================== 3: CodeGenForCausalLM -- 4: ├─CodeGenModel: 1-1 -- 5: │ └─Embedding: 2-1 314,572,800 6: │ └─Dropout: 2-2 -- 7: │ └─ModuleList: 2-3 -- 8: │ │ └─CodeGenBlock: 3-1 -- 9: │ │ │ └─LayerNorm: 4-1 12,288 10: │ │ │ └─CodeGenAttention: 4-2 -- 11: │ │ │ │ └─Dropout: 5-1 -- 12: │ │ │ │ └─Dropout: 5-2 -- 13: │ │ │ │ └─Linear: 5-3 113,246,208 14: │ │ │ │ └─Linear: 5-4 37,748,736 15: │ │ │ └─CodeGenMLP: 4-3 -- 16: │ │ │ │ └─Linear: 5-5 151,019,520 17: │ │ │ │ └─Linear: 5-6 151,001,088 18: │ │ │ │ └─NewGELUActivation: 5-7 -- 19: │ │ │ │ └─Dropout: 5-8 -- 20: │ │ └─CodeGenBlock: 3-2 -- 21: │ │ │ └─LayerNorm: 4-4 12,288 22: │ │ │ └─CodeGenAttention: 4-5 -- 23: │ │ │ │ └─Dropout: 5-9 -- 24: │ │ │ │ └─Dropout: 5-10 -- 25: │ │ │ │ └─Linear: 5-11 113,246,208 26: │ │ │ │ └─Linear: 5-12 37,748,736 27: │ │ │ └─CodeGenMLP: 4-6 -- 28: │ │ │ │ └─Linear: 5-13 151,019,520 29: │ │ │ │ └─Linear: 5-14 151,001,088 30: │ │ │ │ └─NewGELUActivation: 5-15 -- 31: │ │ │ │ └─Dropout: 5-16 -- 32: │ │ └─CodeGenBlock: 3-3 -- 33: │ │ │ └─LayerNorm: 4-7 12,288 34: │ │ │ └─CodeGenAttention: 4-8 -- 35: │ │ │ │ └─Dropout: 5-17 -- 36: │ │ │ │ └─Dropout: 5-18 -- 37: │ │ │ │ └─Linear: 5-19 113,246,208 38: │ │ │ │ └─Linear: 5-20 37,748,736 39: │ │ │ └─CodeGenMLP: 4-9 -- 40: │ │ │ │ └─Linear: 5-21 151,019,520 41: │ │ │ │ └─Linear: 5-22 151,001,088 42: │ │ │ │ └─NewGELUActivation: 5-23 -- 43: │ │ │ │ └─Dropout: 5-24 -- 44: │ │ └─CodeGenBlock: 3-4 -- 45: │ │ │ └─LayerNorm: 4-10 12,288 46: │ │ │ └─CodeGenAttention: 4-11 -- 47: │ │ │ │ └─Dropout: 5-25 -- 48: │ │ │ │ └─Dropout: 5-26 -- 49: │ │ │ │ └─Linear: 5-27 113,246,208 50: │ │ │ │ └─Linear: 5-28 37,748,736 51: │ │ │ └─CodeGenMLP: 4-12 -- 52: │ │ │ │ └─Linear: 5-29 151,019,520 53: │ │ │ │ └─Linear: 5-30 151,001,088 54: │ │ │ │ └─NewGELUActivation: 5-31 -- 55: │ │ │ │ └─Dropout: 5-32 -- 56: │ │ └─CodeGenBlock: 3-5 -- 57: │ │ │ └─LayerNorm: 4-13 12,288 58: │ │ │ └─CodeGenAttention: 4-14 -- 59: │ │ │ │ └─Dropout: 5-33 -- 60: │ │ │ │ └─Dropout: 5-34 -- 61: │ │ │ │ └─Linear: 5-35 113,246,208 62: │ │ │ │ └─Linear: 5-36 37,748,736 63: │ │ │ └─CodeGenMLP: 4-15 -- 64: │ │ │ │ └─Linear: 5-37 151,019,520 65: │ │ │ │ └─Linear: 5-38 151,001,088 66: │ │ │ │ └─NewGELUActivation: 5-39 -- 67: │ │ │ │ └─Dropout: 5-40 -- 68: │ │ └─CodeGenBlock: 3-6 -- 69: │ │ │ └─LayerNorm: 4-16 12,288 70: │ │ │ └─CodeGenAttention: 4-17 -- 71: │ │ │ │ └─Dropout: 5-41 -- 72: │ │ │ │ └─Dropout: 5-42 -- 73: │ │ │ │ └─Linear: 5-43 113,246,208 74: │ │ │ │ └─Linear: 5-44 37,748,736 75: │ │ │ └─CodeGenMLP: 4-18 -- 76: │ │ │ │ └─Linear: 5-45 151,019,520 77: │ │ │ │ └─Linear: 5-46 151,001,088 78: │ │ │ │ └─NewGELUActivation: 5-47 -- 79: │ │ │ │ └─Dropout: 5-48 -- 80: │ │ └─CodeGenBlock: 3-7 -- 81: │ │ │ └─LayerNorm: 4-19 12,288 82: │ │ │ └─CodeGenAttention: 4-20 -- 83: │ │ │ │ └─Dropout: 5-49 -- 84: │ │ │ │ └─Dropout: 5-50 -- 85: │ │ │ │ └─Linear: 5-51 113,246,208 86: │ │ │ │ └─Linear: 5-52 37,748,736 87: │ │ │ └─CodeGenMLP: 4-21 -- 88: │ │ │ │ └─Linear: 5-53 151,019,520 89: │ │ │ │ └─Linear: 5-54 151,001,088 90: │ │ │ │ └─NewGELUActivation: 5-55 -- 91: │ │ │ │ └─Dropout: 5-56 -- 92: │ │ └─CodeGenBlock: 3-8 -- 93: │ │ │ └─LayerNorm: 4-22 12,288 94: │ │ │ └─CodeGenAttention: 4-23 -- 95: │ │ │ │ └─Dropout: 5-57 -- 96: │ │ │ │ └─Dropout: 5-58 -- 97: │ │ │ │ └─Linear: 5-59 113,246,208 98: │ │ │ │ └─Linear: 5-60 37,748,736 99: │ │ │ └─CodeGenMLP: 4-24 -- 100: │ │ │ │ └─Linear: 5-61 151,019,520 101: │ │ │ │ └─Linear: 5-62 151,001,088 102: │ │ │ │ └─NewGELUActivation: 5-63 -- 103: │ │ │ │ └─Dropout: 5-64 -- 104: │ │ └─CodeGenBlock: 3-9 -- 105: │ │ │ └─LayerNorm: 4-25 12,288 106: │ │ │ └─CodeGenAttention: 4-26 -- 107: │ │ │ │ └─Dropout: 5-65 -- 108: │ │ │ │ └─Dropout: 5-66 -- 109: │ │ │ │ └─Linear: 5-67 113,246,208 110: │ │ │ │ └─Linear: 5-68 37,748,736 111: │ │ │ └─CodeGenMLP: 4-27 -- 112: │ │ │ │ └─Linear: 5-69 151,019,520 113: │ │ │ │ └─Linear: 5-70 151,001,088 114: │ │ │ │ └─NewGELUActivation: 5-71 -- 115: │ │ │ │ └─Dropout: 5-72 -- 116: │ │ └─CodeGenBlock: 3-10 -- 117: │ │ │ └─LayerNorm: 4-28 12,288 118: │ │ │ └─CodeGenAttention: 4-29 -- 119: │ │ │ │ └─Dropout: 5-73 -- 120: │ │ │ │ └─Dropout: 5-74 -- 121: │ │ │ │ └─Linear: 5-75 113,246,208 122: │ │ │ │ └─Linear: 5-76 37,748,736 123: │ │ │ └─CodeGenMLP: 4-30 -- 124: │ │ │ │ └─Linear: 5-77 151,019,520 125: │ │ │ │ └─Linear: 5-78 151,001,088 126: │ │ │ │ └─NewGELUActivation: 5-79 -- 127: │ │ │ │ └─Dropout: 5-80 -- 128: │ │ └─CodeGenBlock: 3-11 -- 129: │ │ │ └─LayerNorm: 4-31 12,288 130: │ │ │ └─CodeGenAttention: 4-32 -- 131: │ │ │ │ └─Dropout: 5-81 -- 132: │ │ │ │ └─Dropout: 5-82 -- 133: │ │ │ │ └─Linear: 5-83 113,246,208 134: │ │ │ │ └─Linear: 5-84 37,748,736 135: │ │ │ └─CodeGenMLP: 4-33 -- 136: │ │ │ │ └─Linear: 5-85 151,019,520 137: │ │ │ │ └─Linear: 5-86 151,001,088 138: │ │ │ │ └─NewGELUActivation: 5-87 -- 139: │ │ │ │ └─Dropout: 5-88 -- 140: │ │ └─CodeGenBlock: 3-12 -- 141: │ │ │ └─LayerNorm: 4-34 12,288 142: │ │ │ └─CodeGenAttention: 4-35 -- 143: │ │ │ │ └─Dropout: 5-89 -- 144: │ │ │ │ └─Dropout: 5-90 -- 145: │ │ │ │ └─Linear: 5-91 113,246,208 146: │ │ │ │ └─Linear: 5-92 37,748,736 147: │ │ │ └─CodeGenMLP: 4-36 -- 148: │ │ │ │ └─Linear: 5-93 151,019,520 149: │ │ │ │ └─Linear: 5-94 151,001,088 150: │ │ │ │ └─NewGELUActivation: 5-95 -- 151: │ │ │ │ └─Dropout: 5-96 -- 152: │ │ └─CodeGenBlock: 3-13 -- 153: │ │ │ └─LayerNorm: 4-37 12,288 154: │ │ │ └─CodeGenAttention: 4-38 -- 155: │ │ │ │ └─Dropout: 5-97 -- 156: │ │ │ │ └─Dropout: 5-98 -- 157: │ │ │ │ └─Linear: 5-99 113,246,208 158: │ │ │ │ └─Linear: 5-100 37,748,736 159: │ │ │ └─CodeGenMLP: 4-39 -- 160: │ │ │ │ └─Linear: 5-101 151,019,520 161: │ │ │ │ └─Linear: 5-102 151,001,088 162: │ │ │ │ └─NewGELUActivation: 5-103 -- 163: │ │ │ │ └─Dropout: 5-104 -- 164: │ │ └─CodeGenBlock: 3-14 -- 165: │ │ │ └─LayerNorm: 4-40 12,288 166: │ │ │ └─CodeGenAttention: 4-41 -- 167: │ │ │ │ └─Dropout: 5-105 -- 168: │ │ │ │ └─Dropout: 5-106 -- 169: │ │ │ │ └─Linear: 5-107 113,246,208 170: │ │ │ │ └─Linear: 5-108 37,748,736 171: │ │ │ └─CodeGenMLP: 4-42 -- 172: │ │ │ │ └─Linear: 5-109 151,019,520 173: │ │ │ │ └─Linear: 5-110 151,001,088 174: │ │ │ │ └─NewGELUActivation: 5-111 -- 175: │ │ │ │ └─Dropout: 5-112 -- 176: │ │ └─CodeGenBlock: 3-15 -- 177: │ │ │ └─LayerNorm: 4-43 12,288 178: │ │ │ └─CodeGenAttention: 4-44 -- 179: │ │ │ │ └─Dropout: 5-113 -- 180: │ │ │ │ └─Dropout: 5-114 -- 181: │ │ │ │ └─Linear: 5-115 113,246,208 182: │ │ │ │ └─Linear: 5-116 37,748,736 183: │ │ │ └─CodeGenMLP: 4-45 -- 184: │ │ │ │ └─Linear: 5-117 151,019,520 185: │ │ │ │ └─Linear: 5-118 151,001,088 186: │ │ │ │ └─NewGELUActivation: 5-119 -- 187: │ │ │ │ └─Dropout: 5-120 -- 188: │ │ └─CodeGenBlock: 3-16 -- 189: │ │ │ └─LayerNorm: 4-46 12,288 190: │ │ │ └─CodeGenAttention: 4-47 -- 191: │ │ │ │ └─Dropout: 5-121 -- 192: │ │ │ │ └─Dropout: 5-122 -- 193: │ │ │ │ └─Linear: 5-123 113,246,208 194: │ │ │ │ └─Linear: 5-124 37,748,736 195: │ │ │ └─CodeGenMLP: 4-48 -- 196: │ │ │ │ └─Linear: 5-125 151,019,520 197: │ │ │ │ └─Linear: 5-126 151,001,088 198: │ │ │ │ └─NewGELUActivation: 5-127 -- 199: │ │ │ │ └─Dropout: 5-128 -- 200: │ │ └─CodeGenBlock: 3-17 -- 201: │ │ │ └─LayerNorm: 4-49 12,288 202: │ │ │ └─CodeGenAttention: 4-50 -- 203: │ │ │ │ └─Dropout: 5-129 -- 204: │ │ │ │ └─Dropout: 5-130 -- 205: │ │ │ │ └─Linear: 5-131 113,246,208 206: │ │ │ │ └─Linear: 5-132 37,748,736 207: │ │ │ └─CodeGenMLP: 4-51 -- 208: │ │ │ │ └─Linear: 5-133 151,019,520 209: │ │ │ │ └─Linear: 5-134 151,001,088 210: │ │ │ │ └─NewGELUActivation: 5-135 -- 211: │ │ │ │ └─Dropout: 5-136 -- 212: │ │ └─CodeGenBlock: 3-18 -- 213: │ │ │ └─LayerNorm: 4-52 12,288 214: │ │ │ └─CodeGenAttention: 4-53 -- 215: │ │ │ │ └─Dropout: 5-137 -- 216: │ │ │ │ └─Dropout: 5-138 -- 217: │ │ │ │ └─Linear: 5-139 113,246,208 218: │ │ │ │ └─Linear: 5-140 37,748,736 219: │ │ │ └─CodeGenMLP: 4-54 -- 220: │ │ │ │ └─Linear: 5-141 151,019,520 221: │ │ │ │ └─Linear: 5-142 151,001,088 222: │ │ │ │ └─NewGELUActivation: 5-143 -- 223: │ │ │ │ └─Dropout: 5-144 -- 224: │ │ └─CodeGenBlock: 3-19 -- 225: │ │ │ └─LayerNorm: 4-55 12,288 226: │ │ │ └─CodeGenAttention: 4-56 -- 227: │ │ │ │ └─Dropout: 5-145 -- 228: │ │ │ │ └─Dropout: 5-146 -- 229: │ │ │ │ └─Linear: 5-147 113,246,208 230: │ │ │ │ └─Linear: 5-148 37,748,736 231: │ │ │ └─CodeGenMLP: 4-57 -- 232: │ │ │ │ └─Linear: 5-149 151,019,520 233: │ │ │ │ └─Linear: 5-150 151,001,088 234: │ │ │ │ └─NewGELUActivation: 5-151 -- 235: │ │ │ │ └─Dropout: 5-152 -- 236: │ │ └─CodeGenBlock: 3-20 -- 237: │ │ │ └─LayerNorm: 4-58 12,288 238: │ │ │ └─CodeGenAttention: 4-59 -- 239: │ │ │ │ └─Dropout: 5-153 -- 240: │ │ │ │ └─Dropout: 5-154 -- 241: │ │ │ │ └─Linear: 5-155 113,246,208 242: │ │ │ │ └─Linear: 5-156 37,748,736 243: │ │ │ └─CodeGenMLP: 4-60 -- 244: │ │ │ │ └─Linear: 5-157 151,019,520 245: │ │ │ │ └─Linear: 5-158 151,001,088 246: │ │ │ │ └─NewGELUActivation: 5-159 -- 247: │ │ │ │ └─Dropout: 5-160 -- 248: │ │ └─CodeGenBlock: 3-21 -- 249: │ │ │ └─LayerNorm: 4-61 12,288 250: │ │ │ └─CodeGenAttention: 4-62 -- 251: │ │ │ │ └─Dropout: 5-161 -- 252: │ │ │ │ └─Dropout: 5-162 -- 253: │ │ │ │ └─Linear: 5-163 113,246,208 254: │ │ │ │ └─Linear: 5-164 37,748,736 255: │ │ │ └─CodeGenMLP: 4-63 -- 256: │ │ │ │ └─Linear: 5-165 151,019,520 257: │ │ │ │ └─Linear: 5-166 151,001,088 258: │ │ │ │ └─NewGELUActivation: 5-167 -- 259: │ │ │ │ └─Dropout: 5-168 -- 260: │ │ └─CodeGenBlock: 3-22 -- 261: │ │ │ └─LayerNorm: 4-64 12,288 262: │ │ │ └─CodeGenAttention: 4-65 -- 263: │ │ │ │ └─Dropout: 5-169 -- 264: │ │ │ │ └─Dropout: 5-170 -- 265: │ │ │ │ └─Linear: 5-171 113,246,208 266: │ │ │ │ └─Linear: 5-172 37,748,736 267: │ │ │ └─CodeGenMLP: 4-66 -- 268: │ │ │ │ └─Linear: 5-173 151,019,520 269: │ │ │ │ └─Linear: 5-174 151,001,088 270: │ │ │ │ └─NewGELUActivation: 5-175 -- 271: │ │ │ │ └─Dropout: 5-176 -- 272: │ │ └─CodeGenBlock: 3-23 -- 273: │ │ │ └─LayerNorm: 4-67 12,288 274: │ │ │ └─CodeGenAttention: 4-68 -- 275: │ │ │ │ └─Dropout: 5-177 -- 276: │ │ │ │ └─Dropout: 5-178 -- 277: │ │ │ │ └─Linear: 5-179 113,246,208 278: │ │ │ │ └─Linear: 5-180 37,748,736 279: │ │ │ └─CodeGenMLP: 4-69 -- 280: │ │ │ │ └─Linear: 5-181 151,019,520 281: │ │ │ │ └─Linear: 5-182 151,001,088 282: │ │ │ │ └─NewGELUActivation: 5-183 -- 283: │ │ │ │ └─Dropout: 5-184 -- 284: │ │ └─CodeGenBlock: 3-24 -- 285: │ │ │ └─LayerNorm: 4-70 12,288 286: │ │ │ └─CodeGenAttention: 4-71 -- 287: │ │ │ │ └─Dropout: 5-185 -- 288: │ │ │ │ └─Dropout: 5-186 -- 289: │ │ │ │ └─Linear: 5-187 113,246,208 290: │ │ │ │ └─Linear: 5-188 37,748,736 291: │ │ │ └─CodeGenMLP: 4-72 -- 292: │ │ │ │ └─Linear: 5-189 151,019,520 293: │ │ │ │ └─Linear: 5-190 151,001,088 294: │ │ │ │ └─NewGELUActivation: 5-191 -- 295: │ │ │ │ └─Dropout: 5-192 -- 296: │ │ └─CodeGenBlock: 3-25 -- 297: │ │ │ └─LayerNorm: 4-73 12,288 298: │ │ │ └─CodeGenAttention: 4-74 -- 299: │ │ │ │ └─Dropout: 5-193 -- 300: │ │ │ │ └─Dropout: 5-194 -- 301: │ │ │ │ └─Linear: 5-195 113,246,208 302: │ │ │ │ └─Linear: 5-196 37,748,736 303: │ │ │ └─CodeGenMLP: 4-75 -- 304: │ │ │ │ └─Linear: 5-197 151,019,520 305: │ │ │ │ └─Linear: 5-198 151,001,088 306: │ │ │ │ └─NewGELUActivation: 5-199 -- 307: │ │ │ │ └─Dropout: 5-200 -- 308: │ │ └─CodeGenBlock: 3-26 -- 309: │ │ │ └─LayerNorm: 4-76 12,288 310: │ │ │ └─CodeGenAttention: 4-77 -- 311: │ │ │ │ └─Dropout: 5-201 -- 312: │ │ │ │ └─Dropout: 5-202 -- 313: │ │ │ │ └─Linear: 5-203 113,246,208 314: │ │ │ │ └─Linear: 5-204 37,748,736 315: │ │ │ └─CodeGenMLP: 4-78 -- 316: │ │ │ │ └─Linear: 5-205 151,019,520 317: │ │ │ │ └─Linear: 5-206 151,001,088 318: │ │ │ │ └─NewGELUActivation: 5-207 -- 319: │ │ │ │ └─Dropout: 5-208 -- 320: │ │ └─CodeGenBlock: 3-27 -- 321: │ │ │ └─LayerNorm: 4-79 12,288 322: │ │ │ └─CodeGenAttention: 4-80 -- 323: │ │ │ │ └─Dropout: 5-209 -- 324: │ │ │ │ └─Dropout: 5-210 -- 325: │ │ │ │ └─Linear: 5-211 113,246,208 326: │ │ │ │ └─Linear: 5-212 37,748,736 327: │ │ │ └─CodeGenMLP: 4-81 -- 328: │ │ │ │ └─Linear: 5-213 151,019,520 329: │ │ │ │ └─Linear: 5-214 151,001,088 330: │ │ │ │ └─NewGELUActivation: 5-215 -- 331: │ │ │ │ └─Dropout: 5-216 -- 332: │ │ └─CodeGenBlock: 3-28 -- 333: │ │ │ └─LayerNorm: 4-82 12,288 334: │ │ │ └─CodeGenAttention: 4-83 -- 335: │ │ │ │ └─Dropout: 5-217 -- 336: │ │ │ │ └─Dropout: 5-218 -- 337: │ │ │ │ └─Linear: 5-219 113,246,208 338: │ │ │ │ └─Linear: 5-220 37,748,736 339: │ │ │ └─CodeGenMLP: 4-84 -- 340: │ │ │ │ └─Linear: 5-221 151,019,520 341: │ │ │ │ └─Linear: 5-222 151,001,088 342: │ │ │ │ └─NewGELUActivation: 5-223 -- 343: │ │ │ │ └─Dropout: 5-224 -- 344: │ │ └─CodeGenBlock: 3-29 -- 345: │ │ │ └─LayerNorm: 4-85 12,288 346: │ │ │ └─CodeGenAttention: 4-86 -- 347: │ │ │ │ └─Dropout: 5-225 -- 348: │ │ │ │ └─Dropout: 5-226 -- 349: │ │ │ │ └─Linear: 5-227 113,246,208 350: │ │ │ │ └─Linear: 5-228 37,748,736 351: │ │ │ └─CodeGenMLP: 4-87 -- 352: │ │ │ │ └─Linear: 5-229 151,019,520 353: │ │ │ │ └─Linear: 5-230 151,001,088 354: │ │ │ │ └─NewGELUActivation: 5-231 -- 355: │ │ │ │ └─Dropout: 5-232 -- 356: │ │ └─CodeGenBlock: 3-30 -- 357: │ │ │ └─LayerNorm: 4-88 12,288 358: │ │ │ └─CodeGenAttention: 4-89 -- 359: │ │ │ │ └─Dropout: 5-233 -- 360: │ │ │ │ └─Dropout: 5-234 -- 361: │ │ │ │ └─Linear: 5-235 113,246,208 362: │ │ │ │ └─Linear: 5-236 37,748,736 363: │ │ │ └─CodeGenMLP: 4-90 -- 364: │ │ │ │ └─Linear: 5-237 151,019,520 365: │ │ │ │ └─Linear: 5-238 151,001,088 366: │ │ │ │ └─NewGELUActivation: 5-239 -- 367: │ │ │ │ └─Dropout: 5-240 -- 368: │ │ └─CodeGenBlock: 3-31 -- 369: │ │ │ └─LayerNorm: 4-91 12,288 370: │ │ │ └─CodeGenAttention: 4-92 -- 371: │ │ │ │ └─Dropout: 5-241 -- 372: │ │ │ │ └─Dropout: 5-242 -- 373: │ │ │ │ └─Linear: 5-243 113,246,208 374: │ │ │ │ └─Linear: 5-244 37,748,736 375: │ │ │ └─CodeGenMLP: 4-93 -- 376: │ │ │ │ └─Linear: 5-245 151,019,520 377: │ │ │ │ └─Linear: 5-246 151,001,088 378: │ │ │ │ └─NewGELUActivation: 5-247 -- 379: │ │ │ │ └─Dropout: 5-248 -- 380: │ │ └─CodeGenBlock: 3-32 -- 381: │ │ │ └─LayerNorm: 4-94 12,288 382: │ │ │ └─CodeGenAttention: 4-95 -- 383: │ │ │ │ └─Dropout: 5-249 -- 384: │ │ │ │ └─Dropout: 5-250 -- 385: │ │ │ │ └─Linear: 5-251 113,246,208 386: │ │ │ │ └─Linear: 5-252 37,748,736 387: │ │ │ └─CodeGenMLP: 4-96 -- 388: │ │ │ │ └─Linear: 5-253 151,019,520 389: │ │ │ │ └─Linear: 5-254 151,001,088 390: │ │ │ │ └─NewGELUActivation: 5-255 -- 391: │ │ │ │ └─Dropout: 5-256 -- 392: │ │ └─CodeGenBlock: 3-33 -- 393: │ │ │ └─LayerNorm: 4-97 12,288 394: │ │ │ └─CodeGenAttention: 4-98 -- 395: │ │ │ │ └─Dropout: 5-257 -- 396: │ │ │ │ └─Dropout: 5-258 -- 397: │ │ │ │ └─Linear: 5-259 113,246,208 398: │ │ │ │ └─Linear: 5-260 37,748,736 399: │ │ │ └─CodeGenMLP: 4-99 -- 400: │ │ │ │ └─Linear: 5-261 151,019,520 401: │ │ │ │ └─Linear: 5-262 151,001,088 402: │ │ │ │ └─NewGELUActivation: 5-263 -- 403: │ │ │ │ └─Dropout: 5-264 -- 404: │ │ └─CodeGenBlock: 3-34 -- 405: │ │ │ └─LayerNorm: 4-100 12,288 406: │ │ │ └─CodeGenAttention: 4-101 -- 407: │ │ │ │ └─Dropout: 5-265 -- 408: │ │ │ │ └─Dropout: 5-266 -- 409: │ │ │ │ └─Linear: 5-267 113,246,208 410: │ │ │ │ └─Linear: 5-268 37,748,736 411: │ │ │ └─CodeGenMLP: 4-102 -- 412: │ │ │ │ └─Linear: 5-269 151,019,520 413: │ │ │ │ └─Linear: 5-270 151,001,088 414: │ │ │ │ └─NewGELUActivation: 5-271 -- 415: │ │ │ │ └─Dropout: 5-272 -- 416: │ └─LayerNorm: 2-4 12,288 417: ├─Linear: 1-2 314,624,000 418: =========================================================================== 419: Total params: 16,032,155,648 420: Trainable params: 16,032,155,648 421: Non-trainable params: 0 422: ===========================================================================